summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2021-05-15Merge tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for shared tag set exit (Bart) - Correct ioctl range for zoned ioctls (Damien) - Removed dead/unused function (Lin) - Fix perf regression for shared tags (Ming) - Fix out-of-bounds issue with kyber and preemption (Omar) - BFQ merge fix (Paolo) - Two error handling fixes for nbd (Sun) - Fix weight update in blk-iocost (Tejun) - NVMe pull request (Christoph): - correct the check for using the inline bio in nvmet (Chaitanya Kulkarni) - demote unsupported command warnings (Chaitanya Kulkarni) - fix corruption due to double initializing ANA state (me, Hou Pu) - reset ns->file when open fails (Daniel Wagner) - fix a NULL deref when SEND is completed with error in nvmet-rdma (Michal Kalderon) - Fix kernel-doc warning (Bart) * tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block: block/partitions/efi.c: Fix the efi_partition() kernel-doc header blk-mq: Swap two calls in blk_mq_exit_queue() blk-mq: plug request for shared sbitmap nvmet: use new ana_log_size instead the old one nvmet: seset ns->file when open fails nbd: share nbd_put and return by goto put_nbd nbd: Fix NULL pointer in flush_workqueue blkdev.h: remove unused codes blk_account_rq block, bfq: avoid circular stable merges blk-iocost: fix weight updates of inner active iocgs nvmet: demote fabrics cmd parse err msg to debug nvmet: use helper to remove the duplicate code nvmet: demote discovery cmd parse err msg to debug nvmet-rdma: Fix NULL deref when SEND is completed with error nvmet: fix inline bio check for passthru nvmet: fix inline bio check for bdev-ns nvme-multipath: fix double initialization of ANA state kyber: fix out of bounds access when preempted block: uapi: fix comment about block device ioctl
2021-05-14block/partitions/efi.c: Fix the efi_partition() kernel-doc headerBart Van Assche
Fix the following kernel-doc warning: block/partitions/efi.c:685: warning: wrong kernel-doc identifier on line: * efi_partition(struct parsed_partitions *state) Cc: Alexander Viro <viro@math.psu.edu> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210513171708.8391-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-14blk-mq: Swap two calls in blk_mq_exit_queue()Bart Van Assche
If a tag set is shared across request queues (e.g. SCSI LUNs) then the block layer core keeps track of the number of active request queues in tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is cleared by blk_mq_del_queue_tag_set(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Fixes: 0d2602ca30e4 ("blk-mq: improve support for shared tags maps") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-14blk-mq: plug request for shared sbitmapMing Lei
In case of shared sbitmap, request won't be held in plug list any more sine commit 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset"), this way makes request merge from flush plug list & batching submission not possible, so cause performance regression. Yanhui reports performance regression when running sequential IO test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk is emulated with image stored on xfs/megaraid_sas. Fix the issue by recovering original behavior to allow to hold request in plug list. Cc: Yanhui Ma <yama@redhat.com> Cc: John Garry <john.garry@huawei.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: kashyap.desai@broadcom.com Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-12block, bfq: avoid circular stable mergesPaolo Valente
BFQ may merge a new bfq_queue, stably, with the last bfq_queue created. In particular, BFQ first waits a little bit for some I/O to flow inside the new queue, say Q2, if this is needed to understand whether it is better or worse to merge Q2 with the last queue created, say Q1. This delayed stable merge is performed by assigning bic->stable_merge_bfqq = Q1, for the bic associated with Q1. Yet, while waiting for some I/O to flow in Q2, a non-stable queue merge of Q2 with Q1 may happen, causing the bic previously associated with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that, Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to be recycled as a non-shared bfq_queue. In that case, Q1 may then happen to undergo a stable merge with the bfq_queue pointed by bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to Q1. So Q1 would be merged with itself. This commit fixes this error by intercepting this situation, and canceling the schedule of the stable merge. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-11blk-iocost: fix weight updates of inner active iocgsTejun Heo
When the weight of an active iocg is updated, weight_updated() is called which in turn calls __propagate_weights() to update the active and inuse weights so that the effective hierarchical weights are update accordingly. The current implementation is incorrect for inner active nodes. For an active leaf iocg, inuse can be any value between 1 and active and the difference represents how much the iocg is donating. When weight is updated, as long as inuse is clamped between 1 and the new weight, we're alright and this is what __propagate_weights() currently implements. However, that's not how an active inner node's inuse is set. An inner node's inuse is solely determined by the ratio between the sums of inuse's and active's of its children - ie. they're results of propagating the leaves' active and inuse weights upwards. __propagate_weights() incorrectly applies the same clamping as for a leaf when an active inner node's weight is updated. Consider a hierarchy which looks like the following with saturating workloads in AA and BB. R / \ A B | | AA BB 1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5. 2. echo 200 > A/io.weight 3. __propagate_weights() update A's active to 200 and leave inuse at 100 as it's already between 1 and the new active, making A:active=200, A:inuse=100. As R's active_sum is updated along with A's active, A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the hwi's remain unchanged at 0.5. 4. The weight of A is now twice that of B but AA and BB still have the same hwi of 0.5 and thus are doing the same amount of IOs. Fix it by making __propgate_weights() always calculate the inuse of an active inner iocg based on the ratio of child_inuse_sum to child_active_sum. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Schatzberg <dschatzberg@fb.com> Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-11kyber: fix out of bounds access when preemptedOmar Sandoval
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx for the current CPU again and uses that to get the corresponding Kyber context in the passed hctx. However, the thread may be preempted between the two calls to blk_mq_get_ctx(), and the ctx returned the second time may no longer correspond to the passed hctx. This "works" accidentally most of the time, but it can cause us to read garbage if the second ctx came from an hctx with more ctx's than the first one (i.e., if ctx->index_hw[hctx->type] > hctx->nr_ctx). This manifested as this UBSAN array index out of bounds error reported by Jakub: UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9 index 13106 is out of range for type 'long unsigned int [128]' Call Trace: dump_stack+0xa4/0xe5 ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34 queued_spin_lock_slowpath+0x476/0x480 do_raw_spin_lock+0x1c2/0x1d0 kyber_bio_merge+0x112/0x180 blk_mq_submit_bio+0x1f5/0x1100 submit_bio_noacct+0x7b0/0x870 submit_bio+0xc2/0x3a0 btrfs_map_bio+0x4f0/0x9d0 btrfs_submit_data_bio+0x24e/0x310 submit_one_bio+0x7f/0xb0 submit_extent_page+0xc4/0x440 __extent_writepage_io+0x2b8/0x5e0 __extent_writepage+0x28d/0x6e0 extent_write_cache_pages+0x4d7/0x7a0 extent_writepages+0xa2/0x110 do_writepages+0x8f/0x180 __writeback_single_inode+0x99/0x7f0 writeback_sb_inodes+0x34e/0x790 __writeback_inodes_wb+0x9e/0x120 wb_writeback+0x4d2/0x660 wb_workfn+0x64d/0xa10 process_one_work+0x53a/0xa80 worker_thread+0x69/0x5b0 kthread+0x20b/0x240 ret_from_fork+0x1f/0x30 Only Kyber uses the hctx, so fix it by passing the request_queue to ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can map the queues itself to avoid the mismatch. Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-09Merge tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fix from Jens Axboe: "Turns out the bio max size change still has issues, so let's get it reverted for 5.13-rc1. We'll shake out the issues there and defer it to 5.14 instead" * tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block: Revert "bio: limit bio max size"
2021-05-08Revert "bio: limit bio max size"Jens Axboe
This reverts commit cd2c7545ae1beac3b6aae033c7f31193b3255946. Alex reports that the commit causes corruption with LUKS on ext4. Revert it for now so that this can be investigated properly. Link: https://lore.kernel.org/linux-block/1620493841.bxdq8r5haw.none@localhost/ Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-07Merge tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - dasd spelling fixes (Bhaskar) - Limit bio max size on multi-page bvecs to the hardware limit, to avoid overly large bio's (and hence latencies). Originally queued for the merge window, but needed a fix and was dropped from the initial pull (Changheun) - NVMe pull request (Christoph): - reset the bdev to ns head when failover (Daniel Wagner) - remove unsupported command noise (Keith Busch) - misc passthrough improvements (Kanchan Joshi) - fix controller ioctl through ns_head (Minwoo Im) - fix controller timeouts during reset (Tao Chiu) - rnbd fixes/cleanups (Gioh, Md, Dima) - Fix iov_iter re-expansion (yangerkun) * tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block: block: reexpand iov_iter after read/write nvmet: remove unsupported command noise nvme-multipath: reset bdev to ns head when failover nvme-pci: fix controller reset hang when racing with nvme_timeout nvme: move the fabrics queue ready check routines to core nvme: avoid memset for passthrough requests nvme: add nvme_get_ns helper nvme: fix controller ioctl through ns_head bio: limit bio max size RDMA/rtrs: fix uninitialized symbol 'cnt' s390: dasd: Mundane spelling fixes block/rnbd: Remove all likely and unlikely block/rnbd-clt: Check the return value of the function rtrs_clt_query block/rnbd: Fix style issues block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t
2021-05-06include: remove pagemap.h from blkdev.hMatthew Wilcox (Oracle)
My UEK-derived config has 1030 files depending on pagemap.h before this change. Afterwards, just 326 files need to be rebuilt when I touch pagemap.h. I think blkdev.h is probably included too widely, but untangling that dependency is harder and this solves my problem. x86 allmodconfig builds, but there may be implicit include problems on other architectures. Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm] Acked-by: Jens Axboe <axboe@kernel.dk> [block] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi] Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-03bio: limit bio max sizeChangheun Lee
bio size can grow up to 4GB when muli-page bvec is enabled. but sometimes it would lead to inefficient behaviors. in case of large chunk direct I/O, - 32MB chunk read in user space - all pages for 32MB would be merged to a bio structure if the pages physical addresses are contiguous. it makes some delay to submit until merge complete. bio max size should be limited to a proper size. When 32MB chunk read with direct I/O option is coming from userspace, kernel behavior is below now in do_direct_IO() loop. it's timeline. | bio merge for 32MB. total 8,192 pages are merged. | total elapsed time is over 2ms. |------------------ ... ----------------------->| | 8,192 pages merged a bio. | at this time, first bio submit is done. | 1 bio is split to 32 read request and issue. |---------------> |---------------> |---------------> ...... |---------------> |--------------->| total 19ms elapsed to complete 32MB read done from device. | If bio max size is limited with 1MB, behavior is changed below. | bio merge for 1MB. 256 pages are merged for each bio. | total 32 bio will be made. | total elapsed time is over 2ms. it's same. | but, first bio submit timing is fast. about 100us. |--->|--->|--->|---> ... -->|--->|--->|--->|--->| | 256 pages merged a bio. | at this time, first bio submit is done. | and 1 read request is issued for 1 bio. |---------------> |---------------> |---------------> ...... |---------------> |--------------->| total 17ms elapsed to complete 32MB read done from device. | As a result, read request issue timing is faster if bio max size is limited. Current kernel behavior with multipage bvec, super large bio can be created. And it lead to delay first I/O request issue. Signed-off-by: Changheun Lee <nanich.lee@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-30cgroup: rstat: punt root-level optimization to individual controllersJohannes Weiner
Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. This is the case for the io controller and the cgroup base stats. In their respective flush callbacks, check whether the parent is the root cgroup, and if so, skip the unnecessary upward propagation. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-28Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This consists of the usual driver updates (ufs, target, tcmu, smartpqi, lpfc, zfcp, qla2xxx, mpt3sas, pm80xx). The major core change is using a sbitmap instead of an atomic for queue tracking" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (412 commits) scsi: target: tcm_fc: Fix a kernel-doc header scsi: target: Shorten ALUA error messages scsi: target: Fix two format specifiers scsi: target: Compare explicitly with SAM_STAT_GOOD scsi: sd: Introduce a new local variable in sd_check_events() scsi: dc395x: Open-code status_byte(u8) calls scsi: 53c700: Open-code status_byte(u8) calls scsi: smartpqi: Remove unused functions scsi: qla4xxx: Remove an unused function scsi: myrs: Remove unused functions scsi: myrb: Remove unused functions scsi: mpt3sas: Fix two kernel-doc headers scsi: fcoe: Suppress a compiler warning scsi: libfc: Fix a format specifier scsi: aacraid: Remove an unused function scsi: core: Introduce enum scsi_disposition scsi: core: Modify the scsi_send_eh_cmnd() return value for the SDEV_BLOCK case scsi: core: Rename scsi_softirq_done() into scsi_complete() scsi: core: Remove an incorrect comment scsi: core: Make the scsi_alloc_sgtables() documentation more accurate ...
2021-04-28Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round this time, which is nice. In detail: - Series revamping bounce buffer support (Christoph) - Dead code removal (Christoph, Bart) - Partition iteration revamp, now using xarray (Christoph) - Passthrough request scheduler improvements (Lin) - Series of BFQ improvements (Paolo) - Fix ioprio task iteration (Peter) - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max, Nikolay)" * tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits) blk-iocost: don't ignore vrate_min on QD contention blk-mq: Fix spurious debugfs directory creation during initialization bfq/mq-deadline: remove redundant check for passthrough request blk-mq: bypass IO scheduler's limit_depth for passthrough request block: Remove an obsolete comment from sg_io() block: move bio_list_copy_data to pktcdvd block: remove zero_fill_bio_iter block: add queue_to_disk() to get gendisk from request_queue block: remove an incorrect check from blk_rq_append_bio block: initialize ret in bdev_disk_changed block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration block: remove disk_part_iter block: simplify diskstats_show block: simplify show_partition block: simplify printk_all_partitions block: simplify partition_overlaps block: simplify partition removal block: take bd_mutex around delete_partitions in del_gendisk block: refactor blk_drop_partitions block: move more syncing and invalidation to delete_partition ...
2021-04-27Merge tag 'cfi-v5.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull CFI on arm64 support from Kees Cook: "This builds on last cycle's LTO work, and allows the arm64 kernels to be built with Clang's Control Flow Integrity feature. This feature has happily lived in Android kernels for almost 3 years[1], so I'm excited to have it ready for upstream. The wide diffstat is mainly due to the treewide fixing of mismatched list_sort prototypes. Other things in core kernel are to address various CFI corner cases. The largest code portion is the CFI runtime implementation itself (which will be shared by all architectures implementing support for CFI). The arm64 pieces are Acked by arm64 maintainers rather than coming through the arm64 tree since carrying this tree over there was going to be awkward. CFI support for x86 is still under development, but is pretty close. There are a handful of corner cases on x86 that need some improvements to Clang and objtool, but otherwise works well. Summary: - Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)" * tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: arm64: allow CONFIG_CFI_CLANG to be selected KVM: arm64: Disable CFI for nVHE arm64: ftrace: use function_nocfi for ftrace_call arm64: add __nocfi to __apply_alternatives arm64: add __nocfi to functions that jump to a physical address arm64: use function_nocfi with __pa_symbol arm64: implement function_nocfi psci: use function_nocfi for cpu_resume lkdtm: use function_nocfi treewide: Change list_sort to use const pointers bpf: disable CFI in dispatcher functions kallsyms: strip ThinLTO hashes from static functions kthread: use WARN_ON_FUNCTION_MISMATCH workqueue: use WARN_ON_FUNCTION_MISMATCH module: ensure __cfi_check alignment mm: add generic function_nocfi macro cfi: add __cficanonical add support for Clang CFI
2021-04-26blk-iocost: don't ignore vrate_min on QD contentionTejun Heo
ioc_adjust_base_vrate() ignored vrate_min when rq_wait_pct indicates that there is QD contention. The reasoning was that QD depletion always reliably indicates device saturation and thus it's safe to override user specified vrate_min. However, this sometimes leads to unnecessary throttling, especially on really fast devices, because vrate adjustments have delays and inertia. It also confuses users because the behavior violates the explicitly specified configuration. This patch drops the special case handling so that vrate_min is always applied. Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/YIIo1HuyNmhDeiNx@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-21block: return -EBUSY when there are open partitions in blkdev_reread_partChristoph Hellwig
The switch to go through blkdev_get_by_dev means we now ignore the return value from bdev_disk_changed in __blkdev_get. Add a manual check to restore the old semantics. Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part") Reported-by: Karel Zak <kzak@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-16blk-mq: Fix spurious debugfs directory creation during initializationSaravanan D
blk_mq_debugfs_register_sched_hctx() called from device_add_disk()->elevator_init_mq()->blk_mq_init_sched() initialization sequence does not have relevant parent directory setup and thus spuriously attempts "sched" directory creation from root mount of debugfs for every hw queue detected on the block device dmesg ... debugfs: Directory 'sched' with parent '/' already present! debugfs: Directory 'sched' with parent '/' already present! . . debugfs: Directory 'sched' with parent '/' already present! ... The parent debugfs directory for hw queues get properly setup device_add_disk()->blk_register_queue()->blk_mq_debugfs_register() ->blk_mq_debugfs_register_hctx() later in the block device initialization sequence. A simple check for debugfs_dir has been added to thwart premature debugfs directory/file creation attempts. Signed-off-by: Saravanan D <saravanand@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-16bfq/mq-deadline: remove redundant check for passthrough requestLin Feng
Since commit 01e99aeca39796003 'blk-mq: insert passthrough request into hctx->dispatch directly', passthrough request should not appear in IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO schedulers is redundant. (Notes: this patch passes generic IO load test with hdds under SAS controller and hdds under AHCI controller but obviously not covers all. Not sure if passthrough request can still escape into IO scheduler from blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and has lots of indirect callers.) Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-16blk-mq: bypass IO scheduler's limit_depth for passthrough requestLin Feng
Commit 01e99aeca39796003 ("blk-mq: insert passthrough request into hctx->dispatch directly") gives high priority to passthrough requests and bypass underlying IO scheduler. But as we allocate tag for such request it still runs io-scheduler's callback limit_depth, while we really want is to give full sbitmap-depth capabity to such request for acquiring available tag. blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth: 8,0 2 0 0.000000000 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8 8,0 2 1 0.000008134 39952 D R 4 [dmraid] 8,0 2 2 0.000021538 24 C R [0] 8,0 2 0 0.000035442 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8 8,0 2 3 0.000038813 39952 D R 24 [dmraid] 8,0 2 4 0.000044356 24 C R [0] This patch introduce a new wrapper to make code not that ugly. Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-13block: Remove an obsolete comment from sg_io()Bart Van Assche
Commit b7819b925918 ("block: remove the blk_execute_rq return value") changed the return type of blk_execute_rq() from int into void. That change made a comment in sg_io() obsolete. Hence remove that comment. Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210413034142.23460-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12block: move bio_list_copy_data to pktcdvdChristoph Hellwig
bio_list_copy_data is only used by pktcdvd, so move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210412134658.2623190-2-hch@lst.de Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12block: remove zero_fill_bio_iterChristoph Hellwig
zero_fill_bio_iter is only used to implement zero_fill_bio, so remove the indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210412134658.2623190-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12block: remove an incorrect check from blk_rq_append_bioChristoph Hellwig
blk_rq_append_bio is also used for the copy case, not just the map case, so tis debug check is not correct. Fixes: 393bb12e0058 ("block: stop calling blk_queue_bounce for passthrough requests") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210409150447.1977410-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08treewide: Change list_sort to use const pointersSami Tolvanen
list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
2021-04-08block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iterationPeter Zijlstra
do_each_pid_thread() { } while_each_pid_thread() is a double loop and thus break doesn't work as expected. Also, it should be used under tasklist_lock because otherwise we can race against change_pid() for PGID/SID. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/YG7Q5C4Rb5dx5GFx@hirez.programming.kicks-ass.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: remove disk_part_iterChristoph Hellwig
Just open code the xa_for_each in the remaining user. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: simplify diskstats_showChristoph Hellwig
Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: simplify show_partitionChristoph Hellwig
Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: simplify printk_all_partitionsChristoph Hellwig
Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: simplify partition_overlapsChristoph Hellwig
Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: simplify partition removalChristoph Hellwig
Always look up the first available entry instead of the complicated stateful traversal. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: take bd_mutex around delete_partitions in del_gendiskChristoph Hellwig
There is nothing preventing an ioctl from trying do delete partition concurrenly with del_gendisk, so take open_mutex to serialize against that. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: refactor blk_drop_partitionsChristoph Hellwig
Move the busy check and disk-wide sync into the only caller, so that the remainder can be shared with del_gendisk. Also pass the gendisk instead of the bdev as that is all that is needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: move more syncing and invalidation to delete_partitionChristoph Hellwig
Move the calls to fsync_bdev and __invalidate_device from del_gendisk to delete_partition. For the other two callers that check that there are no openers for the delete partitions(s) the callouts are a no-op as no file system can be mounted, but this keeps all the cleanup in one place. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: remove invalidate_partitionChristoph Hellwig
invalidate_partition has two callers, one of which already performs the remove_inode_hash just after the call. Just open code the function in the two callsites. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08dasd: use bdev_disk_changed instead of blk_drop_partitionsChristoph Hellwig
Use the more general interface - the behavior is the same except that now a change uevent is sent, which is the right thing to do when the device becomes unusable. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20210406062303.811835-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-07blk-zoned: Remove the definition of blk_zone_start()Bart Van Assche
Commit e76239a3748c ("block: add a report_zones method") removed the last blk_zone_start() call. Hence also remove the definition of this function. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406200820.15180-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-07blk-mq: set default elevator as deadline in case of hctx shared tagsetMing Lei
Yanhui found that write performance is degraded a lot after applying hctx shared tagset on one test machine with megaraid_sas. And turns out it is caused by none scheduler which becomes default elevator caused by hctx shared tagset patchset. Given more scsi HBAs will apply hctx shared tagset, and the similar performance exists for them too. So keep previous behavior by still using default mq-deadline for queues which apply hctx shared tagset, just like before. Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Reported-by: Yanhui Ma <yama@redhat.com> Cc: John Garry <john.garry@huawei.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210406031933.767228-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: stop calling blk_queue_bounce for passthrough requestsChristoph Hellwig
Instead of overloading the passthrough fast path with the deprecated block layer bounce buffering let the users that combine an old undermaintained driver with a highmem system pay the price by always falling back to copies in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: refactor the bounce buffering codeChristoph Hellwig
Get rid of all the PFN arithmetics and just use an enum for the two remaining options, and use PageHighMem for the actual bounce decision. Add a fast path to entirely avoid the call for the common case of a queue not using the legacy bouncing code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: remove BLK_BOUNCE_ISA supportChristoph Hellwig
Remove the BLK_BOUNCE_ISA support now that all users are gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06blk-mq: Always use blk_mq_is_sbitmap_sharedNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210311081713.2763171-1-nborisov@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: add sysfs entry for virt boundary maskMax Gurtovoy
This entry will expose the bio vector alignment mask for a specific block device. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210405132012.12504-1-mgurtovoy@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-02block: remove the unused RQF_ALLOCED flagChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-31block: only update parent bi_status when bio failYufen Yu
For multiple split bios, if one of the bio is fail, the whole should return error to application. But we found there is a race between bio_integrity_verify_fn and bio complete, which return io success to application after one of the bio fail. The race as following: split bio(READ) kworker nvme_complete_rq blk_update_request //split error=0 bio_endio bio_integrity_endio queue_work(kintegrityd_wq, &bip->bip_work); bio_integrity_verify_fn bio_endio //split bio __bio_chain_endio if (!parent->bi_status) <interrupt entry> nvme_irq blk_update_request //parent error=7 req_bio_endio bio->bi_status = 7 //parent bio <interrupt exit> parent->bi_status = 0 parent->bi_end_io() // return bi_status=0 The bio has been split as two: split and parent. When split bio completed, it depends on kworker to do endio, while bio_integrity_verify_fn have been interrupted by parent bio complete irq handler. Then, parent bio->bi_status which have been set in irq handler will overwrite by kworker. In fact, even without the above race, we also need to conside the concurrency beteen mulitple split bio complete and update the same parent bi_status. Normally, multiple split bios will be issued to the same hctx and complete from the same irq vector. But if we have updated queue map between multiple split bios, these bios may complete on different hw queue and different irq vector. Then the concurrency update parent bi_status may cause the final status error. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-27block: don't create too many partitionsMing Lei
Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") drops the check on max supported number of partitionsr, and allows partition with bigger partition numbers to be added. However, ->bd_partno is defined as u8, so partition index of xarray table may not match with ->bd_partno. Then delete_partition() may delete one unmatched partition, and caused use-after-free. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reported-by: syzbot+8fede7e30c7cee0de139@syzkaller.appspotmail.com Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: merge bursts of newly-created queuesPaolo Valente
Many throughput-sensitive workloads are made of several parallel I/O flows, with all flows generated by the same application, or more generically by the same task (e.g., system boot). The most counterproductive action with these workloads is plugging I/O dispatch when one of the bfq_queues associated with these flows remains temporarily empty. To avoid this plugging, BFQ has been using a burst-handling mechanism for years now. This mechanism has proven effective for throughput, and not detrimental for service guarantees. This commit pushes this mechanism a little bit further, basing on the following two facts. First, all the I/O flows of a the same application or task contribute to the execution/completion of that common application or task. So the performance figures that matter are total throughput of the flows and task-wide I/O latency. In particular, these flows do not need to be protected from each other, in terms of individual bandwidth or latency. Second, the above fact holds regardless of the number of flows. Putting these two facts together, this commits merges stably the bfq_queues associated with these I/O flows, i.e., with the processes that generate these IO/ flows, regardless of how many the involved processes are. To decide whether a set of bfq_queues is actually associated with the I/O flows of a common application or task, and to merge these queues stably, this commit operates as follows: given a bfq_queue, say Q2, currently being created, and the last bfq_queue, say Q1, created before Q2, Q2 is merged stably with Q1 if - very little time has elapsed since when Q1 was created - Q2 has the same ioprio as Q1 - Q2 belongs to the same group as Q1 Merging bfq_queues also reduces scheduling overhead. A fio test with ten random readers on /dev/nullb shows a throughput boost of 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of the total per-request processing time, the above throughput boost implies that BFQ's overhead is reduced by more than 50%. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: keep shared queues out of the waker mechanismPaolo Valente
Shared queues are likely to receive I/O at a high rate. This may deceptively let them be considered as wakers of other queues. But a false waker will unjustly steal bandwidth to its supposedly woken queue. So considering also shared queues in the waking mechanism may cause more control troubles than throughput benefits. This commit keeps shared queues out of the waker-detection mechanism. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>