summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2020-09-28blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()Xianting Tian
We found blk_mq_alloc_rq_maps() takes more time in kernel space when testing nvme device hot-plugging. The test and anlysis as below. Debug code, 1, blk_mq_alloc_rq_maps(): u64 start, end; depth = set->queue_depth; start = ktime_get_ns(); pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n", current->pid, current->comm, current->nvcsw, current->nivcsw, set->queue_depth, set->nr_hw_queues); do { err = __blk_mq_alloc_rq_maps(set); if (!err) break; set->queue_depth >>= 1; if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) { err = -ENOMEM; break; } } while (set->queue_depth); end = ktime_get_ns(); pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n", current->pid, current->comm, current->nvcsw, current->nivcsw, end - start); 2, __blk_mq_alloc_rq_maps(): u64 start, end; for (i = 0; i < set->nr_hw_queues; i++) { start = ktime_get_ns(); if (!__blk_mq_alloc_rq_map(set, i)) goto out_unwind; end = ktime_get_ns(); pr_err("hw queue %d init cost time %lld ns\n", i, end - start); } Test nvme hot-plugging with above debug code, we found it totally cost more than 3ms in kernel space without being scheduled out when alloc rqs for all 16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost time will be increased with hw queue number and queue depth increasing. And in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try "queue_depth >>= 1", more time will be consumed. [ 428.428771] nvme nvme0: pci function 10000:01:00.0 [ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002) [ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A [ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1 [ 432.593404] hw queue 0 init cost time 22883 ns [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16 [ 432.596203] hw queue 0 init cost time 242630 ns [ 432.596441] hw queue 1 init cost time 235913 ns [ 432.596659] hw queue 2 init cost time 216461 ns [ 432.596877] hw queue 3 init cost time 215851 ns [ 432.597107] hw queue 4 init cost time 228406 ns [ 432.597336] hw queue 5 init cost time 227298 ns [ 432.597564] hw queue 6 init cost time 224633 ns [ 432.597785] hw queue 7 init cost time 219954 ns [ 432.597937] hw queue 8 init cost time 150930 ns [ 432.598082] hw queue 9 init cost time 143496 ns [ 432.598231] hw queue 10 init cost time 147261 ns [ 432.598397] hw queue 11 init cost time 164522 ns [ 432.598542] hw queue 12 init cost time 143401 ns [ 432.598692] hw queue 13 init cost time 148934 ns [ 432.598841] hw queue 14 init cost time 147194 ns [ 432.598991] hw queue 15 init cost time 148942 ns [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns [ 432.602611] nvme0n1: p1 So use this patch to trigger schedule between each hw queue init, to avoid other threads getting stuck. It is not in atomic context when executing __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched(). Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Three fixes: one in drivers (lpfc) and two for zoned block devices. The latter also impinges on the block layer but only to introduce a new block API for setting the zone model rather than fiddling with the queue directly in the zoned block driver" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd: sd_zbc: Fix ZBC disk initialization scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported
2020-09-25iocost: consider iocgs with active delays for debt forgivenessTejun Heo
An iocg may have 0 debt but non-zero delay. The current debt forgiveness logic doesn't act on such iocgs. This can lead to unexpected behaviors - an iocg with a little bit of debt will have its delay canceled through debt forgiveness but one w/o any debt but active delay will have to wait out until its delay decays out. This patch updates the debt handling logic so that it treats delays the same as debts. If either debt or delay is active, debt forgiveness logic kicks in and acts on both the same way. Also, avoid turning the debt and delay directly to zero as that can confuse state transitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: add iocg_forgive_debt tracepointTejun Heo
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: reimplement debt forgiveness using average usageTejun Heo
Debt forgiveness logic was counting the number of consecutive !busy periods as the trigger condition. While this usually works, it can easily be thrown off by temporary fluctuations especially on configurations w/ short periods. This patch reimplements debt forgiveness so that: * Use the average usage over the forgiveness period instead of counting consecutive periods. * Debt is reduced at around the target rate (1/2 every 100ms) regardless of ioc period duration. * Usage threshold is raised to 50%. Combined with the preceding changes and the switch to average usage, this makes debt forgivness a lot more effective at reducing the amount of unnecessary idleness. * Constants are renamed with DFGV_ prefix. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: recalculate delay after debt reductionTejun Heo
Debt sets the initial delay duration which is decayed over time. The current debt reduction halved the debt but didn't change the delay. It prevented future debts from increasing delay but didn't do anything to lower the existing delay, limiting the mechanism's ability to reduce unnecessary idling. Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq() recalculates new delay value based on the reduced debt amount. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level oneTejun Heo
Debt reduction was blocked if any iocg was short on budget in the past period to avoid reducing debts while some iocgs are saturated. However, this ends up unnecessarily blocking debt reduction due to temporary local imbalances when the device is generally being underutilized, while also failing to block when the underlying device is overwhelmed and the usage becomes low from high latency. Given that debt accumulation mostly happens with swapout bursts which can significantly deteriorate the underlying device's latency response, the current logic is not great. Let's replace it with ioc->busy_level based condition so that we block debt reduction when the underlying device is being saturated. ioc_forgive_debts() call is moved after busy_level determination. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: factor out ioc_forgive_debts()Tejun Heo
Debt reduction logic is going to be improved and expanded. Factor it out into ioc_forgive_debts() and generalize the comment a bit. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25block: add QUEUE_FLAG_NOWAITMike Snitzer
Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where applicable. Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also update submit_bio_checks() to verify it is set for REQ_NOWAIT bios. Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25block: use bd_partno in bdevnameChristoph Hellwig
No need to go through the hd_struct to find the partition number. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25block: add a bdev_is_partition helperChristoph Hellwig
Add a littler helper to make the somewhat arcane bd_contains checks a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24Merge branch 'for-5.10/block' into for-5.10/driversJens Axboe
* for-5.10/block: (140 commits) bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag bdi: invert BDI_CAP_NO_ACCT_WB bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag mm: use SWP_SYNCHRONOUS_IO more intelligently bdi: remove BDI_CAP_SYNCHRONOUS_IO bdi: remove BDI_CAP_CGROUP_WRITEBACK block: lift setting the readahead size into the block layer md: update the optimal I/O size on reshape bdi: initialize ->ra_pages and ->io_pages in bdi_init aoe: set an optimal I/O size bcache: inherit the optimal I/O size drbd: remove dead code in device_to_statistics fs: remove the unused SB_I_MULTIROOT flag block: mark blkdev_get static PM: mm: cleanup swsusp_swap_check mm: split swap_type_of PM: rewrite is_hibernate_resume_dev to not require an inode mm: cleanup claim_swapfile ocfs2: cleanup o2hb_region_dev_store dasd: cleanup dasd_scan_partitions ...
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: remove BDI_CAP_CGROUP_WRITEBACKChristoph Hellwig
Just checking SB_I_CGROUPWB for cgroup writeback support is enough. Either the file system allocates its own bdi (e.g. btrfs), in which case it is known to support cgroup writeback, or the bdi comes from the block layer, which always supports cgroup writeback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24block: lift setting the readahead size into the block layerChristoph Hellwig
Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: initialize ->ra_pages and ->io_pages in bdi_initChristoph Hellwig
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: cleanup blkdev_bszsetChristoph Hellwig
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: cleanup partition scanning in register_diskChristoph Hellwig
Use blkdev_get_by_dev instead of open coding it using bdget_disk + blkdev_get, and split the code to read the partition table into a separate helper to make it a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: move the NEED_PART_SCAN flag to struct gendiskChristoph Hellwig
We can only scan for partitions on the whole disk, so move the flag from struct block_device to struct gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: allow 'chunk_sectors' to be non-power-of-2Mike Snitzer
It is possible, albeit more unlikely, for a block device to have a non power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors, which results in a full-stripe size of 1280K. This causes the RAID6's io_opt to be advertised as 1280K, and a stacked device _could_ then be made to use a blocksize, aka chunk_sectors, that matches non power-of-2 io_opt of underlying RAID6 -- resulting in stacked device's chunk_sectors being a non power-of-2). Update blk_queue_chunk_sectors() and blk_max_size_offset() to accommodate drivers that need a non power-of-2 chunk_sectors. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: use lcm_not_zero() when stacking chunk_sectorsMike Snitzer
Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using lcm_not_zero() rather than min_not_zero() -- otherwise the final 'chunk_sectors' could result in sub-optimal alignment of IO to component devices in the IO stack. Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size' then it is a bug in the driver and the device should be flagged as 'misaligned'. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: fix bmd->is_null_mapped initializationChristoph Hellwig
bmd is allocated using kmalloc in bio_alloc_map_data, so make sure is_null_mapped is properly initialized to false for the !null_mapped case. Fixes: f3256075ba49 ("block: remove the BIO_NULL_MAPPED flag") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: drop double zeroingJulia Lawall
sg_init_table zeroes its first argument, so the allocation of that argument doesn't have to. the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ x = - kzalloc + kmalloc (...) ... sg_init_table(x,...) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-15scsi: sd: sd_zbc: Fix handling of host-aware ZBC disksDamien Le Moal
When CONFIG_BLK_DEV_ZONED is disabled, allow using host-aware ZBC disks as regular disks. In this case, ensure that command completion is correctly executed by changing sd_zbc_complete() to return good_bytes instead of 0 and causing a hang during device probe (endless retries). When CONFIG_BLK_DEV_ZONED is enabled and a host-aware disk is detected to have partitions, it will be used as a regular disk. In this case, make sure to not do anything in sd_zbc_revalidate_zones() as that triggers warnings. Since all these different cases result in subtle settings of the disk queue zoned model, introduce the block layer helper function blk_queue_set_zoned() to generically implement setting up the effective zoned model according to the disk type, the presence of partitions on the disk and CONFIG_BLK_DEV_ZONED configuration. Link: https://lore.kernel.org/r/20200915073347.832424-2-damien.lemoal@wdc.com Fixes: b72053072c0b ("block: allow partitions on host aware zone devices") Cc: <stable@vger.kernel.org> Reported-by: Borislav Petkov <bp@alien8.de> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-09-14blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimitedBaolin Wang
Do not need check the bps or iops limitation if bps or iops is unlimited. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Avoid calculating bps/iops limitation repeatedlyBaolin Wang
The tg_may_dispatch() will call tg_with_in_bps_limit() and tg_with_in_iops_limit() to check if we can dispatch a bio or not, which will calculate bps/iops limitation multiple times. But tg_may_dispatch() is always called under queue lock, which means the bps/iops limitation will not change in tg_may_dispatch(). So we can calculate the bps/iops limitation only once, and pass them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to avoid calculating bps/iops limitation repeatedly. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Define readable macros instead of static variablesBaolin Wang
The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only variables, thus better to use readable macros instead of static variables, which can also save some spaces for .bss area. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Use readable READ/WRITE macrosBaolin Wang
Use readable READ/WRITE macros instead of magic numbers. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Fix some comments' typosBaolin Wang
Fix some comments' typos. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()Tejun Heo
adjust_inuse_and_calc_cost() is responsible for reducing the amount of donated weights dynamically in period as the budget runs low. Because we don't want to do full donation calculation in period, we keep latching up inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the resulting hweight_inuse is satisfactory. Unfortunately, the adj_step calculation was reading the active weight before acquiring ioc->lock. Because the current thread could have lost race to activate the iocg to another thread before entering this function, it may read the active weight as zero before acquiring ioc->lock. When this happens, the adj_step is calculated as zero and the incremental adjustment loop becomes an infinite one. Fix it by fetching the active weight after acquiring ioc->lock. Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11blk-iocost: fix divide-by-zero in transfer_surpluses()Tejun Heo
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but all hweight calculations round up and thus it may end up >= WEIGHT_ONE triggering divide-by-zero and other issues. Bound the value to avoid surprises. Fixes: e08d02aa5fc9 ("blk-iocost: implement Andy's method for donation weight updates") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11block: introduce part_[begin|end]_io_acctSong Liu
These functions can be used to enable iostat for partitions on devices like md, bcache. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix a regression in bdev partition locking (Christoph) - NVMe pull request from Christoph: - cancel async events before freeing them (David Milburn) - revert a broken race fix (James Smart) - fix command processing during resets (Sagi Grimberg) - Fix a kyber crash with requeued flushes (Omar) - Fix __bio_try_merge_page() same_page error for no merging (Ritesh) * tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block: block: Set same_page to false in __bio_try_merge_page if ret is false nvme-fabrics: allow to queue requests for live queues block: only call sched requeue_request() for scheduled requests nvme-tcp: cancel async events before freeing event struct nvme-rdma: cancel async events before freeing event struct nvme-fc: cancel async events before freeing event struct nvme: Revert: Fix controller creation races with teardown flow block: restore a specific error code in bdev_del_partition
2020-09-11blk-mq: always allow reserved allocation in hctx_may_queueMing Lei
NVMe shares tagset between fabric queue and admin queue or between connect_q and NS queue, so hctx_may_queue() can be called to allocate request for these queues. Tags can be reserved in these tagset. Before error recovery, there is often lots of in-flight requests which can't be completed, and new reserved request may be needed in error recovery path. However, hctx_may_queue() can always return false because there is too many in-flight requests which can't be completed during error handling. Finally, nothing can proceed. Fix this issue by always allowing reserved tag allocation in hctx_may_queue(). This is reasonable because reserved tags are supposed to always be available. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: David Milburn <dmilburn@redhat.com> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11block: remove duplicate include statement in scsi_ioctl.cTian Tao
scsi/sg.h is included more than once, Remove the one that isn't necessary. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10blkcg: add plugging support for punt bioXianting Tian
The test and the explaination of the patch as bellow. Before test we added more debug code in blkg_async_bio_workfn(): int count = 0 if (bios.head && bios.head->bi_next) { need_plug = true; blk_start_plug(&plug); } while ((bio = bio_list_pop(&bios))) { /*io_punt is a sysctl user interface to control the print*/ if(io_punt) { printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n", current->comm, current->pid, bio->bi_iter.bi_sector, (bio->bi_iter.bi_size)>>9, count++, need_plug); } submit_bio(bio); } if (need_plug) blk_finish_plug(&plug); Steps that need to be set to trigger *PUNT* io before testing: mount -t btrfs -o compress=lzo /dev/sda6 /btrfs mount -t cgroup2 nodev /cgroup2 mkdir /cgroup2/cg3 echo "+io" > /cgroup2/cgroup.subtree_control echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s echo $$ > /cgroup2/cg3/cgroup.procs Then use dd command to test btrfs PUNT io in current shell: dd if=/dev/zero of=/btrfs/file bs=64K count=100000 Test hardware environment as below: [root@localhost btrfs]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel With above debug code, test command and test environment, I did the tests under 3 different system loads, which are triggered by stress: 1, Run 64 threads by command "stress -c 64 &" [53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1 [53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1 [53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1 [53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1 [53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1 [53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1 ... ... [53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1 [53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1 [53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1 [53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1 [53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1 [53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1 [53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1 2, Run 32 threads by command "stress -c 32 &" [50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1 [50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1 [50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1 [50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1 [50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1 [50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1 ... ... [50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1 [50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1 [50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1 [50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1 [50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1 [50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1 [50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1 [50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1 3, Don't run thread by stress [50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0 [50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0 [50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0 [50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0 [50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0 [50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0 [50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0 [50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0 [50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0 [50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0 [50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0 [50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0 [50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1 [50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1 [50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0 [50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0 [50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0 [50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0 [50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0 [50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0 [50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0 Analysis of above 3 test results with different system load: >From above test, we can see more and more continuous bios can be plugged with system load increasing. When run "stress -c 64 &", 310 continuous bios are plugged; When run "stress -c 32 &", 260 continuous bios are plugged; When don't run stress, at most only 2 continuous bios are plugged, in most cases, bio_list only contains one single bio. How to explain above phenomenon: We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is scheduled, it depends on the system load. When system load is low, the workqueue will be quickly scheduled, and the bio in bio_list will be quickly processed in blkg_async_bio_workfn(), so there is less chance that the same io submit thread can add multiple continuous bios to bio_list before workqueue is scheduled to run. The analysis aligned with above test "3". When system load is high, there is some delay before the workqueue can be scheduled to run, the higher the system load the greater the delay. So there is more chance that the same io submit thread can add multiple continuous bios to bio_list. Then when the workqueue is scheduled to run, there are more continuous bios in bio_list, which will be processed in blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2". According to test, we can get io performance improved with the patch, especially when system load is higher. Another optimazition is to use the plug only when bio_list contains at least 2 bios. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10block: add a bdev_check_media_change helperChristoph Hellwig
Like check_disk_changed, except that it does not call ->revalidate_disk but leaves that to the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-09block: Set same_page to false in __bio_try_merge_page if ret is falseRitesh Harjani
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway not merging this page in this bio, then it make sense to make same_page also as false before returning. Without this patch, we hit below WARNING in iomap. This mostly happens with very large memory system and / or after tweaking vm dirty threshold params to delay writeback of dirty data. WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6 Call Trace: __remove_mapping+0x154/0x320 (unreliable) iomap_releasepage+0x80/0x180 try_to_release_page+0x94/0xe0 invalidate_inode_page+0xc8/0x110 invalidate_mapping_pages+0x1dc/0x540 generic_fadvise+0x3c8/0x450 xfs_file_fadvise+0x2c/0xe0 [xfs] vfs_fadvise+0x3c/0x60 ksys_fadvise64_64+0x68/0xe0 sys_fadvise64+0x28/0x40 system_call_exception+0xf8/0x1c0 system_call_common+0xf0/0x278 Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"") Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: only call sched requeue_request() for scheduled requestsOmar Sandoval
Yang Yang reported the following crash caused by requeueing a flush request in Kyber: [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00 ... [ 2.517468] pc : clear_bit+0x18/0x2c [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228 [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145 ... [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000) [ 2.517602] Call trace: [ 2.517606] clear_bit+0x18/0x2c [ 2.517619] kyber_finish_request+0x74/0x80 [ 2.517627] blk_mq_requeue_request+0x3c/0xc0 [ 2.517637] __scsi_queue_insert+0x11c/0x148 [ 2.517640] scsi_softirq_done+0x114/0x130 [ 2.517643] blk_done_softirq+0x7c/0xb0 [ 2.517651] __do_softirq+0x208/0x3bc [ 2.517657] run_ksoftirqd+0x34/0x60 [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0 [ 2.517667] kthread+0x110/0x120 [ 2.517669] ret_from_fork+0x10/0x18 This happens because Kyber doesn't track flush requests, so kyber_finish_request() reads a garbage domain token. Only call the scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for the finish_request() hook in blk_mq_free_request()). Now that we're handling it in blk-mq, also remove the check from BFQ. Reported-by: Yang Yang <yang.yang@vivo.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: make QUEUE_SYSFS_BIT_FNS more usefulChristoph Hellwig
Switch to the naming used by the other entries so that we can use the QUEUE_RW_ENTRY helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: add helper macros for queue sysfs entriesChristoph Hellwig
Add two helpers macros to avoid boilerplate code for the queue sysfs entries. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08block: restore a specific error code in bdev_del_partitionChristoph Hellwig
mdadm relies on the fact that deleting an invalid partition returns -ENXIO or -ENOTTY to detect if a block device is a partition or a whole device. Fixes: 08fc1ab6d748 ("block: fix locking in bdev_del_partition") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07block: Remove unused blk_mq_sched_free_hctx_data()Baolin Wang
Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(), and no users will use blk_mq_sched_free_hctx_data() function. Remove it. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07block: Do not discard buffers under a mounted filesystemJan Kara
Discarding blocks and buffers under a mounted filesystem is hardly anything admin wants to do. Usually it will confuse the filesystem and sometimes the loss of buffer_head state (including b_private field) can even cause crashes like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1 Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015 RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2] ... Call Trace: __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2] kjournald2+0xbd/0x270 [jbd2] So if we don't have block device open with O_EXCL already, claim the block device while we truncate buffer cache. This makes sure any exclusive block device user (such as filesystem) cannot operate on the device while we are discarding buffer cache. Reported-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-04Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A bit larger than usual this week, mostly due to the NVMe fixes arriving late for -rc3 and hence didn't make last weeks pull request. - NVMe: - instance leak and io boundary fixes from Keith - fc locking fix from Christophe - various tcp/rdma reset during traffic fixes from Sagi - pci use-after-free fix from Tong - tcp target null deref fix from Ziye - Locking fix for partition removal (Christoph) - Ensure bdi->io_pages is always set (me) - Fixup for hd struct reference (Ming) - Fix for zero length bvecs (Ming) - Two small blk-iocost fixes (Tejun)" * tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block: block: allow for_each_bvec to support zero len bvec blk-stat: make q->stats->lock irqsafe blk-iocost: ioc_pd_free() shouldn't assume irq disabled block: fix locking in bdev_del_partition block: release disk reference in hd_struct_free_work block: ensure bdi->io_pages is always initialized nvme-pci: cancel nvme device request before disabling nvme: only use power of two io boundaries nvme: fix controller instance leak nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' nvme: Fix NULL dereference for pci nvme controllers nvme-rdma: fix reset hang if controller died in the middle of a reset nvme-rdma: fix timeout handler nvme-rdma: serialize controller teardown sequences nvme-tcp: fix reset hang if controller died in the middle of a reset nvme-tcp: fix timeout handler nvme-tcp: serialize controller teardown sequences nvme: have nvme_wait_freeze_timeout return if it timed out nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-09-03blk-mq, elevator: Count requests per hctx to improve performanceKashyap Desai
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock contention is possible for mq-deadline and bfq IO schedulers when nr_hw_queues is more than one. It is because kblockd work queue can submit IO from all online CPUs (through blk_mq_run_hw_queues()) even though only one hctx has pending commands. The elevator callback .has_work for mq-deadline and bfq scheduler considers pending work if there are any IOs on request queue but it does not account hctx context. Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering the elevator even though there are no requests queued. [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap] Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Record active_queues_shared_sbitmap per tag_set for when using ↵John Garry
shared sbitmap For when using a shared sbitmap, no longer should the number of active request queues per hctx be relied on for when judging how to share the tag bitmap. Instead maintain the number of active request queues per tag_set, and make the judgement based on that. Originally-from: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Record nr_active_requests per queue for when using shared sbitmapJohn Garry
The per-hctx nr_active value can no longer be used to fairly assign a share of tag depth per request queue for when using a shared sbitmap, as it does not consider that the tags are shared tags over all hctx's. For this case, record the nr_active_requests per request_queue, and make the judgement based on that value. Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Relocate hctx_may_queue()John Garry
blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal. Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code. In this way, we can drop the blk-mq-tag.h include of blk-mq.h Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Facilitate a shared sbitmap per tagsetJohn Garry
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support multiple reply queues with single hostwide tags. In addition, these drivers want to use interrupt assignment in pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0], CPU hotplug may cause in-flight IO completion to not be serviced when an interrupt is shutdown. That problem is solved in commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline"). However, to take advantage of that blk-mq feature, the HBA HW queuess are required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues need to be exposed to the upper layer. In making that transition, the per-SCSI command request tags are no longer unique per Scsi host - they are just unique per hctx. As such, the HBA LLDD would have to generate this tag internally, which has a certain performance overhead. However another problem is that blk-mq assumes the host may accept (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy counter was removed, which would stop the LLDD being sent more than .can_queue commands; however, it should still be ensured that the block layer does not issue more than .can_queue commands to the Scsi host. To solve this problem, introduce a shared sbitmap per blk_mq_tag_set, which may be requested at init time. New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the tagset to indicate whether the shared sbitmap should be used. Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests are still allocated per hctx; the reason for this is that if tags and requests were only allocated for a single hctx - like hctx0 - it may break block drivers which expect a request be associated with a specific hctx, i.e. not always hctx0. This will introduce extra memory usage. This change is based on work originally from Ming Lei in [1] and from Bart's suggestion in [2]. [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/ [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>