summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2021-04-08block: take bd_mutex around delete_partitions in del_gendiskChristoph Hellwig
There is nothing preventing an ioctl from trying do delete partition concurrenly with del_gendisk, so take open_mutex to serialize against that. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: refactor blk_drop_partitionsChristoph Hellwig
Move the busy check and disk-wide sync into the only caller, so that the remainder can be shared with del_gendisk. Also pass the gendisk instead of the bdev as that is all that is needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: move more syncing and invalidation to delete_partitionChristoph Hellwig
Move the calls to fsync_bdev and __invalidate_device from del_gendisk to delete_partition. For the other two callers that check that there are no openers for the delete partitions(s) the callouts are a no-op as no file system can be mounted, but this keeps all the cleanup in one place. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08block: remove invalidate_partitionChristoph Hellwig
invalidate_partition has two callers, one of which already performs the remove_inode_hash just after the call. Just open code the function in the two callsites. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08dasd: use bdev_disk_changed instead of blk_drop_partitionsChristoph Hellwig
Use the more general interface - the behavior is the same except that now a change uevent is sent, which is the right thing to do when the device becomes unusable. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20210406062303.811835-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-07blk-zoned: Remove the definition of blk_zone_start()Bart Van Assche
Commit e76239a3748c ("block: add a report_zones method") removed the last blk_zone_start() call. Hence also remove the definition of this function. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406200820.15180-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-07blk-mq: set default elevator as deadline in case of hctx shared tagsetMing Lei
Yanhui found that write performance is degraded a lot after applying hctx shared tagset on one test machine with megaraid_sas. And turns out it is caused by none scheduler which becomes default elevator caused by hctx shared tagset patchset. Given more scsi HBAs will apply hctx shared tagset, and the similar performance exists for them too. So keep previous behavior by still using default mq-deadline for queues which apply hctx shared tagset, just like before. Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Reported-by: Yanhui Ma <yama@redhat.com> Cc: John Garry <john.garry@huawei.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210406031933.767228-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: stop calling blk_queue_bounce for passthrough requestsChristoph Hellwig
Instead of overloading the passthrough fast path with the deprecated block layer bounce buffering let the users that combine an old undermaintained driver with a highmem system pay the price by always falling back to copies in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: refactor the bounce buffering codeChristoph Hellwig
Get rid of all the PFN arithmetics and just use an enum for the two remaining options, and use PageHighMem for the actual bounce decision. Add a fast path to entirely avoid the call for the common case of a queue not using the legacy bouncing code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: remove BLK_BOUNCE_ISA supportChristoph Hellwig
Remove the BLK_BOUNCE_ISA support now that all users are gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06blk-mq: Always use blk_mq_is_sbitmap_sharedNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210311081713.2763171-1-nborisov@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06block: add sysfs entry for virt boundary maskMax Gurtovoy
This entry will expose the bio vector alignment mask for a specific block device. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210405132012.12504-1-mgurtovoy@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: merge bursts of newly-created queuesPaolo Valente
Many throughput-sensitive workloads are made of several parallel I/O flows, with all flows generated by the same application, or more generically by the same task (e.g., system boot). The most counterproductive action with these workloads is plugging I/O dispatch when one of the bfq_queues associated with these flows remains temporarily empty. To avoid this plugging, BFQ has been using a burst-handling mechanism for years now. This mechanism has proven effective for throughput, and not detrimental for service guarantees. This commit pushes this mechanism a little bit further, basing on the following two facts. First, all the I/O flows of a the same application or task contribute to the execution/completion of that common application or task. So the performance figures that matter are total throughput of the flows and task-wide I/O latency. In particular, these flows do not need to be protected from each other, in terms of individual bandwidth or latency. Second, the above fact holds regardless of the number of flows. Putting these two facts together, this commits merges stably the bfq_queues associated with these I/O flows, i.e., with the processes that generate these IO/ flows, regardless of how many the involved processes are. To decide whether a set of bfq_queues is actually associated with the I/O flows of a common application or task, and to merge these queues stably, this commit operates as follows: given a bfq_queue, say Q2, currently being created, and the last bfq_queue, say Q1, created before Q2, Q2 is merged stably with Q1 if - very little time has elapsed since when Q1 was created - Q2 has the same ioprio as Q1 - Q2 belongs to the same group as Q1 Merging bfq_queues also reduces scheduling overhead. A fio test with ten random readers on /dev/nullb shows a throughput boost of 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of the total per-request processing time, the above throughput boost implies that BFQ's overhead is reduced by more than 50%. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: keep shared queues out of the waker mechanismPaolo Valente
Shared queues are likely to receive I/O at a high rate. This may deceptively let them be considered as wakers of other queues. But a false waker will unjustly steal bandwidth to its supposedly woken queue. So considering also shared queues in the waking mechanism may cause more control troubles than throughput benefits. This commit keeps shared queues out of the waker-detection mechanism. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: fix weight-raising resume with !low_latencyPaolo Valente
When the io_latency heuristic is off, bfq_queues must not start to be weight-raised. Unfortunately, by mistake, this may happen when the state of a previously weight-raised bfq_queue is resumed after a queue split. This commit fixes this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: make shared queues inherit wakersPaolo Valente
Consider a bfq_queue bfqq that is about to be merged with another bfq_queue new_bfqq. The processes associated with bfqq are cooperators of the processes associated with new_bfqq. So, if bfqq has a waker, then it is reasonable (and beneficial for throughput) to assume that all these processes will be happy to let bfqq's waker freely inject I/O when they have no I/O. So this commit makes new_bfqq inherit bfqq's waker. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: put reqs of waker and woken in dispatch listPaolo Valente
Consider a new I/O request that arrives for a bfq_queue bfqq. If, when this happens, the only active bfq_queues are bfqq and either its waker bfq_queue or one of its woken bfq_queues, then there is no point in queueing this new I/O request in bfqq for service. In fact, the in-service queue and bfqq agree on serving this new I/O request as soon as possible. So this commit puts this new I/O request directly into the dispatch list. Tested-by: Jan Kara <jack@suse.cz> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: always inject I/O of queues blocked by wakersPaolo Valente
Suppose that I/O dispatch is plugged, to wait for new I/O for the in-service bfq-queue, say bfqq. Suppose then that there is a further bfq_queue woken by bfqq, and that this woken queue has pending I/O. A woken queue does not steal bandwidth from bfqq, because it remains soon without I/O if bfqq is not served. So there is virtually no risk of loss of bandwidth for bfqq if this woken queue has I/O dispatched while bfqq is waiting for new I/O. In contrast, this extra I/O injection boosts throughput. This commit performs this extra injection. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25blk-mq: Sentence reconstruct for better readabilityBhaskar Chowdhury
Sentence reconstruction for better readability. Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11block: Discard page cache of zone reset target rangeShin'ichiro Kawasaki
When zone reset ioctl and data read race for a same zone on zoned block devices, the data read leaves stale page cache even though the zone reset ioctl zero clears all the zone data on the device. To avoid non-zero data read from the stale page cache after zone reset, discard page cache of reset target zones in blkdev_zone_mgmt_ioctl(). Introduce the helper function blkdev_truncate_zone_range() to discard the page cache. Ensure the page cache discarded by calling the helper function before and after zone reset in same manner as fallocate does. This patch can be applied back to the stable kernel version v5.10.y. Rework is needed for older stable kernels. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Cc: <stable@vger.kernel.org> # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210311072546.678999-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11block: Suppress uevent for hidden device when removedDaniel Wagner
register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN but enables uevents at the end again in order to announce disk after possible partitions are created. When the device is removed the uevents are still on and user land sees 'remove' messages for devices which were never 'add'ed to the system. KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block) Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the uevents at all. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin Wilck <mwilck@suse.com> Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11block: rename BIO_MAX_PAGES to BIO_MAX_VECSChristoph Hellwig
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10block: Fix REQ_OP_ZONE_RESET_ALL handlingDamien Le Moal
Similarly to a single zone reset operation (REQ_OP_ZONE_RESET), execute REQ_OP_ZONE_RESET_ALL operations with REQ_SYNC set. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05blk-cgroup: Fix the recursive blkg rwstatXunlei Pang
The current blkio.throttle.io_service_bytes_recursive doesn't work correctly. As an example, for the following blkcg hierarchy: (Made 1GB READ in test1, 512MB READ in test2) test / \ test1 test2 $ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive 8:0 Read 1073684480 $ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive 8:0 Read 537448448 $ head -n 1 test/blkio.throttle.io_service_bytes_recursive 8:0 Read 537448448 Clearly, above data of "test" reflects "test2" not "test1"+"test2". Do the correct summary in blkg_rwstat_recursive_sum(). Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-02block/bfq: update comments and default value in docs for fifo_expireJoseph Qi
Correct the comments since bfq_fifo_expire[0] is for async request, while bfq_fifo_expire[1] is for sync request. Also update docs, according the source code, the default fifo_expire_async is 250ms, and fifo_expire_sync is 125ms. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-01block: Drop leftover references to RQF_SORTEDJean Delvare
Commit a1ce35fa49852db60fc6e268038530be533c5b15 ("block: remove dead elevator code") removed all users of RQF_SORTED. However it is still defined, and there is one reference left to it (which in effect is dead code). Clear it all up. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-28block: revert "block: fix bd_size_lock use"Damien Le Moal
With the removal of the skd driver, using IRQ safe locking of a bdev bd_size_lock spinlock to protect the bdev inode size is not necessary anymore as there is no other known driver using this lock under an IRQ disabled context (e.g. calling set_capacity() with IRQ disabled). Revert commit 0fe37724f8e7 ("block: fix bd_size_lock use") which introduced the IRQ safe change. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-28Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "A few stragglers (and one due to me missing it originally), and fixes for changes in this merge window mostly. In particular: - blktrace cleanups (Chaitanya, Greg) - Kill dead blk_pm_* functions (Bart) - Fixes for the bio alloc changes (Christoph) - Fix for the partition changes (Christoph, Ming) - Fix for turning off iopoll with polled IO inflight (Jeffle) - nbd disconnect fix (Josef) - loop fsync error fix (Mauricio) - kyber update depth fix (Yang) - max_sectors alignment fix (Mikulas) - Add bio_max_segs helper (Matthew)" * tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits) block: Add bio_max_segs blktrace: fix documentation for blk_fill_rw() block: memory allocations in bounce_clone_bio must not fail block: remove the gfp_mask argument to bounce_clone_bio block: fix bounce_clone_bio for passthrough bios block-crypto-fallback: use a bio_set for splitting bios block: fix logging on capacity change blk-settings: align max_sectors on "logical_block_size" boundary block: reopen the device in blkdev_reread_part block: don't skip empty device in in disk_uevent blktrace: remove debugfs file dentries from struct blk_trace nbd: handle device refs for DESTROY_ON_DISCONNECT properly kyber: introduce kyber_depth_updated() loop: fix I/O error on fsync() in detached loop devices block: fix potential IO hang when turning off io_poll block: get rid of the trace rq insert wrapper blktrace: fix blk_rq_merge documentation blktrace: fix blk_rq_issue documentation blktrace: add blk_fill_rwbs documentation comment block: remove superfluous param in blk_fill_rwbs() ...
2021-02-26block: Add bio_max_segsMatthew Wilcox (Oracle)
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to be unsigned to make it easier for the users. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: memory allocations in bounce_clone_bio must not failChristoph Hellwig
The caller can't cope with a failure from bounce_clone_bio, so use __GFP_NOFAIL for the passthrough case. bio_alloc_bioset already won't fail due to the use of mempools. And yes, we need to get rid of this bock layer bouncing code entirely sooner or later.. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: remove the gfp_mask argument to bounce_clone_bioChristoph Hellwig
The only caller always passes GFP_NOIO. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: fix bounce_clone_bio for passthrough biosChristoph Hellwig
Now that bio_alloc_bioset does not fall back to kmalloc for a NULL bio_set, handle that case explicitly and simplify the calling conventions. Based on an earlier patch from Chaitanya Kulkarni. Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset") Reported-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block-crypto-fallback: use a bio_set for splitting biosChristoph Hellwig
bio_split with a NULL bs argumen used to fall back to kmalloc the bio, which does not guarantee forward progress and could to deadlocks. Now that the overloading of the NULL bs argument to bio_alloc_bioset has been removed it crashes instead. Fix all that by using a special crafted bioset. Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset") Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23block: fix logging on capacity changeMing Lei
Local variable of 'capacity' stores the previous disk capacity, and 'size' variable records the latest disk capacity, so swap them for fixing logging on capacity change. Cc: Christoph Hellwig <hch@lst.de> Fixes: a782483cc1f8 ("block: remove the nr_sects field in struct hd_struct") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23blk-settings: align max_sectors on "logical_block_size" boundaryMikulas Patocka
We get I/O errors when we run md-raid1 on the top of dm-integrity on the top of ramdisk. device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1 device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1 device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb The ramdisk device has logical_block_size 512 and max_sectors 255. The dm-integrity device uses logical_block_size 4096 and it doesn't affect the "max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have a device with max_sectors not aligned on logical_block_size. The md-raid device sees that the underlying leg has max_sectors 255 and it will split the bios on 255-sector boundary, making the bios unaligned on logical_block_size. In order to fix the bug, we round down max_sectors to logical_block_size. Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23block: reopen the device in blkdev_reread_partChristoph Hellwig
Historically the BLKRRPART ioctls called into the now defunct ->revalidate method, which caused the sd driver to check if any media is present. When the ->revalidate method was removed this revalidation was lost, leading to lots of I/O errors when using the eject command. Fix this by reopening the device to rescan the partitions, and thus calling the revalidation logic in the sd driver. Fixes: 471bd0af544b ("sd: use bdev_check_media_change") Reported--by: Tom Seewald <tseewald@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Tom Seewald <tseewald@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23block: don't skip empty device in in disk_ueventChristoph Hellwig
Restore the previous behavior by using the correct flag for the whole device ("part0"). Fixes: 99dfc43ecbf6 ("block: use ->bi_bdev for bio based I/O accounting") Reported-by: John Stultz <john.stultz@linaro.org> Tested-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22kyber: introduce kyber_depth_updated()Yang Yang
Hang occurs when user changes the scheduler queue depth, by writing to the 'nr_requests' sysfs file of that device. The details of the environment that we found the problem are as follows: an eMMC block device total driver tags: 16 default queue_depth: 32 kqd->async_depth initialized in kyber_init_sched() with queue_depth=32 Then we change queue_depth to 256, by writing to the 'nr_requests' sysfs file. But kqd->async_depth don't be updated after queue_depth changes. Now the value of async depth is too small for queue_depth=256, this may cause hang. This patch introduces kyber_depth_updated(), so that kyber can update async depth when queue depth changes. Signed-off-by: Yang Yang <yang.yang@vivo.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22Merge tag 'for-5.12/block-ipi-2021-02-21' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IPI updates from Jens Axboe: "Avoid IRQ locking for the block IPI handling (Sebastian Andrzej Siewior)" * tag 'for-5.12/block-ipi-2021-02-21' of git://git.kernel.dk/linux-block: blk-mq: Use llist_head for blk_cpu_done blk-mq: Always complete remote completions requests in softirq smp: Process pending softirqs in flush_smp_call_function_from_idle()
2021-02-22Merge tag 'for-5.12/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix DM integrity's HMAC support to provide enhanced security of internal_hash and journal_mac capabilities. - Various DM writecache fixes to address performance, fix table output to match what was provided at table creation, fix writing beyond end of device when shrinking underlying data device, and a couple other small cleanups. - Add DM crypt support for using trusted keys. - Fix deadlock when swapping to DM crypt device by throttling number of in-flight REQ_SWAP bios. Implemented in DM core so that other bio-based targets can opt-in by setting ti->limit_swap_bios. - Fix various inverted logic bugs in the .iterate_devices callout functions that are used to assess if specific feature or capability is supported across all devices being combined/stacked by DM. - Fix DM era target bugs that exposed users to lost writes or memory leaks. - Add DM core support for passing through inline crypto support of underlying devices. Includes block/keyslot-manager changes that enable extending this support to DM. - Various small fixes and cleanups (spelling fixes, front padding calculation cleanup, cleanup conditional zoned support in targets, etc). * tag 'for-5.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (31 commits) dm: fix deadlock when swapping to encrypted device dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED dm: set DM_TARGET_PASSES_CRYPTO feature for some targets dm: support key eviction from keyslot managers of underlying devices dm: add support for passing through inline crypto support block/keyslot-manager: Introduce functions for device mapper support block/keyslot-manager: Introduce passthrough keyslot manager dm era: only resize metadata in preresume dm era: Use correct value size in equality function of writeset tree dm era: Fix bitset memory leaks dm era: Verify the data block size hasn't changed dm era: Reinitialize bitset cache before digesting a new writeset dm era: Update in-core bitset after committing the metadata dm era: Recover committed writeset after crash dm writecache: use bdev_nr_sectors() instead of open-coded equivalent dm writecache: fix writing beyond end of underlying device when shrinking dm table: remove needless request_queue NULL pointer checks dm table: fix zoned iterate_devices based device capability checks dm table: fix DAX iterate_devices based device capability checks dm table: fix iterate_devices based device capability checks ...
2021-02-22Merge tag 'mmc-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds
Pull MMC updates from Ulf Hansson: "MMC core: - Add support for eMMC inline encryption - Add a helper function to parse DT properties for clock phases - Some improvements and cleanups for the mmc_test module MMC host: - android-goldfish: Remove driver - cqhci: Add support for eMMC inline encryption - dw_mmc-zx: Remove driver - meson-gx: Extend support for scatter-gather to allow SD_IO_RW_EXTENDED - mmci: Add support for probing bus voltage level translator - mtk-sd: Address race condition for request timeouts - sdhci_am654: Add Support for the variant on TI's AM64 SoC - sdhci-esdhc-imx: Prevent kernel panic at ->remove() - sdhci-iproc: Add ACPI bindings for the RPi to enable SD and WiFi on RPi4 - sdhci-msm: Add Inline Crypto Engine support - sdhci-msm: Use actual_clock to improve timeout calculations - sdhci-of-aspeed: Add Andrew Jeffery as maintainer - sdhci-of-aspeed: Extend clock support for the AST2600 variant - sdhci-pci-gli: Increase idle period for low power state for GL9763E - sdhci-pci-o2micro: Make tuning for SDR104 HW more robust - sdhci-sirf: Remove driver - sdhci-xenon: Add support for the AP807 variant - sunxi-mmc: Add support for the A100 variant - sunxi-mmc: Ensure host is suspended during system sleep - tmio: Add detection of data timeout errors - tmio/renesas_sdhi: Extend support for retuning - renesas_sdhi_internal_dmac: Add support for the ->pre|post_req() ops" * tag 'mmc-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (86 commits) mmc: sdhci-esdhc-imx: fix kernel panic when remove module mmc: host: Retire MMC_GOLDFISH mmc: cb710: Use new tasklet API mmc: sdhci-pci-o2micro: Bug fix for SDR104 HW tuning failure mmc: mmc_test: use erase_arg for mmc_erase command mmc: wbsd: Use new tasklet API mmc: via-sdmmc: Use new tasklet API mmc: uniphier-sd: Use new tasklet API mmc: tifm_sd: Use new tasklet API mmc: s3cmci: Use new tasklet API mmc: omap: Use new tasklet API mmc: dw_mmc: Use new tasklet API mmc: au1xmmc: Use new tasklet API mmc: atmel-mci: Use new tasklet API mmc: cavium: Replace spin_lock_irqsave with spin_lock in hard IRQ mmc: queue: Remove unused define mmc: core: Drop redundant bouncesz from struct mmc_card mmc: core: Drop redundant member in struct mmc host mmc: core: Use host instead of card argument to mmc_spi_send_csd() mmc: core: Exclude unnecessary header file ...
2021-02-22block: fix potential IO hang when turning off io_pollJeffle Xu
QUEUE_FLAG_POLL flag will be cleared when turning off 'io_poll', while at that moment there may be IOs stuck in hw queue uncompleted. The following polling routine won't help reap these IOs, since blk_poll() will return immediately because of cleared QUEUE_FLAG_POLL flag. Thus these IOs will hang until they finnaly time out. The hang out can be observed by 'fio --engine=io_uring iodepth=1', while turning off 'io_poll' at the same time. To fix this, freeze and flush the request queue first when turning off 'io_poll'. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22block: get rid of the trace rq insert wrapperChaitanya Kulkarni
Get rid of the wrapper for trace_block_rq_insert() and call the function directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22block: Remove unused blk_pm_*() function definitionsBart Van Assche
Commit a1ce35fa4985 ("block: remove dead elevator code") removed the last callers of blk_pm_requeue_request(), blk_pm_add_request() and blk_pm_put_request(). Hence remove the definitions of these functions. Removing these functions removes all users of the struct request nr_pending member. Hence also remove 'nr_pending'. Note: 'nr_pending' is no longer used since commit 7cedffec8e75 ("block: Make blk_get_request() block for non-PM requests while suspended"). Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
2021-02-12blk-mq: Use llist_head for blk_cpu_doneSebastian Andrzej Siewior
With llist_head it is possible to avoid the locking (the irq-off region) when items are added. This makes it possible to add items on a remote CPU without additional locking. llist_add() returns true if the list was previously empty. This can be used to invoke the SMP function call / raise sofirq only if the first item was added (otherwise it is already pending). This simplifies the code a little and reduces the IRQ-off regions. blk_mq_raise_softirq() needs a preempt-disable section to ensure the request is enqueued on the same CPU as the softirq is raised. Some callers (USB-storage) invoke this path in preemptible context. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12blk-mq: Always complete remote completions requests in softirqSebastian Andrzej Siewior
Controllers with multiple queues have their IRQ-handelers pinned to a CPU. The core shouldn't need to complete the request on a remote CPU. Remove this case and always raise the softirq to complete the request. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11dm: support key eviction from keyslot managers of underlying devicesSatya Tangirala
Now that device mapper supports inline encryption, add the ability to evict keys from all underlying devices. When an upper layer requests a key eviction, we simply iterate through all underlying devices and evict that key from each device. Co-developed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-02-11block/keyslot-manager: Introduce functions for device mapper supportSatya Tangirala
Introduce blk_ksm_update_capabilities() to update the capabilities of a keyslot manager (ksm) in-place. The pointer to a ksm in a device's request queue may not be easily replaced, because upper layers like the filesystem might access it (e.g. for programming keys/checking capabilities) at the same time the device wants to replace that request queue's ksm (and free the old ksm's memory). This function allows the device to update the capabilities of the ksm in its request queue directly. Devices can safely update the ksm this way without any synchronization with upper layers *only* if the updated (new) ksm continues to support all the crypto capabilities that the old ksm did (see description below for blk_ksm_is_superset() for why this is so). Also introduce blk_ksm_is_superset() which checks whether one ksm's capabilities are a (not necessarily strict) superset of another ksm's. The blk-crypto framework requires that crypto capabilities that were advertised when a bio was created continue to be supported by the device until that bio is ended - in practice this probably means that a device's advertised crypto capabilities can *never* "shrink" (since there's no synchronization between bio creation and when a device may want to change its advertised capabilities) - so a previously advertised crypto capability must always continue to be supported. This function can be used to check that a new ksm is a valid replacement for an old ksm. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-02-11block/keyslot-manager: Introduce passthrough keyslot managerSatya Tangirala
The device mapper may map over devices that have inline encryption capabilities, and to make use of those capabilities, the DM device must itself advertise those inline encryption capabilities. One way to do this would be to have the DM device set up a keyslot manager with a "sufficiently large" number of keyslots, but that would use a lot of memory. Also, the DM device itself has no "keyslots", and it doesn't make much sense to talk about "programming a key into a DM device's keyslot manager", so all that extra memory used to represent those keyslots is just wasted. All a DM device really needs to be able to do is advertise the crypto capabilities of the underlying devices in a coherent manner and expose a way to evict keys from the underlying devices. There are also devices with inline encryption hardware that do not have a limited number of keyslots. One can send a raw encryption key along with a bio to these devices (as opposed to typical inline encryption hardware that require users to first program a raw encryption key into a keyslot, and send the index of that keyslot along with the bio). These devices also only need the same things from the keyslot manager that DM devices need - a way to advertise crypto capabilities and potentially a way to expose a function to evict keys from hardware. So we introduce a "passthrough" keyslot manager that provides a way to represent a keyslot manager that doesn't have just a limited number of keyslots, and for which do not require keys to be programmed into keyslots. DM devices can set up a passthrough keyslot manager in their request queues, and advertise appropriate crypto capabilities based on those of the underlying devices. Blk-crypto does not attempt to program keys into any keyslots in the passthrough keyslot manager. Instead, if/when the bio is resubmitted to the underlying device, blk-crypto will try to program the key into the underlying device's keyslot manager. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>