summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2017-11-03blk-mq: fix nr_requests wrong value when modify it from sysfsweiping zhang
if blk-mq use "none" io scheduler, nr_request get a wrong value when input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get the smaller one min(nr, set->queue_depth), and then q->nr_request get a wrong value. Reproduce: echo none > /sys/block/nvme0n1/queue/scheduler echo 1000000 > /sys/block/nvme0n1/queue/nr_requests cat /sys/block/nvme0n1/queue/nr_requests 1000000 Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03block: add a poll_fn callback to struct request_queueChristoph Hellwig
That we we can also poll non blk-mq queues. Mostly needed for the NVMe multipath code, but could also be useful elsewhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03block: introduce GENHD_FL_HIDDENChristoph Hellwig
With this flag a driver can create a gendisk that can be used for I/O submission inside the kernel, but which is not registered as user facing block device. This will be useful for the NVMe multipath implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03block: don't look at the struct device dev_t in disk_devtChristoph Hellwig
The hidden gendisks introduced in the next patch need to keep the dev field in their struct device empty so that udev won't try to create block device nodes for them. To support that rewrite disk_devt to look at the major and first_minor fields in the gendisk itself instead of looking into the struct device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03block: add a blk_steal_bios helperChristoph Hellwig
This helpers allows to bounce steal the uncompleted bios from a request so that they can be reissued on another path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03block: provide a direct_make_request helperChristoph Hellwig
This helper allows reinserting a bio into a new queue without much overhead, but requires all queue limits to be the same for the upper and lower queues, and it does not provide any recursion preventions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-4.15/blockJens Axboe
Pull NVMe changes from Christoph: "Below are the currently queue nvme updates for Linux 4.15. There are a few more things that could make it for this merge window, but I'd like to get things into linux-next, especially for the unlikely case that Linus decided to cut -rc8. Highlights: - support for SGLs in the PCIe driver (Chaitanya Kulkarni) - disable I/O schedulers for the admin queue (Israel Rukshin) - various Fibre Channel fixes and enhancements (James Smart) - various refactoring for better code sharing between transports (Sagi Grimberg and me) as well as lots of little bits from various contributors."
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCEMing Lei
SCSI restarts its queue in scsi_end_request() automatically, so we don't need to handle this case in blk-mq. Especailly any request won't be dequeued in this case, we needn't to worry about IO hang caused by restart vs. dispatch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01blk-mq: don't handle TAG_SHARED in restartMing Lei
Now restart is used in the following cases, and TAG_SHARED is for SCSI only. 1) .get_budget() returns BLK_STS_RESOURCE - if resource in target/host level isn't satisfied, this SCSI device will be added in shost->starved_list, and the whole queue will be rerun (via SCSI's built-in RESTART) in scsi_end_request() after any request initiated from this host/targe is completed. Forget to mention, host level resource can't be an issue for blk-mq at all. - the same is true if resource in the queue level isn't satisfied. - if there isn't outstanding request on this queue, then SCSI's RESTART can't work(blk-mq's can't work too), and the queue will be run after SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's RESTART when this request is finished 2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE - if there isn't onprogressing request on this queue, the queue will be run after SCSI_QUEUE_DELAY - otherwise, SCSI's RESTART covers the rerun. 3) blk_mq_get_driver_tag() failed - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver allocation. In one word, SCSI's built-in RESTART is enough to cover the queue rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart. In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when running I/O on these 8 luns concurrently. Aslo Roman Pen reported the current RESTART is very expensive especialy when there are lots of LUNs attached in one host, such as in his test, RESTART causes half of IOPS be cut. Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2 Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01blk-mq-sched: improve dispatching from sw queueMing Lei
SCSI devices use host-wide tagset, and the shared driver tag space is often quite big. However, there is also a queue depth for each lun( .cmd_per_lun), which is often small, for example, on both lpfc and qla2xxx, .cmd_per_lun is just 3. So lots of requests may stay in sw queue, and we always flush all belonging to same hw queue and dispatch them all to driver. Unfortunately it is easy to cause queue busy because of the small .cmd_per_lun. Once these requests are flushed out, they have to stay in hctx->dispatch, and no bio merge can happen on these requests, and sequential IO performance is harmed. This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request from a sw queue, so that we can dispatch them in scheduler's way. We can then avoid dequeueing too many requests from sw queue, since we don't flush ->dispatch completely. This patch improves dispatching from sw queue by using the .get_budget and .put_budget callbacks. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01blk-mq: introduce .get_budget and .put_budget in blk_mq_opsMing Lei
For SCSI devices, there is often a per-request-queue depth, which needs to be respected before queuing one request. Currently blk-mq always dequeues the request first, then calls .queue_rq() to dispatch the request to lld. One obvious issue with this approach is that I/O merging may not be successful, because when the per-request-queue depth can't be respected, .queue_rq() has to return BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch list. This means it never gets a chance to be merged with other IO. This patch introduces .get_budget and .put_budget callback in blk_mq_ops, then we can try to get reserved budget first before dequeuing request. If the budget for queueing I/O can't be satisfied, we don't need to dequeue request at all. Hence the request can be left in the IO scheduler queue, for more merging opportunities. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01block: kyber: check if there are requests in ctx in kyber_has_work()Ming Lei
There may be request in sw queue, and not fetched to domain queue yet, so check it in kyber_has_work(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01blk-mq-sched: move actual dispatching into one helperMing Lei
So that it becomes easy to support to dispatch from sw queue in the following patch. No functional change. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatchMing Lei
When the hw queue is busy, we shouldn't take requests from the scheduler queue any more, otherwise it is difficult to do IO merge. This patch fixes the awful IO performance on some SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if hw queue is busy. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-30block: Fix a race between blk_cleanup_queue() and timeout handlingBart Van Assche
Make sure that if the timeout timer fires after a queue has been marked "dying" that the affected requests are finished. Reported-by: chenxiang (M) <chenxiang66@hisilicon.com> Fixes: commit 287922eb0b18 ("block: defer timeouts to a workqueue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: chenxiang (M) <chenxiang66@hisilicon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-26block, locking/lockdep: Assign a lock_class per gendisk used for ↵Byungchul Park
wait_for_completion() Darrick posted the following warning and Dave Chinner analyzed it: > ====================================================== > WARNING: possible circular locking dependency detected > 4.14.0-rc1-fixes #1 Tainted: G W > ------------------------------------------------------ > loop0/31693 is trying to acquire lock: > (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs] > > but now in release context of a crosslock acquired at the following: > ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #2 ((complete)&ret.event){+.+.}: > lock_acquire+0xab/0x200 > wait_for_completion_io+0x4e/0x1a0 > submit_bio_wait+0x7f/0xb0 > blkdev_issue_zeroout+0x71/0xa0 > xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs] > xfs_bmapi_write+0x374/0x11f0 [xfs] > xfs_iomap_write_direct+0x2ac/0x430 [xfs] > xfs_file_iomap_begin+0x20d/0xd50 [xfs] > iomap_apply+0x43/0xe0 > dax_iomap_rw+0x89/0xf0 > xfs_file_dax_write+0xcc/0x220 [xfs] > xfs_file_write_iter+0xf0/0x130 [xfs] > __vfs_write+0xd9/0x150 > vfs_write+0xc8/0x1c0 > SyS_write+0x45/0xa0 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > -> #1 (&xfs_nondir_ilock_class){++++}: > lock_acquire+0xab/0x200 > down_write_nested+0x4a/0xb0 > xfs_ilock+0x263/0x330 [xfs] > xfs_setattr_size+0x152/0x370 [xfs] > xfs_vn_setattr+0x6b/0x90 [xfs] > notify_change+0x27d/0x3f0 > do_truncate+0x5b/0x90 > path_openat+0x237/0xa90 > do_filp_open+0x8a/0xf0 > do_sys_open+0x11c/0x1f0 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}: > up_write+0x1c/0x40 > xfs_iunlock+0x1d0/0x310 [xfs] > xfs_file_fallocate+0x8a/0x310 [xfs] > loop_queue_work+0xb7/0x8d0 > kthread_worker_fn+0xb9/0x1f0 > > Chain exists of: > &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event > > Possible unsafe locking scenario by crosslock: > > CPU0 CPU1 > ---- ---- > lock(&xfs_nondir_ilock_class); > lock((complete)&ret.event); > lock(&(&ip->i_mmaplock)->mr_lock); > unlock((complete)&ret.event); > > *** DEADLOCK *** The warning is a false positive, caused by the fact that all wait_for_completion()s in submit_bio_wait() are waiting with the same lock class. However, some bios have nothing to do with others, for example in the case of loop devices, there's no direct connection between the bios of an upper device and the bios of a lower device(=loop device). The safest way to assign different lock classes to different devices is to do it for each gendisk. In other words, this patch assigns a lockdep_map per gendisk and uses it when initializing completion in submit_bio_wait(). Analyzed-by: Dave Chinner <david@fromorbit.com> Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Byungchul Park <byungchul.park@lge.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: amir73il@gmail.com Cc: axboe@kernel.dk Cc: david@fromorbit.com Cc: hch@infradead.org Cc: idryomov@gmail.com Cc: johan@kernel.org Cc: johannes.berg@intel.com Cc: kernel-team@lge.com Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Cc: oleg@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25mq-deadline: add 'deadline' as a name aliasJens Axboe
The scheduler framework now supports looking up the appropriate scheduler with the {name,mq} tupple. We can register mq-deadline with the alias of 'deadline', so that switching to 'deadline' will do the right thing based on the type of driver attached to it. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25elevator: allow name aliasesJens Axboe
Since we now lookup elevator types with the appropriate multiqueue capability, allow schedulers to register with an alias alongside the real name. This is in preparation for allowing 'mq-deadline' to register an alias of 'deadline' as well. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25elevator: lookup mq vs non-mq elevatorsJens Axboe
If an IO scheduler is selected via elevator= and it doesn't match the driver in question wrt blk-mq support, then we fail to boot. The elevator= parameter is deprecated and only supported for non-mq devices. Augment the elevator lookup API so that we pass in if we're looking for an mq capable scheduler or not, so that we only ever return a valid type for the queue in question. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695 Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()Ilya Dryomov
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25block: factor out __blkdev_issue_zero_pages()Ilya Dryomov
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25block: move CAP_SYS_ADMIN check in blkdev_roset()Ilya Dryomov
Check for CAP_SYS_ADMIN before calling into the driver, similar to blkdev_flushbuf(). This is safer and can spare a check in the driver. (Currently BLKROSET is overridden by md and rbd, rbd is missing the check. md has the check, but it covers a lot more than BLKROSET.) Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait()Christoph Hellwig
Simplify the code by getting rid of the submit_bio_ret structure. (This also helps address a lockdep false positive.) Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: amir73il@gmail.com Cc: axboe@kernel.dk Cc: darrick.wong@oracle.com Cc: david@fromorbit.com Cc: hch@infradead.org Cc: idryomov@gmail.com Cc: johan@kernel.org Cc: johannes.berg@intel.com Cc: kernel-team@lge.com Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Cc: oleg@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1508921765-15396-2-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24block: Invalidate cache on discard v2Dmitry Monakhov
It is reasonable drop page cache on discard, otherwise that pages may be written by writeback second later, so thin provision devices will not be happy. This seems to be a security leak in case of secure discard case. Also add check for queue_discard flag on early stage. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-18block: remove blk_mq_reinit_tagsetSagi Grimberg
No callers left. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-18block: introduce blk_mq_tagset_iterSagi Grimberg
Iterator helper to apply a function on all the tags in a given tagset. export it as it will be used outside the block layer later on. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-17kyber: fix hang on domain token wait queueOmar Sandoval
When we're getting a domain token, if we fail to get a token on our first attempt, we put the current hardware queue on a wait queue and then try again just in case a token was freed after our initial attempt but before we got on the wait queue. If this second attempt succeeds, we currently leave the hardware queue on the wait queue. Usually this is okay; we'll just run the hardware queue one extra time when another token is freed. However, if the hardware queue doesn't have any other requests waiting, then when it it gets the extra wakeup, it won't have anything to free and therefore won't wake up any other hardware queues. If tokens are limited, then we won't make forward progress and the device will hang. Reported-by: Bin Zha <zhabin.zb@alibaba-inc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16block: fix Sphinx kernel-doc warningRandy Dunlap
Sphinx treats symbols that end with '_' as a kind of special documentation indicator, so fix that by adding an ending '*' to it. ../block/bio.c:404: ERROR: Unknown target name: "gfp". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-11bio_alloc_map_data(): do bmd->iter setup right thereAl Viro
just need to copy it iter instead of iter->nr_segs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11bio_copy_user_iov(): saner bio size calculationAl Viro
it's a bounce buffer; we don't *care* how badly is the real source/destination fragmented, all that matters is the total size. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11bio_map_user_iov(): get rid of copying iov_iterAl Viro
we do want *iter advanced Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11bio_copy_from_iter(): get rid of copying iov_iterAl Viro
we want the one passed to it advanced, anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11move more stuff down into bio_copy_user_iov()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11blk_rq_map_user_iov(): move iov_iter_advance() downAl Viro
... into bio_{map,copy}_user_iov() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11bio_map_user_iov(): get rid of the iov_for_each()Al Viro
Use iov_iter_npages() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11bio_map_user_iov(): move alignment check into the main loopAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11don't rely upon subsequent bio_add_pc_page() calls failingAl Viro
... they might actually succeed in some cases (when we are at the queue-imposed segments limit, the next page is not mergable with the last one we'd got in, but the first page covered by the next iovec *is* mergable). Make sure that once it's failed, we are done with that bio. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11... and with iov_iter_get_pages_alloc() it becomes even simplerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11bio_map_user_iov(): switch to iov_iter_get_pages()/iov_iter_advance()Al Viro
... and to hell with iov_for_each() nonsense Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10bio_copy_user_iov(): don't ignore ->iov_offsetAl Viro
Since "block: support large requests in blk_rq_map_user_iov" we started to call it with partially drained iter; that works fine on the write side, but reads create a copy of iter for completion time. And that needs to take the possibility of ->iov_iter != 0 into account... Cc: stable@vger.kernel.org #v4.5+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10more bio_map_user_iov() leak fixesAl Viro
we need to take care of failure exit as well - pages already in bio should be dropped by analogue of bio_unmap_pages(), since their refcounts had been bumped only once per reference in bio. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10fix unbalanced page refcounting in bio_map_user_iovVitaly Mayatskikh
bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if IO vector has small consecutive buffers belonging to the same page. bio_add_pc_page merges them into one, but the page reference is never dropped. Cc: stable@vger.kernel.org Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10block: set request_list for requestShaohua Li
Legacy queue sets request's request_list, mq doesn't. This makes mq does the same thing, so we can find cgroup of a request. Note, we really only use blkg field of request_list, it's pointless to allocate mempool for request_list in mq case. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blk-stat: delete useless codeShaohua Li
Fix two issues: - the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except sum it to global stat. We can do the calculation there. The flush just wastes cpu time. - some fields are signed int/s64. I don't see the point. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blk-throttle: fix null pointer dereference while throttling writeback IOsJiufei Xue
A null pointer dereference can occur when blkcg is removed manually with writeback IOs inflight. This is caused by the following case: Writeback kworker submit the bio and set bio->bi_cg_private to tg in blk_throtl_assoc_bio. Then we remove the block cgroup manually, the blkg and tg would be freed if there is no request inflight. When the submitted bio come back, blk_throtl_bio_endio() fetch the tg which was already freed. Fix this by increasing the refcount of blkg in funcion blk_throtl_assoc_bio() so that the blkg will not be freed until the bio_endio called. Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Jiufei Xue <jiufei.xjf@alibaba-inc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blkcg: check pol->cpd_free_fn before free cpdweiping zhang
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09block, bfq: fix unbalanced decrements of burst sizePaolo Valente
The commit "block, bfq: decrease burst size when queues in burst exit" introduced the decrement of burst_size on the removal of a bfq_queue from the burst list. Unfortunately, this decrement can happen to be performed even when burst size is already equal to 0, because of unbalanced decrements. A description follows of the cause of these unbalanced decrements, namely a wrong assumption, and of the way how this wrong assumption leads to unbalanced decrements. The wrong assumption is that a bfq_queue can exit only if the process associated with the bfq_queue has exited. This is false, because a bfq_queue, say Q, may exit also as a consequence of a merge with another bfq_queue. In this case, Q exits because the I/O of its associated process has been redirected to another bfq_queue. The decrement unbalance occurs because Q may then be re-created after a split, and added back to the current burst list, *without* incrementing burst_size. burst_size is not incremented because Q is not a new bfq_queue added to the burst list, but a bfq_queue only temporarily removed from the list, and, before the commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit", burst_size was not decremented when Q was removed. This commit addresses this issue by just checking whether the exiting bfq_queue is a merged bfq_queue, and, in that case, not decrementing burst_size. Unfortunately, this still leaves room for unbalanced decrements, in the following rarer case: on a split, the bfq_queue happens to be inserted into a different burst list than that it was removed from when merged. If this happens, the number of elements in the new burst list becomes higher than burst_size (by one). When the bfq_queue then exits, it is of course not in a merged state any longer, thus burst_size is decremented, which results in an unbalanced decrement. To handle this sporadic, unlucky case in a simple way, this commit also checks that burst_size is larger than 0 before decrementing it. Finally, this commit removes an useless, extra check: the check that the bfq_queue is sync, performed before checking whether the bfq_queue is in the burst list. This extra check is redundant, because only sync bfq_queues can be inserted into the burst list. Fixes: 7cb04004fa37 ("block, bfq: decrease burst size when queues in burst exit") Reported-by: Philip Müller <philm@manjaro.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Philip Müller <philm@manjaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09block,bfq: Disable writeback throttlingLuca Miccio
Similarly to CFQ, BFQ has its write-throttling heuristics, and it is better not to combine them with further write-throttling heuristics of a different nature. So this commit disables write-back throttling for a device if BFQ is used as I/O scheduler for that device. Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>