summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2019-02-09block: kill QUEUE_FLAG_FLUSH_NQJens Axboe
We have various helpers for setting/clearing this flag, and also a helper to check if the queue supports queueable flushes or not. But nobody uses them anymore, kill it with fire. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09Merge tag 'for-linus-20190209' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph, fixing namespace locking when dealing with the effects log, and a rapid add/remove issue (Keith) - blktrace tweak, ensuring requests with -1 sectors are shown (Jan) - link power management quirk for a Smasung SSD (Hans) - m68k nfblock dynamic major number fix (Chengguang) - series fixing blk-iolatency inflight counter issue (Liu) - ensure that we clear ->private when setting up the aio kiocb (Mike) - __find_get_block_slow() rate limit print (Tetsuo) * tag 'for-linus-20190209' of git://git.kernel.dk/linux-block: blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter blktrace: Show requests without sector fs: ratelimit __find_get_block_slow() failure message. m68k: set proper major_num when specifying module param major_num libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD nvme-pci: fix rapid add remove sequence nvme: lock NS list changes while handling command effects aio: initialize kiocb private in case any filesystems expect it.
2019-02-08block: avoid setting nr_requests to current valueAleksei Zakharov
There's no reason to freeze queue and set nr_requests value if current value is the same. Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-08blk-mq: remove duplicated definition of blk_mq_freeze_queueLiu Bo
As the prototype has been defined in "include/linux/blk-mq.h", the one in "block/blk-mq.h" can be removed then. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-08Blk-iolatency: warn on negative inflight IO counterLiu Bo
This is to catch any unexpected negative value of inflight IO counter. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-08blk-iolatency: fix IO hang due to negative inflight counterLiu Bo
Our test reported the following stack, and vmcore showed that ->inflight counter is -1. [ffffc9003fcc38d0] __schedule at ffffffff8173d95d [ffffc9003fcc3958] schedule at ffffffff8173de26 [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e The ->inflight counter may be negative (-1) if 1) blk-iolatency was disabled when the IO was issued, 2) blk-iolatency was enabled before this IO reached its endio, 3) the ->inflight counter is decreased from 0 to -1 in endio() In fact the hang can be easily reproduced by the below script, H=/sys/fs/cgroup/unified/ P=/sys/fs/cgroup/unified/test echo "+io" > $H/cgroup.subtree_control mkdir -p $P echo $$ > $P/cgroup.procs xfs_io -f -d -c "pwrite 0 4k" /dev/sdg echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency xfs_io -f -d -c "pwrite 0 4k" /dev/sdg This fixes the problem by freezing the queue so that while enabling/disabling iolatency, there is no inflight rq running. Note that quiesce_queue is not needed as this only updating iolatency configuration about which dispatching request_queue doesn't care. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-08Merge tag 'driver-core-5.0-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some driver core fixes for 5.0-rc6. Well, not so much "driver core" as "debugfs". There's a lot of outstanding debugfs cleanup patches coming in through different subsystem trees, and in that process the debugfs core was found that it really should return errors when something bad happens, to prevent random files from showing up in the root of debugfs afterward. So debugfs was fixed up to handle this properly, and then two fixes for the relay and blk-mq code was needed as it was making invalid assumptions about debugfs return values. There's also a cacheinfo fix in here that resolves a tiny issue. All of these have been in linux-next for over a week with no reported problems" * tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: blk-mq: protect debugfs_create_files() from failures relay: check return of create_buf_file() properly debugfs: debugfs_lookup() should return NULL if not found debugfs: return error values, not NULL debugfs: fix debugfs_rename parameter checking cacheinfo: Keep the old value if of_property_read_u32 fails
2019-02-05scsi: block: remove bidi supportChristoph Hellwig
Unused now, and another field in struct request bites the dust. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-02-05scsi: block: remove req->specialChristoph Hellwig
No users left. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-02-05scsi: bsg-lib: handle bidi requests without block layer helpChristoph Hellwig
We can just stash away the second request in struct bsg_job instead of using the block layer req->next_rq field, allowing for the eventual removal of the latter. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-02-05scsi: bsg: refactor bsg_ioctlChristoph Hellwig
Move all actual functionality into helpers, just leaving the dispatch in this function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Tested-by: Benjamin Block <bblock@linux.ibm.com> Tested-by: Avri Altman <avri.altman@wdc.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-02-01blk-mq: save default hctx into ctx->hctxs for not-supported typeJianchao Wang
Currently, we check whether the hctx type is supported every time in hot path. Actually, this is not necessary, we could save the default hctx into ctx->hctxs if the type is not supported when map swqueues and use it directly with ctx->hctxs[type]. We also needn't check whether the poll is enabled or not, because the caller would clear the REQ_HIPRI in that case. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-01blk-mq: save queue mapping result into ctx directlyJianchao Wang
Currently, the queue mapping result is saved in a two-dimensional array. In the hot path, to get a hctx, we need do following: q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]] This isn't very efficient. We could save the queue mapping result into ctx directly with different hctx type, like, ctx->hctxs[type] Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: fix in-service-queue check for queue mergingPaolo Valente
When a new I/O request arrives for a bfq_queue, say Q, bfq checks whether that request is close to (a) the head request of some other queue waiting to be served, or (b) the last request dispatched for the in-service queue (in case Q itself is not the in-service queue) If a queue, say Q2, is found for which the above condition holds, then bfq merges Q and Q2, to hopefully get a more sequential I/O in the resulting merged queue, and thus a possibly higher throughput. Case (b) is checked by comparing the new request for Q with the last request dispatched, assuming that the latter necessarily belonged to the in-service queue. Unfortunately, this assumption is no longer always correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash"). When the assumption does not hold, queues that must not be merged may be merged, causing unexpected loss of control on per-queue service guarantees. This commit solves this problem by adding an extra field, which stores the actual last request dispatched for the in-service queue, and by using this new field to correctly check case (b). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: do not overcharge writes in asymmetric scenariosPaolo Valente
Writes tend to starve reads. bfq counters this problem by overcharging writes with an inflated service w.r.t. the actual service (number of sector written) they receive. Yet his overcharging is useless, and actually causes unfairness in the opposite direction, when bfq happens to be enforcing strong I/O control. bfq does this enforcing when the scenario is asymmetric, i.e., when some bfq_queue or group of bfq_queues is to be granted a different bandwidth than some other bfq_queue or group of bfq_queues. So, in such a scenario, this commit disables write overcharging. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: port commit "cfq-iosched: improve hw_tag detection"Paolo Valente
The original commit is commit 1a1238a7dd48 ("cfq-iosched: improve hw_tag detection") and has the following commit message: If active queue hasn't enough requests and idle window opens, cfq will not dispatch sufficient requests to hardware. In such situation, current code will zero hw_tag. But this is because cfq doesn't dispatch enough requests instead of hardware queue doesn't work. Don't zero hw_tag in such case. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: reduce threshold for detecting command queueingPaolo Valente
bfq simple heuristic from cfq for detecting whether the drive performs command queueing: check whether the average number of in-flight requests is above a given threshold. Unfortunately this heuristic does fail to detect queueing (on drives with queueing) if processes doing I/O are few and issue I/O with a low depth. To reduce false negatives, this commit lowers the threshold. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: fix queue removal from weights treePaolo Valente
bfq maintains an ordered list, through a red-black tree, of unique weights of active bfq_queues. This list is used to detect whether there are active queues with differentiated weights. The weight of a queue is removed from the list when both the following two conditions become true: (1) the bfq_queue is flagged as inactive (2) the has no in-flight request any longer; Unfortunately, in the rare cases where condition (2) becomes true before condition (1), the removal fails, because the function to remove the weight of the queue (bfq_weights_tree_remove) is rightly invoked in the path that deactivates the bfq_queue, but mistakenly invoked *before* the function that actually performs the deactivation (bfq_deactivate_bfqq). This commits moves the invocation of bfq_weights_tree_remove for condition (1) to after bfq_deactivate_bfqq. As a consequence of this move, it is necessary to add a further reference to the queue when the weight of a queue is added, because the queue might otherwise be freed before bfq_weights_tree_remove is invoked. This commit adds this reference and makes all related modifications. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: fix sequential rq detection in rate estimationPaolo Valente
In bfq_update_peak_rate, to check whether an I/O request rq is sequential, only the seek distance of rq w.r.t. the last request dispatched is controlled. This is not sufficient for non-rotational storage, where the size of rq is at least as relevant. This commit adds the missing control. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: unconditionally plug I/O in asymmetric scenariosPaolo Valente
bfq detects the creation of multiple bfq_queues shortly after each other, namely a burst of queue creations in the terminology used in the code. If the burst is large, then no queue in the burst is granted - either I/O-dispatch plugging when the queue remains temporarily idle while in service; - or weight raising, because it causes even longer plugging. In fact, such a plugging tends to lower throughput, while these bursts are typically due to applications or services that spawn multiple processes, to reach a common goal as soon as possible. Examples are a "git grep" or the booting of a system. Unfortunately, disabling plugging may cause a loss of service guarantees in asymmetric scenarios, i.e., if queue weights are differentiated or if more than one group is active. This commit addresses this issue by no longer disabling I/O-dispatch plugging for queues in large bursts. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: do not plug I/O of in-service queue when harmfulPaolo Valente
If the in-service bfq_queue is sync and remains temporarily idle, then I/O dispatching (from other queues) may be plugged. It may be dome for two reasons: either to boost throughput, or to preserve the bandwidth share of the in-service queue. In the first case, if the I/O of the in-service queue, when it finally arrives, consists only of one small I/O request, then it makes sense to plug even the I/O of the in-service queue. In fact, serving such a small request immediately is likely to lower throughput instead of boosting it, whereas waiting a little bit is likely to let that request grow, thanks to request merging, and become more profitable in terms of throughput (this is likely to happen exactly because the I/O of the queue has been detected to boost throughput). On the opposite end, if I/O dispatching is being plugged only to preserve the bandwidth of the in-service queue, then it would be better not to plug also the I/O of the in-service queue, because such a plugging is likely to cause only loss of bandwidth for the queue. Unfortunately, no distinction is made between the two cases, and the I/O of the in-service queue is always plugged in case just a small I/O request arrives. This commit draws this missing distinction and does not perform harmful plugging. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: split function bfq_better_to_idlePaolo Valente
This is a preparatory commit for commits that need to check only one of the two main reasons for idling. This change should also improve the quality of the code a little bit, by splitting a function that contains very long, non-trivial and little related comments. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: consider also ioprio classes in symmetry detectionPaolo Valente
In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to be guaranteed a different bandwidth than other bfq_queues or bfq_groups, these service guaranteed can be provided only by plugging I/O dispatch, completely or partially, when the queue in service remains temporarily empty. A case where asymmetry is particularly strong is when some active bfq_queues belong to a higher-priority class than some other active bfq_queues. Unfortunately, this important case is not considered at all in the code for detecting asymmetric scenarios. This commit adds the missing logic. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: remove case of redirected bic from insert_requestPaolo Valente
Before commit 18e5a57d7987 ("block, bfq: postpone rq preparation to insert or merge"), the destination queue for a request was chosen by a different hook than the one that then inserted the request. So, between the execution of the two hooks, the bic of the process generating the request could happen to be redirected to a different bfq_queue. As a consequence, the destination bfq_queue stored in the request could be wrong. Such an event does not need to ba handled any longer. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: make sure queue budgets are not below service receivedPaolo Valente
With some unlucky sequences of events, the function bfq_updated_next_req updates the current budget of a bfq_queue to a lower value than the service received by the queue using such a budget. Unfortunately, if this happens, then the return value of the function bfq_bfqq_budget_left becomes inconsistent. This commit solves this problem by lower-bounding the budget computed in bfq_updated_next_req to the service currently charged to the queue. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: avoid selecting a queue w/o budgetPaolo Valente
To boost throughput on devices with internal queueing and in scenarios where device idling is not strictly needed, bfq immediately starts serving a new bfq_queue if the in-service bfq_queue remains without pending I/O, even if new I/O may arrive soon for the latter queue. Then, if such I/O actually arrives soon, bfq preempts the new in-service bfq_queue so as to give the previous queue a chance to go on being served (in case the previous queue should actually be the one to be served, according to its timestamps). However, the in-service bfq_queue, say Q, may also be without further budget when it remains also pending I/O. Since bfq changes budgets dynamically to fit the needs of bfq_queues, this happens more often than one may expect. If this happens, then there is no point in trying to go on serving Q when new I/O arrives for it soon: Q would be expired immediately after being selected for service. This would only cause useless overhead. This commit avoids such a useless selection. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31block, bfq: do not consider interactive queues in srt filteringPaolo Valente
The speed at which a bfq_queue receives I/O is one of the parameters by which bfq decides whether the queue is soft real-time (i.e., whether the queue contains the I/O of a soft real-time application). In particular, when a bfq_queue remains without outstanding I/O requests, bfq computes the minimum time instant, named soft_rt_next_start, at which the next request of the queue may arrive for the queue to be deemed as soft real time. Unfortunately this filtering may cause problems with a queue in interactive weight raising. In fact, such a queue may be conveying the I/O needed to load a soft real-time application. The latter will actually exhibit a soft real-time I/O pattern after it finally starts doing its job. But, if soft_rt_next_start is updated for an interactive bfq_queue, and the queue has received a lot of service before remaining with no outstanding request (likely to happen on a fast device), then soft_rt_next_start is assigned such a high value that, for a very long time, the queue is prevented from being possibly considered as soft real time. This commit removes the updating of soft_rt_next_start for bfq_queues in interactive weight raising. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31blk-mq: protect debugfs_create_files() from failuresGreg Kroah-Hartman
If debugfs were to return a non-NULL error for a debugfs call, using that pointer later in debugfs_create_files() would crash. Fix that by properly checking the pointer before referencing it. Reported-by: Michal Hocko <mhocko@kernel.org> Reported-and-tested-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-30blk-mq: fix a hung issue when fsyncJianchao Wang
Florian reported a io hung issue when fsync(). It should be triggered by following race condition. data + post flush a flush blk_flush_complete_seq case REQ_FSEQ_DATA blk_flush_queue_rq issued to driver blk_mq_dispatch_rq_list try to issue a flush req failed due to NON-NCQ command .queue_rq return BLK_STS_DEV_RESOURCE request completion req->end_io // doesn't check RESTART mq_flush_data_end_io case REQ_FSEQ_POSTFLUSH blk_kick_flush do nothing because previous flush has not been completed blk_mq_run_hw_queue insert rq to hctx->dispatch due to RESTART is still set, do nothing To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io with blk_mq_sched_restart to check and clear the RESTART flag. Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers) Reported-by: Florian Stecker <m19@florianstecker.de> Tested-by: Florian Stecker <m19@florianstecker.de> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-30block: pass no-op callback to INIT_WORK().Tetsuo Handa
syzbot is hitting flush_work() warning caused by commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without INIT_WORK().") [1]. Although that commit did not expect INIT_WORK(NULL) case, calling flush_work() without setting a valid callback should be avoided anyway. Fix this problem by setting a no-op callback instead of NULL. [1] https://syzkaller.appspot.com/bug?id=e390366bc48bc82a7c668326e0663be3b91cbd29 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: syzbot <syzbot+ba2a929dcf8e704c180e@syzkaller.appspotmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-27Revert "block: cover another queue enter recursion via BIO_QUEUE_ENTERED"Jens Axboe
We can't touch a bio after ->make_request_fn(), for all we know it could already have been completed by the time this function returns. This reverts commit 698cef173983b086977e633e46476e0f925ca01e. Reported-by: syzbot+4df6ca820108fd248943@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-24blk-wbt: Declare local functions staticBart Van Assche
This patch avoids that sparse reports the following warnings: CHECK block/blk-wbt.c block/blk-wbt.c:600:6: warning: symbol 'wbt_issue' was not declared. Should it be static? block/blk-wbt.c:620:6: warning: symbol 'wbt_requeue' was not declared. Should it be static? CC block/blk-wbt.o block/blk-wbt.c:600:6: warning: no previous prototype for wbt_issue [-Wmissing-prototypes] void wbt_issue(struct rq_qos *rqos, struct request *rq) ^~~~~~~~~ block/blk-wbt.c:620:6: warning: no previous prototype for wbt_requeue [-Wmissing-prototypes] void wbt_requeue(struct rq_qos *rqos, struct request *rq) ^~~~~~~~~~~ Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-24blk-mq: fix the cmd_flag_name arrayJianchao Wang
Swap REQ_NOWAIT and REQ_NOUNMAP and add REQ_HIPRI. Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22block: cover another queue enter recursion via BIO_QUEUE_ENTEREDMing Lei
Except for blk_queue_split(), bio_split() is used for splitting bio too, then the remained bio is often resubmit to queue via generic_make_request(). So the same queue enter recursion exits in this case too. Unfortunatley commit cd4a4ae4683dc2 doesn't help this case. This patch covers the above case by setting BIO_QUEUE_ENTERED before calling q->make_request_fn. In theory the per-bio flag is used to simulate one stack variable, it is just fine to clear it after q->make_request_fn is returned. Especially the same bio can't be submitted from another context. Fixes: cd4a4ae4683dc2 ("block: don't use blocking queue entered for recursive bio submits") Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: NeilBrown <neilb@suse.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-17block: Cleanup license noticeThomas Gleixner
Remove the imprecise and sloppy: "This files is licensed under the GPL." license notice in the top level comment. 1) The file already contains a SPDX license identifier which clearly states that the license of the file is GPL V2 only 2) The notice resolves to GPL v1 or later for scanners which is just contrary to the intent of SPDX identifiers to provide clear and non ambiguous license information. Aside of that the value add of this notice is below zero, Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Matias Bjorling <mb@lightnvm.io> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Fixes: 6a5ac9846508 ("block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-16block: don't lose track of REQ_INTEGRITY flagMing Lei
We need to pass bio->bi_opf after bio intergrity preparing, otherwise the flag of REQ_INTEGRITY may not be set on the allocated request, then breaks block integrity. Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping") Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-14block, bfq: fix comments on __bfq_deactivate_entityPaolo Valente
Comments on function __bfq_deactivate_entity contains two imprecise or wrong statements: 1) The function performs the deactivation of the entity. 2) The function must be invoked only if the entity is on a service tree. This commits replaces both statements with the correct ones: 1) The functions updates sched_data and service trees for the entity, so as to represent entity as inactive (which is only part of the steps needed for the deactivation of the entity). 2) The function must be invoked on every entity being deactivated. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-09block: fix kerneldoc comment for blk_attempt_plug_merge()Jonathan Corbet
Commit 5f0ed774ed29 ("block: sum requests in the plug structure") removed the request_count parameter from block_attempt_plug_merge(), but did not remove the associated kerneldoc comment, introducing this warning to the docs build: ./block/blk-core.c:685: warning: Excess function parameter 'request_count' description in 'blk_attempt_plug_merge' Remove the obsolete description and make things a little quieter. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08block: clarify documentation for blk_{start|finish}_plugJeff Moyer
There was some confusion about what these functions did. Make it clear that this is a hint for upper layers to pass to the block layer, and that it does not guarantee that I/O will not be submitted between a start and finish plug. Reported-by: "Darrick J. Wong" <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-02Merge tag 'for-4.21/block-20190102' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: - Dead code removal for loop/sunvdc (Chengguang) - Mark BIDI support for bsg as deprecated, logging a single dmesg warning if anyone is actually using it (Christoph) - blkcg cleanup, killing a dead function and making the tryget_closest variant easier to read (Dennis) - Floppy fixes, one fixing a regression in swim3 (Finn) - lightnvm use-after-free fix (Gustavo) - gdrom leak fix (Wenwen) - a set of drbd updates (Lars, Luc, Nathan, Roland) * tag 'for-4.21/block-20190102' of git://git.kernel.dk/linux-block: (28 commits) block/swim3: Fix regression on PowerBook G3 block/swim3: Fix -EBUSY error when re-opening device after unmount block/swim3: Remove dead return statement block/amiflop: Don't log error message on invalid ioctl gdrom: fix a memory leak bug lightnvm: pblk: fix use-after-free bug block: sunvdc: remove redundant code block: loop: remove redundant code bsg: deprecate BIDI support in bsg blkcg: remove unused __blkg_release_rcu() blkcg: clean up blkg_tryget_closest() drbd: Change drbd_request_detach_interruptible's return type to int drbd: Avoid Clang warning about pointless switch statment drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire") drbd: skip spurious timeout (ping-timeo) when failing promote drbd: don't retry connection if peers do not agree on "authentication" settings drbd: fix print_st_err()'s prototype to match the definition drbd: avoid spurious self-outdating with concurrent disconnect / down drbd: do not block when adjusting "disk-options" while IO is frozen drbd: fix comment typos ...
2018-12-29Merge tag 'kconfig-v4.21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kconfig updates from Masahiro Yamada: - support -y option for merge_config.sh to avoid downgrading =y to =m - remove S_OTHER symbol type, and touch include/config/*.h files correctly - fix file name and line number in lexer warnings - fix memory leak when EOF is encountered in quotation - resolve all shift/reduce conflicts of the parser - warn no new line at end of file - make 'source' statement more strict to take only string literal - rewrite the lexer and remove the keyword lookup table - convert to SPDX License Identifier - compile C files independently instead of including them from zconf.y - fix various warnings of gconfig - misc cleanups * tag 'kconfig-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits) kconfig: surround dbg_sym_flags with #ifdef DEBUG to fix gconf warning kconfig: split images.c out of qconf.cc/gconf.c to fix gconf warnings kconfig: add static qualifiers to fix gconf warnings kconfig: split the lexer out of zconf.y kconfig: split some C files out of zconf.y kconfig: convert to SPDX License Identifier kconfig: remove keyword lookup table entirely kconfig: update current_pos in the second lexer kconfig: switch to ASSIGN_VAL state in the second lexer kconfig: stop associating kconf_id with yylval kconfig: refactor end token rules kconfig: stop supporting '.' and '/' in unquoted words treewide: surround Kconfig file paths with double quotes microblaze: surround string default in Kconfig with double quotes kconfig: use T_WORD instead of T_VARIABLE for variables kconfig: use specific tokens instead of T_ASSIGN for assignments kconfig: refactor scanning and parsing "option" properties kconfig: use distinct tokens for type and default properties kconfig: remove redundant token defines kconfig: rename depends_list to comment_option_list ...
2018-12-28Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: smarpqi, lpfc, qedi, megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas. Additionally, we have a pile of annotation, unused variable and minor updates. The big API change is the updates for Christoph's DMA rework which include removing the DISABLE_CLUSTERING flag. And finally there are a couple of target tree updates" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits) scsi: isci: request: mark expected switch fall-through scsi: isci: remote_node_context: mark expected switch fall-throughs scsi: isci: remote_device: Mark expected switch fall-throughs scsi: isci: phy: Mark expected switch fall-through scsi: iscsi: Capture iscsi debug messages using tracepoints scsi: myrb: Mark expected switch fall-throughs scsi: megaraid: fix out-of-bound array accesses scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through scsi: fcoe: remove set but not used variable 'port' scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown() scsi: smartpqi: fix build warnings scsi: smartpqi: update driver version scsi: smartpqi: add ofa support scsi: smartpqi: increase fw status register read timeout scsi: smartpqi: bump driver version scsi: smartpqi: add smp_utils support scsi: smartpqi: correct lun reset issues scsi: smartpqi: correct volume status scsi: smartpqi: do not offline disks for transient did no connect conditions scsi: smartpqi: allow for larger raid maps ...
2018-12-28Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ...
2018-12-21bsg: deprecate BIDI support in bsgChristoph Hellwig
Besides the OSD command set that never got traction, the only SCSI command using bidirectional buffers is XDWRITEREAD in the 10 and 32 byte variants, which is extremely esoteric and has been removed from the spec again as of SBC4r15. It probably doesn't make sense to keep the support code around just for that, so start deprecating the support. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-21blkcg: remove unused __blkg_release_rcu()Dennis Zhou
An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref") moved around the release call from blkg_put() to be a part of the percpu_ref cleanup. Remove the additional unused code which should have been removed earlier. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-21blkcg: clean up blkg_tryget_closest()Dennis Zhou
The implementation of blkg_tryget_closest() wasn't super obvious and became a point of suspicion when debugging [1]. So let's clean it up so it's obviously not the problem. Also add missing RCU read locking to bio_clone_blkg_association(), which got exposed by adding the RCU read lock held check in blkg_tryget_closest(). [1] https://lore.kernel.org/linux-block/a7e97e4b-0dd8-3a54-23b7-a0f27b17fde8@kernel.dk/ Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-22treewide: surround Kconfig file paths with double quotesMasahiro Yamada
The Kconfig lexer supports special characters such as '.' and '/' in the parameter context. In my understanding, the reason is just to support bare file paths in the source statement. I do not see a good reason to complicate Kconfig for the room of ambiguity. The majority of code already surrounds file paths with double quotes, and it makes sense since file paths are constant string literals. Make it treewide consistent now. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-12-20kyber: use sbitmap add_wait_queue/list_del wait helpersJens Axboe
sbq_wake_ptr() checks sbq->ws_active to know if it needs to loop the wait indexes or not. This requires the use of the sbitmap waitqueue wrappers, but kyber doesn't use those for its domain token waitqueue handling. Convert kyber to use the helpers. This fixes a hang with waiting for domain tokens. Fixes: 5d2ee7122c73 ("sbitmap: optimize wakeup check") Tested-by: Ming Lei <ming.lei@redhat.com> Reported-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-19block: save irq state in blkg_lookup_create()Ming Lei
blkg_lookup_create() may be called from pool_map() in which irq state is saved, so we have to do that in blkg_lookup_create(). Otherwise, the following lockdep warning can be triggered: [ 104.258537] ================================ [ 104.259129] WARNING: inconsistent lock state [ 104.259725] 4.20.0-rc6+ #545 Not tainted [ 104.260268] -------------------------------- [ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.263747] {SOFTIRQ-ON-W} state was registered at: [ 104.264417] _raw_spin_unlock_irq+0x29/0x4c [ 104.265014] blkg_lookup_create+0xdc/0xe6 [ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f [ 104.266312] bio_associate_blkg+0x15a/0x1bb [ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool] [ 104.267572] __map_bio+0x98/0x29c [dm_mod] [ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod] [ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod] [ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod] [ 104.270973] generic_make_request+0x3f5/0x68b [ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool] [ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool] [ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool] [ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool] [ 104.274635] process_one_work+0x4eb/0x834 [ 104.275203] worker_thread+0x318/0x484 [ 104.275740] kthread+0x1d1/0x1e1 [ 104.276203] ret_from_fork+0x3a/0x50 [ 104.276714] irq event stamp: 170003 [ 104.277201] hardirqs last enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55 [ 104.280273] softirqs last enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73 [ 104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d [ 104.282744] [ 104.282744] other info that might help us debug this: [ 104.283640] Possible unsafe locking scenario: [ 104.283640] [ 104.284452] CPU0 [ 104.284803] ---- [ 104.285150] lock(&(&pool->lock)->rlock#3); [ 104.285762] <Interrupt> [ 104.286130] lock(&(&pool->lock)->rlock#3); [ 104.286750] [ 104.286750] *** DEADLOCK *** [ 104.286750] [ 104.287564] no locks held by swapper/49/0. [ 104.288129] [ 104.288129] stack backtrace: [ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545 [ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 104.290858] Call Trace: [ 104.291204] <IRQ> [ 104.291502] dump_stack+0x9a/0xe6 [ 104.291968] mark_lock+0x56c/0x7a6 [ 104.292442] ? check_usage_backwards+0x209/0x209 [ 104.293086] __lock_acquire+0x400/0x15bf [ 104.293662] ? check_chain_key+0x150/0x1aa [ 104.294236] lock_acquire+0x1a6/0x1e3 [ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.297031] _raw_spin_lock_irqsave+0x46/0x55 [ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.299886] ? check_flags+0x20a/0x20a [ 104.300408] ? lock_acquire+0x1a6/0x1e3 [ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.301865] clone_endio+0x1bb/0x22d [dm_mod] [ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f [ 104.303836] ? bio_endio+0x2b2/0x2da [ 104.304349] clone_endio+0x1f3/0x22d [dm_mod] [ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f [ 104.306333] ? bio_endio+0x2b2/0x2da [ 104.306853] clone_endio+0x1f3/0x22d [dm_mod] [ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f [ 104.308817] ? bio_endio+0x2b2/0x2da [ 104.309319] blk_update_request+0x2de/0x4cc [ 104.309927] blk_mq_end_request+0x2a/0x183 [ 104.310498] blk_done_softirq+0x16a/0x1a6 [ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2 [ 104.311653] ? __lock_is_held+0x2a/0x87 [ 104.312186] __do_softirq+0x250/0x4e8 [ 104.312705] irq_exit+0x7e/0x11d [ 104.313157] call_function_single_interrupt+0xf/0x20 [ 104.313860] </IRQ> [ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3 [ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8 [ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04 [ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7 [ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c [ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001 [ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf [ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000 [ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278 [ 104.323927] default_idle+0xd9/0x1a8 [ 104.324427] do_idle+0x162/0x2b2 [ 104.324891] ? arch_cpu_idle_exit+0x28/0x28 [ 104.325467] ? mark_held_locks+0x28/0x7f [ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.326719] cpu_startup_entry+0x1d/0x1f [ 104.327261] start_secondary+0x2cb/0x308 [ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3 [ 104.328421] secondary_startup_64+0xa4/0xb0 Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking") Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18scsi: block: remove the cluster flagChristoph Hellwig
Now that the the SCSI layer replaced the use of the cluster flag with segment size limits and the DMA boundary we can remove the cluster flag from the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>