summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2015-06-02writeback: make backing_dev_info host cgroup-specific bdi_writebacksTejun Heo
For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACKTejun Heo
cgroup writeback requires support from both bdi and filesystem sides. Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if both MEMCG and BLK_CGROUP are enabled. inode_cgwb_enabled() which determines whether a given inode's both bdi and fs support cgroup writeback is added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: separate out include/linux/backing-dev-defs.hTejun Heo
With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: move backing_dev_info->state into bdi_writebackTejun Heo
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02blkcg: implement bio_associate_blkcg()Tejun Heo
Currently, a bio can only be associated with the io_context and blkcg of %current using bio_associate_current(). This is too restrictive for cgroup writeback support. Implement bio_associate_blkcg() which associates a bio with the specified blkcg. bio_associate_blkcg() leaves the io_context unassociated. bio_associate_current() is updated so that it considers a bio as already associated if it has a blkcg_css, instead of an io_context, associated with it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02cgroup, block: implement task_get_css() and use it in bio_associate_current()Tejun Heo
bio_associate_current() currently open codes task_css() and css_tryget_online() to find and pin $current's blkcg css. Abstract it into task_get_css() which is implemented from cgroup side. As a task is always associated with an online css for every subsystem except while the css_set update is propagating, task_get_css() retries till css_tryget_online() succeeds. This is a cleanup and shouldn't lead to noticeable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02blkcg: add blkcg_root_cssTejun Heo
Add global constant blkcg_root_css which points to &blkcg_root.css. This will be used by cgroup writeback support. If blkcg is disabled, it's defined as ERR_PTR(-EINVAL). v2: The declarations moved to include/linux/blk-cgroup.h as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02blkcg: always create the blkcg_gq for the root blkcgTejun Heo
Currently, blkcg does a minor optimization where the root blkcg is created when the first blkcg policy is activated on a queue and destroyed on the deactivation of the last. On systems where blkcg is configured but not used, this saves one blkcg_gq struct per queue. On systems where blkcg is actually used, there's no difference. The only case where this can lead to any meaninful, albeit still minute, save in memory consumption is when all blkcg policies are deactivated after being widely used in the system, which is a hihgly unlikely scenario. The conditional existence of root blkcg_gq has already created several bugs in blkcg and became an issue once again for the new per-cgroup wb_congested mechanism for cgroup writeback support leading to a NULL dereference when no blkcg policy is active. This is really not worth bothering with. This patch makes blkcg always allocate and link the root blkcg_gq and release it only on queue destruction. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.hTejun Heo
cgroup aware writeback support will require exposing some of blkcg details. In preprataion, move block/blk-cgroup.h to include/linux/blk-cgroup.h. This patch is pure file move. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02Merge branch 'linus' into sched/core, to resolve conflictIngo Molnar
Conflicts: arch/sparc/include/asm/topology_64.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-01blk-mq: Shared tag enhancementsKeith Busch
Storage controllers may expose multiple block devices that share hardware resources managed by blk-mq. This patch enhances the shared tags so a low-level driver can access the shared resources not tied to the unshared h/w contexts. This way the LLD can dynamically add and delete disks and request queues without having to track all the request_queue hctx's to iterate outstanding tags. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29block: only honor SG gap prevention for merges that contain dataJens Axboe
We can safely merge anything that wont generate an SG list entry, so if the bio is data-less (discard), don't look at potential SG gaps. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2Mike Snitzer
2015-05-28block: discard bdi_unregister() in favour of bdi_destroy()NeilBrown
bdi_unregister() now contains very little functionality. It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no real consequence as bdi->dev isn't needed by anything else in the function, and it triggers if blk_cleanup_queue() -> bdi_destroy() is called before bdi_unregister, which happens since Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") So this isn't wanted. It also calls bdi_set_min_ratio(). This needs to be called after writes through the bdi have all been flushed, and before the bdi is destroyed. Calling it early is better than calling it late as it frees up a global resource. Calling it immediately after bdi_wb_shutdown() in bdi_destroy() perfectly fits these requirements. So bdi_unregister() can be discarded with the important content moved to bdi_destroy(), as can the writeback_bdi_unregister event which is already not used. Reported-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org (v4.0) Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info") Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-27sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask()Bartosz Golaszewski
Rename topology_thread_cpumask() to topology_sibling_cpumask() for more consistency with scheduler code. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Benoit Cousson <bcousson@baylibre.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jean Delvare <jdelvare@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-22block, dm: don't copy bios for request clonesChristoph Hellwig
Currently dm-multipath has to clone the bios for every request sent to the lower devices, which wastes cpu cycles and ties down memory. This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio to not complete bios attached to a request, which we set on clone requests similar to bios in a flush sequence. With this change I/O errors on a path failure only get propagated to dm-multipath, which can then either resubmit the I/O or complete the bios on the original request. I've done some basic testing of this on a Linux target with ALUA support, and it survives path failures during I/O nicely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20block: replace trylock with mutex_lock in blkdev_reread_part()Ming Lei
The only possible problem of using mutex_lock() instead of trylock is about deadlock. If there aren't any locks held before calling blkdev_reread_part(), deadlock can't be caused by this conversion. If there are locks held before calling blkdev_reread_part(), and if these locks arn't required in open, close handler and I/O path, deadlock shouldn't be caused too. Both user space's ioctl(BLKRRPART) and md_setup_drive() from init/do_mounts_md.c belongs to the 1st case, so the conversion is safe for the two cases. For loop, the previous patches in this pathset has fixed the ABBA lock dependency, so the conversion is OK. For nbd, tx_lock is held when calling the function: - both open and release won't hold the lock - when blkdev_reread_part() is run, I/O thread has been stopped already, so tx_lock won't be acquired in I/O path at that time. - so the conversion won't cause deadlock for nbd For dasd, both dasd_open(), dasd_release() and request function don't acquire any mutex/semphone, so the conversion should be safe. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20block: export blkdev_reread_part() and __blkdev_reread_part()Jarod Wilson
This patch exports blkdev_reread_part() for block drivers, also introduce __blkdev_reread_part(). For some drivers, such as loop, reread of partitions can be run from the release path, and bd_mutex may already be held prior to calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce __blkdev_reread_part for use in such cases. CC: Christoph Hellwig <hch@lst.de> CC: Jens Axboe <axboe@kernel.dk> CC: Tejun Heo <tj@kernel.org> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Markus Pargmann <mpa@pengutronix.de> CC: Stefan Weinhuber <wein@de.ibm.com> CC: Stefan Haberland <stefan.haberland@de.ibm.com> CC: Sebastian Ott <sebott@linux.vnet.ibm.com> CC: Fabian Frederick <fabf@skynet.be> CC: Ming Lei <ming.lei@canonical.com> CC: David Herrmann <dh.herrmann@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: nbd-general@lists.sourceforge.net CC: linux-s390@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19block: remove unused BIO_RW_BLOCK and BIO_EOF flagsChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19block: remove BIO_EOPNOTSUPPChristoph Hellwig
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag to help propagating those to the buffer_head layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19block: use an atomic_t for mq_freeze_depthChristoph Hellwig
lockdep gets unhappy about the not disabling irqs when using the queue_lock around it. Instead of trying to fix that up just switch to an atomic_t and get rid of the lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-12block: remove export for blk_queue_bioMike Snitzer
With commit ff36ab345 ("dm: remove request-based logic from make_request_fn wrapper") DM no longer calls blk_queue_bio() directly, so remove its export. Doing so required a forward declaration in blk-core.c. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08blk-mq: make plug work for mutiple disks and queuesShaohua Li
Last patch makes plug work for multiple queue case. However it only works for single disk case, because it assumes only one request in the plug list. If a task is accessing multiple disks, eg MD/DM, the assumption is wrong. Let blk_attempt_plug_merge() record request from the same queue. V2: use NULL parameter in !mq case. Fix a bug. Add comments in blk_attempt_plug_merge to make it less (hopefully) confusion. Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08blk-mq: do limited block plug for multiple queue caseShaohua Li
plug is still helpful for workload with IO merge, but it can be harmful otherwise especially with multiple hardware queues, as there is (supposed) no lock contention in this case and plug can introduce latency. For multiple queues, we do limited plug, eg plug only if there is request merge. If a request doesn't have merge with following request, the requet will be dispatched immediately. V2: check blk_queue_nomerges() as suggested by Jeff. Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08blk-mq: avoid re-initialize request which is failed in direct dispatchShaohua Li
If we directly issue a request and it fails, we use blk_mq_merge_queue_io(). But we already assigned bio to a request in blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run blk_mq_bio_to_request again. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08blk-mq: fix plugging in blk_sq_make_requestJeff Moyer
The following appears in blk_sq_make_request: /* * If we have multiple hardware queues, just go directly to * one of those for sync IO. */ We clearly don't have multiple hardware queues, here! This comment was introduced with this commit 07068d5b8e (blk-mq: split make request handler for multi and single queue): We want slightly different behavior from them: - On single queue devices, we currently use the per-process plug for deferred IO and for merging. - On multi queue devices, we don't use the per-process plug, but we want to go straight to hardware for SYNC IO. The old code had this: use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync); and that was converted to: use_plug = !is_flush_fua && !is_sync; which is not equivalent. For the single queue case, that second half of the && expression is always true. So, what I think was actually inteded follows (and this more closely matches what is done in blk_queue_bio). V2: delete the 'likely', which should not be a big deal Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08blk: clean up plugShaohua Li
Current code looks like inner plug gets flushed with a blk_finish_plug(). Actually it's a nop. All requests/callbacks are added to current->plug, while only outmost plug is assigned to current->plug. So inner plug always has empty request/callback list, which makes blk_flush_plug_list() a nop. This tries to make the code more clear. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05block: move PM request support to IDEChristoph Hellwig
This removes the request types and hacks from the block code and into the old IDE driver. There is a small amunt of code duplication due to this, but it's not too bad. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_cnt for most use casesJens Axboe
Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_remaining for non-chainsJens Axboe
Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-04blk-mq: don't lose requests if a stopped queue restartsShaohua Li
Normally if driver is busy to dispatch a request the logic is like below: block layer: driver: __blk_mq_run_hw_queue a. blk_mq_stop_hw_queue b. rq add to ctx->dispatch later: 1. blk_mq_start_hw_queue 2. __blk_mq_run_hw_queue But it's possible step 1-2 runs between a and b. And since rq isn't in ctx->dispatch yet, step 2 will not run rq. The rq might get lost if there are no subsequent requests kick in. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27block: destroy bdi before blockdev is unregistered.NeilBrown
Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit c4db59d31e39ea067c32163ac961e9c80198fd37 Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37 Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27block:bounce: fix call inc_|dec_zone_page_state on different pages confuse ↵Wang YanQing
value of NR_BOUNCE Commit d2c5e30c9a1420902262aa923794d2ae4e0bc391 ("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter") convert statistic of nr_bounce to per zone and one global value in vm_stat, but it call inc_|dec_zone_page_state on different pages, then different zones, and cause us to get unexpected value of NR_BOUNCE. Below is the result on my machine: Mar 2 09:26:08 udknight kernel: [144766.778265] Mem-Info: Mar 2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu: Mar 2 09:26:08 udknight kernel: [144766.778268] CPU 0: hi: 0, btch: 1 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778269] CPU 1: hi: 0, btch: 1 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu: Mar 2 09:26:08 udknight kernel: [144766.778271] CPU 0: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778273] CPU 1: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu: Mar 2 09:26:08 udknight kernel: [144766.778275] CPU 0: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778276] CPU 1: hi: 186, btch: 31 usd: 0 Mar 2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0 Mar 2 09:26:08 udknight kernel: [144766.778279] active_file:105085 inactive_file:139432 isolated_file:0 Mar 2 09:26:08 udknight kernel: [144766.778279] unevictable:653 dirty:0 writeback:0 unstable:0 Mar 2 09:26:08 udknight kernel: [144766.778279] free:178957 slab_reclaimable:6419 slab_unreclaimable:9966 Mar 2 09:26:08 udknight kernel: [144766.778279] mapped:4426 shmem:305277 pagetables:784 bounce:0 Mar 2 09:26:08 udknight kernel: [144766.778279] free_cma:0 Mar 2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Mar 2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754 Mar 2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes Mar 2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451 Mar 2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Mar 2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0 You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global. This patch fix it. Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23elevator: fix double release of elevator moduleChao Yu
Our issue is descripted in below call path: ->elevator_init ->elevator_init_fn ->{cfq,deadline,noop}_init_queue ->elevator_alloc ->kzalloc_node fail to call kzalloc_node and then put module in elevator_alloc; fail to call elevator_init_fn and then put module again in elevator_init. Remove elevator_put invoking in error path of elevator_alloc to avoid double release issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23blk-mq: fix CPU hotplug handlingMing Lei
hctx->tags has to be set as NULL in case that it is to be unmapped no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue() because shared tags can be freed already from another request queue. The same situation has to be considered during handling CPU online too. Unmapped hw queue can be remapped after CPU topo is changed, so we need to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags allocation for hw queue can be removed in hctx cpu online notifier, and it is reasonable to do that after mapping is updated. Cc: <stable@vger.kernel.org> Reported-by: Dongsu Park <dongsu.park@profitbricks.com> Tested-by: Dongsu Park <dongsu.park@profitbricks.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23blk-mq: fix race between timeout and CPU hotplugMing Lei
Firstly during CPU hotplug, even queue is freezed, timeout handler still may come and access hctx->tags, which may cause use after free, so this patch deactivates timeout handler inside CPU hotplug notifier. Secondly, tags can be shared by more than one queues, so we have to check if the hctx has been unmapped, otherwise still use-after-free on tags can be triggered. Cc: <stable@vger.kernel.org> Reported-by: Dongsu Park <dongsu.park@profitbricks.com> Tested-by: Dongsu Park <dongsu.park@profitbricks.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-17blk-mq: fix iteration of busy bitmapJens Axboe
Commit 889fa31f00b2 was a bit too eager in reducing the loop count, so we ended up missing queues in some configurations. Ensure that our division rounds up, so that's not the case. Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: 889fa31f00b2 ("blk-mq: reduce unnecessary software queue looping") Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-16Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer core bits from Jens Axboe: "This is the core pull request for 4.1. Not a lot of stuff in here for this round, mostly little fixes or optimizations. This pull request contains: - An optimization that speeds up queue runs on blk-mq, especially for the case where there's a large difference between nr_cpu_ids and the actual mapped software queues on a hardware queue. From Chong Yuan. - Honor node local allocations for requests on legacy devices. From David Rientjes. - Cleanup of blk_mq_rq_to_pdu() from me. - exit_aio() fixup from me, greatly speeding up exiting multiple IO contexts off exit_group(). For my particular test case, fio exit took ~6 seconds. A typical case of both exposing RCU grace periods to user space, and serializing exit of them. - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only wait if __GFP_WAIT is set. From Keith Busch. - blk-mq exports and two added helpers from Mike Snitzer, which will be used by the dm-mq code. - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang" * 'for-4.1/core' of git://git.kernel.dk/linux-block: blk-mq: reduce unnecessary software queue looping aio: fix serial draining in exit_aio() blk-mq: cleanup blk_mq_rq_to_pdu() blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue() block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set() block: allocate request memory local to request queue blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set blk-mq: export blk_mq_run_hw_queues blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
2015-04-15blk-mq: reduce unnecessary software queue loopingChong Yuan
In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8. Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary. Change ->map_size to contain the actually mapped software queues, so we only loop for as many iterations as we have to. And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues() since they are all re-done in blk_mq_map_swqueue(). blk_mq_map_swqueue(). Signed-off-by: Chong Yuan <chong.yuan@memblaze.com> Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com> Updated by me for formatting and commenting. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-14Merge branch 'for-linus-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: "Part one: - struct filename-related cleanups - saner iov_iter_init() replacements (and switching the syscalls to use of those) - ntfs switch to ->write_iter() (Anton) - aio cleanups and splitting iocb into common and async parts (Christoph) - assorted fixes (me, bfields, Andrew Elble) There's a lot more, including the completion of switchover to ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags race fixes, etc, but that goes after #for-davem merge. David has pulled it, and once it's in I'll send the next vfs pull request" * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits) sg_start_req(): use import_iovec() sg_start_req(): make sure that there's not too many elements in iovec blk_rq_map_user(): use import_single_range() sg_io(): use import_iovec() process_vm_access: switch to {compat_,}import_iovec() switch keyctl_instantiate_key_common() to iov_iter switch {compat_,}do_readv_writev() to {compat_,}import_iovec() aio_setup_vectored_rw(): switch to {compat_,}import_iovec() vmsplice_to_user(): switch to import_iovec() kill aio_setup_single_vector() aio: simplify arguments of aio_setup_..._rw() aio: lift iov_iter_init() into aio_setup_..._rw() lift iov_iter into {compat_,}do_readv_writev() NFS: fix BUG() crash in notify_change() with patch to chown_common() dcache: return -ESTALE not -EBUSY on distributed fs race NTFS: Version 2.1.32 - Update file write from aio_write to write_iter. VFS: Add iov_iter_fault_in_multipages_readable() drop bogus check in file_open_root() switch security_inode_getattr() to struct path * constify tomoyo_realpath_from_path() ...
2015-04-11blk_rq_map_user(): use import_single_range()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11sg_io(): use import_iovec()Al Viro
... and don't skip access_ok() validation. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11blk-mq: initialize 'struct request' and associated data to zeroLinus Torvalds
Jan Engelhardt reports a strange oops with an invalid ->sense_buffer pointer in scsi_init_cmd_errh() with the blk-mq code. The sense_buffer pointer should have been initialized by the call to scsi_init_request() from blk_mq_init_rq_map(), but there seems to be some non-repeatable memory corruptor. This patch makes sure we initialize the whole struct request allocation (and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by using __GFP_ZERO in the allocation. The old code initialized a couple of individual fields, leaving the rest undefined (although many of them are then initialized in later phases, like blk_mq_rq_ctx_init() etc. It's not entirely clear why this matters, but it's the rigth thing to do regardless, and with 4.0 imminent this is the defensive "let's just make sure everything is initialized properly" patch. Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-31block: fix blk_stack_limits() regression due to lcm() changeMike Snitzer
Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n") caused blk_stack_limits() to not properly stack queue_limits for stacked devices (e.g. DM). Fix this regression by establishing lcm_not_zero() and switching blk_stack_limits() over to using it. DM uses blk_set_stacking_limits() to establish the initial top-level queue_limits that are then built up based on underlying devices' limits using blk_stack_limits(). In the case of optimal_io_size (io_opt) blk_set_stacking_limits() establishes a default value of 0. With commit 69c953c, lcm(0, n) is no longer n, which compromises proper stacking of the underlying devices' io_opt. Test: $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536 $ cat /sys/block/sde/queue/optimal_io_size 786432 $ dmsetup create node --table "0 100 linear /dev/sde 0" Before this fix: $ cat /sys/block/dm-5/queue/optimal_io_size 0 After this fix: $ cat /sys/block/dm-5/queue/optimal_io_size 786432 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+ Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-30blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()Wei Fang
Don't assign ->rq_timeout twice. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-30block: remove redundant check about 'set->nr_hw_queues' in ↵Xiaoguang Wang
blk_mq_alloc_tag_set() At the beginning of blk_mq_alloc_tag_set(), we have already checked whether 'set->nr_hw_queues' is zero, so here remove this redundant check. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-24block: allocate request memory local to request queueDavid Rientjes
blk_init_rl() allocates a mempool using mempool_create_node() with node local memory. This only allocates the mempool and element list locally to the requeue queue node. What we really want to do is allocate the request itself local to the queue. To do this, we need our own alloc and free functions that will allocate from request_cachep and pass the request queue node in to prefer node local memory. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-20Fix bug in blk_rq_merge_okWenbo Wang
Use the right array index to reference the last element of rq->biotail->bi_io_vec[] Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com> Reviewed-by: Chong Yuan <chong.yuan@memblaze.com> Fixes: 66cb45aa41315 ("block: add support for limiting gaps in SG lists") Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-18blkmq: Fix NULL pointer deref when all reserved tags inSam Bradshaw
When allocating from the reserved tags pool, bt_get() is called with a NULL hctx. If all tags are in use, the hw queue is kicked to push out any pending IO, potentially freeing tags, and tag allocation is retried. The problem is that blk_mq_run_hw_queue() doesn't check for a NULL hctx. So we avoid it with a simple NULL hctx test. Tested by hammering mtip32xx with concurrent smartctl/hdparm. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Selvan Mani <smani@micron.com> Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()") Cc: stable@kernel.org Added appropriate comment. Signed-off-by: Jens Axboe <axboe@fb.com>