summaryrefslogtreecommitdiff
path: root/drivers/md/dm-thin.c
AgeCommit message (Collapse)Author
2017-07-03Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block/IO updates from Jens Axboe: "This is the main pull request for the block layer for 4.13. Not a huge round in terms of features, but there's a lot of churn related to some core cleanups. Note this depends on the UUID tree pull request, that Christoph already sent out. This pull request contains: - A series from Christoph, unifying the error/stats codes in the block layer. We now use blk_status_t everywhere, instead of using different schemes for different places. - Also from Christoph, some cleanups around request allocation and IO scheduler interactions in blk-mq. - And yet another series from Christoph, cleaning up how we handle and do bounce buffering in the block layer. - A blk-mq debugfs series from Bart, further improving on the support we have for exporting internal information to aid debugging IO hangs or stalls. - Also from Bart, a series that cleans up the request initialization differences across types of devices. - A series from Goldwyn Rodrigues, allowing the block layer to return failure if we will block and the user asked for non-blocking. - Patch from Hannes for supporting setting loop devices block size to that of the underlying device. - Two series of patches from Javier, fixing various issues with lightnvm, particular around pblk. - A series from me, adding support for write hints. This comes with NVMe support as well, so applications can help guide data placement on flash to improve performance, latencies, and write amplification. - A series from Ming, improving and hardening blk-mq support for stopping/starting and quiescing hardware queues. - Two pull requests for NVMe updates. Nothing major on the feature side, but lots of cleanups and bug fixes. From the usual crew. - A series from Neil Brown, greatly improving the bio rescue set support. Most notably, this kills the bio rescue work queues, if we don't really need them. - Lots of other little bug fixes that are all over the place" * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits) lightnvm: pblk: set line bitmap check under debug lightnvm: pblk: verify that cache read is still valid lightnvm: pblk: add initialization check lightnvm: pblk: remove target using async. I/Os lightnvm: pblk: use vmalloc for GC data buffer lightnvm: pblk: use right metadata buffer for recovery lightnvm: pblk: schedule if data is not ready lightnvm: pblk: remove unused return variable lightnvm: pblk: fix double-free on pblk init lightnvm: pblk: fix bad le64 assignations nvme: Makefile: remove dead build rule blk-mq: map all HWQ also in hyperthreaded system nvmet-rdma: register ib_client to not deadlock in device removal nvme_fc: fix error recovery on link down. nvmet_fc: fix crashes on bad opcodes nvme_fc: Fix crash when nvme controller connection fails. nvme_fc: replace ioabort msleep loop with completion nvme_fc: fix double calls to nvme_cleanup_cmd() nvme-fabrics: verify that a controller returns the correct NQN nvme: simplify nvme_dev_attrs_are_visible ...
2017-06-27dm thin: do not queue freed thin mapping for next stage processingVallish Vaidyeshwara
process_prepared_discard_passdown_pt1() should cleanup dm_thin_new_mapping in cases of error. dm_pool_inc_data_range() can fail trying to get a block reference: metadata operation 'dm_pool_inc_data_range' failed: error = -61 When dm_pool_inc_data_range() fails, dm thin aborts current metadata transaction and marks pool as PM_READ_ONLY. Memory for thin mapping is released as well. However, current thin mapping will be queued onto next stage as part of queue_passdown_pt2() or passdown_endio(). This dangling thin mapping memory when processed and accessed in next stage will lead to device mapper crashing. Code flow without fix: -> process_prepared_discard_passdown_pt1(m) -> dm_thin_remove_range() -> discard passdown --> passdown_endio(m) queues m onto next stage -> dm_pool_inc_data_range() fails, frees memory m but does not remove it from next stage queue -> process_prepared_discard_passdown_pt2(m) -> processes freed memory m and crashes One such stack: Call Trace: [<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison] [<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool] [<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool] [<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool] [<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool] [<ffffffff8152bf54>] ? __schedule+0x244/0x680 [<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0 [<ffffffff81089f53>] process_one_work+0x153/0x3f0 [<ffffffff8108a71b>] worker_thread+0x12b/0x4b0 [<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350 [<ffffffff8108fd6a>] kthread+0xca/0xe0 [<ffffffff8108fca0>] ? kthread_park+0x60/0x60 [<ffffffff81530b45>] ret_from_fork+0x25/0x30 The fix is to first take the block ref count for discarded block and then do a passdown discard of this block. If block ref count fails, then bail out aborting current metadata transaction, mark pool as PM_READ_ONLY and also free current thin mapping memory (existing error handling code) without queueing this thin mapping onto next stage of processing. If block ref count succeeds, then passdown discard of this block. Discard callback of passdown_endio() will queue this thin mapping onto next stage of processing. Code flow with fix: -> process_prepared_discard_passdown_pt1(m) -> dm_thin_remove_range() -> dm_pool_inc_data_range() --> if fails, free memory m and bail out -> discard passdown --> passdown_endio(m) queues m onto next stage Cc: stable <stable@vger.kernel.org> # v4.9+ Reviewed-by: Eduardo Valentin <eduval@amazon.com> Reviewed-by: Cristian Gafton <gafton@amazon.com> Reviewed-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Vallish Vaidyeshwara <vallish@amazon.com> Reviewed-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-09block: switch bios to blk_status_tChristoph Hellwig
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09dm: change ->end_io calling conventionChristoph Hellwig
Turn the error paramter into a pointer so that target drivers can change the value, and make sure only DM_ENDIO_* values are returned from the methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-03Merge tag 'for-4.12/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - A major update for DM cache that reduces the latency for deciding whether blocks should migrate to/from the cache. The bio-prison-v2 interface supports this improvement by enabling direct dispatch of work to workqueues rather than having to delay the actual work dispatch to the DM cache core. So the dm-cache policies are much more nimble by being able to drive IO as they see fit. One immediate benefit from the improved latency is a cache that should be much more adaptive to changing workloads. - Add a new DM integrity target that emulates a block device that has additional per-sector tags that can be used for storing integrity information. - Add a new authenticated encryption feature to the DM crypt target that builds on the capabilities provided by the DM integrity target. - Add MD interface for switching the raid4/5/6 journal mode and update the DM raid target to use it to enable aid4/5/6 journal write-back support. - Switch the DM verity target over to using the asynchronous hash crypto API (this helps work better with architectures that have access to off-CPU algorithm providers, which should reduce CPU utilization). - Various request-based DM and DM multipath fixes and improvements from Bart and Christoph. - A DM thinp target fix for a bio structure leak that occurs for each discard IFF discard passdown is enabled. - A fix for a possible deadlock in DM bufio and a fix to re-check the new buffer allocation watermark in the face of competing admin changes to the 'max_cache_size_bytes' tunable. - A couple DM core cleanups. * tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits) dm bufio: check new buffer allocation watermark every 30 seconds dm bufio: avoid a possible ABBA deadlock dm mpath: make it easier to detect unintended I/O request flushes dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit() dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH dm: introduce enum dm_queue_mode to cleanup related code dm mpath: verify __pg_init_all_paths locking assumptions at runtime dm: verify suspend_locking assumptions at runtime dm block manager: remove an unused argument from dm_block_manager_create() dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue() dm mpath: delay requeuing while path initialization is in progress dm mpath: avoid that path removal can trigger an infinite loop dm mpath: split and rename activate_path() to prepare for its expanded use dm ioctl: prevent stack leak in dm ioctl call dm integrity: use previously calculated log2 of sectors_per_block dm integrity: use hex2bin instead of open-coded variant dm crypt: replace custom implementation of hex2bin() dm crypt: remove obsolete references to per-CPU state dm verity: switch to using asynchronous hash crypto API dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues ...
2017-04-24dm thin: fix a memory leak when passing discard bio downDennis Yang
dm-thin does not free the discard_parent bio after all chained sub bios finished. The following kmemleak report could be observed after pool with discard_passdown option processes discard bios in linux v4.11-rc7. To fix this, we drop the discard_parent bio reference when its endio (passdown_endio) called. unreferenced object 0xffff8803d6b29700 (size 256): comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81a5efd9>] kmemleak_alloc+0x49/0xa0 [<ffffffff8114ec34>] kmem_cache_alloc+0xb4/0x100 [<ffffffff8110eec0>] mempool_alloc_slab+0x10/0x20 [<ffffffff8110efa5>] mempool_alloc+0x55/0x150 [<ffffffff81374939>] bio_alloc_bioset+0xb9/0x260 [<ffffffffa018fd20>] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool] [<ffffffffa018b409>] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool] [<ffffffffa018b484>] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool] [<ffffffffa018b24d>] process_discard_bio+0xdd/0xf0 [dm_thin_pool] [<ffffffffa018ecf6>] do_worker+0xa76/0xd50 [dm_thin_pool] [<ffffffff81086239>] process_one_work+0x139/0x370 [<ffffffff810867b1>] worker_thread+0x61/0x450 [<ffffffff8108b316>] kthread+0xd6/0xf0 [<ffffffff81a6cd1f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff Cc: stable@vger.kernel.org Signed-off-by: Dennis Yang <dennisyang@qnap.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-08block: remove the discard_zeroes_data flagChristoph Hellwig
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-07dm bio prison v2: new interface for the bio prisonJoe Thornber
The deferred set is gone and all methods have _v2 appended to the end of their names to allow for continued use of the original bio prison in DM thin-provisioning. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-02block: Use pointer to backing_dev_info from request_queueJan Kara
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27block: add a op_is_flush helperChristoph Hellwig
This centralizes the checks for bios that needs to be go into the flush state machine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20dm thin: fix a race condition between discarding and provisioning a blockJoe Thornber
The discard passdown was being issued after the block was unmapped, which meant the block could be reprovisioned whilst the passdown discard was still in flight. We can only identify unshared blocks (safe to do a passdown a discard to) once they're unmapped and their ref count hits zero. Block ref counts are now used to guard against concurrent allocation of these blocks that are being discarded. So now we unmap the block, issue passdown discards, and the immediately increment ref counts for regions that have been discarded via passed down (this is safe because allocation occurs within the same thread). We then decrement ref counts once the passdown discard IO is complete -- signaling these blocks may now be allocated. This fixes the potential for corruption that was reported here: https://www.redhat.com/archives/dm-devel/2016-June/msg00311.html Reported-by: Dennis Yang <dennisyang@qnap.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07dm: use bio op accessorsMike Christie
Separate the op from the rq_flag_bits and have dm set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block discard: use bio set op accessorMike Christie
This converts the block issue discard helper and users to use the bio_set_op_attrs accessor and only pass in the operation flags like REQ_SEQURE. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-13dm thin: unroll issue_discard() to create longer discard bio chainsJoe Thornber
There is little benefit to doing this but it does structure DM thinp's code to more cleanly use the __blkdev_issue_discard() interface -- particularly in passdown_double_checking_shared_status(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-13dm thin: use __blkdev_issue_discard for async discard supportMike Snitzer
With commit 38f25255330 ("block: add __blkdev_issue_discard") DM thinp no longer needs to carry its own async discard method. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-05-13dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining()Mike Snitzer
DM thinp's use of bio_inc_remaining() is critical to ensure the original parent discard bio isn't completed before sub-discards have. DM thinp needs this due to the extra quiescing that occurs, via multiple DM thinp mappings, while processing large discards. As such DM thinp must build the async discard bio chain after some delay -- so bio_inc_remaining() is used to enable DM thinp to take a reference on the original parent discard bio for each mapping. This allows the immediate use of bio_endio() on that discard bio; but with the understanding that the actual completion won't occur until each of the sub-discards' per-mapping references are dropped. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2016-05-05dm thin: Remove return statement from void functionAmitoj Kaur Chawla
Return statement at the end of a void function is useless. The Coccinelle semantic patch used to make this change is as follows: //<smpl> @@ identifier f; expression e; @@ void f(...) { <... - return e; ...> } //</smpl> Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11dm thin: consistently return -ENOSPC if pool has run out of data spaceMike Snitzer
Commit 0a927c2f02 ("dm thin: return -ENOSPC when erroring retry list due to out of data space") was a step in the right direction but didn't go far enough. Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when the pool runs of of data space. This fixes cell_error() and error_retry_list() to not blindly return -EIO. We cannot rely on the 'error_if_no_space' feature flag since it is transient (in that it can be reset once space is added, plus it only controls whether errors are issued, it doesn't reflect whether the pool is actually out of space). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: rename target's per_bio_data_size to per_io_data_sizeMike Snitzer
Request-based DM will also make use of per_bio_data_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-01-06dm thin: bump thin and thin-pool target versionsMike Snitzer
Commit 3d5f6733 ("dm thin metadata: speed up discard of partially mapped volumes"), or some other dm-thinp change during the Linux 4.5 development window, really should've bumped these target versions. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-17dm thin: fix race condition when destroying thin pool workqueueNikolay Borisov
When a thin pool is being destroyed delayed work items are cancelled using cancel_delayed_work(), which doesn't guarantee that on return the delayed item isn't running. This can cause the work item to requeue itself on an already destroyed workqueue. Fix this by using cancel_delayed_work_sync() which guarantees that on return the work item is not running anymore. Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second") Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-11-23dm thin: fix regression in advertised discard limitsMike Snitzer
When establishing a thin device's discard limits we cannot rely on the underlying thin-pool device's discard capabilities (which are inherited from the thin-pool's underlying data device) given that DM thin devices must provide discard support even when the thin-pool's underlying data device doesn't support discards. Users were exposed to this thin device discard limits regression if their thin-pool's underlying data device does _not_ support discards. This regression caused all upper-layers that called the blkdev_issue_discard() interface to not be able to issue discards to thin devices (because discard_granularity was 0). This regression wasn't caught earlier because the device-mapper-test-suite's extensive 'thin-provisioning' discard tests are only ever performed against thin-pool's with data devices that support discards. Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature rather than inferring whether or not a thin device's discard support should be enabled by looking at the thin-pool's discard_granularity. Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled") Reported-by: Mike Gerber <mike@sprachgewalt.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
2015-11-16dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE ↵Mike Snitzer
transition A thin-pool that is in out-of-data-space (OODS) mode may transition back to write mode -- without the admin adding more space to the thin-pool -- if/when blocks are released (either by deleting thin devices or discarding provisioned blocks). But as part of the thin-pool's earlier transition to out-of-data-space mode the thin-pool may have set the 'error_if_no_space' flag to true if the no_space_timeout expires without more space having been made available. That implementation detail, of changing the pool's error_if_no_space setting, needs to be reset back to the default that the user specified when the thin-pool's table was loaded. Otherwise we'll drop the user requested behaviour on the floor when this out-of-data-space to write mode transition occurs. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Fixes: 2c43fd26e4 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released") Cc: stable@vger.kernel.org
2015-10-13dm thin: fix missing pool reference count decrement in pool_ctr error pathMike Snitzer
Fixes: ac8c3f3df ("dm thin: generate event when metadata threshold passed") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.10+
2015-09-13dm thin: disable discard support for thin devices if pool's is disabledMike Snitzer
If the pool is configured with 'ignore_discard' its discard support is disabled. The pool's thin devices should also have queue_limits that reflect discards are disabled. Fixes: 34fbcf62 ("dm thin: range discard support") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
2015-09-02Merge tag 'dm-4.3-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper update from Mike Snitzer: - a couple small cleanups in dm-cache, dm-verity, persistent-data's dm-btree, and DM core. - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio prison cells - a 4.2-stable fix that adds feature reporting for the dm-stats features added in 4.2 - improve DM-snapshot to not invalidate the on-disk snapshot if snapshot device write overflow occurs; but a write overflow triggered through the origin device will still invalidate the snapshot. - optimize DM-thinp's async discard submission a bit now that late bio splitting has been included in block core. - switch DM-cache's SMQ policy lock from using a mutex to a spinlock; improves performance on very low latency devices (eg. NVMe SSD). - document DM RAID 4/5/6's discard support [ I did not pull the slab changes, which weren't appropriate for this tree, and weren't obviously the right thing to do anyway. At the very least they need some discussion and explanation before getting merged. Because not pulling the actual tagged commit but doing a partial pull instead, this merge commit thus also obviously is missing the git signature from the original tag ] * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix use after freeing migrations dm cache: small cleanups related to deferred prison cell cleanup dm cache: fix leaking of deferred bio prison cells dm raid: document RAID 4/5/6 discard support dm stats: report precise_timestamps and histogram in @stats_list output dm thin: optimize async discard submission dm snapshot: don't invalidate on-disk image on snapshot write overflow dm: remove unlikely() before IS_ERR() dm: do not override error code returned from dm_get_device() dm: test return value for DM_MAPIO_SUBMITTED dm verity: remove unused mempool dm cache: move wake_waker() from free_migrations() to where it is needed dm btree remove: remove unused function get_nr_entries() dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() dm cache policy smq: change the mutex to a spinlock
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-08-18dm thin: optimize async discard submissionMike Snitzer
__blkdev_issue_discard_async() doesn't need to worry about further splitting because the upper layer blkdev_issue_discard() will have already handled splitting bios such that the bi_size isn't overflowed. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2015-08-13block: kill merge_bvec_fn() completelyKent Overstreet
As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-26dm thin: return -ENOSPC when erroring retry list due to out of data spaceMike Snitzer
Otherwise -EIO would be returned when -ENOSPC should be used consistently. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm thin: display 'needs_check' in status if it is setMike Snitzer
There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the thin-pool status if it is set in the thinp metadata. Also, update thinp documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm thin: stay in out-of-data-space mode once no_space_timeout expiresMike Snitzer
This fixes an issue where running out of data space would cause the thin-pool's metadata to become read-only. There was no reason to make metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a misguided way to error all queued and future write IOs. We can accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to PM_OUT_OF_DATA_SPACE with error_if_no_space enabled. Otherwise, the use of PM_READ_ONLY could cause a race where commit() was started before the PM_READ_ONLY transition but dm_pool_commit_metadata() would go on to fail because the block manager had transitioned to read-only. The return of -EPERM from dm_pool_commit_metadata(), due to attempting to commit while in read-only mode, caused the thin-pool to set 'needs_check' because a metadata_operation_failed(). This needless cascade of failures makes life for users more difficult than needed. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-05dm thin: allocate the cell_sort_array dynamicallyJoe Thornber
Given the pool's cell_sort_array holds 8192 pointers it triggers an order 5 allocation via kmalloc. This order 5 allocation is prone to failure as system memory gets more fragmented over time. Fix this by allocating the cell_sort_array using vmalloc. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-06-11dm thin: fail messages with EOPNOTSUPP when pool cannot handle messagesMike Snitzer
Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to send the pool a message. Otherwise usespace is led to believe the message failed due to invalid argument. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm thin: range discard supportJoe Thornber
Previously REQ_DISCARD bios have been split into block sized chunks before submission to the thin target. There are a couple of issues with this: - If the block size is small, a large discard request can get broken up into a great many bios which is both slow and causes a lot of memory pressure. - The thin pool block size and the discard granularity for the underlying data device need to be compatible if we want to passdown the discard. This patch relaxes the block size granularity for thin devices. It makes use of the recent range locking added to the bio_prison to quiesce a whole range of thin blocks before unmapping them. Once a thin range has been unmapped the discard can then be passed down to the data device for those sub ranges where the data blocks are no longer used (ie. they weren't shared in the first place). This patch also doesn't make any apologies about open-coding portions of block core as a means to supporting async discard completions in the near-term -- if/when late bio splitting lands it'll all get cleaned up. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm thin: cleanup schedule_zero() to read more logicallyMike Snitzer
The overwrite has only ever about optimizing away the need to zero a block if the entire block was being overwritten. As such it is only relevant when zeroing is enabled. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>
2015-05-29dm thin: cleanup overwrite's endio restore to be centralizedMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-22block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_remaining for non-chainsJens Axboe
Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-27dm thin: fix to consistently zero-fill reads to unprovisioned blocksJoe Thornber
It was always intended that a read to an unprovisioned block will return zeroes regardless of whether the pool is in read-only or read-write mode. thin_bio_map() was inconsistent with its handling of such reads when the pool is in read-only mode, it now properly zero-fills the bios it returns in response to unprovisioned block reads. Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA and just allow the IO to be deferred to the worker which will result in pool->process_bio() handling the IO (which already properly zero-fills reads to unprovisioned blocks). Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-02-12Merge tag 'dm-3.20-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - The most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. An early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. * tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues dm snapshot: remove unnecessary NULL checks before vfree() calls dm mpath: simplify failure path of dm_multipath_init() dm thin metadata: remove unused dm_pool_get_data_block_size() dm ioctl: fix stale comment above dm_get_inactive_table() dm crypt: update url in CONFIG_DM_CRYPT help text dm bufio: fix time comparison to use time_after_eq() dm: use time_in_range() and time_after() dm raid: fix a couple integer overflows dm table: train hybrid target type detection to select blk-mq if appropriate dm: allocate requests in target when stacking on blk-mq devices dm: prepare for allocating blk-mq clone requests in target dm: submit stacked requests in irq enabled context dm: split request structure out from dm_rq_target_io structure dm: remove exports for request-based interfaces without external callers
2015-02-09dm: use time_in_range() and time_after()Manuel Schölling
To be future-proof and for better readability the time comparisons are modified to use time_in_range() and time_after() instead of plain, error-prone math. Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-01-28dm thin: don't allow messages to be sent to a pool target in READ_ONLY or ↵Joe Thornber
FAIL mode You can't modify the metadata in these modes. It's better to fail these messages immediately than let the block-manager deny write locks on metadata blocks. Otherwise these failed metadata changes will trigger 'needs_check' to get set in the metadata superblock -- requiring repair using the thin_check utility. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-17dm thin: fix crash by initializing thin device's refcount and completion earlierMarc Dionne
Commit 80e96c5484be ("dm thin: do not allow thin device activation while pool is suspended") delayed the initialization of a new thin device's refcount and completion until after this new thin was added to the pool's active_thins list and the pool lock is released. This opens a race with a worker thread that walks the list and calls thin_get/put, noticing that the refcount goes to 0 and calling complete, freezing up the system and giving the oops below: kernel: BUG: unable to handle kernel NULL pointer dereference at (null) kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90 kernel: Call Trace: kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool] kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool] kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 Set the thin device's initial refcount and initialize the completion before adding it to the pool's active_thins list in thin_ctr(). Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-17dm thin: fix missing out-of-data-space to write mode transition if blocks ↵Joe Thornber
are released Discard bios and thin device deletion have the potential to release data blocks. If the thin-pool is in out-of-data-space mode, and blocks were released, transition the thin-pool back to full write mode. The correct time to do this is just after the thin-pool metadata commit. It cannot be done before the commit because the space maps will not allow immediate reuse of the data blocks in case there's a rollback following power failure. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-17dm thin: fix inability to discard blocks when in out-of-data-space modeJoe Thornber
When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard function pointer was incorrectly being set to process_prepared_discard_passdown rather than process_prepared_discard. This incorrect function pointer meant the discard was being passed down, but not effecting the mapping. As such any discard that was issued, in an attempt to reclaim blocks, would not successfully free data space. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org