summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2019-04-25dm mpath: fix missing call of path selector type->end_ioYufen Yu
After commit 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"), map_request() will requeue the tio when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. Thus, if device driver status is error, a tio may be requeued multiple times until the return value is not DM_MAPIO_REQUEUE. That means type->start_io may be called multiple times, while type->end_io is only called when IO complete. In fact, even without commit 396eaf21ee17, setup_clone() failure can also cause tio requeue and associated missed call to type->end_io. The service-time path selector selects path based on in_flight_size, which is increased by st_start_io() and decreased by st_end_io(). Missed calls to st_end_io() can lead to in_flight_size count error and will cause the selector to make the wrong choice. In addition, queue-length path selector will also be affected. To fix the problem, call type->end_io in ->release_clone_rq before tio requeue. map_info is passed to ->release_clone_rq() for map_request() error path that result in requeue. Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Cc: stable@vger.kernl.org Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-25crypto: shash - remove shash_desc::flagsEric Biggers
The flags field in 'struct shash_desc' never actually does anything. The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP. However, no shash algorithm ever sleeps, making this flag a no-op. With this being the case, inevitably some users who can't sleep wrongly pass MAY_SLEEP. These would all need to be fixed if any shash algorithm actually started sleeping. For example, the shash_ahash_*() functions, which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP from the ahash API to the shash API. However, the shash functions are called under kmap_atomic(), so actually they're assumed to never sleep. Even if it turns out that some users do need preemption points while hashing large buffers, we could easily provide a helper function crypto_shash_update_large() which divides the data into smaller chunks and calls crypto_shash_update() and cond_resched() for each chunk. It's not necessary to have a flag in 'struct shash_desc', nor is it necessary to make individual shash algorithms aware of this at all. Therefore, remove shash_desc::flags, and document that the crypto_shash_*() functions can be called from any context. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-24bcache: avoid potential memleak of list of journal_replay(s) in the ↵Shenghui Wang
CACHE_SYNC branch of run_cache_set In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used to collect journal_replay(s) and filled by bch_journal_read(). If all goes well, bch_journal_replay() will release the list of jounal_replay(s) at the end of the branch. If something goes wrong, code flow will jump to the label "err:" and leave the list unreleased. This patch will release the list of journal_replay(s) in the case of error detected. v1 -> v2: * Move the release code to the location after label 'err:' to simply the change. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch ↵Shenghui Wang
of btree_gc_coalesce Elements of keylist should be accessed before the list is freed. Move bch_keylist_free() calling after the while loop to avoid wrong content accessed. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: fix failure in journal relplayTang Junhui
journal replay failed with messages: Sep 10 19:10:43 ceph kernel: bcache: error on bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries 2057493-2057567 missing! (replaying 2057493-2076601), disabling caching The reason is in journal_reclaim(), when discard is enabled, we send discard command and reclaim those journal buckets whose seq is old than the last_seq_now, but before we write a journal with last_seq_now, the machine is restarted, so the journal with the last_seq_now is not written to the journal bucket, and the last_seq_wrote in the newest journal is old than last_seq_now which we expect to be, so when we doing replay, journals from last_seq_wrote to last_seq_now are missing. It's hard to write a journal immediately after journal_reclaim(), and it harmless if those missed journal are caused by discarding since those journals are already wrote to btree node. So, if miss seqs are started from the beginning journal, we treat it as normal, and only print a message to show the miss journal, and point out it maybe caused by discarding. Patch v2 add a judgement condition to ignore the missed journal only when discard enabled as Coly suggested. (Coly Li: rebase the patch with other changes in bch_journal_replay()) Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Tested-by: Dennis Schridde <devurandom@gmx.net> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: improve bcache_reboot()Coly Li
This patch tries to release mutex bch_register_lock early, to give chance to stop cache set and bcache device early. This patch also expends time out of stopping all bcache device from 2 seconds to 10 seconds, because stopping writeback rate update worker may delay for 5 seconds, 2 seconds is not enough. After this patch applied, stopping bcache devices during system reboot or shutdown is very hard to be observed any more. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add comments for closure_fn to be called in closure_queue()Coly Li
Add code comments to explain which call back function might be called for the closure_queue(). This is an effort to make code to be more understandable for readers. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: Add comments for blkdev_put() in registration code pathColy Li
Add comments to explain why in register_bcache() blkdev_put() won't be called in two location. Add comments to explain why blkdev_put() must be called in register_cache() when cache_alloc() failed. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add error check for calling register_bdev()Coly Li
This patch adds return value to register_bdev(). Then if failure happens inside register_bdev(), its caller register_bcache() may detect and handle the failure more properly. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: return error immediately in bch_journal_replay()Coly Li
When failure happens inside bch_journal_replay(), calling cache_set_err_on() and handling the failure in async way is not a good idea. Because after bch_journal_replay() returns, registering code will continue to execute following steps, and unregistering code triggered by cache_set_err_on() is running in same time. First it is unnecessary to handle failure and unregister cache set in an async way, second there might be potential race condition to run register and unregister code for same cache set. So in this patch, if failure happens in bch_journal_replay(), we don't call cache_set_err_on(), and just print out the same error message to kernel message buffer, then return -EIO immediately caller. Then caller can detect such failure and handle it in synchrnozied way. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add comments for kobj release callback routineColy Li
Bcache has several routines to release resources in implicit way, they are called when the associated kobj released. This patch adds code comments to notice when and which release callback will be called, - When dc->disk.kobj released: void bch_cached_dev_release(struct kobject *kobj) - When d->kobj released: void bch_flash_dev_release(struct kobject *kobj) - When c->kobj released: void bch_cache_set_release(struct kobject *kobj) - When ca->kobj released void bch_cache_release(struct kobject *kobj) Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add failure check to run_cache_set() for journal replayColy Li
Currently run_cache_set() has no return value, if there is failure in bch_journal_replay(), the caller of run_cache_set() has no idea about such failure and just continue to execute following code after run_cache_set(). The internal failure is triggered inside bch_journal_replay() and being handled in async way. This behavior is inefficient, while failure handling inside bch_journal_replay(), cache register code is still running to start the cache set. Registering and unregistering code running as same time may introduce some rare race condition, and make the code to be more hard to be understood. This patch adds return value to run_cache_set(), and returns -EIO if bch_journal_rreplay() fails. Then caller of run_cache_set() may detect such failure and stop registering code flow immedidately inside register_cache_set(). If journal replay fails, run_cache_set() can report error immediately to register_cache_set(). This patch makes the failure handling for bch_journal_replay() be in synchronized way, easier to understand and debug, and avoid poetential race condition for register-and-unregister in same time. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()Coly Li
In journal_reclaim() ja->cur_idx of each cache will be update to reclaim available journal buckets. Variable 'int n' is used to count how many cache is successfully reclaimed, then n is set to c->journal.key by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache() loop will write the jset data onto each cache. The problem is, if all jouranl buckets on each cache is full, the following code in journal_reclaim(), 529 for_each_cache(ca, c, iter) { 530 struct journal_device *ja = &ca->journal; 531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; 532 533 /* No space available on this device */ 534 if (next == ja->discard_idx) 535 continue; 536 537 ja->cur_idx = next; 538 k->ptr[n++] = MAKE_PTR(0, 539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]), 540 ca->sb.nr_this_dev); 541 } 542 543 bkey_init(k); 544 SET_KEY_PTRS(k, n); If there is no available bucket to reclaim, the if() condition at line 534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS() will set KEY_PTRS field of c->journal.key to 0. Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in journal_write_unlocked() the journal data is written in following loop, 649 for (i = 0; i < KEY_PTRS(k); i++) { 650-671 submit journal data to cache device 672 } If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data won't be written to cache device here. If system crahed or rebooted before bkeys of the lost journal entries written into btree nodes, data corruption will be reported during bcache reload after rebooting the system. Indeed there is only one cache in a cache set, there is no need to set KEY_PTRS field in journal_reclaim() at all. But in order to keep the for_each_cache() logic consistent for now, this patch fixes the above problem by not setting 0 KEY_PTRS of journal key, if there is no bucket available to reclaim. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: move definition of 'int ret' out of macro read_bucket()Coly Li
'int ret' is defined as a local variable inside macro read_bucket(). Since this macro is called multiple times, and following patches will use a 'int ret' variable in bch_journal_read(), this patch moves definition of 'int ret' from macro read_bucket() to range of function bch_journal_read(). Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: fix a race between cache register and cacheset unregisterLiang Chen
There is a race between cache device register and cache set unregister. For an already registered cache device, register_bcache will call bch_is_open to iterate through all cachesets and check every cache there. The race occurs if cache_set_free executes at the same time and clears the caches right before ca is dereferenced in bch_is_open_cache. To close the race, let's make sure the clean up work is protected by the bch_register_lock as well. This issue can be reproduced as follows, while true; do echo /dev/XXX> /sys/fs/bcache/register ; done& while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done & and results in the following oops, [ +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998 [ +0.000457] #PF error: [normal kernel read fault] [ +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0 [ +0.000388] Oops: 0000 [#1] SMP PTI [ +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6 [ +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 [ +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache] [ +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d [ +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202 [ +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000 [ +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001 [ +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a [ +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0 [ +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620 [ +0.000384] FS: 00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000 [ +0.000420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0 [ +0.000837] Call Trace: [ +0.000682] ? _cond_resched+0x10/0x20 [ +0.000691] ? __kmalloc+0x131/0x1b0 [ +0.000710] kernfs_fop_write+0xfa/0x170 [ +0.000733] __vfs_write+0x2e/0x190 [ +0.000688] ? inode_security+0x10/0x30 [ +0.000698] ? selinux_file_permission+0xd2/0x120 [ +0.000752] ? security_file_permission+0x2b/0x100 [ +0.000753] vfs_write+0xa8/0x1a0 [ +0.000676] ksys_write+0x4d/0xb0 [ +0.000699] do_syscall_64+0x3a/0xf0 [ +0.000692] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: Clean up bch_get_congested()George Spelvin
There are a few nits in this function. They could in theory all be separate patches, but that's probably taking small commits too far. 1) I added a brief comment saying what it does. 2) I like to declare pointer parameters "const" where possible for documentation reasons. 3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming weight of a 32-bit random number (giving a random integer with mean 16 and variance 8). Passing by reference in a 64-bit variable is silly; just use hweight32(). 4) Its helper function fract_exp_two is unnecessarily tangled. Gcc can optimize the multiply by (1 << x) to a shift, but it can be written in a much more straightforward way at the cost of one more bit of internal precision. Some analysis reveals that this bit is always available. This shrinks the object code for fract_exp_two(x, 6) from 23 bytes: 0000000000000000 <foo1>: 0: 89 f9 mov %edi,%ecx 2: c1 e9 06 shr $0x6,%ecx 5: b8 01 00 00 00 mov $0x1,%eax a: d3 e0 shl %cl,%eax c: 83 e7 3f and $0x3f,%edi f: d3 e7 shl %cl,%edi 11: c1 ef 06 shr $0x6,%edi 14: 01 f8 add %edi,%eax 16: c3 retq To 19: 0000000000000017 <foo2>: 17: 89 f8 mov %edi,%eax 19: 83 e0 3f and $0x3f,%eax 1c: 83 c0 40 add $0x40,%eax 1f: 89 f9 mov %edi,%ecx 21: c1 e9 06 shr $0x6,%ecx 24: d3 e0 shl %cl,%eax 26: c1 e8 06 shr $0x6,%eax 29: c3 retq (Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits; both versions produce the same output.) 5) And finally, the call to bch_get_congested() in check_should_bypass() is separated from the use of the value by multiple tests which could moot the need to compute it. Move the computation down to where it's needed. This also saves a local register to hold the computed value. Signed-off-by: George Spelvin <lkml@sdf.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: use kmemdup_nul for CACHED_LABEL bufferGeliang Tang
This patch uses kmemdup_nul to create a NUL-terminated string from dc->sb.label. This is better than open coding it. With this, we can move env[2] initialization into env[] array to make code more elegant. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: avoid clang -Wunintialized warningArnd Bergmann
clang has identified a code path in which it thinks a variable may be unused: drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] fifo_pop(&ca->free_inc, bucket); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop' #define fifo_pop(fifo, i) fifo_pop_front(fifo, (i)) ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front' if (_r) { \ ^~ drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here allocator_wait(ca, bch_allocator_push(ca, bucket)); ^~~~~~ drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait' if (cond) \ ^~~~ drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true fifo_pop(&ca->free_inc, bucket); ^ drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop' #define fifo_pop(fifo, i) fifo_pop_front(fifo, (i)) ^ drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front' if (_r) { \ ^ drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning long bucket; ^ This cannot happen in practice because we only enter the loop if there is at least one element in the list. Slightly rearranging the code makes this clearer to both the reader and the compiler, which avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: fix inaccurate result of unused bucketsGuoju Fang
To get the amount of unused buckets in sysfs_priority_stats, the code count the buckets which GC_SECTORS_USED is zero. It's correct and should not be overwritten by the count of buckets which prio is zero. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: fix crashes stopping bcache device before read miss doneGuoju Fang
The bio from upper layer is considered completed when bio_complete() returns. In most scenarios bio_complete() is called in search_free(), but when read miss happens, the bio_compete() is called when backing device reading completed, while the struct search is still in use until cache inserting finished. If someone stops the bcache device just then, the device may be closed and released, but after cache inserting finished the struct search will access a freed struct cached_dev. This patch add the reference of bcache device before bio_complete() when read miss happens, and put it after the search is not used. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-18dm thin metadata: do not write metadata if no changes occurredMike Snitzer
Otherwise, just activating a thin-pool and thin device and then deactivating them will cause the thin-pool metadata to be changed (e.g. superblock written) -- even without any metadata being changed. Add 'in_service' flag to struct dm_pool_metadata and set it in pmd_write_lock() because all on-disk metadata changes must take a write lock of pmd->root_lock. Once 'in_service' is set it is never cleared. __commit_transaction() will return 0 if 'in_service' is not set. dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that it isn't the sole reason for putting a thin-pool in service. Also fix dm_pool_commit_metadata() to open the next transaction if the return from __commit_transaction() is 0. Not seeing why the early return ever made since for a return of 0 given that dm-io's async_io(), as used by bufio, always returns 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm thin metadata: add wrappers for managing write locking of metadataMike Snitzer
No functional change, but this prepares to hook off of pmd_write_lock() with additional functionality (as provided in next commit). Suggested-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm thin metadata: check __commit_transaction()'s returnMike Snitzer
Fix __reserve_metadata_snap() to return early if __commit_transaction() fails. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm space map common: zero entire ll_diskMike Snitzer
Otherwise, memory that is allocated (and potentially not previously zeroed) will get written to disk as part of the space maps. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm writecache: add unlikely for returned value of rb_next/prevHuaisheng Ye
In functions writecache_discard() and writecache_find_entry() there is a high probablity that the pointer of structure rb_node won't equal NULL. Add unlikely for the pointer node NULL. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm writecache: remove needless dereferences in __writecache_writeback_pmem()Huaisheng Ye
bio is already available so there is no need to access it in terms of the wb pointer. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Use fine-grained locking schemeNikos Tsironis
Substitute the global locking scheme with a fine grained one, employing the read-write semaphore and the scalable exception tables with per-bucket locks introduced by the previous two commits. Summarizing, we now use a read-write semaphore to protect the mostly read fields of the snapshot structure, e.g., valid, active, etc., and per-bucket bit spinlocks to protect accesses to the complete and pending exception tables. Finally, we use an extra spinlock (pe_allocation_lock) to serialize the allocation of new exceptions by the exception store. This allocation is really fast, so the extra spinlock doesn't hurt the performance. This scheme allows dm-snapshot to scale better, resulting in increased IOPS and reduced latency. Following are some benchmark results using the null_blk device: modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \ queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1 * Benchmark fio_origin_randwrite_throughput_N, from the device mapper test suite [1] (direct IO, random 4K writes to origin device, IO engine libaio): +--------------+-------------+------------+ | # of workers | IOPS Before | IOPS After | +--------------+-------------+------------+ | 1 | 57708 | 66421 | | 2 | 63415 | 77589 | | 4 | 67276 | 98839 | | 8 | 60564 | 109258 | +--------------+-------------+------------+ * Benchmark fio_origin_randwrite_latency_N, from the device mapper test suite [1] (direct IO, random 4K writes to origin device, IO engine psync): +--------------+-----------------------+----------------------+ | # of workers | Latency (usec) Before | Latency (usec) After | +--------------+-----------------------+----------------------+ | 1 | 16.25 | 13.27 | | 2 | 31.65 | 25.08 | | 4 | 55.28 | 41.08 | | 8 | 121.47 | 74.44 | +--------------+-----------------------+----------------------+ * Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper test suite [1] (direct IO, random 4K writes to snapshot device, IO engine libaio): +--------------+-------------+------------+ | # of workers | IOPS Before | IOPS After | +--------------+-------------+------------+ | 1 | 72593 | 84938 | | 2 | 97379 | 134973 | | 4 | 90610 | 143077 | | 8 | 90537 | 180085 | +--------------+-------------+------------+ * Benchmark fio_snapshot_randwrite_latency_N, from the device mapper test suite [1] (direct IO, random 4K writes to snapshot device, IO engine psync): +--------------+-----------------------+----------------------+ | # of workers | Latency (usec) Before | Latency (usec) After | +--------------+-----------------------+----------------------+ | 1 | 12.53 | 10.6 | | 2 | 19.78 | 14.89 | | 4 | 40.37 | 23.47 | | 8 | 89.32 | 48.48 | +--------------+-----------------------+----------------------+ [1] https://github.com/jthornber/device-mapper-test-suite Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Make exception tables scalableNikos Tsironis
Use list_bl to implement the exception hash tables' buckets. This change permits concurrent access, to distinct buckets, by multiple threads. Also, implement helper functions to lock and unlock the exception tables based on the chunk number of the exception at hand. We retain the global locking, by means of down_write(), which is replaced by the next commit. Still, we must acquire the per-bucket spinlocks when accessing the hash tables, since list_bl does not allow modification on unlocked lists. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Replace mutex with rw semaphoreNikos Tsironis
dm-snapshot uses a single mutex to serialize every access to the snapshot state. This includes all accesses to the complete and pending exception tables, which occur at every origin write, every snapshot read/write and every exception completion. The lock statistics indicate that this mutex is a bottleneck (average wait time ~480 usecs for 8 processes doing random 4K writes to the origin device) preventing dm-snapshot to scale as the number of threads doing IO increases. The major contention points are __origin_write()/snapshot_map() and pending_complete(), i.e., the submission and completion of pending exceptions. Replace this mutex with a rw semaphore. We essentially revert commit ae1093be5a0ef9 ("dm snapshot: use mutex instead of rw_semaphore") and together with the next two patches we substitute the single mutex with a fine-grained locking scheme, where we use a read-write semaphore to protect the mostly read fields of the snapshot structure, e.g., valid, active, etc., and per-bucket bit spinlocks to protect accesses to the complete and pending exception tables. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm snapshot: Don't sleep holding the snapshot lockNikos Tsironis
When completing a pending exception, pending_complete() waits for all conflicting reads to drain, before inserting the final, completed exception. Conflicting reads are snapshot reads redirected to the origin, because the relevant chunk is not remapped to the COW device the moment we receive the read. The completed exception must be inserted into the exception table after all conflicting reads drain to ensure snapshot reads don't return corrupted data. This is required because inserting the completed exception into the exception table signals that the relevant chunk is remapped and both origin writes and snapshot merging will now overwrite the chunk in origin. This wait is done holding the snapshot lock to ensure that pending_complete() doesn't starve if new snapshot reads keep coming for this chunk. In preparation for the next commit, where we use a spinlock instead of a mutex to protect the exception tables, we remove the need for holding the lock while waiting for conflicting reads to drain. We achieve this in two steps: 1. pending_complete() inserts the completed exception before waiting for conflicting reads to drain and removes the pending exception after all conflicting reads drain. This ensures that new snapshot reads will be redirected to the COW device, instead of the origin, and thus pending_complete() will not starve. Moreover, we use the existence of both a completed and a pending exception to signify that the COW is done but there are conflicting reads in flight. 2. In __origin_write() we check first if there is a pending exception and then if there is a completed exception. If there is a pending exception any submitted BIO is delayed on the pe->origin_bios list and DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the origin nor snapshot merging can overwrite the origin chunk, until all conflicting reads drain, and thus snapshot reads will not return corrupted data. Summarizing, we now have the following possible combinations of pending and completed exceptions for a chunk, along with their meaning: A. No exceptions exist: The chunk has not been remapped yet. B. Only a pending exception exists: The chunk is currently being copied to the COW device. C. Both a pending and a completed exception exist: COW for this chunk has completed but there are snapshot reads in flight which had been redirected to the origin before the chunk was remapped. D. Only the completed exception exists: COW has been completed and there are no conflicting reads in flight. Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm cache metadata: Fix loading discard bitsetNikos Tsironis
Add missing dm_bitset_cursor_next() to properly advance the bitset cursor. Otherwise, the discarded state of all blocks is set according to the discarded state of the first block. Fixes: ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset") Cc: stable@vger.kernel.org Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm zoned: Fix zone report handlingDamien Le Moal
The function blkdev_report_zones() returns success even if no zone information is reported (empty report). Empty zone reports can only happen if the report start sector passed exceeds the device capacity. The conditions for this to happen are either a bug in the caller code, or, a change in the device that forced the low level driver to change the device capacity to a value that is lower than the report start sector. This situation includes a failed disk revalidation resulting in the disk capacity being changed to 0. If this change happens while dm-zoned is in its initialization phase executing dmz_init_zones(), this function may enter an infinite loop and hang the system. To avoid this, add a check to disallow empty zone reports and bail out early. Also fix the function dmz_update_zone() to make sure that the report for the requested zone was correctly obtained. Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Shaun Tancheff <shaun@tancheff.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm zoned: Silence a static checker warningDan Carpenter
My static checker complains about this line from dmz_get_zoned_device() aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1); The problem is that "aligned_capacity" and "dev->capacity" are sector_t type (which is a u64 under most configs) but blk_queue_zone_sectors(q) returns a u32 so the higher 32 bits in aligned_capacity are cleared to zero. This patch adds a cast to address the issue. Fixes: 114e025968b5 ("dm zoned: ignore last smaller runt zone") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18dm crypt: fix endianness annotations around org_sector_of_dmreqChristoph Hellwig
The sector used here is a little endian value, so use the right type for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-16md/raid: raid5 preserve the writeback action after the parity checkNigel Croxon
The problem is that any 'uptodate' vs 'disks' check is not precise in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the device that might try to kick off writes and then skip the action. Better to prevent the raid driver from taking unexpected action *and* keep the system alive vs killing the machine with BUG_ON. Note: fixed warning reported by kbuild test robot <lkp@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16Revert "Don't jump to compute_result state from check_result state"Song Liu
This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Nigel Croxon <ncroxon@redhat.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16md: return -ENODEV if rdev has no mddev assignedPawel Baldysiak
Mdadm expects that setting drive as faulty will fail with -EBUSY only if this operation will cause RAID to be failed. If this happens, it will try to stop the array. Currently -EBUSY might also be returned if rdev is in the middle of the removal process - for example there is a race with mdmon that already requested the drive to be failed/removed. If rdev does not contain mddev, return -ENODEV instead, so the caller can distinguish between those two cases and behave accordingly. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-15Merge tag 'v5.1-rc5' into for-5.2/blockJens Axboe
Pull in v5.1-rc5 to resolve two conflicts. One is in BFQ, in just a comment, and is trivial. The other one is a conflict due to a later fix in the bio multi-page work, and needs a bit more care. * tag 'v5.1-rc5': (476 commits) Linux 5.1-rc5 fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit clk: imx: Fix PLL_1416X not rounding rates clk: mediatek: fix clk-gate flag setting arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value iommu/amd: Set exclusion range correctly clang-format: Update with the latest for_each macro list perf/core: Fix perf_event_disable_inatomic() race block: fix the return errno for direct IO Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" NFSv4.1 fix incorrect return value in copy_file_range xprtrdma: Fix helper that drains the transport NFS: Fix handling of reply page vector NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family. dma-debug: only skip one stackframe entry platform/x86: pmc_atom: Drop __initconst on dmi table nvmet: fix discover log page when offsets are used ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10md: add __acquires/__releases annotations to handle_active_stripesChristoph Hellwig
This tells sparse that we release and reacquire the device_lock and avoids a warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: add __acquires/__releases annotations to (un)lock_two_stripesChristoph Hellwig
This tells sparse that we acquire/release the two stripe locks and avoids a warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: mark md_cluster_mod staticChristoph Hellwig
Sparse complains that it has no external declaration, and it turns out that it is never even used outside of md.c. So just mark it static and drop the export. Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: use correct type in super_1_syncChristoph Hellwig
If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: use correct type in super_1_loadChristoph Hellwig
If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: use correct types in md_bitmap_print_sbChristoph Hellwig
If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: add a missing endianness conversion in check_sb_changesChristoph Hellwig
The on-disk value is little endian and we need to convert it to native endian before storing the value in the in-core structure. Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode") Cc: <stable@vger.kernel.org> # 4.20+ Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: add mddev->pers to avoid potential NULL pointer dereferenceYufen Yu
When doing re-add, we need to ensure rdev->mddev->pers is not NULL, which can avoid potential NULL pointer derefence in fallowing add_bound_rdev(). Fixes: a6da4ef85cef ("md: re-add a failed disk") Cc: Xiao Ni <xni@redhat.com> Cc: NeilBrown <neilb@suse.com> Cc: <stable@vger.kernel.org> # 4.4+ Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-06block: remove CONFIG_LBDAFChristoph Hellwig
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-05dm integrity: fix deadlock with overlapping I/OMikulas Patocka
dm-integrity will deadlock if overlapping I/O is issued to it, the bug was introduced by commit 724376a04d1a ("dm integrity: implement fair range locks"). Users rarely use overlapping I/O so this bug went undetected until now. Fix this bug by correcting, likely cut-n-paste, typos in ranges_overlap() and also remove a flawed ranges_overlap() check in remove_range_unlocked(). This condition could leave unprocessed bios hanging on wait_list forever. Cc: stable@vger.kernel.org # v4.19+ Fixes: 724376a04d1a ("dm integrity: implement fair range locks") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-04dm: disable DISCARD if the underlying storage no longer supports itMike Snitzer
Storage devices which report supporting discard commands like WRITE_SAME_16 with unmap, but reject discard commands sent to the storage device. This is a clear storage firmware bug but it doesn't change the fact that should a program cause discards to be sent to a multipath device layered on this buggy storage, all paths can end up failed at the same time from the discards, causing possible I/O loss. The first discard to a path will fail with Illegal Request, Invalid field in cdb, e.g.: kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current] kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00 kernel: blk_update_request: critical target error, dev sdfn, sector 10487808 The SCSI layer converts this to the BLK_STS_TARGET error number, the sd device disables its support for discard on this path, and because of the BLK_STS_TARGET error multipath fails the discard without failing any path or retrying down a different path. But subsequent discards can cause path failures. Any discards sent to the path which already failed a discard ends up failing with EIO from blk_cloned_rq_check_limits with an "over max size limit" error since the discard limit was set to 0 by the sd driver for the path. As the error is EIO, this now fails the path and multipath tries to send the discard down the next path. This cycle continues as discards are sent until all paths fail. Fix this by training DM core to disable DISCARD if the underlying storage already did so. Also, fix branching in dm_done() and clone_endio() to reflect the mutually exclussive nature of the IO operations in question. Cc: stable@vger.kernel.org Reported-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errorsIlya Dryomov
Some devices don't use blk_integrity but still want stable pages because they do their own checksumming. Examples include rbd and iSCSI when data digests are negotiated. Stacking DM (and thus LVM) on top of these devices results in sporadic checksum errors. Set BDI_CAP_STABLE_WRITES if any underlying device has it set. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>