summaryrefslogtreecommitdiff
path: root/drivers/md/bcache/super.c
AgeCommit message (Collapse)Author
2020-02-13bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"Coly Li
This reverts commit 1df3877ff6a4810054237c3259d900ded4468969. In my testing, sometimes even all the cached btree nodes are freed, creating gc and allocator kernel threads may still fail. Finally it turns out that kthread_run() may fail if there is pending signal for current task. And the pending signal is sent from OOM killer which is triggered by memory consuption in bch_btree_check(). Therefore explicitly shrinking bcache btree node here does not help, and after the shrinker callback is improved, as well as pending signals are ignored before creating kernel threads, now such operation is unncessary anymore. This patch reverts the commit 1df3877ff6a4 ("bcache: shrink btree node cache after bch_btree_check()") because we have better improvement now. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-01bcache: check return value of prio_read()Coly Li
Now if prio_read() failed during starting a cache set, we can print out error message in run_cache_set() and handle the failure properly. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: use read_cache_page_gfp to read the superblockChristoph Hellwig
Avoid a pointless dependency on buffer heads in bcache by simply open coding reading a single page. Also add a SB_OFFSET define for the byte offset of the superblock instead of using magic numbers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: store a pointer to the on-disk sb in the cache and cached_dev structuresChristoph Hellwig
This allows to properly build the superblock bio including the offset in the page using the normal bio helpers. This fixes writing the superblock for page sizes larger than 4k where the sb write bio would need an offset in the bio_vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: return a pointer to the on-disk sb from read_superChristoph Hellwig
Returning the properly typed actual data structure insteaf of the containing struct page will save the callers some work going forward. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: transfer the sb_page reference to register_{bdev,cache}Christoph Hellwig
Avoid an extra reference count roundtrip by transferring the sb_page ownership to the lower level register helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: fix use-after-free in register_bcache()Coly Li
The patch "bcache: rework error unwinding in register_bcache" introduces a use-after-free regression in register_bcache(). Here are current code, 2510 out_free_path: 2511 kfree(path); 2512 out_module_put: 2513 module_put(THIS_MODULE); 2514 out: 2515 pr_info("error %s: %s", path, err); 2516 return ret; If some error happens and the above code path is executed, at line 2511 path is released, but referenced at line 2515. Then KASAN reports a use- after-free error message. This patch changes line 2515 in the following way to fix the problem, 2515 pr_info("error %s: %s", path?path:"", err); Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: properly initialize 'path' and 'err' in register_bcache()Coly Li
Patch "bcache: rework error unwinding in register_bcache" from Christoph Hellwig changes the local variables 'path' and 'err' in undefined initial state. If the code in register_bcache() jumps to label 'out:' or 'out_module_put:' by goto, these two variables might be reference with undefined value by the following line, out_module_put: module_put(THIS_MODULE); out: pr_info("error %s: %s", path, err); return ret; Therefore this patch initializes these two local variables properly in register_bcache() to avoid such issue. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: rework error unwinding in register_bcacheChristoph Hellwig
Split the successful and error return path, and use one goto label for each resource to unwind. This also fixes some small errors like leaking the module reference count in the reboot case (which seems entirely harmless) or printing the wrong warning messages for early failures. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: use a separate data structure for the on-disk super blockChristoph Hellwig
Split out an on-disk version struct cache_sb with the proper endianness annotations. This fixes a fair chunk of sparse warnings, but there are some left due to the way the checksum is defined. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: cached_dev_free needs to put the sb pageLiang Chen
Same as cache device, the buffer page needs to be put while freeing cached_dev. Otherwise a page would be leaked every time a cached_dev is stopped. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13bcache: add idle_max_writeback_rate sysfs interfaceColy Li
For writeback mode, if there is no regular I/O request for a while, the writeback rate will be set to the maximum value (1TB/s for now). This is good for most of the storage workload, but there are still people don't what the maximum writeback rate in I/O idle time. This patch adds a sysfs interface file idle_max_writeback_rate to permit people to disable maximum writeback rate. Then the minimum writeback rate can be advised by writeback_rate_minimum in the bcache device's sysfs interface. Reported-by: Christian Balzer <chibi@gol.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13bcache: fix deadlock in bcache_allocatorAndrea Righi
bcache_allocator can call the following: bch_allocator_thread() -> bch_prio_write() -> bch_bucket_alloc() -> wait on &ca->set->bucket_wait But the wake up event on bucket_wait is supposed to come from bch_allocator_thread() itself => deadlock: [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds. [ 1158.495929] Not tainted 5.3.0-050300rc3-generic #201908042232 [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1158.504413] bcache_allocato D 0 15861 2 0x80004000 [ 1158.504419] Call Trace: [ 1158.504429] __schedule+0x2a8/0x670 [ 1158.504432] schedule+0x2d/0x90 [ 1158.504448] bch_bucket_alloc+0xe5/0x370 [bcache] [ 1158.504453] ? wait_woken+0x80/0x80 [ 1158.504466] bch_prio_write+0x1dc/0x390 [bcache] [ 1158.504476] bch_allocator_thread+0x233/0x490 [bcache] [ 1158.504491] kthread+0x121/0x140 [ 1158.504503] ? invalidate_buckets+0x890/0x890 [bcache] [ 1158.504506] ? kthread_park+0xb0/0xb0 [ 1158.504510] ret_from_fork+0x35/0x40 Fix by making the call to bch_prio_write() non-blocking, so that bch_allocator_thread() never waits on itself. Moreover, make sure to wake up the garbage collector thread when bch_prio_write() is failing to allocate buckets. BugLink: https://bugs.launchpad.net/bugs/1784665 BugLink: https://bugs.launchpad.net/bugs/1796292 Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13bcache: add more accurate error messages in read_super()Coly Li
Previous code only returns "Not a bcache superblock" for both bcache super block offset and magic error. This patch addss more accurate error messages, - for super block unmatched offset: "Not a bcache superblock (bad offset)" - for super block unmatched magic number: "Not a bcache superblock (bad magic)" Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13bcache: fix static checker warning in bcache_device_free()Coly Li
Commit cafe56359144 ("bcache: A block layer cache") leads to the following static checker warning: ./drivers/md/bcache/super.c:770 bcache_device_free() warn: variable dereferenced before check 'd->disk' (see line 766) drivers/md/bcache/super.c 762 static void bcache_device_free(struct bcache_device *d) 763 { 764 lockdep_assert_held(&bch_register_lock); 765 766 pr_info("%s stopped", d->disk->disk_name); ^^^^^^^^^ Unchecked dereference. 767 768 if (d->c) 769 bcache_device_detach(d); 770 if (d->disk && d->disk->flags & GENHD_FL_UP) ^^^^^^^ Check too late. 771 del_gendisk(d->disk); 772 if (d->disk && d->disk->queue) 773 blk_cleanup_queue(d->disk->queue); 774 if (d->disk) { 775 ida_simple_remove(&bcache_device_idx, 776 first_minor_to_idx(d->disk->first_minor)); 777 put_disk(d->disk); 778 } 779 It is not 100% sure that the gendisk struct of bcache device will always be there, the warning makes sense when there is problem in block core. This patch tries to remove the static checking warning by checking d->disk to avoid NULL pointer deferences. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13bcache: fix a lost wake-up problem caused by mca_cannibalize_lockGuoju Fang
This patch fix a lost wake-up problem caused by the race between mca_cannibalize_lock and bch_cannibalize_unlock. Consider two processes, A and B. Process A is executing mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock and is executing bch_cannibalize_unlock. The problem happens that after process A executes cmpxchg and will execute prepare_to_wait. In this timeslice process B executes wake_up, but after that process A executes prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A goes to sleep but no one will wake up it. This problem may cause bcache device to dead. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-22bcache: fix possible memory leak in bch_cached_dev_run()Wei Yongjun
memory malloced in bch_cached_dev_run() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Fixes: 0b13efecf5f2 ("bcache: add return value check to bch_cached_dev_run()") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: shrink btree node cache after bch_btree_check()Coly Li
When cache set starts, bch_btree_check() will check all bkeys on cache device by calculating the checksum. This operation will consume a huge number of system memory if there are a lot of data cached. Since bcache uses its own mca cache to maintain all its read-in btree nodes, and only releases the cache space when system memory manage code starts to shrink caches. Then before memory manager code to call the mca cache shrinker callback, bcache mca cache will compete memory resource with user space application, which may have nagive effect to performance of user space workloads (e.g. data base, or I/O service of distributed storage node). This patch tries to call bcache mca shrinker routine to proactively release mca cache memory, to decrease the memory pressure of system and avoid negative effort of the overall system I/O performance. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: fix potential deadlock in cached_def_free()Coly Li
When enable lockdep and reboot system with a writeback mode bcache device, the following potential deadlock warning is reported by lockdep engine. [ 101.536569][ T401] kworker/2:2/401 is trying to acquire lock: [ 101.538575][ T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0 [ 101.542054][ T401] [ 101.542054][ T401] but task is already holding lock: [ 101.544587][ T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640 [ 101.548386][ T401] [ 101.548386][ T401] which lock already depends on the new lock. [ 101.548386][ T401] [ 101.551874][ T401] [ 101.551874][ T401] the existing dependency chain (in reverse order) is: [ 101.555000][ T401] [ 101.555000][ T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}: [ 101.557860][ T401] process_one_work+0x277/0x640 [ 101.559661][ T401] worker_thread+0x39/0x3f0 [ 101.561340][ T401] kthread+0x125/0x140 [ 101.562963][ T401] ret_from_fork+0x3a/0x50 [ 101.564718][ T401] [ 101.564718][ T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}: [ 101.567701][ T401] lock_acquire+0xb4/0x1c0 [ 101.569651][ T401] flush_workqueue+0xae/0x4c0 [ 101.571494][ T401] drain_workqueue+0xa9/0x180 [ 101.573234][ T401] destroy_workqueue+0x17/0x250 [ 101.575109][ T401] cached_dev_free+0x44/0x120 [bcache] [ 101.577304][ T401] process_one_work+0x2a4/0x640 [ 101.579357][ T401] worker_thread+0x39/0x3f0 [ 101.581055][ T401] kthread+0x125/0x140 [ 101.582709][ T401] ret_from_fork+0x3a/0x50 [ 101.584592][ T401] [ 101.584592][ T401] other info that might help us debug this: [ 101.584592][ T401] [ 101.588355][ T401] Possible unsafe locking scenario: [ 101.588355][ T401] [ 101.590974][ T401] CPU0 CPU1 [ 101.592889][ T401] ---- ---- [ 101.594743][ T401] lock((work_completion)(&cl->work)#2); [ 101.596785][ T401] lock((wq_completion)bcache_writeback_wq); [ 101.600072][ T401] lock((work_completion)(&cl->work)#2); [ 101.602971][ T401] lock((wq_completion)bcache_writeback_wq); [ 101.605255][ T401] [ 101.605255][ T401] *** DEADLOCK *** [ 101.605255][ T401] [ 101.608310][ T401] 2 locks held by kworker/2:2/401: [ 101.610208][ T401] #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640 [ 101.613709][ T401] #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640 [ 101.617480][ T401] [ 101.617480][ T401] stack backtrace: [ 101.619539][ T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1 [ 101.623225][ T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 101.627210][ T401] Workqueue: events cached_dev_free [bcache] [ 101.629239][ T401] Call Trace: [ 101.630360][ T401] dump_stack+0x85/0xcb [ 101.631777][ T401] print_circular_bug+0x19a/0x1f0 [ 101.633485][ T401] __lock_acquire+0x16cd/0x1850 [ 101.635184][ T401] ? __lock_acquire+0x6a8/0x1850 [ 101.636863][ T401] ? lock_acquire+0xb4/0x1c0 [ 101.638421][ T401] ? find_held_lock+0x34/0xa0 [ 101.640015][ T401] lock_acquire+0xb4/0x1c0 [ 101.641513][ T401] ? flush_workqueue+0x87/0x4c0 [ 101.643248][ T401] flush_workqueue+0xae/0x4c0 [ 101.644832][ T401] ? flush_workqueue+0x87/0x4c0 [ 101.646476][ T401] ? drain_workqueue+0xa9/0x180 [ 101.648303][ T401] drain_workqueue+0xa9/0x180 [ 101.649867][ T401] destroy_workqueue+0x17/0x250 [ 101.651503][ T401] cached_dev_free+0x44/0x120 [bcache] [ 101.653328][ T401] process_one_work+0x2a4/0x640 [ 101.655029][ T401] worker_thread+0x39/0x3f0 [ 101.656693][ T401] ? process_one_work+0x640/0x640 [ 101.658501][ T401] kthread+0x125/0x140 [ 101.660012][ T401] ? kthread_create_worker_on_cpu+0x70/0x70 [ 101.661985][ T401] ret_from_fork+0x3a/0x50 [ 101.691318][ T401] bcache: bcache_device_free() bcache0 stopped Here is how the above potential deadlock may happen in reboot/shutdown code path, 1) bcache_reboot() is called firstly in the reboot/shutdown code path, then in bcache_reboot(), bcache_device_stop() is called. 2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn cached_dev_flush() calls cached_dev_free() via closure_at() 3) In cached_dev_free(), after stopped writebach kthread dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by destroy_workqueue(). 4) Inside destroy_workqueue(), drain_workqueue() is called. Inside drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map is acquired by lock_map_acquire() in flush_workqueue(). After the lock acquired the rest part of flush_workqueue() just wait for the workqueue to complete. 5) Now we look back at writeback thread routine bch_writeback_thread(), in the main while-loop, write_dirty() is called via continue_at() in read_dirty_submit(), which is called via continue_at() in while-loop level called function read_dirty(). Inside write_dirty() it may be re-called on workqueeu dc->writeback_write_wq via continue_at(). It means when the writeback kthread is stopped in cached_dev_free() there might be still one kworker queued on dc->writeback_write_wq to execute write_dirty() again. 6) Now this kworker is scheduled on dc->writeback_write_wq to run by process_one_work() (which is called by worker_thread()). Before calling the kwork routine, wq->lockdep_map is acquired. 7) But wq->lockdep_map is acquired already in step 4), so a A-A lock (lockdep terminology) scenario happens. Indeed on multiple cores syatem, the above deadlock is very rare to happen, just as the code comments in process_one_work() says, 2263 * AFAICT there is no possible deadlock scenario between the 2264 * flush_work() and complete() primitives (except for single-threaded 2265 * workqueues), so hiding them isn't a problem. But it is still good to fix such lockdep warning, even no one running bcache on single core system. The fix is simple. This patch solves the above potential deadlock by, - Do not destroy workqueue dc->writeback_write_wq in cached_dev_free(). - Flush and destroy dc->writeback_write_wq in writebach kthread routine bch_writeback_thread(), where after quit the thread main while-loop and before cached_dev_put() is called. By this fix, dc->writeback_write_wq will be stopped and destroy before the writeback kthread stopped, so the chance for a A-A locking on wq->lockdep_map is disappeared, such A-A deadlock won't happen any more. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: acquire bch_register_lock later in cached_dev_free()Coly Li
When enable lockdep engine, a lockdep warning can be observed when reboot or shutdown system, [ 3142.764557][ T1] bcache: bcache_reboot() Stopping all devices: [ 3142.776265][ T2649] [ 3142.777159][ T2649] ====================================================== [ 3142.780039][ T2649] WARNING: possible circular locking dependency detected [ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G W [ 3142.785684][ T2649] ------------------------------------------------------ [ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock: [ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0 [ 3142.794678][ T2649] [ 3142.794678][ T2649] but task is already holding lock: [ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache] [ 3142.801462][ T2649] [ 3142.801462][ T2649] which lock already depends on the new lock. [ 3142.801462][ T2649] [ 3142.805277][ T2649] [ 3142.805277][ T2649] the existing dependency chain (in reverse order) is: [ 3142.808902][ T2649] [ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}: [ 3142.812396][ T2649] __mutex_lock+0x7a/0x9d0 [ 3142.814184][ T2649] cached_dev_free+0x17/0x120 [bcache] [ 3142.816415][ T2649] process_one_work+0x2a4/0x640 [ 3142.818413][ T2649] worker_thread+0x39/0x3f0 [ 3142.820276][ T2649] kthread+0x125/0x140 [ 3142.822061][ T2649] ret_from_fork+0x3a/0x50 [ 3142.823965][ T2649] [ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}: [ 3142.827244][ T2649] process_one_work+0x277/0x640 [ 3142.829160][ T2649] worker_thread+0x39/0x3f0 [ 3142.830958][ T2649] kthread+0x125/0x140 [ 3142.832674][ T2649] ret_from_fork+0x3a/0x50 [ 3142.834915][ T2649] [ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}: [ 3142.838121][ T2649] lock_acquire+0xb4/0x1c0 [ 3142.840025][ T2649] flush_workqueue+0xae/0x4c0 [ 3142.842035][ T2649] drain_workqueue+0xa9/0x180 [ 3142.844042][ T2649] destroy_workqueue+0x17/0x250 [ 3142.846142][ T2649] cached_dev_free+0x52/0x120 [bcache] [ 3142.848530][ T2649] process_one_work+0x2a4/0x640 [ 3142.850663][ T2649] worker_thread+0x39/0x3f0 [ 3142.852464][ T2649] kthread+0x125/0x140 [ 3142.854106][ T2649] ret_from_fork+0x3a/0x50 [ 3142.855880][ T2649] [ 3142.855880][ T2649] other info that might help us debug this: [ 3142.855880][ T2649] [ 3142.859663][ T2649] Chain exists of: [ 3142.859663][ T2649] (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock [ 3142.859663][ T2649] [ 3142.865424][ T2649] Possible unsafe locking scenario: [ 3142.865424][ T2649] [ 3142.868022][ T2649] CPU0 CPU1 [ 3142.869885][ T2649] ---- ---- [ 3142.871751][ T2649] lock(&bch_register_lock); [ 3142.873379][ T2649] lock((work_completion)(&cl->work)#2); [ 3142.876399][ T2649] lock(&bch_register_lock); [ 3142.879727][ T2649] lock((wq_completion)bcache_writeback_wq); [ 3142.882064][ T2649] [ 3142.882064][ T2649] *** DEADLOCK *** [ 3142.882064][ T2649] [ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649: [ 3142.887245][ T2649] #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640 [ 3142.890815][ T2649] #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640 [ 3142.894884][ T2649] #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache] [ 3142.898797][ T2649] [ 3142.898797][ T2649] stack backtrace: [ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1 [ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache] [ 3142.911422][ T2649] Call Trace: [ 3142.912656][ T2649] dump_stack+0x85/0xcb [ 3142.914181][ T2649] print_circular_bug+0x19a/0x1f0 [ 3142.916193][ T2649] __lock_acquire+0x16cd/0x1850 [ 3142.917936][ T2649] ? __lock_acquire+0x6a8/0x1850 [ 3142.919704][ T2649] ? lock_acquire+0xb4/0x1c0 [ 3142.921335][ T2649] ? find_held_lock+0x34/0xa0 [ 3142.923052][ T2649] lock_acquire+0xb4/0x1c0 [ 3142.924635][ T2649] ? flush_workqueue+0x87/0x4c0 [ 3142.926375][ T2649] flush_workqueue+0xae/0x4c0 [ 3142.928047][ T2649] ? flush_workqueue+0x87/0x4c0 [ 3142.929824][ T2649] ? drain_workqueue+0xa9/0x180 [ 3142.931686][ T2649] drain_workqueue+0xa9/0x180 [ 3142.933534][ T2649] destroy_workqueue+0x17/0x250 [ 3142.935787][ T2649] cached_dev_free+0x52/0x120 [bcache] [ 3142.937795][ T2649] process_one_work+0x2a4/0x640 [ 3142.939803][ T2649] worker_thread+0x39/0x3f0 [ 3142.941487][ T2649] ? process_one_work+0x640/0x640 [ 3142.943389][ T2649] kthread+0x125/0x140 [ 3142.944894][ T2649] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3142.947744][ T2649] ret_from_fork+0x3a/0x50 [ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped Here is how the deadlock happens. 1) bcache_reboot() calls bcache_device_stop(), then inside bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags. Then closure_queue(&d->cl) is called to invoke cached_dev_flush(). 2) In cached_dev_flush(), cached_dev_free() is called by continu_at(). 3) In cached_dev_free(), when stopping the writeback kthread of the cached device by kthread_stop(), dc->writeback_thread will be waken up to quite the kthread while-loop, then cached_dev_put() is called in bch_writeback_thread(). 4) Calling cached_dev_put() in writeback kthread may drop dc->count to 0, then dc->detach kworker is scheduled, which is initialized as cached_dev_detach_finish(). 5) Inside cached_dev_detach_finish(), the last line of code is to call closure_put(&dc->disk.cl), which drops the last reference counter of closrure dc->disk.cl, then the callback cached_dev_flush() gets called. Now cached_dev_flush() is called for second time in the code path, the first time is in step 2). And again bch_register_lock will be acquired again, and a A-A lock (lockdep terminology) is happening. The root cause of the above A-A lock is in cached_dev_free(), mutex bch_register_lock is held before stopping writeback kthread and other kworkers. Fortunately now we have variable 'bcache_is_reboot', which may prevent device registration or unregistration during reboot/shutdown time, so it is unncessary to hold bch_register_lock such early now. This is how this patch fixes the reboot/shutdown time A-A lock issue: After moving mutex_lock(&bch_register_lock) to a later location where before atomic_read(&dc->running) in cached_dev_free(), such A-A lock problem can be solved without any reboot time registration race. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: acquire bch_register_lock later in cached_dev_detach_finish()Coly Li
Now there is variable bcache_is_reboot to prevent device register or unregister during reboot, it is unncessary to still hold mutex lock bch_register_lock before stopping writeback_rate_update kworker and writeback kthread. And if the stopping kworker or kthread holding bch_register_lock inside their routine (we used to have such problem in writeback thread, thanks to Junhui Wang fixed it), it is very easy to introduce deadlock during reboot/shutdown procedure. Therefore in this patch, the location to acquire bch_register_lock is moved to the location before calling calc_cached_dev_sectors(). Which is later then original location in cached_dev_detach_finish(). Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: avoid a deadlock in bcache_reboot()Coly Li
It is quite frequently to observe deadlock in bcache_reboot() happens and hang the system reboot process. The reason is, in bcache_reboot() when calling bch_cache_set_stop() and bcache_device_stop() the mutex bch_register_lock is held. But in the process to stop cache set and bcache device, bch_register_lock will be acquired again. If this mutex is held here, deadlock will happen inside the stopping process. The aftermath of the deadlock is, whole system reboot gets hung. The fix is to avoid holding bch_register_lock for the following loops in bcache_reboot(), list_for_each_entry_safe(c, tc, &bch_cache_sets, list) bch_cache_set_stop(c); list_for_each_entry_safe(dc, tdc, &uncached_devices, list) bcache_device_stop(&dc->disk); A module range variable 'bcache_is_reboot' is added, it sets to true in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked to be true, reject the registration by returning -EBUSY immediately. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: stop writeback kthread and kworker when bch_cached_dev_run() failedColy Li
In bch_cached_dev_attach() after bch_cached_dev_writeback_start() called, the wrireback kthread and writeback rate update kworker of the cached device are created, if the following bch_cached_dev_run() failed, bch_cached_dev_attach() will return with -ENOMEM without stopping the writeback related kthread and kworker. This patch stops writeback kthread and writeback rate update kworker before returning -ENOMEM if bch_cached_dev_run() returns error. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: add pendings_cleanup to stop pending bcache deviceColy Li
If a bcache device is in dirty state and its cache set is not registered, this bcache device will not appear in /dev/bcache<N>, and there is no way to stop it or remove the bcache kernel module. This is an as-designed behavior, but sometimes people has to reboot whole system to release or stop the pending backing device. This sysfs interface may remove such pending bcache devices when write anything into the sysfs file manually. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: remove "XXX:" comment line from run_cache_set()Coly Li
In previous bcache patches for Linux v5.2, the failure code path of run_cache_set() is tested and fixed. So now the following comment line can be removed from run_cache_set(), /* XXX: test this, it's broken */ Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: improve error message in bch_cached_dev_run()Coly Li
This patch adds more error message in bch_cached_dev_run() to indicate the exact reason why an error value is returned. Please notice when printing out the "is running already" message, pr_info() is used here, because in this case also -EBUSY is returned, the bcache device can continue to attach to the cache devince and run, so it won't be an error level message in kernel message. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: add more error message in bch_cached_dev_attach()Coly Li
This patch adds more error message for attaching cached device, this is helpful to debug code failure during bache device start up. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: more detailed error message to bcache_device_link()Coly Li
This patch adds more accurate error message for specific ssyfs_create_link() call, to help debugging failure during bcache device start tup. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: add return value check to bch_cached_dev_run()Coly Li
This patch adds return value check to bch_cached_dev_run(), now if there is error happens inside bch_cached_dev_run(), it can be catched. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: add io error counting in write_bdev_super_endio()Coly Li
When backing device super block is written by bch_write_bdev_super(), the bio complete callback write_bdev_super_endio() simply ignores I/O status. Indeed such write request also contribute to backing device health status if the request failed. This patch checkes bio->bi_status in write_bdev_super_endio(), if there is error, bch_count_backing_io_errors() will be called to count an I/O error to dc->io_errors. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: avoid flushing btree node in cache_set_flush() if io disabledColy Li
When cache_set_flush() is called for too many I/O errors detected on cache device and the cache set is retiring, inside the function it doesn't make sense to flushing cached btree nodes from c->btree_cache because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os onto cache device will be rejected. This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE is set. If yes, then avoids to flush the cached btree nodes to reduce more time and make cache set retiring more faster. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()"Coly Li
This reverts commit 6147305c73e4511ca1a975b766b97a779d442567. Although this patch helps the failed bcache device to stop faster when too many I/O errors detected on corresponding cached device, setting CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This operation will disable all I/Os on cache set, which means other attached bcache devices won't work neither. Without this patch, the failed bcache device can also be stopped eventually if internal I/O accomplished (e.g. writeback). Therefore here I revert it. Fixes: 6147305c73e4 ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()") Reported-by: Yong Li <mr.liyong@qq.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush()Coly Li
When system memory is in heavy pressure, bch_gc_thread_start() from run_cache_set() may fail due to out of memory. In such condition, c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following failure code path bch_cache_set_error(), when cache_set_flush() gets called, the code piece to stop c->gc_thread is broken, if (!IS_ERR_OR_NULL(c->gc_thread)) kthread_stop(c->gc_thread); And KASAN catches such NULL pointer deference problem, with the warning information: [ 561.207881] ================================================================== [ 561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440 [ 561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313 [ 561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G W 5.0.0-vanilla+ #3 [ 561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019 [ 561.207935] Workqueue: events cache_set_flush [bcache] [ 561.207940] Call Trace: [ 561.207948] dump_stack+0x9a/0xeb [ 561.207955] ? kthread_stop+0x3b/0x440 [ 561.207960] ? kthread_stop+0x3b/0x440 [ 561.207965] kasan_report+0x176/0x192 [ 561.207973] ? kthread_stop+0x3b/0x440 [ 561.207981] kthread_stop+0x3b/0x440 [ 561.207995] cache_set_flush+0xd4/0x6d0 [bcache] [ 561.208008] process_one_work+0x856/0x1620 [ 561.208015] ? find_held_lock+0x39/0x1d0 [ 561.208028] ? drain_workqueue+0x380/0x380 [ 561.208048] worker_thread+0x87/0xb80 [ 561.208058] ? __kthread_parkme+0xb6/0x180 [ 561.208067] ? process_one_work+0x1620/0x1620 [ 561.208072] kthread+0x326/0x3e0 [ 561.208079] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 561.208090] ret_from_fork+0x3a/0x50 [ 561.208110] ================================================================== [ 561.208113] Disabling lock debugging due to kernel taint [ 561.208115] irq event stamp: 11800231 [ 561.208126] hardirqs last enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410 [ 561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c [ 561.208129] #PF error: [WRITE] [ 561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c [ 561.312259] softirqs last enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3 [ 561.405975] PGD 0 P4D 0 [ 561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0 [ 561.791359] Oops: 0002 [#1] SMP KASAN NOPTI [ 561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G B W 5.0.0-vanilla+ #3 [ 561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019 [ 561.791371] Workqueue: events cache_set_flush [bcache] [ 561.791374] RIP: 0010:kthread_stop+0x3b/0x440 [ 561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 [ 561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286 [ 561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314 [ 563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297 [ 563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d [ 563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c [ 563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68 [ 563.408620] FS: 0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000 [ 563.408622] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0 [ 563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 563.915796] PKRU: 55555554 [ 563.915797] Call Trace: [ 563.915807] cache_set_flush+0xd4/0x6d0 [bcache] [ 563.915812] process_one_work+0x856/0x1620 [ 564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 564.033563] ? find_held_lock+0x39/0x1d0 [ 564.033567] ? drain_workqueue+0x380/0x380 [ 564.033574] worker_thread+0x87/0xb80 [ 564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 564.118042] ? __kthread_parkme+0xb6/0x180 [ 564.118046] ? process_one_work+0x1620/0x1620 [ 564.118048] kthread+0x326/0x3e0 [ 564.118050] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 564.252441] ret_from_fork+0x3a/0x50 [ 564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs [ 564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree. [ 564.348360] CR2: 000000000000001c [ 564.348362] ---[ end trace b7f0e5cc7b2103b0 ]--- Therefore, it is not enough to only check whether c->gc_thread is NULL, we should use IS_ERR_OR_NULL() to check both NULL pointer and error value. This patch changes the above buggy code piece in this way, if (!IS_ERR_OR_NULL(c->gc_thread)) kthread_stop(c->gc_thread); Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30bcache: remove redundant LIST_HEAD(journal) from run_cache_set()Coly Li
Commit 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets to remove the original define of LIST_HEAD(journal), which makes the change no take effect. This patch removes redundant variable LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix working. Fixes: 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set") Reported-by: Juha Aatrokoski <juha.aatrokoski@aalto.fi> Cc: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: avoid potential memleak of list of journal_replay(s) in the ↵Shenghui Wang
CACHE_SYNC branch of run_cache_set In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used to collect journal_replay(s) and filled by bch_journal_read(). If all goes well, bch_journal_replay() will release the list of jounal_replay(s) at the end of the branch. If something goes wrong, code flow will jump to the label "err:" and leave the list unreleased. This patch will release the list of journal_replay(s) in the case of error detected. v1 -> v2: * Move the release code to the location after label 'err:' to simply the change. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: improve bcache_reboot()Coly Li
This patch tries to release mutex bch_register_lock early, to give chance to stop cache set and bcache device early. This patch also expends time out of stopping all bcache device from 2 seconds to 10 seconds, because stopping writeback rate update worker may delay for 5 seconds, 2 seconds is not enough. After this patch applied, stopping bcache devices during system reboot or shutdown is very hard to be observed any more. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add comments for closure_fn to be called in closure_queue()Coly Li
Add code comments to explain which call back function might be called for the closure_queue(). This is an effort to make code to be more understandable for readers. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: Add comments for blkdev_put() in registration code pathColy Li
Add comments to explain why in register_bcache() blkdev_put() won't be called in two location. Add comments to explain why blkdev_put() must be called in register_cache() when cache_alloc() failed. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add error check for calling register_bdev()Coly Li
This patch adds return value to register_bdev(). Then if failure happens inside register_bdev(), its caller register_bcache() may detect and handle the failure more properly. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add comments for kobj release callback routineColy Li
Bcache has several routines to release resources in implicit way, they are called when the associated kobj released. This patch adds code comments to notice when and which release callback will be called, - When dc->disk.kobj released: void bch_cached_dev_release(struct kobject *kobj) - When d->kobj released: void bch_flash_dev_release(struct kobject *kobj) - When c->kobj released: void bch_cache_set_release(struct kobject *kobj) - When ca->kobj released void bch_cache_release(struct kobject *kobj) Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: add failure check to run_cache_set() for journal replayColy Li
Currently run_cache_set() has no return value, if there is failure in bch_journal_replay(), the caller of run_cache_set() has no idea about such failure and just continue to execute following code after run_cache_set(). The internal failure is triggered inside bch_journal_replay() and being handled in async way. This behavior is inefficient, while failure handling inside bch_journal_replay(), cache register code is still running to start the cache set. Registering and unregistering code running as same time may introduce some rare race condition, and make the code to be more hard to be understood. This patch adds return value to run_cache_set(), and returns -EIO if bch_journal_rreplay() fails. Then caller of run_cache_set() may detect such failure and stop registering code flow immedidately inside register_cache_set(). If journal replay fails, run_cache_set() can report error immediately to register_cache_set(). This patch makes the failure handling for bch_journal_replay() be in synchronized way, easier to understand and debug, and avoid poetential race condition for register-and-unregister in same time. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: fix a race between cache register and cacheset unregisterLiang Chen
There is a race between cache device register and cache set unregister. For an already registered cache device, register_bcache will call bch_is_open to iterate through all cachesets and check every cache there. The race occurs if cache_set_free executes at the same time and clears the caches right before ca is dereferenced in bch_is_open_cache. To close the race, let's make sure the clean up work is protected by the bch_register_lock as well. This issue can be reproduced as follows, while true; do echo /dev/XXX> /sys/fs/bcache/register ; done& while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done & and results in the following oops, [ +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998 [ +0.000457] #PF error: [normal kernel read fault] [ +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0 [ +0.000388] Oops: 0000 [#1] SMP PTI [ +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6 [ +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 [ +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache] [ +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d [ +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202 [ +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000 [ +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001 [ +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a [ +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0 [ +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620 [ +0.000384] FS: 00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000 [ +0.000420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0 [ +0.000837] Call Trace: [ +0.000682] ? _cond_resched+0x10/0x20 [ +0.000691] ? __kmalloc+0x131/0x1b0 [ +0.000710] kernfs_fop_write+0xfa/0x170 [ +0.000733] __vfs_write+0x2e/0x190 [ +0.000688] ? inode_security+0x10/0x30 [ +0.000698] ? selinux_file_permission+0xd2/0x120 [ +0.000752] ? security_file_permission+0x2b/0x100 [ +0.000753] vfs_write+0xa8/0x1a0 [ +0.000676] ksys_write+0x4d/0xb0 [ +0.000699] do_syscall_64+0x3a/0xf0 [ +0.000692] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24bcache: use kmemdup_nul for CACHED_LABEL bufferGeliang Tang
This patch uses kmemdup_nul to create a NUL-terminated string from dc->sb.label. This is better than open coding it. With this, we can move env[2] initialization into env[] array to make code more elegant. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09bcache: fix indentation issue, remove tabs on a hunk of codeColin Ian King
There is a hunk of code that is indented one level too deep, fix this by removing the extra tabs. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13bcache: make cutoff_writeback and cutoff_writeback_sync tunableColy Li
Currently the cutoff writeback and cutoff writeback sync thresholds are defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as static values. Most of time these they work fine, but when people want to do research on bcache writeback mode performance tuning, there is no chance to modify the soft and hard cutoff writeback values. This patch introduces two module parameters bch_cutoff_writeback_sync and bch_cutoff_writeback which permit people to tune the values when loading bcache.ko. If they are not specified by module loading, current values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as default and nothing changes. When people want to tune this two values, - cutoff_writeback can be set in range [1, 70] - cutoff_writeback_sync can be set in range [1, 90] - cutoff_writeback always <= cutoff_writeback_sync The default values are strongly recommended to most of users for most of workloads. Anyway, if people wants to take their own risk to do research on new writeback cutoff tuning for their own workload, now they can make it. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13bcache: add MODULE_DESCRIPTION informationColy Li
This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and add MODULE_DESCRIPTION("Bcache: a Linux block layer cache"). This is preparation for adding module parameters. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13bcache: do not check if debug dentry is ERR or NULL explicitly on removeShenghui Wang
debugfs_remove and debugfs_remove_recursive will check if the dentry pointer is NULL or ERR, and will do nothing in that case. Remove the check in cache_set_free and bch_debug_init. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08bcache: panic fix for making cache deviceDongbo Cao
when the nbuckets of cache device is smaller than 1024, making cache device will trigger BUG_ON in kernel, add a condition to avoid this. Reported-by: nitroxis <n@nxs.re> Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08bcache: split combined if-condition code into separate onesDongbo Cao
Split the combined '||' statements in if() check, to make the code easier for debug. Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08bcache: remove useless parameter of bch_debug_init()Dongbo Cao
Parameter "struct kobject *kobj" in bch_debug_init() is useless, remove it in this patch. Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>