Age | Commit message (Collapse) | Author |
|
Also, rename device_synchronous to device_dax_synchronous.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
If a device doesn't support DAX its 'dax_dev' is NULL. Fix
device_synchronous() to first check if dax_dev is NULL before
dereferencing it.
Fixes: 2e9ee0955d3c ("dm: enable synchronous dax")
Reported-by: jencce.kernel@gmail.com
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Pull block fixes from Jens Axboe:
- Several io_uring fixes/improvements:
- Blocking fix for O_DIRECT (me)
- Latter page slowness for registered buffers (me)
- Fix poll hang under certain conditions (me)
- Defer sequence check fix for wrapped rings (Zhengyuan)
- Mismatch in async inc/dec accounting (Zhengyuan)
- Memory ordering issue that could cause stall (Zhengyuan)
- Track sequential defer in bytes, not pages (Zhengyuan)
- NVMe pull request from Christoph
- Set of hang fixes for wbt (Josef)
- Redundant error message kill for libahci (Ding)
- Remove unused blk_mq_sched_started_request() and related ops (Marcos)
- drbd dynamic alloc shash descriptor to reduce stack use (Arnd)
- blkcg ->pd_stat() non-debug print (Tejun)
- bcache memory leak fix (Wei)
- Comment fix (Akinobu)
- BFQ perf regression fix (Paolo)
* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
io_uring: ensure ->list is initialized for poll commands
Revert "nvme-pci: don't create a read hctx mapping without read queues"
nvme: fix multipath crash when ANA is deactivated
nvme: fix memory leak caused by incorrect subsystem free
nvme: ignore subnqn for ADATA SX6000LNP
drbd: dynamically allocate shash descriptor
block: blk-mq: Remove blk_mq_sched_started_request and started_request
bcache: fix possible memory leak in bch_cached_dev_run()
io_uring: track io length in async_list based on bytes
io_uring: don't use iov_iter_advance() for fixed buffers
block: properly handle IOCB_NOWAIT for async O_DIRECT IO
blk-mq: allow REQ_NOWAIT to return an error inline
io_uring: add a memory barrier before atomic_read
rq-qos: use a mb for got_token
rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
rq-qos: don't reset has_sleepers on spurious wakeups
rq-qos: fix missed wake-ups in rq_qos_throttle
wait: add wq_has_single_sleeper helper
block, bfq: check also in-flight I/O in dispatch plugging
block: fix sysfs module parameters directory path in comment
...
|
|
memory malloced in bch_cached_dev_run() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.
Fixes: 0b13efecf5f2 ("bcache: add return value check to bch_cached_dev_run()")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper updates from Mike Snitzer:
- Fix zone state management race in DM zoned target by eliminating the
unnecessary DMZ_ACTIVE state.
- A couple fixes for issues the DM snapshot target's optional discard
support added during first week of the 5.3 merge.
- Increase default size of outstanding IO that is allowed for a each
dm-kcopyd client and introduce tunable to allow user adjust.
- Update DM core to use printk ratelimiting functions rather than
duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
messages.
* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: use printk ratelimiting functions
dm kcopyd: Increase default sub-job size to 512KB
dm snapshot: fix oversights in optional discard support
dm zoned: fix zone state management race
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"Primarily just the virtio_pmem driver:
- virtio_pmem
The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX
mechanisms to access a host-file with host-page-cache. It arranges
for MAP_SYNC to be disabled and instead triggers a host fsync()
when a 'write-cache flush' command is sent to the virtual disk
device.
- Miscellaneous small fixups"
* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
virtio_pmem: fix sparse warning
xfs: disable map_sync for async flush
ext4: disable map_sync for async flush
dax: check synchronous mapping is supported
dm: enable synchronous dax
libnvdimm: add dax_dev sync flag
virtio-pmem: Add virtio pmem driver
libnvdimm: nd_region flush callback support
libnvdimm, namespace: Drop uuid_t implementation detail
|
|
Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.
This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).
Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.
Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.
We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].
The benchmark:
1. Creates a 1G thin LV
2. Provisions the thin LV
3. Takes a snapshot of the thin LV
4. Writes to the thin LV with:
dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>
Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:
+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
| 1 MB | 256 KB | 242 | 280 |
| 1 MB | 512 KB | 238 | 295 |
| | | | |
| 2 MB | 256 KB | 238 | 354 |
| 2 MB | 512 KB | 241 | 380 |
| 2 MB | 1 MB | 245 | 394 |
| | | | |
| 4 MB | 256 KB | 248 | 412 |
| 4 MB | 512 KB | 234 | 432 |
| 4 MB | 1 MB | 251 | 474 |
| 4 MB | 2 MB | 257 | 504 |
| | | | |
| 8 MB | 256 KB | 239 | 420 |
| 8 MB | 512 KB | 256 | 431 |
| 8 MB | 1 MB | 264 | 467 |
| 8 MB | 2 MB | 264 | 502 |
| 8 MB | 4 MB | 281 | 537 |
+-----------------+-------------+------------------+-----------------+
[1] https://github.com/jthornber/device-mapper-test-suite
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly. Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.
Fixes: 2e6023850e177d ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:
WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.
This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull rst conversion of docs from Mauro Carvalho Chehab:
"As agreed with Jon, I'm sending this big series directly to you, c/c
him, as this series required a special care, in order to avoid
conflicts with other trees"
* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
docs: kbuild: fix build with pdf and fix some minor issues
docs: block: fix pdf output
docs: arm: fix a breakage with pdf output
docs: don't use nested tables
docs: gpio: add sysfs interface to the admin-guide
docs: locking: add it to the main index
docs: add some directories to the main documentation index
docs: add SPDX tags to new index files
docs: add a memory-devices subdir to driver-api
docs: phy: place documentation under driver-api
docs: serial: move it to the driver-api
docs: driver-api: add remaining converted dirs to it
docs: driver-api: add xilinx driver API documentation
docs: driver-api: add a series of orphaned documents
docs: admin-guide: add a series of orphaned documents
docs: cgroup-v1: add it to the admin-guide book
docs: aoe: add it to the driver-api book
docs: add some documentation dirs to the driver-api book
docs: driver-model: move it to the driver-api book
docs: lp855x-driver.rst: add it to the driver-api book
...
|
|
Pull more block updates from Jens Axboe:
"A later pull request with some followup items. I had some vacation
coming up to the merge window, so certain things items were delayed a
bit. This pull request also contains fixes that came in within the
last few days of the merge window, which I didn't want to push right
before sending you a pull request.
This contains:
- NVMe pull request, mostly fixes, but also a few minor items on the
feature side that were timing constrained (Christoph et al)
- Report zones fixes (Damien)
- Removal of dead code (Damien)
- Turn on cgroup psi memstall (Josef)
- block cgroup MAINTAINERS entry (Konstantin)
- Flush init fix (Josef)
- blk-throttle low iops timing fix (Konstantin)
- nbd resize fixes (Mike)
- nbd 0 blocksize crash fix (Xiubo)
- block integrity error leak fix (Wenwen)
- blk-cgroup writeback and priority inheritance fixes (Tejun)"
* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
MAINTAINERS: add entry for block io cgroup
null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
block: Limit zone array allocation size
sd_zbc: Fix report zones buffer allocation
block: Kill gfp_t argument of blkdev_report_zones()
block: Allow mapping of vmalloc-ed buffers
block/bio-integrity: fix a memory leak bug
nvme: fix NULL deref for fabrics options
nbd: add netlink reconfigure resize support
nbd: fix crash when the blksize is zero
block: Disable write plugging for zoned block devices
block: Fix elevator name declaration
block: Remove unused definitions
nvme: fix regression upon hot device removal and insertion
blk-throttle: fix zero wait time for iops throttled group
block: Fix potential overflow in blk_report_zones()
blkcg: implement REQ_CGROUP_PUNT
blkcg, writeback: Implement wbc_blkcg_css()
blkcg, writeback: Add wbc->no_cgroup_owner
blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
...
|
|
The DM support describes lots of aspects related to mapped
disk partitions from the userspace PoV.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
"This includes changes to let percpu_ref release the backing percpu
memory earlier after it has been switched to atomic in cases where the
percpu ref is not revived.
This will help recycle percpu memory earlier in cases where the
refcounts are pinned for prolonged periods of time"
* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Add encrypted byte-offset initialization vector (eboiv) to DM crypt.
- Add optional discard features to DM snapshot which allow freeing
space from a DM device whose free space was exhausted.
- Various small improvements to use struct_size() and kzalloc().
- Fix to check if DM thin metadata is in fail_io mode before attempting
to update the superblock to set the needs_check flag. Otherwise the
DM thin-pool can hang.
- Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
thin provisioning on loop usecase.
* tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix deadlock with loop device
dm snapshot: add optional discard support features
dm crypt: implement eboiv - encrypted byte-offset initialization vector
dm crypt: remove obsolete comment about plumb IV
dm crypt: wipe private IV struct after key invalid flag is set
dm integrity: use kzalloc() instead of kmalloc() + memset()
dm: update stale comment in end_clone_bio()
dm log writes: fix incorrect comment about the logged sequence example
dm log writes: use struct_size() to calculate size of pending_block
dm crypt: use struct_size() when allocating encryption context
dm integrity: always set version on superblock update
dm thin metadata: check if in fail_io mode when setting needs_check
|
|
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:
One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().
But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()). If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete. Resulting in ABBA
deadlock.
Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.
discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device. This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.
The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.
When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet. Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted. In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist. If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused. Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.
Requested-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"
This reverts merge 0f75ef6a9cff49ff612f7ce0578bced9d0b38325 (and thus
effectively commits
7a1ade847596 ("keys: Provide KEYCTL_GRANT_PERMISSION")
2e12256b9a76 ("keys: Replace uid/gid/perm permissions checking with an ACL")
that the merge brought in).
It turns out that it breaks booting with an encrypted volume, and Eric
biggers reports that it also breaks the fscrypt tests [1] and loading of
in-kernel X.509 certificates [2].
The root cause of all the breakage is likely the same, but David Howells
is off email so rather than try to work it out it's getting reverted in
order to not impact the rest of the merge window.
[1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
[2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/
Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
Reported-by: Eric Biggers <ebiggers@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull Documentation updates from Jonathan Corbet:
"It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with
other trees, unfortunately. He has a lot more of these waiting on
the wings that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos,
and one on Spectre vulnerabilities.
- Various improvements to the build system, including automatic
markup of function() references because some people, for reasons I
will never understand, were of the opinion that
:c:func:``function()`` is unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc"
* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
docs: automarkup.py: ignore exceptions when seeking for xrefs
docs: Move binderfs to admin-guide
Disable Sphinx SmartyPants in HTML output
doc: RCU callback locks need only _bh, not necessarily _irq
docs: format kernel-parameters -- as code
Doc : doc-guide : Fix a typo
platform: x86: get rid of a non-existent document
Add the RCU docs to the core-api manual
Documentation: RCU: Add TOC tree hooks
Documentation: RCU: Rename txt files to rst
Documentation: RCU: Convert RCU UP systems to reST
Documentation: RCU: Convert RCU linked list to reST
Documentation: RCU: Convert RCU basic concepts to reST
docs: filesystems: Remove uneeded .rst extension on toctables
scripts/sphinx-pre-install: fix out-of-tree build
docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
Documentation: PGP: update for newer HW devices
Documentation: Add section about CPU vulnerabilities for Spectre
Documentation: platform: Delete x86-laptop-drivers.txt
docs: Note that :c:func: should no longer be used
...
|
|
This IV is used in some BitLocker devices with CBC encryption mode.
IV is encrypted little-endian byte-offset (with the same key and cipher
as the volume).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The URL is no longer valid and the comment is obsolete anyway
(the plumb IV was never used).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
If a private IV wipe function fails, the code does not set the key
invalid flag. To fix this, move code to after the flag is set to
prevent the device from resuming in an inconsistent state.
Also, this allows using of a randomized key in private wipe function
(to be used in a following commit).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Since commit a1ce35fa49852db60fc6e268 ("block: remove dead elevator
code") blk_end_request() has been replaced with blk_mq_end_request().
So update comment to reference blk_mq_end_request() accordingly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
dm-log-writes records the sequence of completion, not submission, thus
for the following sequence (W=write, C=complete):
Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd
The logged results in log device should be:
c,a,b,flush,fua
Fix the comment to give a better example.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The new integrity bitmap mode uses the dirty flag. The dirty flag
should not be set in older superblock versions.
The current code sets it unconditionally, even if the superblock
was already formatted without bitmap in older system.
Fix this by moving the version check to one common place and check
version on every superblock write.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Pull block updates from Jens Axboe:
"This is the main block updates for 5.3. Nothing earth shattering or
major in here, just fixes, additions, and improvements all over the
map. This contains:
- Series of documentation fixes (Bart)
- Optimization of the blk-mq ctx get/put (Bart)
- null_blk removal race condition fix (Bob)
- req/bio_op() cleanups (Chaitanya)
- Series cleaning up the segment accounting, and request/bio mapping
(Christoph)
- Series cleaning up the page getting/putting for bios (Christoph)
- block cgroup cleanups and moving it to where it is used (Christoph)
- block cgroup fixes (Tejun)
- Series of fixes and improvements to bcache, most notably a write
deadlock fix (Coly)
- blk-iolatency STS_AGAIN and accounting fixes (Dennis)
- Series of improvements and fixes to BFQ (Douglas, Paolo)
- debugfs_create() return value check removal for drbd (Greg)
- Use struct_size(), where appropriate (Gustavo)
- Two lighnvm fixes (Heiner, Geert)
- MD fixes, including a read balance and corruption fix (Guoqing,
Marcos, Xiao, Yufen)
- block opal shadow mbr additions (Jonas, Revanth)
- sbitmap compare-and-exhange improvemnts (Pavel)
- Fix for potential bio->bi_size overflow (Ming)
- NVMe pull requests:
- improved PCIe suspent support (Keith Busch)
- error injection support for the admin queue (Akinobu Mita)
- Fibre Channel discovery improvements (James Smart)
- tracing improvements including nvmetc tracing support (Minwoo Im)
- misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
Kulkarni)"
- Various little fixes and improvements to drivers and core"
* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
blk-iolatency: fix STS_AGAIN handling
block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
blk-mq: simplify blk_mq_make_request()
blk-mq: remove blk_mq_put_ctx()
sbitmap: Replace cmpxchg with xchg
block: fix .bi_size overflow
block: sed-opal: check size of shadow mbr
block: sed-opal: ioctl for writing to shadow mbr
block: sed-opal: add ioctl for done-mark of shadow mbr
block: never take page references for ITER_BVEC
direct-io: use bio_release_pages in dio_bio_complete
block_dev: use bio_release_pages in bio_unmap_user
block_dev: use bio_release_pages in blkdev_bio_end_io
iomap: use bio_release_pages in iomap_dio_bio_end_io
block: use bio_release_pages in bio_map_user_iov
block: use bio_release_pages in bio_unmap_user
block: optionally mark pages dirty in bio_release_pages
block: move the BIO_NO_PAGE_REF check into bio_release_pages
block: skd_main.c: Remove call to memset after dma_alloc_coherent
block: mtip32xx: Remove call to memset after dma_alloc_coherent
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull keyring ACL support from David Howells:
"This changes the permissions model used by keys and keyrings to be
based on an internal ACL by the following means:
- Replace the permissions mask internally with an ACL that contains a
list of ACEs, each with a specific subject with a permissions mask.
Potted default ACLs are available for new keys and keyrings.
ACE subjects can be macroised to indicate the UID and GID specified
on the key (which remain). Future commits will be able to add
additional subject types, such as specific UIDs or domain
tags/namespaces.
Also split a number of permissions to give finer control. Examples
include splitting the revocation permit from the change-attributes
permit, thereby allowing someone to be granted permission to revoke
a key without allowing them to change the owner; also the ability
to join a keyring is split from the ability to link to it, thereby
stopping a process accessing a keyring by joining it and thus
acquiring use of possessor permits.
- Provide a keyctl to allow the granting or denial of one or more
permits to a specific subject. Direct access to the ACL is not
granted, and the ACL cannot be viewed"
* tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
keys: Provide KEYCTL_GRANT_PERMISSION
keys: Replace uid/gid/perm permissions checking with an ACL
|
|
This patch sets dax device 'DAXDEV_SYNC' flag if all the target
devices of device mapper support synchrononous DAX. If device
mapper consists of both synchronous and asynchronous dax devices,
we don't set 'DAXDEV_SYNC' flag.
'dm_table_supports_dax' is refactored to pass 'iterate_devices_fn'
as argument so that the callers can pass the appropriate functions.
Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
crash in dm_bm_write_lock() while accessing the block manager object
that was previously destroyed as part of a failed
dm_pool_abort_metadata() that ultimately set fail_io to begin with.
Also, update DMERR() message to more accurately describe
superblock_lock() failure.
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.
* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...
|
|
Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.
This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch improves performance for btree_flush_write() in following
ways,
- Use another spinlock journal.flush_write_lock to replace the very
hot journal.lock. We don't have to use journal.lock here, selecting
candidate btree nodes takes a lot of time, hold journal.lock here will
block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
machine has a large system memory, mca cache may have a huge number of
cached btree nodes. Iterating all the cached nodes will take a lot
of CPU time, and most of the nodes on c->btree_cache_freeable and
c->btree_cache_freed lists are cleared and have need to flush. So only
travel mca list c->btree_cache to select flushing btree node should be
enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
c->btree_cache and select the oldest journal pin btree nodes consumes
huge number of CPU cycles if the list is huge (push and pop a node
into/out of a heap is expensive). The last several dirty btree nodes
on the tail of c->btree_cache list are earlest allocated and cached
btree nodes, they are relative to the oldest journal pin btree nodes.
Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
c->btree_cache probably includes the oldest journal pin btree nodes.
In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In struct cache_set, retry_flush_write is added for commit c4dc2497d50d
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.
Now it is useless anymore, and this patch removes it from bcache code.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b->write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b->write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&b->write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
if (btree_node_dirty(b))
btree_complete_write(b, btree_current_write(b));
clear_bit(BTREE_NODE_dirty, &b->flags);
Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.
This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit c4dc2497d50d9c6fb16aa0d07b6a14f3b2adb1e0.
This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.
Fixes: c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 6268dc2c4703aabfb0b35681be709acf4c2826c6.
This patch depends on commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.
Fixes: 6268dc2c4703 ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When cache set starts, bch_btree_check() will check all bkeys on cache
device by calculating the checksum. This operation will consume a huge
number of system memory if there are a lot of data cached. Since bcache
uses its own mca cache to maintain all its read-in btree nodes, and only
releases the cache space when system memory manage code starts to shrink
caches. Then before memory manager code to call the mca cache shrinker
callback, bcache mca cache will compete memory resource with user space
application, which may have nagive effect to performance of user space
workloads (e.g. data base, or I/O service of distributed storage node).
This patch tries to call bcache mca shrinker routine to proactively
release mca cache memory, to decrease the memory pressure of system and
avoid negative effort of the overall system I/O performance.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When enable lockdep and reboot system with a writeback mode bcache
device, the following potential deadlock warning is reported by lockdep
engine.
[ 101.536569][ T401] kworker/2:2/401 is trying to acquire lock:
[ 101.538575][ T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 101.542054][ T401]
[ 101.542054][ T401] but task is already holding lock:
[ 101.544587][ T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 101.548386][ T401]
[ 101.548386][ T401] which lock already depends on the new lock.
[ 101.548386][ T401]
[ 101.551874][ T401]
[ 101.551874][ T401] the existing dependency chain (in reverse order) is:
[ 101.555000][ T401]
[ 101.555000][ T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 101.557860][ T401] process_one_work+0x277/0x640
[ 101.559661][ T401] worker_thread+0x39/0x3f0
[ 101.561340][ T401] kthread+0x125/0x140
[ 101.562963][ T401] ret_from_fork+0x3a/0x50
[ 101.564718][ T401]
[ 101.564718][ T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 101.567701][ T401] lock_acquire+0xb4/0x1c0
[ 101.569651][ T401] flush_workqueue+0xae/0x4c0
[ 101.571494][ T401] drain_workqueue+0xa9/0x180
[ 101.573234][ T401] destroy_workqueue+0x17/0x250
[ 101.575109][ T401] cached_dev_free+0x44/0x120 [bcache]
[ 101.577304][ T401] process_one_work+0x2a4/0x640
[ 101.579357][ T401] worker_thread+0x39/0x3f0
[ 101.581055][ T401] kthread+0x125/0x140
[ 101.582709][ T401] ret_from_fork+0x3a/0x50
[ 101.584592][ T401]
[ 101.584592][ T401] other info that might help us debug this:
[ 101.584592][ T401]
[ 101.588355][ T401] Possible unsafe locking scenario:
[ 101.588355][ T401]
[ 101.590974][ T401] CPU0 CPU1
[ 101.592889][ T401] ---- ----
[ 101.594743][ T401] lock((work_completion)(&cl->work)#2);
[ 101.596785][ T401] lock((wq_completion)bcache_writeback_wq);
[ 101.600072][ T401] lock((work_completion)(&cl->work)#2);
[ 101.602971][ T401] lock((wq_completion)bcache_writeback_wq);
[ 101.605255][ T401]
[ 101.605255][ T401] *** DEADLOCK ***
[ 101.605255][ T401]
[ 101.608310][ T401] 2 locks held by kworker/2:2/401:
[ 101.610208][ T401] #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 101.613709][ T401] #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 101.617480][ T401]
[ 101.617480][ T401] stack backtrace:
[ 101.619539][ T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1
[ 101.623225][ T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 101.627210][ T401] Workqueue: events cached_dev_free [bcache]
[ 101.629239][ T401] Call Trace:
[ 101.630360][ T401] dump_stack+0x85/0xcb
[ 101.631777][ T401] print_circular_bug+0x19a/0x1f0
[ 101.633485][ T401] __lock_acquire+0x16cd/0x1850
[ 101.635184][ T401] ? __lock_acquire+0x6a8/0x1850
[ 101.636863][ T401] ? lock_acquire+0xb4/0x1c0
[ 101.638421][ T401] ? find_held_lock+0x34/0xa0
[ 101.640015][ T401] lock_acquire+0xb4/0x1c0
[ 101.641513][ T401] ? flush_workqueue+0x87/0x4c0
[ 101.643248][ T401] flush_workqueue+0xae/0x4c0
[ 101.644832][ T401] ? flush_workqueue+0x87/0x4c0
[ 101.646476][ T401] ? drain_workqueue+0xa9/0x180
[ 101.648303][ T401] drain_workqueue+0xa9/0x180
[ 101.649867][ T401] destroy_workqueue+0x17/0x250
[ 101.651503][ T401] cached_dev_free+0x44/0x120 [bcache]
[ 101.653328][ T401] process_one_work+0x2a4/0x640
[ 101.655029][ T401] worker_thread+0x39/0x3f0
[ 101.656693][ T401] ? process_one_work+0x640/0x640
[ 101.658501][ T401] kthread+0x125/0x140
[ 101.660012][ T401] ? kthread_create_worker_on_cpu+0x70/0x70
[ 101.661985][ T401] ret_from_fork+0x3a/0x50
[ 101.691318][ T401] bcache: bcache_device_free() bcache0 stopped
Here is how the above potential deadlock may happen in reboot/shutdown
code path,
1) bcache_reboot() is called firstly in the reboot/shutdown code path,
then in bcache_reboot(), bcache_device_stop() is called.
2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
cached_dev_flush() calls cached_dev_free() via closure_at()
3) In cached_dev_free(), after stopped writebach kthread
dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
destroy_workqueue().
4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
is acquired by lock_map_acquire() in flush_workqueue(). After the
lock acquired the rest part of flush_workqueue() just wait for the
workqueue to complete.
5) Now we look back at writeback thread routine bch_writeback_thread(),
in the main while-loop, write_dirty() is called via continue_at() in
read_dirty_submit(), which is called via continue_at() in while-loop
level called function read_dirty(). Inside write_dirty() it may be
re-called on workqueeu dc->writeback_write_wq via continue_at().
It means when the writeback kthread is stopped in cached_dev_free()
there might be still one kworker queued on dc->writeback_write_wq
to execute write_dirty() again.
6) Now this kworker is scheduled on dc->writeback_write_wq to run by
process_one_work() (which is called by worker_thread()). Before
calling the kwork routine, wq->lockdep_map is acquired.
7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
(lockdep terminology) scenario happens.
Indeed on multiple cores syatem, the above deadlock is very rare to
happen, just as the code comments in process_one_work() says,
2263 * AFAICT there is no possible deadlock scenario between the
2264 * flush_work() and complete() primitives (except for
single-threaded
2265 * workqueues), so hiding them isn't a problem.
But it is still good to fix such lockdep warning, even no one running
bcache on single core system.
The fix is simple. This patch solves the above potential deadlock by,
- Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
- Flush and destroy dc->writeback_write_wq in writebach kthread routine
bch_writeback_thread(), where after quit the thread main while-loop
and before cached_dev_put() is called.
By this fix, dc->writeback_write_wq will be stopped and destroy before
the writeback kthread stopped, so the chance for a A-A locking on
wq->lockdep_map is disappeared, such A-A deadlock won't happen
any more.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When enable lockdep engine, a lockdep warning can be observed when
reboot or shutdown system,
[ 3142.764557][ T1] bcache: bcache_reboot() Stopping all devices:
[ 3142.776265][ T2649]
[ 3142.777159][ T2649] ======================================================
[ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
[ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G W
[ 3142.785684][ T2649] ------------------------------------------------------
[ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
[ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 3142.794678][ T2649]
[ 3142.794678][ T2649] but task is already holding lock:
[ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.801462][ T2649]
[ 3142.801462][ T2649] which lock already depends on the new lock.
[ 3142.801462][ T2649]
[ 3142.805277][ T2649]
[ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
[ 3142.808902][ T2649]
[ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
[ 3142.812396][ T2649] __mutex_lock+0x7a/0x9d0
[ 3142.814184][ T2649] cached_dev_free+0x17/0x120 [bcache]
[ 3142.816415][ T2649] process_one_work+0x2a4/0x640
[ 3142.818413][ T2649] worker_thread+0x39/0x3f0
[ 3142.820276][ T2649] kthread+0x125/0x140
[ 3142.822061][ T2649] ret_from_fork+0x3a/0x50
[ 3142.823965][ T2649]
[ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 3142.827244][ T2649] process_one_work+0x277/0x640
[ 3142.829160][ T2649] worker_thread+0x39/0x3f0
[ 3142.830958][ T2649] kthread+0x125/0x140
[ 3142.832674][ T2649] ret_from_fork+0x3a/0x50
[ 3142.834915][ T2649]
[ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 3142.838121][ T2649] lock_acquire+0xb4/0x1c0
[ 3142.840025][ T2649] flush_workqueue+0xae/0x4c0
[ 3142.842035][ T2649] drain_workqueue+0xa9/0x180
[ 3142.844042][ T2649] destroy_workqueue+0x17/0x250
[ 3142.846142][ T2649] cached_dev_free+0x52/0x120 [bcache]
[ 3142.848530][ T2649] process_one_work+0x2a4/0x640
[ 3142.850663][ T2649] worker_thread+0x39/0x3f0
[ 3142.852464][ T2649] kthread+0x125/0x140
[ 3142.854106][ T2649] ret_from_fork+0x3a/0x50
[ 3142.855880][ T2649]
[ 3142.855880][ T2649] other info that might help us debug this:
[ 3142.855880][ T2649]
[ 3142.859663][ T2649] Chain exists of:
[ 3142.859663][ T2649] (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
[ 3142.859663][ T2649]
[ 3142.865424][ T2649] Possible unsafe locking scenario:
[ 3142.865424][ T2649]
[ 3142.868022][ T2649] CPU0 CPU1
[ 3142.869885][ T2649] ---- ----
[ 3142.871751][ T2649] lock(&bch_register_lock);
[ 3142.873379][ T2649] lock((work_completion)(&cl->work)#2);
[ 3142.876399][ T2649] lock(&bch_register_lock);
[ 3142.879727][ T2649] lock((wq_completion)bcache_writeback_wq);
[ 3142.882064][ T2649]
[ 3142.882064][ T2649] *** DEADLOCK ***
[ 3142.882064][ T2649]
[ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
[ 3142.887245][ T2649] #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.890815][ T2649] #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.894884][ T2649] #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.898797][ T2649]
[ 3142.898797][ T2649] stack backtrace:
[ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1
[ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
[ 3142.911422][ T2649] Call Trace:
[ 3142.912656][ T2649] dump_stack+0x85/0xcb
[ 3142.914181][ T2649] print_circular_bug+0x19a/0x1f0
[ 3142.916193][ T2649] __lock_acquire+0x16cd/0x1850
[ 3142.917936][ T2649] ? __lock_acquire+0x6a8/0x1850
[ 3142.919704][ T2649] ? lock_acquire+0xb4/0x1c0
[ 3142.921335][ T2649] ? find_held_lock+0x34/0xa0
[ 3142.923052][ T2649] lock_acquire+0xb4/0x1c0
[ 3142.924635][ T2649] ? flush_workqueue+0x87/0x4c0
[ 3142.926375][ T2649] flush_workqueue+0xae/0x4c0
[ 3142.928047][ T2649] ? flush_workqueue+0x87/0x4c0
[ 3142.929824][ T2649] ? drain_workqueue+0xa9/0x180
[ 3142.931686][ T2649] drain_workqueue+0xa9/0x180
[ 3142.933534][ T2649] destroy_workqueue+0x17/0x250
[ 3142.935787][ T2649] cached_dev_free+0x52/0x120 [bcache]
[ 3142.937795][ T2649] process_one_work+0x2a4/0x640
[ 3142.939803][ T2649] worker_thread+0x39/0x3f0
[ 3142.941487][ T2649] ? process_one_work+0x640/0x640
[ 3142.943389][ T2649] kthread+0x125/0x140
[ 3142.944894][ T2649] ? kthread_create_worker_on_cpu+0x70/0x70
[ 3142.947744][ T2649] ret_from_fork+0x3a/0x50
[ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
Here is how the deadlock happens.
1) bcache_reboot() calls bcache_device_stop(), then inside
bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
3) In cached_dev_free(), when stopping the writeback kthread of the
cached device by kthread_stop(), dc->writeback_thread will be waken
up to quite the kthread while-loop, then cached_dev_put() is called
in bch_writeback_thread().
4) Calling cached_dev_put() in writeback kthread may drop dc->count to
0, then dc->detach kworker is scheduled, which is initialized as
cached_dev_detach_finish().
5) Inside cached_dev_detach_finish(), the last line of code is to call
closure_put(&dc->disk.cl), which drops the last reference counter of
closrure dc->disk.cl, then the callback cached_dev_flush() gets
called.
Now cached_dev_flush() is called for second time in the code path, the
first time is in step 2). And again bch_register_lock will be acquired
again, and a A-A lock (lockdep terminology) is happening.
The root cause of the above A-A lock is in cached_dev_free(), mutex
bch_register_lock is held before stopping writeback kthread and other
kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
prevent device registration or unregistration during reboot/shutdown
time, so it is unncessary to hold bch_register_lock such early now.
This is how this patch fixes the reboot/shutdown time A-A lock issue:
After moving mutex_lock(&bch_register_lock) to a later location where
before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
problem can be solved without any reboot time registration race.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now there is variable bcache_is_reboot to prevent device register or
unregister during reboot, it is unncessary to still hold mutex lock
bch_register_lock before stopping writeback_rate_update kworker and
writeback kthread. And if the stopping kworker or kthread holding
bch_register_lock inside their routine (we used to have such problem
in writeback thread, thanks to Junhui Wang fixed it), it is very easy
to introduce deadlock during reboot/shutdown procedure.
Therefore in this patch, the location to acquire bch_register_lock is
moved to the location before calling calc_cached_dev_sectors(). Which
is later then original location in cached_dev_detach_finish().
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It is quite frequently to observe deadlock in bcache_reboot() happens
and hang the system reboot process. The reason is, in bcache_reboot()
when calling bch_cache_set_stop() and bcache_device_stop() the mutex
bch_register_lock is held. But in the process to stop cache set and
bcache device, bch_register_lock will be acquired again. If this mutex
is held here, deadlock will happen inside the stopping process. The
aftermath of the deadlock is, whole system reboot gets hung.
The fix is to avoid holding bch_register_lock for the following loops
in bcache_reboot(),
list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
bch_cache_set_stop(c);
list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
bcache_device_stop(&dc->disk);
A module range variable 'bcache_is_reboot' is added, it sets to true
in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
to be true, reject the registration by returning -EBUSY immediately.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
called, the wrireback kthread and writeback rate update kworker of the
cached device are created, if the following bch_cached_dev_run()
failed, bch_cached_dev_attach() will return with -ENOMEM without
stopping the writeback related kthread and kworker.
This patch stops writeback kthread and writeback rate update kworker
before returning -ENOMEM if bch_cached_dev_run() returns error.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|