Age | Commit message (Collapse) | Author |
|
Refer to the correct function (->submit_bio instead of ->queue_bio).
Also, add details about why using blk_queue_split() isn't needed for
dm_wq_work()'s call to dm_process_bio().
Fixes: c62b37d96b6eb ("block: move ->make_request_fn to struct block_device_operations")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
dm_queue_split() is removed because __split_and_process_bio() _must_
handle splitting bios to ensure proper bio submission and completion
ordering as a bio is split.
Otherwise, multiple recursive calls to ->submit_bio will cause multiple
split bios to be allocated from the same ->bio_split mempool at the same
time. This would result in deadlock in low memory conditions because no
progress could be made (only one bio is available in ->bio_split
mempool).
This fix has been verified to still fix the loss of performance, due
to excess splitting, that commit 120c9257f5f1 provided.
Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
DM was calling generic_fsdax_supported() to determine whether a device
referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that
they don't have to duplicate common generic checks. High level code
should call dax_supported() helper which that calls into appropriate
helper for the particular device. This problem manifested itself as
kernel messages:
dm-3: error: dax access failed (-95)
when lvm2-testsuite run in cases where a DM device was stacked on top of
another DM device.
Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable@vger.kernel.org>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
A recent fix to the dm_dax_supported() flow uncovered a latent bug. When
dm_get_live_table() fails it is still required to drop the
srcu_read_lock(). Without this change the lvm2 test-suite triggers this
warning:
# lvm2-testsuite --only pvmove-abort-all.sh
WARNING: lock held when returning to user space!
5.9.0-rc5+ #251 Tainted: G OE
------------------------------------------------
lvm/1318 is leaving the kernel with locks still held!
1 lock held by lvm/1318:
#0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod]
...and later on this hang signature:
INFO: task lvm:1344 blocked for more than 122 seconds.
Tainted: G OE 5.9.0-rc5+ #251
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:lvm state:D stack: 0 pid: 1344 ppid: 1 flags:0x00004000
Call Trace:
__schedule+0x45f/0xa80
? finish_task_switch+0x249/0x2c0
? wait_for_completion+0x86/0x110
schedule+0x5f/0xd0
schedule_timeout+0x212/0x2a0
? __schedule+0x467/0xa80
? wait_for_completion+0x86/0x110
wait_for_completion+0xb0/0x110
__synchronize_srcu+0xd1/0x160
? __bpf_trace_rcu_utilization+0x10/0x10
__dm_suspend+0x6d/0x210 [dm_mod]
dm_suspend+0xf6/0x140 [dm_mod]
Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Link: https://lore.kernel.org/r/160045867590.25663.7548541079217827340.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
The following error ocurred when testing disk online/offline:
[ 301.798344] device-mapper: thin: 253:5: aborting current metadata transaction
[ 301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction
[ 301.849206] Aborting journal on device dm-26-8.
[ 301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted
[ 301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30
[ 301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data]
Reason is:
metadata_operation_failed
abort_transaction
dm_pool_abort_metadata
__create_persistent_data_objects
r = __open_or_format_metadata
if (r) --> If failed will free pmd->bm but pmd->bm not set NULL
dm_block_manager_destroy(pmd->bm);
set_pool_mode
dm_pool_metadata_read_only(pool->pmd);
dm_bm_set_read_only(pmd->bm); --> use-after-free
Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and
dm_bm_set_read_write functions. If bm is NULL it means creating the
bm failed and so dm_bm_is_read_only must return true.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The dm-integrity target did not report errors in bitmap mode just after
creation. The reason is that the function integrity_recalc didn't clean up
ic->recalc_bitmap as it proceeded with recalculation.
Fix this by updating the bitmap accordingly -- the double shift serves
to rounddown.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Use the DECLARE_CRYPTO_WAIT() macro to properly initialize the crypto
wait structures declared on stack before their use with
crypto_wait_req().
Fixes: 39d13a1ac41d ("dm crypt: reuse eboiv skcipher for IV generation")
Fixes: bbb1658461ac ("dm crypt: Implement Elephant diffuser for Bitlocker compatibility")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Commit 935fcc56abc3 ("dm mpath: only flush workqueue when needed")
changed flush_multipath_work() to avoid needless workqueue
flushing (of a multipath global workqueue). But that change didn't
realize the surrounding flush_multipath_work() code should also only
run if 'pg_init_in_progress' is set.
Fix this by only doing all of flush_multipath_work()'s PG init related
work if 'pg_init_in_progress' is set.
Otherwise multipath_wait_for_pg_init_completion() will run
unconditionally but the preceeding flush_workqueue(kmpath_handlerd)
may not. This could lead to deadlock (though only if kmpath_handlerd
never runs a corresponding work to decrement 'pg_init_in_progress').
It could also be, though highly unlikely, that the kmpath_handlerd
work that does PG init completes before 'pg_init_in_progress' is set,
and then an intervening DM table reload's multipath_postsuspend()
triggers flush_multipath_work().
Fixes: 935fcc56abc3 ("dm mpath: only flush workqueue when needed")
Cc: stable@vger.kernel.org
Reported-by: Ben Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The function dax_direct_access doesn't take partitions into account,
it always maps pages from the beginning of the device. Therefore,
persistent_memory_claim() must get the partition offset using
get_start_sect() and add it to the page offsets passed to
dax_direct_access().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Pull block fixes from Jens Axboe:
- nbd timeout fix (Hou)
- device size fix for loop LOOP_CONFIGURE (Martijn)
- MD pull from Song with raid5 stripe size fix (Yufen)
* tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
md/raid5: make sure stripe_size as power of two
loop: Set correct device size when using LOOP_CONFIGURE
nbd: restore default timeout when setting it to zero
|
|
Commit 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs
entry") make stripe_size as a configurable value. It just requires
stripe_size as multiple of 4KB.
In fact, we should make sure stripe_size as power of two. Otherwise,
stripe_shift which is the result of ilog2 can not represent the real
stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
get unexpected value.
Fixes: 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
Pull block fixes from Jens Axboe:
"A few fixes on the block side of things:
- Discard granularity fix (Coly)
- rnbd cleanups (Guoqing)
- md error handling fix (Dan)
- md sysfs fix (Junxiao)
- Fix flush request accounting, which caused an IO slowdown for some
configurations (Ming)
- Properly propagate loop flag for partition scanning (Lennart)"
* tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
block: fix double account of flush request's driver tag
loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
rnbd: no need to set bi_end_io in rnbd_bio_map_kern
rnbd: remove rnbd_dev_submit_io
md-cluster: Fix potential error pointer dereference in resize_bitmaps()
block: check queue's limits.discard_granularity in __blkdev_issue_discard()
md: get sysfs entry after redundancy attr group create
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM multipath locking fixes around m->flags tests and improvements to
bio-based code so that it follows patterns established by
request-based code.
- Request-based DM core improvement to eliminate unnecessary call to
blk_mq_queue_stopped().
- Add "panic_on_corruption" error handling mode to DM verity target.
- DM bufio fix to to perform buffer cleanup from a workqueue rather
than wait for IO in reclaim context from shrinker.
- DM crypt improvement to optionally avoid async processing via
workqueues for reads and/or writes -- via "no_read_workqueue" and
"no_write_workqueue" features. This more direct IO processing
improves latency and throughput with faster storage. Avoiding
workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a
requirement for adding zoned block device support to DM crypt.
- Add zoned block device support to DM crypt. Makes use of
DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
(DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
encryption to complete. This allows write ordering to be preserved,
which is needed for zoned block devices.
- Fix DM ebs target's check for REQ_OP_FLUSH.
- Fix DM core's report zones support to not report more zones than were
requested.
- A few small compiler warning fixes.
- DM dust improvements to return output directly to the user rather
than require they scrape the system log for output.
* tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: don't call report zones for more than the user requested
dm ebs: Fix incorrect checking for REQ_OP_FLUSH
dm init: Set file local variable static
dm ioctl: Fix compilation warning
dm raid: Remove empty if statement
dm verity: Fix compilation warning
dm crypt: Enable zoned block device support
dm crypt: add flags to optionally bypass kcryptd workqueues
dm bufio: do buffer cleanup from a workqueue
dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
dm dust: add interface to list all badblocks
dm dust: report some message results directly back to user
dm verity: add "panic_on_corruption" error handling mode
dm mpath: use double checked locking in fast path
dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
dm mpath: rework __map_bio()
dm mpath: factor out multipath_queue_bio
dm mpath: push locking down to must_push_back_rq()
dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
dm mpath: changes from initial m->flags locking audit
|
|
Merge misc updates from Andrew Morton:
- a few MM hotfixes
- kthread, tools, scripts, ntfs and ocfs2
- some of MM
Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
mm: vmscan: consistent update to pgrefill
mm/vmscan.c: fix typo
khugepaged: khugepaged_test_exit() check mmget_still_valid()
khugepaged: retract_page_tables() remember to test exit
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
khugepaged: collapse_pte_mapped_thp() flush the right range
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
mm: thp: replace HTTP links with HTTPS ones
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
mm/page_alloc.c: skip setting nodemask when we are in interrupt
mm/page_alloc: fallbacks at most has 3 elements
mm/page_alloc: silence a KASAN false positive
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
mm/page_alloc.c: simplify pageblock bitmap access
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
mm/shuffle: remove dynamic reconfiguration
mm/memory_hotplug: document why shuffle_zone() is relevant
mm/page_alloc: remove nr_free_pagecache_pages()
mm: remove vm_total_pages
...
|
|
As said by Linus:
A symmetric naming is only helpful if it implies symmetries in use.
Otherwise it's actively misleading.
In "kzalloc()", the z is meaningful and an important part of what the
caller wants.
In "kzfree()", the z is actively detrimental, because maybe in the
future we really _might_ want to use that "memfill(0xdeadbeef)" or
something. The "zero" part of the interface isn't even _relevant_.
The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.
Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.
The renaming is done by using the command sequence:
git grep -w --name-only kzfree |\
xargs sed -i 's/kzfree/kfree_sensitive/'
followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.
[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Add support for (optionally) using queued spinlocks & rwlocks.
- Support for a new faster system call ABI using the scv instruction on
Power9 or later.
- Drop support for the PROT_SAO mmap/mprotect flag as it will be
unsupported on Power10 and future processors, leaving us with no way
to implement the functionality it requests. This risks breaking
userspace, though we believe it is unused in practice.
- A bug fix for, and then the removal of, our custom stack expansion
checking. We now allow stack expansion up to the rlimit, like other
architectures.
- Remove the remnants of our (previously disabled) topology update
code, which tried to react to NUMA layout changes on virtualised
systems, but was prone to crashes and other problems.
- Add PMU support for Power10 CPUs.
- A change to our signal trampoline so that we don't unbalance the link
stack (branch return predictor) in the signal delivery path.
- Lots of other cleanups, refactorings, smaller features and so on as
usual.
Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
Wei Yongjun, Wen Xiong, YueHaibing.
* tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
selftests/powerpc: Fix pkey syscall redefinitions
powerpc: Fix circular dependency between percpu.h and mmu.h
powerpc/powernv/sriov: Fix use of uninitialised variable
selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
powerpc/40x: Fix assembler warning about r0
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
cpuidle: pseries: Fixup exit latency for CEDE(0)
cpuidle: pseries: Add function to parse extended CEDE records
cpuidle: pseries: Set the latency-hint before entering CEDE
selftests/powerpc: Fix online CPU selection
powerpc/perf: Consolidate perf_callchain_user_[64|32]()
powerpc/pseries/hotplug-cpu: Remove double free in error path
powerpc/pseries/mobility: Add pr_debug() for device tree changes
powerpc/pseries/mobility: Set pr_fmt()
powerpc/cacheinfo: Warn if cache object chain becomes unordered
powerpc/cacheinfo: Improve diagnostics about malformed cache lists
powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
powerpc/cacheinfo: Set pr_fmt()
powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
"Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"
* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
init: add an init_dup helper
init: add an init_utimes helper
init: add an init_stat helper
init: add an init_mknod helper
init: add an init_mkdir helper
init: add an init_symlink helper
init: add an init_link helper
init: add an init_eaccess helper
init: add an init_chmod helper
init: add an init_chown helper
init: add an init_chroot helper
init: add an init_chdir helper
init: add an init_rmdir helper
init: add an init_unlink helper
init: add an init_umount helper
init: add an init_mount helper
init: mark create_dev as __init
init: mark console_on_rootfs as __init
init: initialize ramdisk_execute_command at compile time
devtmpfs: refactor devtmpfsd()
...
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9
Pull MD fixes from Song.
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md-cluster: Fix potential error pointer dereference in resize_bitmaps()
md: get sysfs entry after redundancy attr group create
|
|
Pick up the full seqlock series PeterZ is working on.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The error handling calls md_bitmap_free(bitmap) which checks for NULL
but will Oops if we pass an error pointer. Let's set "bitmap" to NULL
on this error path.
Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
"sync_completed" and "degraded" belongs to redundancy attr group,
it was not exist yet when md device was created.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: e1a86dbbbd6a ("md: fix deadlock causing by sysfs_notify")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Pull block stacking updates from Jens Axboe:
"The stacking related fixes depended on both the core block and drivers
branches, so here's a topic branch with that change.
Outside of that, a late fix from Johannes for zone revalidation"
* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
block: don't do revalidate zones on invalid devices
block: remove blk_queue_stack_limits
block: remove bdev_stack_limits
block: inherit the zoned characteristics in blk_stack_limits
|
|
Pull block driver updates from Jens Axboe:
- NVMe:
- ZNS support (Aravind, Keith, Matias, Niklas)
- Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
Dongli, Max, Sagi)
- null_blk zone capacity support (Aravind)
- MD:
- raid5/6 fixes (ChangSyun)
- Warning fixes (Damien)
- raid5 stripe fixes (Guoqing, Song, Yufen)
- sysfs deadlock fix (Junxiao)
- raid10 deadlock fix (Vitaly)
- struct_size conversions (Gustavo)
- Set of bcache updates/fixes (Coly)
* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
md/raid5: Allow degraded raid6 to do rmw
md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
raid5: don't duplicate code for different paths in handle_stripe
raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
md: print errno in super_written
md/raid5: remove the redundant setting of STRIPE_HANDLE
md: register new md sysfs file 'uuid' read-only
md: fix max sectors calculation for super 1.0
nvme-loop: remove extra variable in create ctrl
nvme-loop: set ctrl state connecting after init
nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
nvme-multipath: fix logic for non-optimized paths
nvme-rdma: fix controller reset hang during traffic
nvme-tcp: fix controller reset hang during traffic
nvmet: introduce the passthru Kconfig option
nvmet: introduce the passthru configfs interface
nvmet: Add passthru enable/disable helpers
nvmet: add passthru code to process commands
nvme: export nvme_find_get_ns() and nvme_put_ns()
nvme: introduce nvme_ctrl_get_by_path()
...
|
|
Pull documentation updates from Jonathan Corbet:
"It's been a busy cycle for documentation - hopefully the busiest for a
while to come. Changes include:
- Some new Chinese translations
- Progress on the battle against double words words and non-HTTPS
URLs
- Some block-mq documentation
- More RST conversions from Mauro. At this point, that task is
essentially complete, so we shouldn't see this kind of churn again
for a while. Unless we decide to switch to asciidoc or
something...:)
- Lots of typo fixes, warning fixes, and more"
* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
scripts/kernel-doc: optionally treat warnings as errors
docs: ia64: correct typo
mailmap: add entry for <alobakin@marvell.com>
doc/zh_CN: add cpu-load Chinese version
Documentation/admin-guide: tainted-kernels: fix spelling mistake
MAINTAINERS: adjust kprobes.rst entry to new location
devices.txt: document rfkill allocation
PCI: correct flag name
docs: filesystems: vfs: correct flag name
docs: filesystems: vfs: correct sync_mode flag names
docs: path-lookup: markup fixes for emphasis
docs: path-lookup: more markup fixes
docs: path-lookup: fix HTML entity mojibake
CREDITS: Replace HTTP links with HTTPS ones
docs: process: Add an example for creating a fixes tag
doc/zh_CN: add Chinese translation prefer section
doc/zh_CN: add clearing-warn-once Chinese version
doc/zh_CN: add admin-guide index
doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
futex: MAINTAINERS: Re-add selftests directory
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull uninitialized_var() macro removal from Kees Cook:
"This is long overdue, and has hidden too many bugs over the years. The
series has several "by hand" fixes, and then a trivial treewide
replacement.
- Clean up non-trivial uses of uninitialized_var()
- Update documentation and checkpatch for uninitialized_var() removal
- Treewide removal of uninitialized_var()"
* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
compiler: Remove uninitialized_var() macro
treewide: Remove uninitialized_var() usage
checkpatch: Remove awareness of uninitialized_var() macro
mm/debug_vm_pgtable: Remove uninitialized_var() usage
f2fs: Eliminate usage of uninitialized_var() macro
media: sur40: Remove uninitialized_var() usage
KVM: PPC: Book3S PR: Remove uninitialized_var() usage
clk: spear: Remove uninitialized_var() usage
clk: st: Remove uninitialized_var() usage
spi: davinci: Remove uninitialized_var() usage
ide: Remove uninitialized_var() usage
rtlwifi: rtl8192cu: Remove uninitialized_var() usage
b43: Remove uninitialized_var() usage
drbd: Remove uninitialized_var() usage
x86/mm/numa: Remove uninitialized_var() usage
docs: deprecated.rst: Add uninitialized_var()
|
|
Don't call report zones for more zones than the user actually requested,
otherwise this can lead to out-of-bounds accesses in the callback
functions.
Such a situation can happen if the target's ->report_zones() callback
function returns 0 because we've reached the end of the target and then
restart the report zones on the second target.
We're again calling into ->report_zones() and ultimately into the user
supplied callback function but when we're not subtracting the number of
zones already processed this may lead to out-of-bounds accesses in the
user callbacks.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: d41003513e61 ("block: rework zone reporting")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
REQ_OP_FLUSH was being treated as a flag, but the operation
part of bio->bi_opf must be treated as a whole. Change to
accessing the operation part via bio_op(bio) and checking
for equality.
Signed-off-by: John Dorminy <jdorminy@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Fixes: d3c7b35c20d60 ("dm: add emulated block size target")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Declare dm_allowed_targets as static to avoid the warning:
drivers/md/dm-init.c:39:12: warning: symbol 'dm_allowed_targets' was
not declared. Should it be static?
when compiling with C=1.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
In retrieve_status(), when copying the target type name in the
target_type string field of struct dm_target_spec, copy at most
DM_MAX_TYPE_NAME - 1 character to avoid the compilation warning:
warning: ‘__builtin_strncpy’ specified bound 16 equals destination
size [-Wstringop-truncation]
when compiling with W-1.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
In super_init_validation(), remove a body-less if statement testing only
variables to avoid a compilation warning when compiling with W=1.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
For the case !CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG, declare the
functions verity_verify_root_hash(), verity_verify_is_sig_opt_arg(),
verity_verify_sig_parse_opt_args() and verity_verify_sig_opts_cleanup()
as inline to avoid a "no previous prototype for xxx" compilation
warning when compiling with W=1.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
|
|
Degraded raid6 always do reconstruct-write now. With raid6 xor supported,
we can do rmw in degraded raid6. This patch can reduce many read IOs to
improve performance.
If the failed disk is P, Q or the disk we want to write to, we may need to
do reconstruct-write in max degraded raid6. In this situation we can not
read enough data from handle_stripe_dirtying() so we have to set force_rcw
in handle_stripe_fill() to read all data.
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.
Reproducible Steps:
1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5
Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.
Cc: <stable@vger.kernel.org> # v4.4+
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
As we can see, R5_LOCKED is set and s.locked is increased whether
R5_ReWrite is set or not, so move it to common path.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Replace mddev_lock with spin_lock to align with other show methods in
raid5_attrs.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
It is better to print errno instead of bi_status.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
The flag is already set before compare rcw with rmw, so it is
not necessary to do it again.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Report the UUID of the MD array in the following format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
This is useful if you don't want to wait for udev to identify array.
And it is also easy for script to monitor it with the format.
Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de>
[Guoqing: mention the change in md.rst]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
To grow size of super 1.0 raid array, it is necessary to check the device
max usable size.
Now it uses rdev->sectors for max usable size. If one disk is 500G and the
raid device only uses the 100GB of this disk. rdev->sectors can't tell the
real max usable size. The max usable size should be
dev_size-(superblock_size+bitmap_size+badblock_size).
Also, remove unnecessary sb_start update in super_1_rdev_size_change().
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Add a simple helper to stat with a kernel space file name and switch
the early init code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-20-a.darwish@linutronix.de
|
|
This patch is a fix to patch "bcache: fix bio_{start,end}_io_acct with
proper device". The previous patch uses a hack to temporarily set
bi_disk to bcache device, which is mistaken too.
As Christoph suggests, this patch uses disk_{start,end}_io_acct() to
count I/O for bcache device in the correct way.
Fixes: 85750aeb748f ("bcache: use bio_{start,end}_io_acct")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 85750aeb748f ("bcache: use bio_{start,end}_io_acct") moves the
io account code to the location after bio_set_dev(bio, dc->bdev) in
cached_dev_make_request(). Then the account is performed incorrectly on
backing device, indeed the I/O should be counted to bcache device like
/dev/bcache0.
With the mistaken I/O account, iostat does not display I/O counts for
bcache device and all the numbers go to backing device. In writeback
mode, the hard drive may have 340K+ IOPS which is impossible and wrong
for spinning disk.
This patch introduces bch_bio_start_io_acct() and bch_bio_end_io_acct(),
which switches bio->bi_disk to bcache device before calling
bio_start_io_acct() or bio_end_io_acct(). Now the I/Os are counted to
bcache device, and bcache device, cache device and backing device have
their correct I/O count information back.
Fixes: 85750aeb748f ("bcache: use bio_{start,end}_io_acct")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bcache uses struct bbio to do I/Os for meta data pages like uuids,
disk_buckets, prio_buckets, and btree nodes.
Example writing a btree node onto cache device, the process is,
- Allocate a struct bbio from mempool c->bio_meta.
- Inside struct bbio embedded a struct bio, initialize bi_inline_vecs
for this embedded bio.
- Call bch_bio_map() to map each meta data page to each bv from the
inlined bi_io_vec table.
- Call bch_submit_bbio() to submit the bio into underlying block layer.
- When the I/O completed, only release the struct bbio, don't touch the
reference counter of the meta data pages.
The struct bbio is defined as,
738 struct bbio {
739 unsigned int submit_time_us;
[snipped]
748 struct bio bio;
749 };
Because struct bio is embedded at the end of struct bbio, therefore the
actual size of struct bbio is sizeof(struct bio) + size of the embedded
bio->bi_inline_vecs.
Now all the meta data bucket size are limited to meta_bucket_pages(), if
the bucket size is large than meta_bucket_pages()*PAGE_SECTORS, rested
space in the bucket is unused. Therefore the most used space in meta
bucket is (1<<MAX_ORDER) pages, or (1<<CONFIG_FORCE_MAX_ZONEORDER) if it
is configured.
Therefore for large bucket size, it is unnecessary to calculate the
allocation size of mempool c->bio_meta as,
mempool_init_kmalloc_pool(&c->bio_meta, 2,
sizeof(struct bbio) +
sizeof(struct bio_vec) * bucket_pages(c))
It is too large, neither the Linux buddy allocator cannot allocate so
much continuous pages, nor the extra allocated pages are wasted.
This patch replace bucket_pages() to meta_bucket_pages() in two places,
- In bch_cache_set_alloc(), when initialize mempool c->bio_meta, uses
sizeof(struct bbio) + sizeof(struct bio_vec) * bucket_pages(c) to set
the allocating object size.
- In bch_bbio_alloc(), when calling bio_init() to set inline bvec talbe
bi_inline_bvecs, uses meta_bucket_pages() to indicate number of the
inline bio vencs number.
Now the maximum size of embedded bio inside struct bbio exactly matches
the limit of meta_bucket_pages(), no extra page wasted.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Mempool c->fill_iter is used to allocate memory for struct btree_iter in
bch_btree_node_read_done() to iterate all keys of a read-in btree node.
The allocation size is defined in bch_cache_set_alloc() by,
mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
where iter_size is defined by a calculation,
(sb->bucket_size / sb->block_size + 1) * sizeof(struct btree_iter_set)
For 16bit width bucket_size the calculation is OK, but now the bucket
size is extended to 32bit, the bucket size can be 2GB. By the above
calculation, iter_size can be 2048 pages (order 11 is still accepted by
buddy allocator).
But the actual size holds the bkeys in meta data bucket is limited to
meta_bucket_pages() already, which is 16MB. By the above calculation,
if replace sb->bucket_size by meta_bucket_pages() * PAGE_SECTORS, the
result is 16 pages. This is the size large enough for the mempool
allocation to struct btree_iter.
Therefore in worst case every time mempool c->fill_iter allocates, at
most 4080 pages are wasted and won't be used. Therefore this patch uses
meta_bucket_pages() * PAGE_SECTORS to calculate the iter size in
bch_cache_set_alloc(), to avoid extra memory allocation from mempool
c->fill_iter.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|