Age | Commit message (Collapse) | Author |
|
The documentation comments for ext4_read_block_bitmap_nowait and
ext4_read_inode_bitmap describe them as returning NULL on error, but
they return an ERR_PTR on error; update the documentation to match.
The documentation comment for ext4_wait_block_bitmap describes it as
returning 1 on error, but it returns -errno on error; update the
documentation to match.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Ritesh Harani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since commit a8ac900b8163 ("ext4: use non-movable memory for the
superblock") buffers for ext4 superblock were allocated using
the sb_bread_unmovable() helper which allocated buffer heads
out of non-movable memory blocks. It was necessarily to not block
page migrations and do not cause cma allocation failures.
However commit 85c8f176a611 ("ext4: preload block group descriptors")
broke this by introducing pre-reading of the ext4 superblock.
The problem is that __breadahead() is using __getblk() underneath,
which allocates buffer heads out of movable memory.
It resulted in page migration failures I've seen on a machine
with an ext4 partition and a preallocated cma area.
Fix this by introducing sb_breadahead_unmovable() and
__breadahead_gfp() helpers which use non-movable memory for buffer
head allocations and use them for the ext4 superblock readahead.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Fixes: 85c8f176a611 ("ext4: preload block group descriptors")
Signed-off-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Run generic/388 with journal data mode sometimes may trigger the warning
in ext4_invalidatepage. Actually, we should use the matching invalidatepage
in ext4_writepage.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Found a read performance issue when linux kernel page size is 64KB.
If linux kernel page size is 64KB and mount options cache=strict &
vers=2.1+, it does not support cifs_readpages(). Instead, it is using
cifs_readpage() and cifs_read() with maximum read IO size 16KB, which is
much slower than read IO size 1MB when negotiated SMB 2.1+. Since modern
SMB server supported SMB 2.1+ and Max Read Size can reach more than 64KB
(for example 1MB ~ 8MB), this patch check max_read instead of maxBuf to
determine whether server support readpages() and improve read performance
for page size 64KB & cache=strict & vers=2.1+, and for SMB1 it is more
cleaner to initialize server->max_read to server->maxBuf.
The client is a linux box with linux kernel 4.2.8,
page size 64KB (CONFIG_ARM64_64K_PAGES=y),
cpu arm 1.7GHz, and use mount.cifs as smb client.
The server is another linux box with linux kernel 4.2.8,
share a file '10G.img' with size 10GB,
and use samba-4.7.12 as smb server.
The client mount a share from the server with different
cache options: cache=strict and cache=none,
mount -tcifs //<server_ip>/Public /cache_strict -overs=3.0,cache=strict,username=<xxx>,password=<yyy>
mount -tcifs //<server_ip>/Public /cache_none -overs=3.0,cache=none,username=<xxx>,password=<yyy>
The client download a 10GbE file from the server across 1GbE network,
dd if=/cache_strict/10G.img of=/dev/null bs=1M count=10240
dd if=/cache_none/10G.img of=/dev/null bs=1M count=10240
Found that cache=strict (without patch) is slower read throughput and
smaller read IO size than cache=none.
cache=strict (without patch): read throughput 40MB/s, read IO size is 16KB
cache=strict (with patch): read throughput 113MB/s, read IO size is 1MB
cache=none: read throughput 109MB/s, read IO size is 1MB
Looks like if page size is 64KB,
cifs_set_ops() would use cifs_addr_ops_smallbuf instead of cifs_addr_ops,
/* check if server can support readpages */
if (cifs_sb_master_tcon(cifs_sb)->ses->server->maxBuf <
PAGE_SIZE + MAX_CIFS_HDR_SIZE)
inode->i_data.a_ops = &cifs_addr_ops_smallbuf;
else
inode->i_data.a_ops = &cifs_addr_ops;
maxBuf is came from 2 places, SMB2_negotiate() and CIFSSMBNegotiate(),
(SMB2_MAX_BUFFER_SIZE is 64KB)
SMB2_negotiate():
/* set it to the maximum buffer size value we can send with 1 credit */
server->maxBuf = min_t(unsigned int, le32_to_cpu(rsp->MaxTransactSize),
SMB2_MAX_BUFFER_SIZE);
CIFSSMBNegotiate():
server->maxBuf = le32_to_cpu(pSMBr->MaxBufferSize);
Page size 64KB and cache=strict lead to read_pages() use cifs_readpage()
instead of cifs_readpages(), and then cifs_read() using maximum read IO
size 16KB, which is much slower than maximum read IO size 1MB.
(CIFSMaxBufSize is 16KB by default)
/* FIXME: set up handlers for larger reads and/or convert to async */
rsize = min_t(unsigned int, cifs_sb->rsize, CIFSMaxBufSize);
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Jones Syue <jonessyue@qnap.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We already dump these keys for SMB3, lets also dump it for SMB2
sessions so that we can use the session key in wireshark to check and validate
that the signatures are correct.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
When checking for draining with __req_need_defer(), it tries to match
how many requests were sent before a current one with number of already
completed. Dropped SQEs are included in req->sequence, and they won't
ever appear in CQ. To compensate for that, __req_need_defer() substracts
ctx->cached_sq_dropped.
However, what it should really use is number of SQEs dropped __before__
the current one. In other words, any submitted request shouldn't
shouldn't affect dequeueing from the drain queue of previously submitted
ones.
Instead of saving proper ctx->cached_sq_dropped in each request,
substract from req->sequence it at initialisation, so it includes number
of properly submitted requests.
note: it also changes behaviour of timeouts, but
1. it's already diverge from the description because of using SQ
2. the description is ambiguous regarding dropped SQEs
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
req->timeout.count and req->io->timeout.seq_offset store the same value,
which is sqe->off. Kill the second one
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_timeout() can be executed asynchronously by a worker and without
holding ctx->uring_lock
1. using ctx->cached_sq_head there is racy there
2. it should count events from a moment of timeout's submission, but
not execution
Use req->sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have a few regressions and one fix for stable:
- revert fsync optimization
- fix lost i_size update
- fix a space accounting leak
- build fix, add back definition of a deprecated ioctl flag
- fix search condition for old roots in relocation"
* tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: re-instantiate the removed BTRFS_SUBVOL_CREATE_ASYNC definition
btrfs: fix reclaim counter leak of space_info objects
btrfs: make full fsyncs always operate on the entire file again
btrfs: fix lost i_size update after cloning inline extent
btrfs: check commit root generation in should_ignore_root
|
|
syzbot reports this crash:
BUG: unable to handle page fault for address: ffffffffffffffe8
PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G B L 5.7.0-rc1-next-20200413 #4
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017
RIP: 0010:__wake_up_common+0x98/0x290
el/sched/wait.c:87
Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
FS: 00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
Call Trace:
__wake_up_common_lock+0xea/0x150
ommon_lock at kernel/sched/wait.c:124
? __wake_up_common+0x290/0x290
? lockdep_hardirqs_on+0x16/0x2c0
__wake_up+0x13/0x20
io_cqring_ev_posted+0x75/0xe0
v_posted at fs/io_uring.c:1160
io_ring_ctx_wait_and_kill+0x1c0/0x2f0
l at fs/io_uring.c:7305
io_uring_create+0xa8d/0x13b0
? io_req_defer_prep+0x990/0x990
? __kasan_check_write+0x14/0x20
io_uring_setup+0xb8/0x130
? io_uring_create+0x13b0/0x13b0
? check_flags.part.28+0x220/0x220
? lockdep_hardirqs_on+0x16/0x2c0
__x64_sys_io_uring_setup+0x31/0x40
do_syscall_64+0xcc/0xaf0
? syscall_return_slowpath+0x580/0x580
? lockdep_hardirqs_off+0x1f/0x140
? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
? trace_hardirqs_off_caller+0x3a/0x150
? trace_hardirqs_off_thunk+0x1a/0x1c
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7fdcb9dd76ed
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed
RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54
RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000
R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002
R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000
Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc]
CR2: ffffffffffffffe8
---[ end trace f9502383d57e0e22 ]---
RIP: 0010:__wake_up_common+0x98/0x290
Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
FS: 00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]—
which is due to error injection (or allocation failure) preventing the
rings from being setup. On shutdown, we attempt to remove any pending
requests, and for poll request, we call io_cqring_ev_posted() when we've
killed poll requests. However, since the rings aren't setup, we won't
find any poll requests. Make the calling of io_cqring_ev_posted()
dependent on actually having completed requests. This fixes this setup
corner case, and removes spurious calls if we remove poll requests and
don't find any.
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need to drop the inode spinlock while calling nfs4_select_rw_stateid(),
since nfs4_copy_delegation_stateid() could take the delegation lock.
Note that it is safe to do this, since all other calls to
pnfs_update_layout() for that inode will find themselves blocked by
the lock we hold on NFS_LAYOUT_FIRST_LAYOUTGET.
Fixes: fc51b1cf391d ("NFS: Beware when dereferencing the delegation cred")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
The new async dirops callback routines can pass ERR_PTR values to
ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
even if ceph_mdsc_build_path fails.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
If the request has been marked as canceled, don't try and issue it.
Instead just fill a canceled event and finish the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We added this for just the regular poll requests in commit a6ba632d2c24
("io_uring: retry poll if we got woken with non-matching mask"), we
should do the same for the poll handler used pollable async requests.
Move the re-wait check and arm into a helper, and call it from
io_async_task_func() as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In the reflink extent remap function, it turns out that uirec (the block
mapping corresponding only to the part of the passed-in mapping that got
unmapped) was not fully initialized. Specifically, br_state was not
being copied from the passed-in struct to the uirec. This could lead to
unpredictable results such as the reflinked mapping being marked
unwritten in the destination file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
The filesystem freeze sequence in XFS waits on any background
eofblocks or cowblocks scans to complete before the filesystem is
quiesced. At this point, the freezer has already stopped the
transaction subsystem, however, which means a truncate or cowblock
cancellation in progress is likely blocked in transaction
allocation. This results in a deadlock between freeze and the
associated scanner.
Fix this problem by holding superblock write protection across calls
into the block reapers. Since protection for background scans is
acquired from the workqueue task context, trylock to avoid a similar
deadlock between freeze and blocking on the write lock.
Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap")
Reported-by: Paul Furtado <paulfurtado91@gmail.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
New struct nfsd4_blocked_lock allocated in find_or_allocate_block()
does not initialized nbl_list and nbl_lru.
If conflock allocation fails rollback can call list_del_init()
access uninitialized fields and corrupt memory.
v2: just initialize nbl_list and nbl_lru right after nbl allocation.
Fixes: 76d348fadff5 ("nfsd: have nfsd4_lock use blocking locks for v4.1+ lock")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If a dentry's version is somewhere between invalid_before and the current
directory version, we should be setting it forward to the current version,
not backwards to the invalid_before version. Note that we're only doing
this at all because dentry::d_fsdata isn't large enough on a 32-bit system.
Fix this by using a separate variable for invalid_before so that we don't
accidentally clobber the current dir version.
Fixes: a4ff7401fbfa ("afs: Keep track of invalid-before version for dentry coherency")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
AFS directories are retained locally as a structured file, with lookup
being effected by a local search of the file contents. When a modification
(such as mkdir) happens, the dir file content is modified locally rather
than redownloading the directory.
The directory contents are accessed in a number of ways, with a number of
different locks schemes:
(1) Download of contents - dvnode->validate_lock/write in afs_read_dir().
(2) Lookup and readdir - dvnode->validate_lock/read in afs_dir_iterate(),
downgrading from (1) if necessary.
(3) d_revalidate of child dentry - dvnode->validate_lock/read in
afs_do_lookup_one() downgrading from (1) if necessary.
(4) Edit of dir after modification - page locks on individual dir pages.
Unfortunately, because (4) uses different locking scheme to (1) - (3),
nothing protects against the page being scanned whilst the edit is
underway. Even download is not safe as it doesn't lock the pages - relying
instead on the validate_lock to serialise as a whole (the theory being that
directory contents are treated as a block and always downloaded as a
block).
Fix this by write-locking dvnode->validate_lock around the edits. Care
must be taken in the rename case as there may be two different dirs - but
they need not be locked at the same time. In any case, once the lock is
taken, the directory version must be rechecked, and the edit skipped if a
later version has been downloaded by revalidation (there can't have been
any local changes because the VFS holds the inode lock, but there can have
been remote changes).
Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix the length of the dump of a bad YFSFetchStatus record. The function
was copied from the AFS version, but the YFS variant contains bigger fields
and extra information, so expand the dump to match.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The afs_deliver_fs_rename() and yfs_deliver_fs_rename() functions both only
decode the second file status returned unless the parent directories are
different - unfortunately, this means that the xdr pointer isn't advanced
and the volsync record will be read incorrectly in such an instance.
Fix this by always decoding the second status into the second
status/callback block which wasn't being used if the dirs were the same.
The afs_update_dentry_version() calls that update the directory data
version numbers on the dentries can then unconditionally use the second
status record as this will always reflect the state of the destination dir
(the two records will be identical if the destination dir is the same as
the source dir)
Fixes: 260a980317da ("[AFS]: Add "directory write" support.")
Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
If we're decoding an AFSFetchStatus record and we see that the version is 1
and the abort code is set and we're expecting inline errors, then we store
the abort code and ignore the remaining status record (which is correct),
but we don't set the flag to say we got a valid abort code.
This can affect operation of YFS.RemoveFile2 when removing a file and the
operation of {,Y}FS.InlineBulkStatus when prospectively constructing or
updating of a set of inodes during a lookup.
Fix this to indicate the reception of a valid abort code.
Fixes: a38a75581e6e ("afs: Fix unlink to handle YFS.RemoveFile2 better")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
If we receive a status record that has VNOVNODE set in the abort field,
xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() don't advance
the XDR pointer, thereby corrupting anything subsequent decodes from the
same block of data.
This has the potential to affect AFS.InlineBulkStatus and
YFS.InlineBulkStatus operation, but probably doesn't since the status
records are extracted as individual blocks of data and the buffer pointer
is reset between blocks.
It does affect YFS.RemoveFile2 operation, corrupting the volsync record -
though that is not currently used.
Other operations abort the entire operation rather than returning an error
inline, in which case there is no decoding to be done.
Fix this by unconditionally advancing the xdr pointer.
Fixes: 684b0f68cf1c ("afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The splice file punt check uses file->f_mode to check for O_NONBLOCK,
but it should be checking file->f_flags. This leads to punting even
for files that have O_NONBLOCK set, which isn't necessary. This equates
to checking for FMODE_PATH, which will never be set on the fd in
question.
Fixes: 7d67af2c0134 ("io_uring: add splice(2) support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull cifs fixes from Steve French:
"Ten cifs/smb fixes:
- five RDMA (smbdirect) related fixes
- add experimental support for swap over SMB3 mounts
- also a fix which improves performance of signed connections"
* tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: enable swap on SMB3 mounts
smb3: change noisy error message to FYI
smb3: smbdirect support can be configured by default
cifs: smbd: Do not schedule work to send immediate packet on every receive
cifs: smbd: Properly process errors on ib_post_send
cifs: Allocate crypto structures on the fly for calculating signatures of incoming packets
cifs: smbd: Update receive credits before sending and deal with credits roll back on failure before sending
cifs: smbd: Check send queue size before posting a send
cifs: smbd: Merge code to track pending packets
cifs: ignore cached share root handle closing errors
|
|
Pull NFS client bugfix from Trond Myklebust:
"Fix an RCU read lock leakage in pnfs_alloc_ds_commits_list()"
* tag 'nfs-for-5.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS: Fix RCU lock leakage
|
|
When running liburing test case 'accept', I got below warning:
RED: Invalid credentials
RED: At include/linux/cred.h:285
RED: Specified credentials: 00000000d02474a0
RED: ->magic=4b, put_addr=000000005b4f46e9
RED: ->usage=-1699227648, subscr=-25693
RED: ->*uid = { 256,-25693,-25693,65534 }
RED: ->*gid = { 0,-1925859360,-1789740800,-1827028688 }
RED: ->security is 00000000258c136e
eneral protection fault, probably for non-canonical address 0xdead4ead00000000: 0000 [#1] SMP PTI
PU: 21 PID: 2037 Comm: accept Not tainted 5.6.0+ #318
ardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
IP: 0010:dump_invalid_creds+0x16f/0x184
ode: 48 8b 83 88 00 00 00 48 3d ff 0f 00 00 76 29 48 89 c2 81 e2 00 ff ff ff 48
81 fa 00 6b 6b 6b 74 17 5b 48 c7 c7 4b b1 10 8e 5d <8b> 50 04 41 5c 8b 30 41 5d
e9 67 e3 04 00 5b 5d 41 5c 41 5d c3 0f
SP: 0018:ffffacc1039dfb38 EFLAGS: 00010087
AX: dead4ead00000000 RBX: ffff9ba39319c100 RCX: 0000000000000007
DX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8e10b14b
BP: ffffffff8e108476 R08: 0000000000000000 R09: 0000000000000001
10: 0000000000000000 R11: ffffacc1039df9e5 R12: 000000009552b900
13: 000000009319c130 R14: ffff9ba39319c100 R15: 0000000000000246
S: 00007f96b2bfc4c0(0000) GS:ffff9ba39f340000(0000) knlGS:0000000000000000
S: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
R2: 0000000000401870 CR3: 00000007db7a4000 CR4: 00000000000006e0
all Trace:
__invalid_creds+0x48/0x4a
__io_req_aux_free+0x2e8/0x3b0
? io_poll_remove_one+0x2a/0x1d0
__io_free_req+0x18/0x200
io_free_req+0x31/0x350
io_poll_remove_one+0x17f/0x1d0
io_poll_cancel.isra.80+0x6c/0x80
io_async_find_and_cancel+0x111/0x120
io_issue_sqe+0x181/0x10e0
? __lock_acquire+0x552/0xae0
? lock_acquire+0x8e/0x310
? fs_reclaim_acquire.part.97+0x5/0x30
__io_queue_sqe.part.100+0xc4/0x580
? io_submit_sqes+0x751/0xbd0
? rcu_read_lock_sched_held+0x32/0x40
io_submit_sqes+0x9ba/0xbd0
? __x64_sys_io_uring_enter+0x2b2/0x460
? __x64_sys_io_uring_enter+0xaf/0x460
? find_held_lock+0x2d/0x90
? __x64_sys_io_uring_enter+0x111/0x460
__x64_sys_io_uring_enter+0x2d7/0x460
do_syscall_64+0x5a/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xb3
After looking into codes, it turns out that this issue is because we didn't
restore the req->work, which is changed in io_arm_poll_handler(), req->work
is a union with below struct:
struct {
struct callback_head task_work;
struct hlist_node hash_node;
struct async_poll *apoll;
};
If we forget to restore, members in struct io_wq_work would be invalid,
restore the req->work to fix this issue.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Get rid of not needed 'need_restore' variable.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Requests initialisation is scattered across several functions, namely
io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it
in io_init_req() for better data locality and code clarity.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's a good idea to not read sqe->flags twice, as it's prone to security
bugs. Instead of passing it around, embeed them in req->flags. It's
already so except for IOSQE_IO_LINK.
1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD
2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK
And leave req_set_fail_links() using new REQ_F_LINK, because it's more
sensible.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Having only one place for cleaning up a request after a link assembly/
submission failure will play handy in the future. At least it allows
to remove duplicated cleanup sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As a preparation for extracting request init bits, remove self-coded mm
tracking from io_submit_sqes(), but rely on current->mm. It's more
convenient, than passing this piece of state in other functions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If io_submit_sqes() can't grab an mm, it fails and exits right away.
There is no need to track the fact of the failure. Remove @mm_fault.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Another brown paper bag moment. pnfs_alloc_ds_commits_list() is leaking
the RCU lock.
Fixes: a9901899b649 ("pNFS: Add infrastructure for cleaning up per-layout commit structures")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Merge yet more updates from Andrew Morton:
- Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
gup, hugetlb, pagemap, memremap)
- Various other things (hfs, ocfs2, kmod, misc, seqfile)
* akpm: (34 commits)
ipc/util.c: sysvipc_find_ipc() should increase position index
kernel/gcov/fs.c: gcov_seq_next() should increase position index
fs/seq_file.c: seq_read(): add info message about buggy .next functions
drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
change email address for Pali Rohár
selftests: kmod: test disabling module autoloading
selftests: kmod: fix handling test numbers above 9
docs: admin-guide: document the kernel.modprobe sysctl
fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
kmod: make request_module() return an error when autoloading is disabled
mm/memremap: set caching mode for PCI P2PDMA memory to WC
mm/memory_hotplug: add pgprot_t to mhp_params
powerpc/mm: thread pgprot_t through create_section_mapping()
x86/mm: introduce __set_memory_prot()
x86/mm: thread pgprot_t through init_memory_mapping()
mm/memory_hotplug: rename mhp_restrictions to mhp_params
mm/memory_hotplug: drop the flags field from struct mhp_restrictions
mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
mm/vma: introduce VM_ACCESS_FLAGS
mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"A fix and two cleanups.
Fix:
- Christoph Hellwig noticed that some logic I added to
orangefs_file_read_iter introduced a race condition, so he sent a
reversion patch. I had to modify his patch since reverting at this
point broke Orangefs.
Cleanups:
- Christoph Hellwig noticed that we were doing some unnecessary work
in orangefs_flush, so he sent in a patch that removed the un-needed
code.
- Al Viro told me he had trouble building Orangefs. Orangefs should
be easy to build, even for Al :-).
I looked back at the test server build notes in orangefs.txt, just
in case that's where the trouble really is, and found a couple of
typos and made a couple of clarifications"
* tag 'for-linus-5.7-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: clarify build steps for test server in orangefs.txt
orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush
orangefs: get rid of knob code...
|
|
Patch series "seq_file .next functions should increase position index".
In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
simplify seq_file iteration code and interface")
"Some ->next functions do not increment *pos when they return NULL...
Note that such ->next functions are buggy and should be fixed. A simple
demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size
larger than the size of /proc/swaps. This will always show the whole
last line of /proc/swaps"
Described problem is still actual. If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.
$ dd if=/proc/swaps bs=1 # usual output
Filename Type Size Used Priority
/dev/dm-0 partition 4194812 97536 -2
104+0 records in
104+0 records out
104 bytes copied
$ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0 partition 4194812 97536 -2
/dev/dm-0 partition 4194812 97536 -2
3+1 records in
3+1 records out
131 bytes copied
There are lot of other affected files, I've found 30+ including
/proc/net/ip_tables_matches and /proc/sysvipc/*
I've sent patches into maillists of affected subsystems already, this
patch-set fixes the problem in files related to pstore, tracing, gcov,
sysvipc and other subsystems processed via linux-kernel@ mailing list
directly
https://bugzilla.kernel.org/show_bug.cgi?id=206283
This patch (of 4):
Add debug code to seq_read() to detect missed or out-of-tree incorrect
.next seq_file functions.
[akpm@linux-foundation.org: s/pr_info/pr_info_ratelimited/, per Qian Cai]
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Waiman Long <longman@redhat.com>
Link: http://lkml.kernel.org/r/244674e5-760c-86bd-d08a-047042881748@virtuozzo.com
Link: http://lkml.kernel.org/r/7c24087c-e280-e580-5b0c-0cdaeb14cd18@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For security reasons I stopped using gmail account and kernel address is
now up-to-date alias to my personal address.
People periodically send me emails to address which they found in source
code of drivers, so this change reflects state where people can contact
me.
[ Added .mailmap entry as per Joe Perches - Linus ]
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200307104237.8199-1-pali@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After request_module(), nothing is stopping the module from being
unloaded until someone takes a reference to it via try_get_module().
The WARN_ONCE() in get_fs_type() is thus user-reachable, via userspace
running 'rmmod' concurrently.
Since WARN_ONCE() is for kernel bugs only, not for user-reachable
situations, downgrade this warning to pr_warn_once().
Keep it printed once only, since the intent of this warning is to detect
a bug in modprobe at boot time. Printing the warning more than once
wouldn't really provide any useful extra information.
Fixes: 41124db869b7 ("fs: warn in case userspace lied about modprobe return")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jessica Yu <jeyu@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> [4.13+]
Link: http://lkml.kernel.org/r/20200312202552.241885-3-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Linux fallocate(2) with FALLOC_FL_PUNCH_HOLE mode set, its offset can
exceed the inode size. Ocfs2 now doesn't allow that offset beyond inode
size. This restriction is not necessary and violates fallocate(2)
semantics.
If fallocate(2) offset is beyond inode size, just return success and do
nothing further.
Otherwise, ocfs2 will crash the kernel.
kernel BUG at fs/ocfs2//alloc.c:7264!
ocfs2_truncate_inline+0x20f/0x360 [ocfs2]
ocfs2_remove_inode_range+0x23c/0xcb0 [ocfs2]
__ocfs2_change_file_space+0x4a5/0x650 [ocfs2]
ocfs2_fallocate+0x83/0xa0 [ocfs2]
vfs_fallocate+0x148/0x230
SyS_fallocate+0x48/0x80
do_syscall_64+0x79/0x170
Signed-off-by: Changwei Ge <chge@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200407082754.17565-1-chge@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When removing files containing extended attributes, the hfsplus driver may
remove the wrong entries from the attributes b-tree, causing major
filesystem damage and in some cases even kernel crashes.
To remove a file, all its extended attributes have to be removed as well.
The driver does this by looking up all keys in the attributes b-tree with
the cnid of the file. Each of these entries then gets deleted using the
key used for searching, which doesn't contain the attribute's name when it
should. Since the key doesn't contain the name, the deletion routine will
not find the correct entry and instead remove the one in front of it. If
parent nodes have to be modified, these become corrupt as well. This
causes invalid links and unsorted entries that not even macOS's fsck_hfs
is able to fix.
To fix this, modify the search key before an entry is deleted from the
attributes b-tree by copying the found entry's key into the search key,
therefore ensuring that the correct entry gets removed from the tree.
Signed-off-by: Simon Gander <simon@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200327155541.1521-1-simon@tuxera.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
"A brown paper bag slipped through my proc changes, and syzcaller
caught it when the code ended up in your tree.
I have opted to fix it the simplest cleanest way I know how, so there
is no reasonable chance for the bug to repeat"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use a dedicated lock in struct pid
|
|
Add experimental support for allowing a swap file to be on an SMB3
mount. There are use cases where swapping over a secure network
filesystem is preferable. In some cases there are no local
block devices large enough, and network block devices can be
hard to setup and secure. And in some cases there are no
local block devices at all (e.g. with the recent addition of
remote boot over SMB3 mounts).
There are various enhancements that can be added later e.g.:
- doing a mandatory byte range lock over the swapfile (until
the Linux VFS is modified to notify the file system that an open
is for a swapfile, when the file can be opened "DENY_ALL" to prevent
others from opening it).
- pinning more buffers in the underlying transport to minimize memory
allocations in the TCP stack under the fs
- documenting how to create ACLs (on the server) to secure the
swapfile (or adding additional tools to cifs-utils to make it easier)
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
|
|
Pull io_uring fixes from Jens Axboe:
"Here's a set of fixes that either weren't quite ready for the first,
or came about from some intensive testing on memcached with 350K+
sockets.
Summary:
- Fixes for races or deadlocks around poll handling
- Don't double account fixed files against RLIMIT_NOFILE
- IORING_OP_OPENAT LFS fix
- Poll retry handling (Bijan)
- Missing finish_wait() for SQPOLL (Hillf)
- Cleanup/split of io_kiocb alloc vs ctx references (Pavel)
- Fixed file unregistration and init fixes (Xiaoguang)
- Various little fixes (Xiaoguang, Pavel, Colin)"
* tag 'io_uring-5.7-2020-04-09' of git://git.kernel.dk/linux-block:
io_uring: punt final io_ring_ctx wait-and-free to workqueue
io_uring: fix fs cleanup on cqe overflow
io_uring: don't read user-shared sqe flags twice
io_uring: remove req init from io_get_req()
io_uring: alloc req only after getting sqe
io_uring: simplify io_get_sqring
io_uring: do not always copy iovec in io_req_map_rw()
io_uring: ensure openat sets O_LARGEFILE if needed
io_uring: initialize fixed_file_data lock
io_uring: remove redundant variable pointer nxt and io_wq_assign_next call
io_uring: fix ctx refcounting in io_submit_sqes()
io_uring: process requests completed with -EAGAIN on poll list
io_uring: remove bogus RLIMIT_NOFILE check in file registration
io_uring: use io-wq manager as backup task if task is exiting
io_uring: grab task reference for poll requests
io_uring: retry poll if we got woken with non-matching mask
io_uring: add missing finish_wait() in io_sq_thread()
io_uring: refactor file register/unregister/update handling
|
|
Pull more xfs updates from Darrick Wong:
"As promised last week, this batch changes how xfs interacts with
memory reclaim; how the log batches and throttles log items; how hard
writes near ENOSPC will try to squeeze more space out of the
filesystem; and hopefully fix the last of the umount hangs after a
catastrophic failure.
Summary:
- Validate the realtime geometry in the superblock when mounting
- Refactor a bunch of tricky flag handling in the log code
- Flush the CIL more judiciously so that we don't wait until there
are millions of log items consuming a lot of memory.
- Throttle transaction commits to prevent the xfs frontend from
flooding the CIL with too many log items.
- Account metadata buffers correctly for memory reclaim.
- Mark slabs properly for memory reclaim. These should help reclaim
run more effectively when XFS is using a lot of memory.
- Don't write a garbage log record at unmount time if we're trying to
trigger summary counter recalculation at next mount.
- Don't block the AIL on locked dquot/inode buffers; instead trigger
its backoff mechanism to give the lock holder a chance to finish
up.
- Ratelimit writeback flushing when buffered writes encounter ENOSPC.
- Other minor cleanups.
- Make reflink a synchronous operation when the fs is mounted with
wsync or sync, which means that now we force the log to disk to
record the changes"
* tag 'xfs-5.7-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
xfs: reflink should force the log out if mounted with wsync
xfs: factor out a new xfs_log_force_inode helper
xfs: fix inode number overflow in ifree cluster helper
xfs: remove redundant variable assignment in xfs_symlink()
xfs: ratelimit inode flush on buffered write ENOSPC
xfs: return locked status of inode buffer on xfsaild push
xfs: trylock underlying buffer on dquot flush
xfs: remove unnecessary ternary from xfs_create
xfs: don't write a corrupt unmount record to force summary counter recalc
xfs: factor inode lookup from xfs_ifree_cluster
xfs: tail updates only need to occur when LSN changes
xfs: factor common AIL item deletion code
xfs: correctly acount for reclaimable slabs
xfs: Improve metadata buffer reclaim accountability
xfs: don't allow log IO to be throttled
xfs: Throttle commits on delayed background CIL push
xfs: Lower CIL flush limit for large logs
xfs: remove some stale comments from the log code
xfs: refactor unmount record writing
xfs: merge xlog_commit_record with xlog_write_done
...
|
|
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the
task_works list isn't ordered (in fact it's LIFO ordered). We could
either fix this with a separate task_works list for io_uring work, or
just punt the wait-and-free to async context. This ensures that
task_work that comes in while we're shutting down is processed
correctly. If we don't go async, we could have work past the fput()
work for the ring that depends on work that won't be executed until
after we're done with the wait-and-free. But as this operation is
blocking, it'll never get a chance to run.
This was reproduced with hundreds of thousands of sockets running
memcached, haven't been able to reproduce this synthetically.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The noisy posix error message in readdir was supposed
to be an FYI (not enabled by default)
CIFS VFS: XXX dev 66306, reparse 0, mode 755
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"The bulk of this is the series to make CONFIG_COMPAT user-selectable,
it's been around for a long time but was blocked behind the
syscall-in-C series.
Plus there's also a few fixes and other minor things.
Summary:
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and
turn it off by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
Nicholas Piggin, Stephen Boyd, Wen Xiong"
* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Always build the tm-poison test 64-bit
powerpc: Improve ppc_save_regs()
Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>
powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
powerpc/perf: split callchain.c by bitness
powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
powerpc/64: make buildable without CONFIG_COMPAT
powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
powerpc/perf: consolidate read_user_stack_32
powerpc: move common register copy functions from signal_32.c to signal.c
powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
powerpc/ps3: Remove an unneeded NULL check
powerpc/ps3: Remove duplicate error message
powerpc/powernv: Re-enable imc trace-mode in kernel
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
powerpc/pseries: Fix MCE handling on pseries
selftests/eeh: Skip ahci adapters
powerpc/64s: Fix doorbell wakeup msgclr optimisation
|
|
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&pid->wait_pidfd);
> local_irq_disable();
> lock(tasklist_lock);
> lock(&pid->wait_pidfd);
> <Interrupt>
> lock(tasklist_lock);
>
> *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:
The problem is that because wait_pidfd.lock is taken under the tasklist
lock. It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.
Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock. That should be safe and I think the code will eventually
do that. It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.
It is also true that my pre-merge window testing was insufficient. So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things. I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.
It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals. That will require taking pid->lock with irqs disabled.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
If completion queue overflow occurs, __io_cqring_fill_event() will
update req->cflags, which is in a union with req->work and happens to
be aliased to req->work.fs. Following io_free_req() ->
io_req_work_drop_env() may get a bunch of different problems (miscount
fs->users, segfault, etc) on cleaning @fs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull ceph updates from Ilya Dryomov:
"The main items are:
- support for asynchronous create and unlink (Jeff Layton).
Creates and unlinks are satisfied locally, without waiting for a
reply from the MDS, provided the client has been granted
appropriate caps (new in v15.y.z ("Octopus") release). This can be
a big help for metadata heavy workloads such as tar and rsync.
Opt-in with the new nowsync mount option.
- multiple blk-mq queues for rbd (Hannes Reinecke and myself).
When the driver was converted to blk-mq, we settled on a single
blk-mq queue because of a global lock in libceph and some other
technical debt. These have since been addressed, so allocate a
queue per CPU to enhance parallelism.
- don't hold onto caps that aren't actually needed (Zheng Yan).
This has been our long-standing behavior, but it causes issues with
some active/standby applications (synchronous I/O, stalls if the
standby goes down, etc).
- .snap directory timestamps consistent with ceph-fuse (Luis
Henriques)"
* tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
ceph: fix snapshot directory timestamps
ceph: wait for async creating inode before requesting new max size
ceph: don't skip updating wanted caps when cap is stale
ceph: request new max size only when there is auth cap
ceph: cleanup return error of try_get_cap_refs()
ceph: return ceph_mdsc_do_request() errors from __get_parent()
ceph: check all mds' caps after page writeback
ceph: update i_requested_max_size only when sending cap msg to auth mds
ceph: simplify calling of ceph_get_fmode()
ceph: remove delay check logic from ceph_check_caps()
ceph: consider inode's last read/write when calculating wanted caps
ceph: always renew caps if mds_wanted is insufficient
ceph: update dentry lease for async create
ceph: attempt to do async create when possible
ceph: cache layout in parent dir on first sync create
ceph: add new MDS req field to hold delegated inode number
ceph: decode interval_sets for delegated inos
ceph: make ceph_fill_inode non-static
ceph: perform asynchronous unlink if we have sufficient caps
ceph: don't take refs to want mask unless we have all bits
...
|