Age | Commit message (Collapse) | Author |
|
Extract common code from if/else branches. That is cleaner and optimised
even better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This flag is no longer used, remove it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The smart syzbot has found a reproducer for the following issue:
==================================================================
BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_write include/linux/instrumented.h:71 [inline]
atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
io_wqe_inc_running fs/io-wq.c:301 [inline]
io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Allocated by task 7768:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
kmalloc_node include/linux/slab.h:572 [inline]
kzalloc_node include/linux/slab.h:677 [inline]
io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
io_init_wq_offload fs/io_uring.c:7432 [inline]
io_sq_offload_start fs/io_uring.c:7504 [inline]
io_uring_create fs/io_uring.c:8625 [inline]
io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 21:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x10e/0x2b0 mm/slab.c:3756
__io_wq_destroy fs/io-wq.c:1138 [inline]
io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
io_finish_async fs/io_uring.c:6836 [inline]
io_ring_ctx_free fs/io_uring.c:7870 [inline]
io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
The buggy address belongs to the object at ffff8882183db000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 140 bytes inside of
1024-byte region [ffff8882183db000, ffff8882183db400)
The buggy address belongs to the page:
page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
flags: 0x57ffe0000000200(slab)
raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
which is down to the comment below,
/* all workers gone, wq exit can proceed */
if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
complete(&wqe->wq->done);
because there might be multiple cases of wqe in a wq and we would wait
for every worker in every wqe to go home before releasing wq's resources
on destroying.
To that end, rework wq's refcount by making it independent of the tracking
of workers because after all they are two different things, and keeping
it balanced when workers come and go. Note the manager kthread, like
other workers, now holds a grab to wq during its lifetime.
Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
and do nothing for exiting wq.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>
pos: 0
flags: 02000002
mnt_id: 13
SqThread: 6929
SqThreadCpu: 2
UserFiles: 1
0: testfile
UserBufs: 0
PollList:
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.
For the SQPOLL thread, it will live and attach in the inited cgroup's
context.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.
This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fixes coccicheck warning:
fs/io_uring.c:4242:13-14: Unneeded semicolon
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.
The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.
This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.
Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set rw->free_iovec to @iovec, that gives an identical result and stresses
that @iovec param rw->free_iovec play the same role.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't touch iter->iov and iov in between __io_import_iovec() and
io_req_map_rw(), the former function aleady sets it correctly, because it
creates one more case with NULL'ed iov to consider in io_req_map_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When using SQPOLL, applications can run into the issue of running out of
SQ ring entries because the thread hasn't consumed them yet. The only
option for dealing with that is checking later, or busy checking for the
condition.
Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These structures are never written, move them appropriately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.
This means that multiple rings can share the same SQPOLL thread, saving
resources.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.
As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.
In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.
__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.
Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.
We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).
The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.
While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.
In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.
The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.
Convert io_wqe::lock to a raw_spinlock_t.c
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds a new IORING_SETUP_R_DISABLED flag to start the
rings disabled, allowing the user to register restrictions,
buffers, files, before to start processing SQEs.
When IORING_SETUP_R_DISABLED is set, SQE are not processed and
SQPOLL kthread is not started.
The restrictions registration are allowed only when the rings
are disable to prevent concurrency issue while processing SQEs.
The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
opcode with io_uring_register(2).
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
permanently installs a feature allowlist on an io_ring_ctx.
The io_ring_ctx can then be passed to untrusted code with the
knowledge that only operations present in the allowlist can be
executed.
The allowlist approach ensures that new features added to io_uring
do not accidentally become available when an existing application
is launched on a newer kernel version.
Currently is it possible to restrict sqe opcodes, sqe flags, and
register opcodes.
IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
it is not possible to change restrictions anymore.
This prevents untrusted code from removing restrictions.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.
No intended functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No functional changes in this patch, prep patch for grabbing references
to the files_struct.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
* io_uring-5.9:
io_uring: fix async buffered reads when readahead is disabled
io_uring: fix potential ABBA deadlock in ->show_fdinfo()
io_uring: always delete double poll wait entry on match
|
|
The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:
- when doing retry in io_read, not only the IOCB_WAITQ flag but also
the IOCB_NOWAIT flag is still set, which makes it goes to would_block
phase in generic_file_buffered_read() and then return -EAGAIN. After
that, the io-wq thread work is queued, and later doing the async
reads in the old way.
- even if we remove IOCB_NOWAIT when doing retry, the feature is still
not running properly, since in generic_file_buffered_read() it goes to
lock_page_killable() after calling mapping->a_ops->readpage() to do
IO, and thus causing process to sleep.
Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0ae ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- NFSv4.2: copy_file_range needs to invalidate caches on success
- NFSv4.2: Fix security label length not being reset
- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
on read
- pNFS/flexfiles: Fix signed/unsigned type issues with mirror
indices"
* tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS/flexfiles: Be consistent about mirror index types
pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
NFSv4.2: fix client's attribute cache management for copy_file_range
nfs: Fix security label length not being reset
|
|
syzbot reports a potential lock deadlock between the normal IO path and
->show_fdinfo():
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc6-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/19710 is trying to acquire lock:
ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296
but task is already holding lock:
ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&ctx->uring_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
__io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
seq_show+0x4a8/0x700 fs/proc/fd.c:65
seq_read+0x432/0x1070 fs/seq_file.c:208
do_loop_readv_writev fs/read_write.c:734 [inline]
do_loop_readv_writev fs/read_write.c:721 [inline]
do_iter_read+0x48e/0x6e0 fs/read_write.c:955
vfs_readv+0xe5/0x150 fs/read_write.c:1073
kernel_readv fs/splice.c:355 [inline]
default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
do_splice_to+0x137/0x170 fs/splice.c:871
splice_direct_to_actor+0x307/0x980 fs/splice.c:950
do_splice_direct+0x1b3/0x280 fs/splice.c:1059
do_sendfile+0x55f/0xd40 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&p->lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
seq_read+0x61/0x1070 fs/seq_file.c:155
pde_read fs/proc/inode.c:306 [inline]
proc_reg_read+0x221/0x300 fs/proc/inode.c:318
do_loop_readv_writev fs/read_write.c:734 [inline]
do_loop_readv_writev fs/read_write.c:721 [inline]
do_iter_read+0x48e/0x6e0 fs/read_write.c:955
vfs_readv+0xe5/0x150 fs/read_write.c:1073
kernel_readv fs/splice.c:355 [inline]
default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
do_splice_to+0x137/0x170 fs/splice.c:871
splice_direct_to_actor+0x307/0x980 fs/splice.c:950
do_splice_direct+0x1b3/0x280 fs/splice.c:1059
do_sendfile+0x55f/0xd40 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (sb_writers#4){.+.+}-{0:0}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x228/0x450 fs/super.c:1672
io_write+0x6b5/0xb30 fs/io_uring.c:3296
io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
__io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
io_submit_sqe fs/io_uring.c:6324 [inline]
io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
__do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
sb_writers#4 --> &p->lock --> &ctx->uring_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ctx->uring_lock);
lock(&p->lock);
lock(&ctx->uring_lock);
lock(sb_writers#4);
*** DEADLOCK ***
1 lock held by syz-executor.2/19710:
#0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
stack backtrace:
CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x228/0x450 fs/super.c:1672
io_write+0x6b5/0xb30 fs/io_uring.c:3296
io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
__io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
io_submit_sqe fs/io_uring.c:6324 [inline]
io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
__do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e179
Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c
Fix this by just not diving into details if we fail to trylock the
io_uring mutex. We know the ctx isn't going away during this operation,
but we cannot safely iterate buffers/files/personalities if we don't
hold the io_uring mutex.
Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot reports a crash with tty polling, which is using the double poll
handling:
general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
FS: 0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__wake_up_common+0x147/0x650 kernel/sched/wait.c:93
__wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
__tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
__tty_hangup drivers/tty/tty_io.c:575 [inline]
tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
pty_close+0x3f5/0x550 drivers/tty/pty.c:79
tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
__fput+0x285/0x920 fs/file_table.c:281
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x401210
which is due to a failure in removing the double poll wait entry if we
hit a wakeup match. This can cause multiple invocations of the wakeup,
which isn't safe.
Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull io_uring fixes from Jens Axboe:
"Two fixes for regressions in this cycle, and one that goes to 5.8
stable:
- fix leak of getname() retrieved filename
- remove plug->nowait assignment, fixing a regression with btrfs
- fix for async buffered retry"
* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
io_uring: ensure async buffered read-retry is setup properly
io_uring: don't unconditionally set plug->nowait = true
io_uring: ensure open/openat2 name is cleaned on cancelation
|
|
A previous commit for fixing up short reads botched the async retry
path, so we ended up going to worker threads more often than we should.
Fix this up, so retries work the way they originally were intended to.
Fixes: 227c0c9673d8 ("io_uring: internally retry short reads")
Reported-by: Hao_Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This causes all the bios to be submitted with REQ_NOWAIT, which can be
problematic on either btrfs or on file systems that otherwise use a mix
of block devices where only some of them support it.
For now, just remove the setting of plug->nowait = true.
Reported-by: Dan Melnic <dmm@fb.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org
Fixes: e62753e4e292 ("io_uring: call statx directly")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"syzkaller started to hit us with reports, here's a fix for one type
(stack overflow when printing checksums on read error).
The other patch is a fix for sysfs object, we have a test for that and
it leads to a crash."
* tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix put of uninitialized kobject after seed device delete
btrfs: fix overflow when copying corrupt csums for a message
|
|
Pull vfs fixes from Al Viro:
"No common topic, just assorted fixes"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse: fix the ->direct_IO() treatment of iov_iter
fs: fix cast in fsparam_u32hex() macro
vboxsf: Fix the check for the old binary mount-arguments struct
|
|
Pull io_uring fixes from Jens Axboe:
"A few fixes - most of them regression fixes from this cycle, but also
a few stable heading fixes, and a build fix for the included demo tool
since some systems now actually have gettid() available"
* tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block:
io_uring: fix openat/openat2 unified prep handling
io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL
tools/io_uring: fix compile breakage
io_uring: don't use retry based buffered reads for non-async bdev
io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there
io_uring: don't run task work on an exiting task
io_uring: drop 'ctx' ref on task work cancelation
io_uring: grab any needed state during defer prep
|
|
The following test case leads to NULL kobject free error:
mount seed /mnt
add sprout to /mnt
umount /mnt
mount sprout to /mnt
delete seed
kobject: '(null)' (00000000dd2b87e4): is not initialized, yet kobject_put() is being called.
WARNING: CPU: 1 PID: 15784 at lib/kobject.c:736 kobject_put+0x80/0x350
RIP: 0010:kobject_put+0x80/0x350
::
Call Trace:
btrfs_sysfs_remove_devices_dir+0x6e/0x160 [btrfs]
btrfs_rm_device.cold+0xa8/0x298 [btrfs]
btrfs_ioctl+0x206c/0x22a0 [btrfs]
ksys_ioctl+0xe2/0x140
__x64_sys_ioctl+0x1e/0x29
do_syscall_64+0x96/0x150
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4047c6288b
::
This is because, at the end of the seed device-delete, we try to remove
the seed's devid sysfs entry. But for the seed devices under the sprout
fs, we don't initialize the devid kobject yet. So add a kobject state
check, which takes care of the bug.
Fixes: 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
CC: stable@vger.kernel.org # 5.6+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A previous commit unified how we handle prep for these two functions,
but this means that we check the allowed context (SQPOLL, specifically)
later than we should. Move the ring type checking into the two parent
functions, instead of doing it after we've done some setup work.
Fixes: ec65fea5a8d7 ("io_uring: deduplicate io_openat{,2}_prep()")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These will naturally fail when attempted through SQPOLL, but either
with -EFAULT or -EBADF. Make it explicit that these are not workable
through SQPOLL and return -EINVAL, just like other ops that need to
use ->files.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Some block devices, like dm, bubble back -EAGAIN through the completion
handler. We check for this in io_read(), but don't honor it for when
we have copied the iov. Return -EAGAIN for this case before retrying,
to force punt to io-wq.
Fixes: bcf5a06304d6 ("io_uring: support true async buffered reads, if file provides it")
Reported-by: Zorro Lang <zlang@redhat.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|