Age | Commit message (Collapse) | Author |
|
There are no bugs here that I've spotted, it's just easier to use the
normal API and there are no performance advantages to using the more
verbose advanced API.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't
allocate memory using GFP_NOWAIT, it would leak the reference to the file
descriptor. Also the node pointed to by the xas could be freed between
the call to xas_load() under the rcu_read_lock() and the acquisition of
the xa_lock.
It's easier to just use the normal xa_load/xa_store interface here.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[axboe: fix missing assign after alloc, cur_uring -> tctx rename]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We have to drop the lock during each iteration, so there's no advantage
to using the advanced API. Convert this to a standard xa_for_each() loop.
Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Colin reports that there's unreachable code, since we only ever break
if ret == 0. This is correct, and is due to a reversed logic condition
in when to break or not.
Break out of the loop if we don't process any task work, in that case
we do want to return -EINTR.
Fixes: af9c1a44f8de ("io_uring: process task work in io_uring_register()")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The kernel test robot reports this lockdep issue:
[child1:659] mbind (274) returned ENOSYS, marking as inactive.
[child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive.
[main] 10175 iterations. [F:7781 S:2344 HI:2397]
[ 24.610601]
[ 24.610743] ================================
[ 24.611083] WARNING: inconsistent lock state
[ 24.611437] 5.9.0-rc7-00017-g0f2122045b9462 #5 Not tainted
[ 24.611861] --------------------------------
[ 24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes:
[ 24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1
[ 24.613642] {SOFTIRQ-ON-W} state was registered at:
[ 24.614024] lock_acquire+0x20c/0x29b
[ 24.614341] _raw_spin_lock+0x21/0x30
[ 24.614636] io_uring_add_task_file+0xe8/0x13a
[ 24.614987] io_uring_create+0x535/0x6bd
[ 24.615297] io_uring_setup+0x11d/0x136
[ 24.615606] __ia32_sys_io_uring_setup+0xd/0xf
[ 24.615977] do_int80_syscall_32+0x53/0x6c
[ 24.616306] restore_all_switch_stack+0x0/0xb1
[ 24.616677] irq event stamp: 939881
[ 24.616968] hardirqs last enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145
[ 24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e
[ 24.618321] softirqs last enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a
[ 24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61
[ 24.619521]
[ 24.619521] other info that might help us debug this:
[ 24.620028] Possible unsafe locking scenario:
[ 24.620028]
[ 24.620492] CPU0
[ 24.620685] ----
[ 24.620894] lock(&xa->xa_lock#4);
[ 24.621168] <Interrupt>
[ 24.621381] lock(&xa->xa_lock#4);
[ 24.621695]
[ 24.621695] *** DEADLOCK ***
[ 24.621695]
[ 24.622154] 1 lock held by ksoftirqd/0/7:
[ 24.622468] #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155
[ 24.623106]
[ 24.623106] stack backtrace:
[ 24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f2122045b9462 #5
[ 24.624090] Call Trace:
[ 24.624284] ? show_stack+0x40/0x46
[ 24.624551] dump_stack+0x1b/0x1d
[ 24.624809] print_usage_bug+0x17a/0x185
[ 24.625142] mark_lock+0x11d/0x1db
[ 24.625474] ? print_shortest_lock_dependencies+0x121/0x121
[ 24.625905] __lock_acquire+0x41e/0x7bf
[ 24.626206] lock_acquire+0x20c/0x29b
[ 24.626517] ? xa_destroy+0x43/0xc1
[ 24.626810] ? lock_acquire+0x20c/0x29b
[ 24.627110] _raw_spin_lock_irqsave+0x3e/0x4e
[ 24.627450] ? xa_destroy+0x43/0xc1
[ 24.627725] xa_destroy+0x43/0xc1
[ 24.627989] __io_uring_free+0x57/0x71
[ 24.628286] ? get_pid+0x22/0x22
[ 24.628544] __put_task_struct+0xf2/0x163
[ 24.628865] put_task_struct+0x1f/0x2a
[ 24.629161] delayed_put_task_struct+0xe2/0xe9
[ 24.629509] rcu_process_callbacks+0x128/0x155
[ 24.629860] __do_softirq+0x1a3/0x45a
[ 24.630151] run_ksoftirqd+0x35/0x61
[ 24.630443] smpboot_thread_fn+0x304/0x31a
[ 24.630763] kthread+0x124/0x139
[ 24.631016] ? sort_range+0x18/0x18
[ 24.631290] ? kthread_create_worker_on_cpu+0x17/0x17
[ 24.631682] ret_from_fork+0x1c/0x28
which is complaining about xa_destroy() grabbing the xa lock in an
IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt
safe. This is really an xarray issue, since it should not assume the
lock type. But for our use case, since we know the xarray is empty at
this point, there's no need to actually call xa_destroy(). So just get
rid of it.
Fixes: 0f2122045b94 ("io_uring: don't rely on weak ->files references")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Identical to how we handle the ctx reference counts, increase by the
batch we're expecting to submit, and handle any slow path residual,
if any. The request alloc-and-issue path is very hot, and this makes
a noticeable difference by avoiding an two atomic incs for each
individual request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.
This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Pull io_uring fixes from Jens Axboe:
- fix for async buffered reads if read-ahead is fully disabled (Hao)
- double poll match fix
- ->show_fdinfo() potential ABBA deadlock complaint fix
* tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
io_uring: fix async buffered reads when readahead is disabled
io_uring: fix potential ABBA deadlock in ->show_fdinfo()
io_uring: always delete double poll wait entry on match
|
|
We always use &req->task_work anyway, no point in passing it in.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All request preparations are done only during submission, reflect it in
the code by moving io_req_prep() much earlier into io_queue_sqe().
That's much cleaner, because it doen't expose bits to async code which
it won't ever use. Also it makes the interface harder to misuse, and
there are potential places for bugs.
For instance, __io_queue() doesn't clear @sqe before proceeding to a
next linked request, that could have been disastrous, but hopefully
there are linked requests IFF sqe==NULL, so not actually a bug.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_issue_sqe() does two things at once, trying to prepare request and
issuing them. Split it in two and deduplicate with io_defer_prep().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All io_*_prep() functions including io_{read,write}_prep() are called
only during submission where @force_nonblock is always true. Don't keep
propagating it and instead remove the @force_nonblock argument
from prep() altogether.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
it's set/cleared in a single place. Also remove @force_nonblock
parameter from io_prep_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
be called only once, so there is no one who may have set the flag
before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Put brackets around bitwise ops in a complex expression
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Extract common code from if/else branches. That is cleaner and optimised
even better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>
pos: 0
flags: 02000002
mnt_id: 13
SqThread: 6929
SqThreadCpu: 2
UserFiles: 1
0: testfile
UserBufs: 0
PollList:
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.
For the SQPOLL thread, it will live and attach in the inited cgroup's
context.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.
This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fixes coccicheck warning:
fs/io_uring.c:4242:13-14: Unneeded semicolon
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.
The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.
This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.
Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set rw->free_iovec to @iovec, that gives an identical result and stresses
that @iovec param rw->free_iovec play the same role.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't touch iter->iov and iov in between __io_import_iovec() and
io_req_map_rw(), the former function aleady sets it correctly, because it
creates one more case with NULL'ed iov to consider in io_req_map_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When using SQPOLL, applications can run into the issue of running out of
SQ ring entries because the thread hasn't consumed them yet. The only
option for dealing with that is checking later, or busy checking for the
condition.
Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These structures are never written, move them appropriately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.
This means that multiple rings can share the same SQPOLL thread, saving
resources.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.
As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.
In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.
__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.
Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.
We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds a new IORING_SETUP_R_DISABLED flag to start the
rings disabled, allowing the user to register restrictions,
buffers, files, before to start processing SQEs.
When IORING_SETUP_R_DISABLED is set, SQE are not processed and
SQPOLL kthread is not started.
The restrictions registration are allowed only when the rings
are disable to prevent concurrency issue while processing SQEs.
The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
opcode with io_uring_register(2).
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
permanently installs a feature allowlist on an io_ring_ctx.
The io_ring_ctx can then be passed to untrusted code with the
knowledge that only operations present in the allowlist can be
executed.
The allowlist approach ensures that new features added to io_uring
do not accidentally become available when an existing application
is launched on a newer kernel version.
Currently is it possible to restrict sqe opcodes, sqe flags, and
register opcodes.
IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
it is not possible to change restrictions anymore.
This prevents untrusted code from removing restrictions.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.
No intended functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No functional changes in this patch, prep patch for grabbing references
to the files_struct.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
* io_uring-5.9:
io_uring: fix async buffered reads when readahead is disabled
io_uring: fix potential ABBA deadlock in ->show_fdinfo()
io_uring: always delete double poll wait entry on match
|
|
The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:
- when doing retry in io_read, not only the IOCB_WAITQ flag but also
the IOCB_NOWAIT flag is still set, which makes it goes to would_block
phase in generic_file_buffered_read() and then return -EAGAIN. After
that, the io-wq thread work is queued, and later doing the async
reads in the old way.
- even if we remove IOCB_NOWAIT when doing retry, the feature is still
not running properly, since in generic_file_buffered_read() it goes to
lock_page_killable() after calling mapping->a_ops->readpage() to do
IO, and thus causing process to sleep.
Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0ae ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot reports a potential lock deadlock between the normal IO path and
->show_fdinfo():
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc6-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/19710 is trying to acquire lock:
ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296
but task is already holding lock:
ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&ctx->uring_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
__io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
seq_show+0x4a8/0x700 fs/proc/fd.c:65
seq_read+0x432/0x1070 fs/seq_file.c:208
do_loop_readv_writev fs/read_write.c:734 [inline]
do_loop_readv_writev fs/read_write.c:721 [inline]
do_iter_read+0x48e/0x6e0 fs/read_write.c:955
vfs_readv+0xe5/0x150 fs/read_write.c:1073
kernel_readv fs/splice.c:355 [inline]
default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
do_splice_to+0x137/0x170 fs/splice.c:871
splice_direct_to_actor+0x307/0x980 fs/splice.c:950
do_splice_direct+0x1b3/0x280 fs/splice.c:1059
do_sendfile+0x55f/0xd40 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&p->lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
seq_read+0x61/0x1070 fs/seq_file.c:155
pde_read fs/proc/inode.c:306 [inline]
proc_reg_read+0x221/0x300 fs/proc/inode.c:318
do_loop_readv_writev fs/read_write.c:734 [inline]
do_loop_readv_writev fs/read_write.c:721 [inline]
do_iter_read+0x48e/0x6e0 fs/read_write.c:955
vfs_readv+0xe5/0x150 fs/read_write.c:1073
kernel_readv fs/splice.c:355 [inline]
default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
do_splice_to+0x137/0x170 fs/splice.c:871
splice_direct_to_actor+0x307/0x980 fs/splice.c:950
do_splice_direct+0x1b3/0x280 fs/splice.c:1059
do_sendfile+0x55f/0xd40 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (sb_writers#4){.+.+}-{0:0}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x228/0x450 fs/super.c:1672
io_write+0x6b5/0xb30 fs/io_uring.c:3296
io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
__io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
io_submit_sqe fs/io_uring.c:6324 [inline]
io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
__do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
sb_writers#4 --> &p->lock --> &ctx->uring_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ctx->uring_lock);
lock(&p->lock);
lock(&ctx->uring_lock);
lock(sb_writers#4);
*** DEADLOCK ***
1 lock held by syz-executor.2/19710:
#0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
stack backtrace:
CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x228/0x450 fs/super.c:1672
io_write+0x6b5/0xb30 fs/io_uring.c:3296
io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
__io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
io_submit_sqe fs/io_uring.c:6324 [inline]
io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
__do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e179
Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c
Fix this by just not diving into details if we fail to trylock the
io_uring mutex. We know the ctx isn't going away during this operation,
but we cannot safely iterate buffers/files/personalities if we don't
hold the io_uring mutex.
Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|