summaryrefslogtreecommitdiff
path: root/fs/io-wq.c
AgeCommit message (Collapse)Author
2020-02-12io-wq: don't call kXalloc_node() with non-online nodeJens Axboe
Glauber reports a crash on init on a box he has: RIP: 0010:__alloc_pages_nodemask+0x132/0x340 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0 FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: alloc_slab_page+0x46/0x320 new_slab+0x9d/0x4e0 ___slab_alloc+0x507/0x6a0 ? io_wq_create+0xb4/0x2a0 __slab_alloc+0x1c/0x30 kmem_cache_alloc_node_trace+0xa6/0x260 io_wq_create+0xb4/0x2a0 io_uring_setup+0x97f/0xaa0 ? io_remove_personalities+0x30/0x30 ? io_poll_trigger_evfd+0x30/0x30 do_syscall_64+0x5b/0x1c0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4d116cb1ed which is due to the 'wqe' and 'worker' allocation being node affine. But it isn't valid to call the node affine allocation if the node isn't online. Setup structures for even offline nodes, as usual, but skip them in terms of thread setup to not waste resources. If the node isn't online, just alloc memory with NUMA_NO_NODE. Reported-by: Glauber Costa <glauber@scylladb.com> Tested-by: Glauber Costa <glauber@scylladb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09io-wq: add io_wq_cancel_pid() to cancel based on a specific pidJens Axboe
Add a helper that allows the caller to cancel work based on what mm it belongs to. This allows io_uring to cancel work from a given task or thread when it exits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09io-wq: make io_wqe_cancel_work() take a match handlerJens Axboe
We want to use the cancel functionality for canceling based on not just the work itself. Instead of matching on the work address manually, allow a match handler to tell us if we found the right work item or not. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08io-wq: add support for inheriting ->fsJens Axboe
Some work items need this for relative path lookup, make it available like the other inherited credentials/mm/etc. Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-29io_uring: fix linked command file table usageJens Axboe
We're not consistent in how the file table is grabbed and assigned if we have a command linked that requires the use of it. Add ->file_table to the io_op_defs[] array, and use that to determine when to grab the table instead of having the handlers set it if they need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES flag. We always initialize work->files, so io-wq can just check for that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28io-wq: allow grabbing existing io-wqPavel Begunkov
Export a helper to attach to an existing io-wq, rather than setting up a new one. This is doable now that we have reference counted io_wq's. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28io_uring/io-wq: don't use static creds/mm assignmentsJens Axboe
We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-27io-wq: make the io_wq ref countedJens Axboe
In preparation for sharing an io-wq across different users, add a reference count that manages destruction of it. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io-wq: support concurrent non-blocking workJens Axboe
io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast. Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io-wq: add support for uncancellable workJens Axboe
Not all work can be cancelled, some of it we may need to guarantee that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL on work that must not be cancelled. Note that the caller work function must also check for IO_WQ_WORK_NO_CANCEL on work that is marked IO_WQ_WORK_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-14io-wq: cancel work if we fail getting a mm referenceJens Axboe
If we require mm and user context, mark the request for cancellation if we fail to acquire the desired mm. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-24io-wq: add cond_resched() to worker threadHillf Danton
Reschedule the current IO worker to cut the risk that it is becoming a cpu hog. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-23io-wq: remove unused busy list from io_sqeHillf Danton
Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") added a list for io workers in addition to the free and busy lists, not only making worker walk cleaner, but leaving the busy list unused. Let's remove it. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-15io_uring: fix stale comment and a few typosBrian Gianforcaro
- Fix a few typos found while reading the code. - Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter was renamed to 'req', but the comment still holds. Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io-wq: briefly spin for new work after finishing workJens Axboe
To avoid going to sleep only to get woken shortly thereafter, spin briefly for new work upon completion of work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io-wq: remove worker->wait waitqueueJens Axboe
We only have one cases of using the waitqueue to wake the worker, the rest are using wake_up_process(). Since we can save some cycles not fiddling with the waitqueue io_wqe_worker(), switch the work activation to task wakeup and get rid of the now unused wait_queue_head_t in struct io_worker. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02io_uring: use current task creds instead of allocating a new oneJens Axboe
syzbot reports: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack. Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26io-wq: shrink io_wq_work a bitJens Axboe
Currently we're using 40 bytes for the io_wq_work structure, and 16 of those is the doubly link list node. We don't need doubly linked lists, we always add to tail to keep things ordered, and any other use case is list traversal with deletion. For the deletion case, we can easily support any node deletion by keeping track of the previous entry. This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from io_uring to 216 to 208 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26io-wq: fix handling of NUMA node IDsJann Horn
There are several things that can go wrong in the current code on NUMA systems, especially if not all nodes are online all the time: - If the identifiers of the online nodes do not form a single contiguous block starting at zero, wq->wqes will be too small, and OOB memory accesses will occur e.g. in the loop in io_wq_create(). - If a node comes online between the call to num_online_nodes() and the for_each_node() loop in io_wq_create(), an OOB write will occur. - If a node comes online between io_wq_create() and io_wq_enqueue(), a lookup is performed for an element that doesn't exist, and an OOB read will probably occur. Fix it by: - using nr_node_ids instead of num_online_nodes() for the allocation size; nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the highest node ID that could possibly come online at some point, even if those nodes' identifiers are not a contiguous block - creating workers for all possible CPUs, not just all online ones This is basically what the normal workqueue code also does, as far as I can tell. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26io_uring: use kzalloc instead of kcalloc for single-element allocationsJann Horn
These allocations are single-element allocations, so don't use the array allocation wrapper for them. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25io_uring: async workers should inherit the user credsJens Axboe
If we don't inherit the original task creds, then we can confuse users like fuse that pass creds in the request header. See link below on identical aio issue. Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25io-wq: have io_wq_create() take a 'data' argumentJens Axboe
We currently pass in 4 arguments outside of the bounded size. In preparation for adding one more argument, let's bundle them up in a struct to make it more readable. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25io_uring: close lookup gap for dependent next workJens Axboe
When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible. Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25io-wq: remove extra space charactersDan Carpenter
These lines are indented an extra space character. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25io-wq: wait for io_wq_create() to setup necessary workersJens Axboe
We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited. Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck. In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue. Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14io-wq: remove now redundant struct io_wq_nulls_listJens Axboe
Since we don't iterate these lists anymore after commit: e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") we don't need to retain the nulls value we use for them. That means it's pretty pointless to wrap the hlist_nulls_head in a structure, so get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13io-wq: ensure free/busy list browsing see all itemsJens Axboe
We have two lists for workers in io-wq, a busy and a free list. For certain operations we want to browse all workers, and we currently do that by browsing the two separate lists. But since these lists are RCU protected, we can potentially miss workers if they move between the two lists while we're browsing them. Add a third list, all_list, that simply holds all workers. A worker is added to that list when it starts, and removed when it exits. This makes the worker iteration cleaner, too. Reported-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13io-wq: ensure we have a stable view of ->cur_work for cancellationsJens Axboe
worker->cur_work is currently protected by the lock of the wqe that the worker belongs to. When we send a signal to a worker, we need a stable view of ->cur_work, so we need to hold that lock. But this doesn't work so well, since we have the opposite order potentially on queueing work. If POLL_ADD is used with a signalfd, then io_poll_wake() is called with the signal lock, and that sometimes needs to insert work items. Add a specific worker lock that protects the current work item. Then we can guarantee that the task we're sending a signal is currently processing the exact work we think it is. Reported-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13io_wq: add get/put_work handlers to io_wq_create()Jens Axboe
For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel. Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07io-wq: add support for bounded vs unbunded workJens Axboe
io_uring supports request types that basically have two different lifetimes: 1) Bounded completion time. These are requests like disk reads or writes, which we know will finish in a finite amount of time. 2) Unbounded completion time. These are generally networked IO, where we have no idea how long they will take to complete. Another example is POLL commands. This patch provides support for io-wq to handle these differently, so we don't starve bounded requests by tying up workers for too long. By default all work is bounded, unless otherwise specified in the work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful()Jens Axboe
We hold the wqe lock at this point (which is also annotated), so there's no need to use the careful variant of list_empty(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05io-wq: use proper nesting IRQ disabling spinlocks for cancelJens Axboe
We don't know what context we'll be called in for cancel, it could very well be with IRQs disabled already. Use the IRQ saving variants of the locking primitives. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-02io-wq: use kfree_rcu() to simplify the codeYueHaibing
The callback function of call_rcu() just calls kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01io_uring: support for generic async request cancelJens Axboe
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: io_uring: add support for async work inheriting filesJens Axboe
This is in preparation for adding opcodes that need to add new files in a process file table, system calls like open(2) or accept4(2). If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work item. If work that needs to get punted to async context have this set, the async worker will assume the original task file table before executing the work. Note that opcodes that need access to the current files of an application cannot be done through IORING_SETUP_SQPOLL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io-wq: small threadpool implementation for io_uringJens Axboe
This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>