Age | Commit message (Collapse) | Author |
|
KASAN reports a BUG when download file in jffs2 filesystem.It is
because when dstlen == 1, cpage_out will write array out of bounds.
Actually, data will not be compressed in jffs2_zlib_compress() if
data's length less than 4.
[ 393.799778] BUG: KASAN: slab-out-of-bounds in jffs2_rtime_compress+0x214/0x2f0 at addr ffff800062e3b281
[ 393.809166] Write of size 1 by task tftp/2918
[ 393.813526] CPU: 3 PID: 2918 Comm: tftp Tainted: G B 4.9.115-rt93-EMBSYS-CGEL-6.1.R6-dirty #1
[ 393.823173] Hardware name: LS1043A RDB Board (DT)
[ 393.827870] Call trace:
[ 393.830322] [<ffff20000808c700>] dump_backtrace+0x0/0x2f0
[ 393.835721] [<ffff20000808ca04>] show_stack+0x14/0x20
[ 393.840774] [<ffff2000086ef700>] dump_stack+0x90/0xb0
[ 393.845829] [<ffff20000827b19c>] kasan_object_err+0x24/0x80
[ 393.851402] [<ffff20000827b404>] kasan_report_error+0x1b4/0x4d8
[ 393.857323] [<ffff20000827bae8>] kasan_report+0x38/0x40
[ 393.862548] [<ffff200008279d44>] __asan_store1+0x4c/0x58
[ 393.867859] [<ffff2000084ce2ec>] jffs2_rtime_compress+0x214/0x2f0
[ 393.873955] [<ffff2000084bb3b0>] jffs2_selected_compress+0x178/0x2a0
[ 393.880308] [<ffff2000084bb530>] jffs2_compress+0x58/0x478
[ 393.885796] [<ffff2000084c5b34>] jffs2_write_inode_range+0x13c/0x450
[ 393.892150] [<ffff2000084be0b8>] jffs2_write_end+0x2a8/0x4a0
[ 393.897811] [<ffff2000081f3008>] generic_perform_write+0x1c0/0x280
[ 393.903990] [<ffff2000081f5074>] __generic_file_write_iter+0x1c4/0x228
[ 393.910517] [<ffff2000081f5210>] generic_file_write_iter+0x138/0x288
[ 393.916870] [<ffff20000829ec1c>] __vfs_write+0x1b4/0x238
[ 393.922181] [<ffff20000829ff00>] vfs_write+0xd0/0x238
[ 393.927232] [<ffff2000082a1ba8>] SyS_write+0xa0/0x110
[ 393.932283] [<ffff20000808429c>] __sys_trace_return+0x0/0x4
[ 393.937851] Object at ffff800062e3b280, in cache kmalloc-64 size: 64
[ 393.944197] Allocated:
[ 393.946552] PID = 2918
[ 393.948913] save_stack_trace_tsk+0x0/0x220
[ 393.953096] save_stack_trace+0x18/0x20
[ 393.956932] kasan_kmalloc+0xd8/0x188
[ 393.960594] __kmalloc+0x144/0x238
[ 393.963994] jffs2_selected_compress+0x48/0x2a0
[ 393.968524] jffs2_compress+0x58/0x478
[ 393.972273] jffs2_write_inode_range+0x13c/0x450
[ 393.976889] jffs2_write_end+0x2a8/0x4a0
[ 393.980810] generic_perform_write+0x1c0/0x280
[ 393.985251] __generic_file_write_iter+0x1c4/0x228
[ 393.990040] generic_file_write_iter+0x138/0x288
[ 393.994655] __vfs_write+0x1b4/0x238
[ 393.998228] vfs_write+0xd0/0x238
[ 394.001543] SyS_write+0xa0/0x110
[ 394.004856] __sys_trace_return+0x0/0x4
[ 394.008684] Freed:
[ 394.010691] PID = 2918
[ 394.013051] save_stack_trace_tsk+0x0/0x220
[ 394.017233] save_stack_trace+0x18/0x20
[ 394.021069] kasan_slab_free+0x88/0x188
[ 394.024902] kfree+0x6c/0x1d8
[ 394.027868] jffs2_sum_write_sumnode+0x2c4/0x880
[ 394.032486] jffs2_do_reserve_space+0x198/0x598
[ 394.037016] jffs2_reserve_space+0x3f8/0x4d8
[ 394.041286] jffs2_write_inode_range+0xf0/0x450
[ 394.045816] jffs2_write_end+0x2a8/0x4a0
[ 394.049737] generic_perform_write+0x1c0/0x280
[ 394.054179] __generic_file_write_iter+0x1c4/0x228
[ 394.058968] generic_file_write_iter+0x138/0x288
[ 394.063583] __vfs_write+0x1b4/0x238
[ 394.067157] vfs_write+0xd0/0x238
[ 394.070470] SyS_write+0xa0/0x110
[ 394.073783] __sys_trace_return+0x0/0x4
[ 394.077612] Memory state around the buggy address:
[ 394.082404] ffff800062e3b180: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 394.089623] ffff800062e3b200: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 394.096842] >ffff800062e3b280: 01 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 394.104056] ^
[ 394.107283] ffff800062e3b300: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 394.114502] ffff800062e3b380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 394.121718] ==================================================================
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
An inode is allowed to have ubifs_xattr_max_cnt() xattrs, so we must
complain only when an inode has more xattrs, having exactly
ubifs_xattr_max_cnt() xattrs is fine.
With this the maximum number of xattrs can be created without hitting
the "has too many xattrs" warning when removing it.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
An earlier commit moved out some functions to not be inlined by gcc, but
after some other rework to remove one of those, clang started inlining
the other one and ran into the same problem as gcc did before:
fs/ubifs/replay.c:1174:5: error: stack frame size of 1152 bytes in function 'ubifs_replay_journal' [-Werror,-Wframe-larger-than=]
Mark the function as noinline_for_stack to ensure it doesn't happen
again.
Fixes: f80df3851246 ("ubifs: use crypto_shash_tfm_digest()")
Fixes: eb66eff6636d ("ubifs: replay: Fix high stack usage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
When crypto_shash_digestsize() fails, c->hmac_tfm
has not been freed before returning, which leads
to memleak.
Fixes: 49525e5eecca5 ("ubifs: Add helper functions for authentication support")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
clang static analysis reports this problem
fs/jffs2/summary.c:794:31: warning: Use of memory after it is freed
c->summary->sum_list_head = temp->u.next;
^~~~~~~~~~~~
In jffs2_sum_write_data(), in a loop summary data is handles a node at
a time. When it has written out the node it is removed the summary list,
and the node is deleted. In the corner case when a
JFFS2_FEATURE_RWCOMPAT_COPY is seen, a call is made to
jffs2_sum_disable_collecting(). jffs2_sum_disable_collecting() deletes
the whole list which conflicts with the loop's deleting the list by parts.
To preserve the old behavior of stopping the write midway, bail out of
the loop after disabling summary collection.
Fixes: 6171586a7ae5 ("[JFFS2] Correct handling of JFFS2_FEATURE_RWCOMPAT_COPY nodes.")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
This collects all of them together and makes it possible to
e.g. exclude it from slub debugging.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Pull io_uring fix from Jens Axboe:
"Revert of a patch from this release that caused a regression"
* tag 'io_uring-5.11-2021-02-12' of git://git.kernel.dk/linux-block:
Revert "io_uring: don't take fs for recvmsg/sendmsg"
|
|
Invalid req->flags are tolerated by free/put well, avoid this dancing
needlessly presetting it to zero, and then not even resetting but
modifying it, i.e. "|=".
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Indirectly io_req_find_next() is called for every request, optimise the
check by testing flags as it was long before -- __io_req_find_next()
tolerates false-positives well (i.e. link==NULL), and those should be
really rare.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_sq_thread_acquire_mm_files() can find a PF_EXITING task only when
it's called from task_work context. Don't check it in all other cases,
that are when we're in io_uring_enter().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.12
- Make the nVHE EL2 object relocatable, resulting in much more
maintainable code
- Handle concurrent translation faults hitting the same page
in a more elegant way
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Allow the disabling of symbol export from assembly code
- Simplification of the early init hypercall handling
|
|
User reported that btrfs-progs misc-tests/028-superblock-recover fails:
[TEST/misc] 028-superblock-recover
unexpected success: mounted fs with corrupted superblock
test failed for case 028-superblock-recover
The test case expects that a broken image with bad superblock will be
rejected to be mounted. However, the test image just passed csum check
of superblock and was successfully mounted.
Commit 55fc29bed8dd ("btrfs: use cached value of fs_info::csum_size
everywhere") replaces all calls to btrfs_super_csum_size by
fs_info::csum_size. The calls include the place where fs_info->csum_size
is not initialized. So btrfs_check_super_csum() passes because memcmp()
with len 0 always returns 0.
Fix it by caching csum size in btrfs_fs_info::csum_size once we know the
csum type in superblock is valid in open_ctree().
Link: https://github.com/kdave/btrfs-progs/issues/250
Fixes: 55fc29bed8dd ("btrfs: use cached value of fs_info::csum_size everywhere")
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove io_consume_sqe() and inline it back into io_get_sqe(). It
requires req dealloc on error, but in exchange we get cleaner
io_submit_sqes() and better locality for cached_sq_head.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that
will wait for everyone currently holding it, so we can skip pinning ctx
with ctx->refs for __io_req_task_submit(), which is executed and loses
its refs/reqs while holding the lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't hand code io_req_task_queue() inside of io_async_buf_func(), just
call it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are two reasons for this. First is to optimise
io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do
too many checks and function calls in the hot path, e.g. in
io_init_req().
The second is to not grab mm/files when there are not needed. As
__io_queue_sqe() issues only one request now, we can reuse
io_sq_thread_acquire_mm_files() instead of unconditional acquire
mm/files.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__io_queue_sqe() tries to issue as much requests of a link as it can,
and uses io_put_req_find_next() to extract a next one, targeting inline
completed requests. As now __io_queue_sqe() is always used together with
struct io_comp_state, it leaves next propagation only a small window and
only for async reqs, that doesn't justify its existence.
Remove it, make __io_queue_sqe() to issue only a head request. It
simplifies the code and will allow other optimisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Completion and submission states are now coupled together, it's weird to
get one from argument and another from ctx, do it consistently for
io_req_free_batch(). It's also faster as we already have @state cached
in registers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__io_complete_rw() casts request to kiocb for it to be immediately
container_of()'ed by io_complete_rw_common(). And the last function's name
doesn't do a great job of illuminating its purposes, so just inline it in
its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We pass return code into io_rw_reissue() only to be able to check if it's
-EAGAIN. That's not the cleanest approach and may prevent inlining of the
non-EAGAIN fast path, so do it at call sites.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later,
do it eagerly. It removes overhead on keeping and checking that list,
and allows in case of failure for these requests to be completed through
normal iopoll completion path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_req_free_batch_finish() is final and does not permit struct req_batch
to be reused without re-init. To be more consistent don't clear ->task
there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We recently added the submit side req cache, but it was placed at the
end of the struct. Move it near the other submission state for better
memory placement, and reshuffle a few other members at the same time.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit
arithmetic and then assigned to a signed 64 bit integer. In the case where
l2nb is 32 or more this can lead to an overflow. Avoid this by shifting
the value 1LL instead.
Addresses-Coverity: ("Uninitentional integer overflow")
Fixes: b40c2e665cd5 ("fs/jfs: TRIM support for JFS Filesystem")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
|
|
While debugging another issue today, Steve and I noticed that if a
subdir for a file share is already mounted on the client, any new
mount of any other subdir (or the file share root) of the same share
results in sharing the cifs superblock, which e.g. can result in
incorrect device name.
While setting prefix path for the root of a cifs_sb,
CIFS_MOUNT_USE_PREFIX_PATH flag should also be set.
Without it, prepath is not even considered in some places,
and output of "mount" and various /proc/<>/*mount* related
options can be missing part of the device name.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
so we no longer need to handle or parse the UNC= and prefixpath=
options that mount.cifs are generating.
This also fixes a bug in the mount command option where the devname
would be truncated into just //server/share because we were looking
at the truncated UNC value and not the full path.
I.e. in the mount command output the devive //server/share/path
would show up as just //server/share
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The assert in xfs_btree_del_cursor() checks that the bmapbt block
allocation field has been handled correctly before the cursor is
freed. This field is used for accurate calculation of indirect block
reservation requirements (for delayed allocations), for example.
generic/019 reproduces a scenario where this assert fails because
the filesystem has shutdown while in the middle of a bmbt record
insertion. This occurs after a bmbt block has been allocated via the
cursor but before the higher level bmap function (i.e.
xfs_bmap_add_extent_hole_real()) completes and resets the field.
Update the assert to accommodate the transient state if the
filesystem has shutdown. While here, clean up the indentation and
comments in the function.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We use the assigned slot in io_sqe_file_register(), and a previous
patch moved the assignment to after we have called it. This isn't
super pretty, and will get cleaned up in the future. For now, fix
the regression by restoring the previous assignment/clear of the
file_slot.
Fixes: ea64ec02b31d ("io_uring: deduplicate file table slot calculation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, although set_bit() & test_bit() pairs are used as a fast-
path for initialized configurations. However, these atomic ops are
actually relaxed forms. Instead, load-acquire & store-release form is
needed to make sure uninitialized fields won't be observed in advance
here (yet no such corresponding bitops so use full barriers instead.)
Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com
Fixes: 62dc45979f3f ("staging: erofs: fix race of initializing xattrs of a inode at the same time")
Fixes: 152a333a5895 ("staging: erofs: add compacted compression indexes support")
Cc: <stable@vger.kernel.org> # 5.3+
Reported-by: Huang Jianan <huangjianan@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
|
|
syzbot generated a crafted bitszbits which can be shifted
out-of-bounds[1]. So directly print unsupported blkszbits
instead of blksize.
[1] https://lore.kernel.org/r/000000000000c72ddd05b9444d2f@google.com
Link: https://lore.kernel.org/r/20210120013016.14071-1-hsiangkao@aol.com
Reported-by: syzbot+c68f467cd7c45860e8d4@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
|
|
fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function 'xfs_log_need_covered' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Fixes: 37444fc4cc39 ("xfs: lift writable fs check up into log worker task")
CC: Brian Foster <bfoster@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
XFS triggers an iomap warning in the write fault path due to a
!PageUptodate() page if a write fault happens to occur on a page
that recently failed writeback. The iomap writeback error handling
code can clear the Uptodate flag if no portion of the page is
submitted for I/O. This is reproduced by fstest generic/019, which
combines various forms of I/O with simulated disk failures that
inevitably lead to filesystem shutdown (which then unconditionally
fails page writeback).
This is a regression introduced by commit f150b4234397 ("xfs: split
the iomap ops for buffered vs direct writes") due to the removal of
a shutdown check and explicit error return in the ->iomap_begin()
path used by the write fault path. The explicit error return
historically translated to a SIGBUS, but now carries on with iomap
processing where it complains about the unexpected state. Restore
the shutdown check to xfs_buffered_write_iomap_begin() to restore
historical behavior.
Fixes: f150b4234397 ("xfs: split the iomap ops for buffered vs direct writes")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We park SQPOLL task before going into io_uring_cancel_files(), so the
task won't run task_works including those that might be important for
the cancellation passes. In this case it's io_poll_remove_one(), which
frees requests via io_put_req_deferred().
Unpark it for while waiting, it's ok as we disable submissions
beforehand, so no new requests will be generated.
INFO: task syz-executor893:8493 blocked for more than 143 seconds.
Call Trace:
context_switch kernel/sched/core.c:4327 [inline]
__schedule+0x90c/0x21a0 kernel/sched/core.c:5078
schedule+0xcf/0x270 kernel/sched/core.c:5157
io_uring_cancel_files fs/io_uring.c:8912 [inline]
io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
__io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
io_uring_files_cancel include/linux/io_uring.h:51 [inline]
do_exit+0x2fe/0x2ae0 kernel/exit.c:780
do_group_exit+0x125/0x310 kernel/exit.c:922
__do_sys_exit_group kernel/exit.c:933 [inline]
__se_sys_exit_group kernel/exit.c:931 [inline]
__x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 10cad2c40dcb04bb46b2bf399e00ca5ea93d36b0.
Petr reports that with this commit in place, io_uring fails the chroot
test (CVE-202-29373). We do need to retain ->fs for send/recvmsg, so
revert this commit.
Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since 5.10, splice() or sendfile() to NILFS2 return EINVAL. This was
caused by commit 36e2c7421f02 ("fs: don't allow splice read/write
without explicit ops").
This patch initializes the splice_write field in file_operations, like
most file systems do, to restore the functionality.
Link: https://lkml.kernel.org/r/1612784101-14353-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Joachim Henke <joachim.henke@t-systems.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Zoned block devices have different granularity constraints for write
operations into sequential zones. E.g. ZBC and ZAC devices require that
writes be aligned to the device physical block size while NVMe ZNS
devices allow logical block size aligned write operations. To correctly
handle such difference, use the device zone write granularity limit to
set the block size of a zonefs volume, thus allowing the smallest
possible write unit for all zoned device types.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of imposing rlimit memlock limits for the rings themselves,
ensure that we account them properly under memcg with __GFP_ACCOUNT.
We retain rlimit memlock for registered buffers, this is just for the
ring arrays themselves.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This puts io_uring under the memory cgroups accounting and limits for
requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is the last class of requests that cannot utilize the req alloc
cache. Add a per-ctx req cache that is protected by the completion_lock,
and refill our submit side cache when it gets over our batch count.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Abaci reported follow issue:
[ 30.615891] ======================================================
[ 30.616648] WARNING: possible circular locking dependency detected
[ 30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
[ 30.618035] ------------------------------------------------------
[ 30.618914] a.out/1128 is trying to acquire lock:
[ 30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[ 30.620505]
[ 30.620505] but task is already holding lock:
[ 30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[ 30.622349]
[ 30.622349] which lock already depends on the new lock.
[ 30.622349]
[ 30.623289]
[ 30.623289] the existing dependency chain (in reverse order) is:
[ 30.624243]
[ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[ 30.625263] lock_acquire+0x2c7/0x390
[ 30.625868] __mutex_lock+0xae/0x9f0
[ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70
[ 30.627278] io_uring_poll+0xcb/0xd0
[ 30.627890] ep_item_poll.isra.14+0x4e/0x90
[ 30.628531] do_epoll_ctl+0xb7e/0x1120
[ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0
[ 30.629770] do_syscall_64+0x2d/0x40
[ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 30.631187]
[ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[ 30.631985] check_prevs_add+0x226/0xb00
[ 30.632584] __lock_acquire+0x1237/0x13a0
[ 30.633207] lock_acquire+0x2c7/0x390
[ 30.633740] __mutex_lock+0xae/0x9f0
[ 30.634258] __ep_eventpoll_poll+0x9f/0x220
[ 30.634879] __io_arm_poll_handler+0xbf/0x220
[ 30.635462] io_issue_sqe+0xa6b/0x13e0
[ 30.635982] __io_queue_sqe+0x10b/0x550
[ 30.636648] io_queue_sqe+0x235/0x470
[ 30.637281] io_submit_sqes+0xcce/0xf10
[ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0
[ 30.638465] do_syscall_64+0x2d/0x40
[ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 30.639643]
[ 30.639643] other info that might help us debug this:
[ 30.639643]
[ 30.640618] Possible unsafe locking scenario:
[ 30.640618]
[ 30.641402] CPU0 CPU1
[ 30.641938] ---- ----
[ 30.642664] lock(&ctx->uring_lock);
[ 30.643425] lock(&ep->mtx);
[ 30.644498] lock(&ctx->uring_lock);
[ 30.645668] lock(&ep->mtx);
[ 30.646321]
[ 30.646321] *** DEADLOCK ***
[ 30.646321]
[ 30.647642] 1 lock held by a.out/1128:
[ 30.648424] #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[ 30.649954]
[ 30.649954] stack backtrace:
[ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
[ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 30.652290] Call Trace:
[ 30.652688] dump_stack+0xac/0xe3
[ 30.653164] check_noncircular+0x11e/0x130
[ 30.653747] ? check_prevs_add+0x226/0xb00
[ 30.654303] check_prevs_add+0x226/0xb00
[ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0
[ 30.655564] __lock_acquire+0x1237/0x13a0
[ 30.656262] lock_acquire+0x2c7/0x390
[ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180
[ 30.658014] __mutex_lock+0xae/0x9f0
[ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.659112] ? mark_held_locks+0x5a/0x80
[ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40
[ 30.660885] ? trace_hardirqs_on+0x46/0x110
[ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180
[ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.662696] __ep_eventpoll_poll+0x9f/0x220
[ 30.663273] ? __ep_eventpoll_poll+0x220/0x220
[ 30.663875] __io_arm_poll_handler+0xbf/0x220
[ 30.664463] io_issue_sqe+0xa6b/0x13e0
[ 30.664984] ? __lock_acquire+0x782/0x13a0
[ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180
[ 30.666170] ? __io_queue_sqe+0x10b/0x550
[ 30.666725] __io_queue_sqe+0x10b/0x550
[ 30.667252] ? __fget_files+0x131/0x260
[ 30.667791] ? io_req_prep+0xd8/0x1090
[ 30.668316] ? io_queue_sqe+0x235/0x470
[ 30.668868] io_queue_sqe+0x235/0x470
[ 30.669398] io_submit_sqes+0xcce/0xf10
[ 30.669931] ? xa_load+0xe4/0x1c0
[ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0
[ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180
[ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80
[ 30.672380] do_syscall_64+0x2d/0x40
[ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 30.673503] RIP: 0033:0x7fd89c813239
[ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[ 30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
[ 30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
[ 30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
[ 30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
[ 30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
[ 30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000
This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list, just do a simple check.
Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Awhile there are requests in the allocation cache -- use them, only if
those ended go for the stashed memory in comp.free_list. As list
manipulation are generally heavy and are not good for caches, flush them
all or as much as can in one go.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: return success/failure from io_flush_cached_reqs()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__io_queue_sqe() is always called with a non-NULL comp_state, which is
taken directly from context. Don't pass it around but infer from ctx.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
task_work is run without utilizing the req alloc cache, so any deferred
items don't get to take advantage of either the alloc or free side of it.
With task_work now being wrapped by io_uring, we can use the ctx
completion state to both use the req cache and the completion flush
batching.
With this, the only request type that cannot take advantage of the req
cache is IRQ driven IO for regular files / block devices. Anything else,
including IOPOLL polled IO to those same tyes, will take advantage of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.
Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that we have the submit_state in the ring itself, we can have io_kiocb
allocations that are persistent across invocations. This reduces the time
spent doing slab allocations and frees.
[sil: rebased]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make io_req_free_batch(), which is used for inline executed requests and
IOPOLL, to return requests back into the allocation cache, so avoid
most of kmalloc()/kfree() for those cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't free batch-allocated requests across syscalls.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently batch free handles request memory freeing and ctx ref putting
together. Separate them and use different counters, that will be needed
for reusing reqs memory.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove fallback_req for now, it gets in the way of other changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|