summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-03-15btrfs: fix qgroup data rsv leak caused by falloc failureQu Wenruo
[BUG] When running fsstress with only falloc workload, and a very low qgroup limit set, we can get qgroup data rsv leak at unmount time. BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480 BTRFS error (device dm-0): qgroup reserved space leaked The minimal reproducer looks like: #!/bin/bash dev=/dev/test/test mnt="/mnt/btrfs" fsstress=~/xfstests-dev/ltp/fsstress runtime=8 workload() { umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs quota en $mnt btrfs quota rescan -w $mnt btrfs qgroup limit 16m 0/5 $mnt $fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \ -d $mnt -v > /tmp/fsstress umount $mnt if dmesg | grep leak ; then echo "!!! FAILED !!!" exit 1 fi } for (( i=0; i < $runtime; i++)); do echo "=== $i/$runtime===" workload done Normally it would fail before round 4. [CAUSE] In function insert_prealloc_file_extent(), we first call btrfs_qgroup_release_data() to know how many bytes are reserved for qgroup data rsv. Then use that @qgroup_released number to continue our work. But after we call btrfs_qgroup_release_data(), we should either queue @qgroup_released to delayed ref or free them manually in error path. Unfortunately, we lack the error handling to free the released bytes, leaking qgroup data rsv. All the error handling function outside won't help at all, as we have released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED bit is already cleared, thus all btrfs_qgroup_free_data() call won't free any data rsv. [FIX] Add free_qgroup tag to manually free the released qgroup data rsv. Reported-by: Nikolay Borisov <nborisov@suse.com> Reported-by: David Sterba <dsterba@suse.cz> Fixes: 9729f10a608f ("btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent()") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15btrfs: track qgroup released data in own variable in insert_prealloc_file_extentQu Wenruo
There is a piece of weird code in insert_prealloc_file_extent(), which looks like: ret = btrfs_qgroup_release_data(inode, file_offset, len); if (ret < 0) return ERR_PTR(ret); if (trans) { ret = insert_reserved_file_extent(trans, inode, file_offset, &stack_fi, true, ret); ... } extent_info.is_new_extent = true; extent_info.qgroup_reserved = ret; ... Note how the variable @ret is abused here, and if anyone is adding code just after btrfs_qgroup_release_data() call, it's super easy to overwrite the @ret and cause tons of qgroup related bugs. Fix such abuse by introducing new variable @qgroup_released, so that we won't reuse the existing variable @ret. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15btrfs: fix wrong offset to zero out range beyond i_sizeQu Wenruo
[BUG] The test generic/091 fails , with the following output: fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W mapped writes DISABLED Seed set to 1 main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling! skipping zero size read truncating to largest ever: 0xe400 copying to largest ever: 0x1f400 cloning to largest ever: 0x70000 cloning to largest ever: 0x77000 fallocating to largest ever: 0x7a120 Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<< ... [CAUSE] In commit c28ea613fafa ("btrfs: subpage: fix the false data csum mismatch error") end_bio_extent_readpage() changes to only zero the range inside the bvec for incoming subpage support. But that commit is using incorrect offset to calculate the start. For subpage, we can have a case that the whole bvec is beyond isize, thus we need to calculate the correct offset. But the offending commit is using @end (bvec end), other than @start (bvec start) to calculate the start offset. This means, we only zero the last byte of the bvec, not from the isize. This stupid bug makes the range beyond isize is not properly zeroed, and failed above test. [FIX] Use correct @start to calculate the range start. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: c28ea613fafa ("btrfs: subpage: fix the false data csum mismatch error") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15xfs: also reject BULKSTAT_SINGLE in a mount user namespaceChristoph Hellwig
BULKSTAT_SINGLE exposed the ondisk uids/gids just like bulkstat, and can be called on any inode, including ones not visible in the current mount. Fixes: f736d93d76d3 ("xfs: support idmapped mounts") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-15xfs: force log and push AIL to clear pinned inodes when aborting mountDarrick J. Wong
If we allocate quota inodes in the process of mounting a filesystem but then decide to abort the mount, it's possible that the quota inodes are sitting around pinned by the log. Now that inode reclaim relies on the AIL to flush inodes, we have to force the log and push the AIL in between releasing the quota inodes and kicking off reclaim to tear down all the incore inodes. Do this by extracting the bits we need from the unmount path and reusing them. As an added bonus, failed writes during a failed mount will not retry forever now. This was originally found during a fuzz test of metadata directories (xfs/1546), but the actual symptom was that reclaim hung up on the quota inodes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-03-15io_uring: fix sqpoll cancellation via task_workPavel Begunkov
Running sqpoll cancellations via task_work_run() is a bad idea because it depends on other task works to be run, but those may be locked in currently running task_work_run() because of how it's (splicing the list in batches). Enqueue and run them through a separate callback head, namely struct io_sq_data::park_task_work. As a nice bonus we now precisely control where it's run, that's much safer than guessing where it can happen as it was before. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: add generic callback_head helpersPavel Begunkov
We already have helpers to run/add callback_head but taking ctx and working with ctx->exit_task_work. Extract generic versions of them implemented in terms of struct callback_head, it will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: fix concurrent parkingPavel Begunkov
If io_sq_thread_park() of one task got rescheduled right after set_bit(), before it gets back to mutex_lock() there can happen park()/unpark() by another task with SQPOLL locking again and continuing running never seeing that first set_bit(SHOULD_PARK), so won't even try to put the mutex down for parking. It will get parked eventually when SQPOLL drops the lock for reschedule, but may be problematic and will get in the way of further fixes. Account number of tasks waiting for parking with a new atomic variable park_pending and adjust SHOULD_PARK accordingly. It doesn't entirely replaces SHOULD_PARK bit with this atomic var because it's convenient to have it as a bit in the state and will help to do optimisations later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: halt SQO submission on ctx exitPavel Begunkov
io_sq_thread_finish() is called in io_ring_ctx_free(), so SQPOLL task is potentially running submitting new requests. It's not a disaster because of using a "try" variant of percpu_ref_get, but is far from nice. Remove ctx from the sqd ctx list earlier, before cancellation loop, so SQPOLL can't find it and so won't submit new requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: replace sqd rw_semaphore with mutexPavel Begunkov
The only user of read-locking of sqd->rw_lock is sq_thread itself, which is by definition alone, so we don't really need rw_semaphore, but mutex will do. Replace it with a mutex, and kill read-to-write upgrading and extra task_work handling in io_sq_thread(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: fix complete_post use ctx after freePavel Begunkov
If io_req_complete_post() put not a final ref, we can't rely on the request's ctx ref, and so ctx may potentially be freed while complete_post() is in io_cqring_ev_posted()/etc. In that case get an additional ctx reference, and put it in the end, so protecting following io_cqring_ev_posted(). And also prolong ctx lifetime until spin_unlock happens, as we do with mutexes, so added percpu_ref_get() doesn't race with ctx free. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-15io_uring: fix ->flags races by linked timeoutsPavel Begunkov
It's racy to modify req->flags from a not owning context, e.g. linked timeout calling req_set_fail_links() for the master request might race with that request setting/clearing flags while being executed concurrently. Just remove req_set_fail_links(prev) from io_link_timeout_fn(), io_async_find_and_cancel() and functions down the line take care of setting the fail bit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-14cifs: Fix preauth hash corruptionVincent Whitchurch
smb311_update_preauth_hash() uses the shash in server->secmech without appropriate locking, and this can lead to sessions corrupting each other's preauth hashes. The following script can easily trigger the problem: #!/bin/sh -e NMOUNTS=10 for i in $(seq $NMOUNTS); mkdir -p /tmp/mnt$i umount /tmp/mnt$i 2>/dev/null || : done while :; do for i in $(seq $NMOUNTS); do mount -t cifs //192.168.0.1/test /tmp/mnt$i -o ... & done wait for i in $(seq $NMOUNTS); do umount /tmp/mnt$i done done Usually within seconds this leads to one or more of the mounts failing with the following errors, and a "Bad SMB2 signature for message" is seen in the server logs: CIFS: VFS: \\192.168.0.1 failed to connect to IPC (rc=-13) CIFS: VFS: cifs_mount failed w/return code = -13 Fix it by holding the server mutex just like in the other places where the shashes are used. Fixes: 8bd68c6e47abff34e4 ("CIFS: implement v3.11 preauth integrity") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> CC: <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-14cifs: update new ACE pointer after populate_new_aces.Shyam Prasad N
After the fix for retaining externally set ACEs with cifsacl and modefromsid,idsfromsid, there was an issue in populating the inherited ACEs after setting the ACEs introduced by these two modes. Fixed this by updating the ACE pointer again after the call to populate_new_aces. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Rohith Surabattula <rohiths@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "28 patches. Subsystems affected by this series: mm (memblock, pagealloc, hugetlb, highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and ia64" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits) zram: fix broken page writeback zram: fix return value on writeback_store mm/memcg: set memcg when splitting page mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls mm/userfaultfd: fix memory corruption due to writeprotect kasan: fix KASAN_STACK dependency for HW_TAGS kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC mm/madvise: replace ptrace attach requirement for process_madvise include/linux/sched/mm.h: use rcu_dereference in in_vfork() kfence: fix reports if constant function prefixes exist kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations kfence: fix printk format for ptrdiff_t linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* MAINTAINERS: exclude uapi directories in API/ABI section binfmt_misc: fix possible deadlock in bm_register_write mm/highmem.c: fix zero_user_segments() with start > end hugetlb: do early cow when page pinned on src mm mm: use is_cow_mapping() across tree where proper ...
2021-03-14io_uring: convert io_buffer_idr to XArrayJens Axboe
Like we did for the personality idr, convert the IO buffer idr to use XArray. This avoids a use-after-free on removal of entries, since idr doesn't like doing so from inside an iterator, and it nicely reduces the amount of code we need to support this feature. Fixes: 5a2e745d4d43 ("io_uring: buffer registration infrastructure") Cc: stable@vger.kernel.org Cc: Matthew Wilcox <willy@infradead.org> Cc: yangerkun <yangerkun@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-13Merge tag 'erofs-for-5.12-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: "Fix an urgent regression introduced by commit baa2c7c97153 ("block: set .bi_max_vecs as actual allocated vector number"), which could cause unexpected hung since linux 5.12-rc1. Resolve it by avoiding using bio->bi_max_vecs completely" * tag 'erofs-for-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix bio->bi_max_vecs behavior change
2021-03-13binfmt_misc: fix possible deadlock in bm_register_writeLior Ribak
There is a deadlock in bm_register_write: First, in the begining of the function, a lock is taken on the binfmt_misc root inode with inode_lock(d_inode(root)). Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call open_exec on the user-provided interpreter. open_exec will call a path lookup, and if the path lookup process includes the root of binfmt_misc, it will try to take a shared lock on its inode again, but it is already locked, and the code will get stuck in a deadlock To reproduce the bug: $ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register backtrace of where the lock occurs (#5): 0 schedule () at ./arch/x86/include/asm/current.h:15 1 0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992 2 0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213 3 __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222 4 down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355 5 0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783 6 open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177 7 path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366 8 0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396 9 0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913 10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948 11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682 12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF ", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603 13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF ", count=19) at fs/read_write.c:658 14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46 15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120 To solve the issue, the open_exec call is moved to before the write lock is taken by bm_register_write Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com Fixes: 948b701a607f1 ("binfmt_misc: add persistent opened binary handler for containers") Signed-off-by: Lior Ribak <liorribak@gmail.com> Acked-by: Helge Deller <deller@gmx.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm: use is_cow_mapping() across tree where properPeter Xu
After is_cow_mapping() is exported in mm.h, replace some manual checks elsewhere throughout the tree but start to use the new helper. Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-12io_uring: allow IO worker threads to be frozenJens Axboe
With the freezer using the proper signaling to notify us of when it's time to freeze a thread, we can re-enable normal freezer usage for the IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call try_to_freeze() appropriately, and remove the default setting of PF_NOFREEZE from create_io_thread(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12Merge tag 'nfs-for-5.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Anna Schumaker: "These are mostly fixes for issues discovered at the recent NFS bakeathon: - Fix PNFS_FLEXFILE_LAYOUT kconfig so it is possible to build into the kernel - Correct size calculationn for create reply length - Set memalloc_nofs_save() for sync tasks to prevent deadlocks - Don't revalidate directory permissions on lookup failure - Don't clear inode cache when lookup fails - Change functions to use nfs_set_cache_invalid() for proper delegation handling - Fix return value of _nfs4_get_security_label() - Return an error when attempting to remove system.nfs4_acl" * tag 'nfs-for-5.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs: we don't support removing system.nfs4_acl NFSv4.2: fix return value of _nfs4_get_security_label() NFS: Fix open coded versions of nfs_set_cache_invalid() in NFSv4 NFS: Fix open coded versions of nfs_set_cache_invalid() NFS: Clean up function nfs_mark_dir_for_revalidate() NFS: Don't gratuitously clear the inode cache when lookup failed NFS: Don't revalidate the directory permissions on a lookup failure SUNRPC: Set memalloc_nofs_save() for sync tasks NFS: Correct size calculation for create reply length nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
2021-03-12Merge tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Mostly just random fixes all over the map. The only odd-one-out change is finally getting the rename of BIO_MAX_PAGES to BIO_MAX_VECS done. This should've been done with the multipage bvec change, but it's been left. Do it now to avoid hassles around changes piling up for the next merge window. Summary: - NVMe pull request: - one more quirk (Dmitry Monakhov) - fix max_zone_append_sectors initialization (Chaitanya Kulkarni) - nvme-fc reset/create race fix (James Smart) - fix status code on aborts/resets (Hannes Reinecke) - fix the CSS check for ZNS namespaces (Chaitanya Kulkarni) - fix a use after free in a debug printk in nvme-rdma (Lv Yunlong) - Follow-up NVMe error fix for NULL 'id' (Christoph) - Fixup for the bd_size_lock being IRQ safe, now that the offending driver has been dropped (Damien). - rsxx probe failure error return (Jia-Ju) - umem probe failure error return (Wei) - s390/dasd unbind fixes (Stefan) - blk-cgroup stats summing fix (Xunlei) - zone reset handling fix (Damien) - Rename BIO_MAX_PAGES to BIO_MAX_VECS (Christoph) - Suppress uevent trigger for hidden devices (Daniel) - Fix handling of discard on busy device (Jan) - Fix stale cache issue with zone reset (Shin'ichiro)" * tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block: nvme: fix the nsid value to print in nvme_validate_or_alloc_ns block: Discard page cache of zone reset target range block: Suppress uevent for hidden device when removed block: rename BIO_MAX_PAGES to BIO_MAX_VECS nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done nvme-core: check ctrl css before setting up zns nvme-fc: fix racing controller reset and create association nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange() nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request() nvme: simplify error logic in nvme_validate_ns() nvme: set max_zone_append_sectors nvme_revalidate_zones block: rsxx: fix error return code of rsxx_pci_probe() block: Fix REQ_OP_ZONE_RESET_ALL handling umem: fix error return code in mm_pci_probe() blk-cgroup: Fix the recursive blkg rwstat s390/dasd: fix hanging IO request during DASD driver unbind s390/dasd: fix hanging DASD driver unbind block: Try to handle busy underlying device on discard
2021-03-12Merge tag 'io_uring-5.12-2021-03-12' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Not quite as small this week as I had hoped, but at least this should be the end of it. All the little known issues have been ironed out - most of it little stuff, but cancelations being the bigger part. Only minor tweaks and/or regular fixes expected beyond this point. - Fix the creds tracking for async (io-wq and SQPOLL) - Various SQPOLL fixes related to parking, sharing, forking, IOPOLL, completions, and life times. Much simpler now. - Make IO threads unfreezable by default, on account of a bug report that had them spinning on resume. Honestly not quite sure why thawing leaves us with a perpetual signal pending (causing the spin), but for now make them unfreezable like there were in 5.11 and prior. - Move personality_idr to xarray, solving a use-after-free related to removing an entry from the iterator callback. Buffer idr needs the same treatment. - Re-org around and task vs context tracking, enabling the fixing of cancelations, and then cancelation fixes on top. - Various little bits of cleanups and hardening, and removal of now dead parts" * tag 'io_uring-5.12-2021-03-12' of git://git.kernel.dk/linux-block: (34 commits) io_uring: fix OP_ASYNC_CANCEL across tasks io_uring: cancel sqpoll via task_work io_uring: prevent racy sqd->thread checks io_uring: remove useless ->startup completion io_uring: cancel deferred requests in try_cancel io_uring: perform IOPOLL reaping if canceler is thread itself io_uring: force creation of separate context for ATTACH_WQ and non-threads io_uring: remove indirect ctx into sqo injection io_uring: fix invalid ctx->sq_thread_idle kernel: make IO threads unfreezable by default io_uring: always wait for sqd exited when stopping SQPOLL thread io_uring: remove unneeded variable 'ret' io_uring: move all io_kiocb init early in io_init_req() io-wq: fix ref leak for req in case of exit cancelations io_uring: fix complete_post races for linked req io_uring: add io_disarm_next() helper io_uring: fix io_sq_offload_create error handling io-wq: remove unused 'user' member of io_wq io_uring: Convert personality_idr to XArray io_uring: clean R_DISABLED startup mess ...
2021-03-12Merge tag 'configfs-for-5.12' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs fix from Christoph Hellwig: - fix a use-after-free in __configfs_open_file (Daiyue Zhang) * tag 'configfs-for-5.12' of git://git.infradead.org/users/hch/configfs: configfs: fix a use-after-free in __configfs_open_file
2021-03-12Merge tag 'gfs2-v5.12-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "Various gfs2 fixes" * tag 'gfs2-v5.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: bypass log flush if the journal is not live gfs2: bypass signal_our_withdraw if no journal gfs2: fix use-after-free in trans_drain gfs2: make function gfs2_make_fs_ro() to void type
2021-03-12io_uring: fix OP_ASYNC_CANCEL across tasksPavel Begunkov
IORING_OP_ASYNC_CANCEL tries io-wq cancellation only for current task. If it fails go over tctx_list and try it out for every single tctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12io_uring: cancel sqpoll via task_workPavel Begunkov
1) The first problem is io_uring_cancel_sqpoll() -> io_uring_cancel_task_requests() basically doing park(); park(); and so hanging. 2) Another one is more subtle, when the master task is doing cancellations, but SQPOLL task submits in-between the end of the cancellation but before finish() requests taking a ref to the ctx, and so eternally locking it up. 3) Yet another is a dying SQPOLL task doing io_uring_cancel_sqpoll() and same io_uring_cancel_sqpoll() from the owner task, they race for tctx->wait events. And there probably more of them. Instead do SQPOLL cancellations from within SQPOLL task context via task_work, see io_sqpoll_cancel_sync(). With that we don't need temporal park()/unpark() during cancellation, which is ugly, subtle and anyway doesn't allow to do io_run_task_work() properly. io_uring_cancel_sqpoll() is called only from SQPOLL task context and under sqd locking, so all parking is removed from there. And so, io_sq_thread_[un]park() and io_sq_thread_stop() are not used now by SQPOLL task, and that spare us from some headache. Also remove ctx->sqd_list early to avoid 2). And kill tctx->sqpoll, which is not used anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12io_uring: prevent racy sqd->thread checksPavel Begunkov
SQPOLL thread to which we're trying to attach may be going away, it's not nice but a more serious problem is if io_sq_offload_create() sees sqd->thread==NULL, and tries to init it with a new thread. There are tons of ways it can be exploited or fail. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12gfs2: bypass log flush if the journal is not liveBob Peterson
Patch fe3e397668775 ("gfs2: Rework the log space allocation logic") changed gfs2_log_flush to reserve a set of journal blocks in case no transaction is active. However, gfs2_log_flush also gets called in cases where we don't have an active journal, for example, for spectator mounts. In that case, trying to reserve blocks would sleep forever, but we want gfs2_log_flush to be a no-op instead. Fixes: fe3e397668775 ("gfs2: Rework the log space allocation logic") Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-03-12io_uring: remove useless ->startup completionPavel Begunkov
We always do complete(&sqd->startup) almost right after sqd->thread creation, either in the success path or in io_sq_thread_finish(). It's specifically created not started for us to be able to set some stuff like sqd->thread and io_uring_alloc_task_context() before following right after wake_up_new_task(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12io_uring: cancel deferred requests in try_cancelPavel Begunkov
As io_uring_cancel_files() and others let SQO to run between io_uring_try_cancel_requests(), SQO may generate new deferred requests, so it's safer to try to cancel them in it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12gfs2: bypass signal_our_withdraw if no journalBob Peterson
Before this patch, function signal_our_withdraw referenced the journal inode immediately. But corrupt file systems may have some invalid journals, in which case our attempt to read it in will withdraw and the resulting signal_our_withdraw would dereference the NULL value. This patch adds a check to signal_our_withdraw so that if the journal has not yet been initialized, it simply returns and does the old-style withdraw. Thanks, Andy Price, for his analysis. Reported-by: syzbot+50a8a9cf8127f2c6f5df@syzkaller.appspotmail.com Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish") Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-03-11nfs: we don't support removing system.nfs4_aclJ. Bruce Fields
The NFSv4 protocol doesn't have any notion of reomoving an attribute, so removexattr(path,"system.nfs4_acl") doesn't make sense. There's no documented return value. Arguably it could be EOPNOTSUPP but I'm a little worried an application might take that to mean that we don't support ACLs or xattrs. How about EINVAL? Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-11io_uring: perform IOPOLL reaping if canceler is thread itselfJens Axboe
We bypass IOPOLL completion polling (and reaping) for the SQPOLL thread, but if it's the thread itself invoking cancelations, then we still need to perform it or no one will. Fixes: 9936c7c2bc76 ("io_uring: deduplicate core cancellations sequence") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11io_uring: force creation of separate context for ATTACH_WQ and non-threadsJens Axboe
Earlier kernels had SQPOLL threads that could share across anything, as we grabbed the context we needed on a per-ring basis. This is no longer the case, so only allow attaching directly if we're in the same thread group. That is the common use case. For non-group tasks, just setup a new context and thread as we would've done if sharing wasn't set. This isn't 100% ideal in terms of CPU utilization for the forked and share case, but hopefully that isn't much of a concern. If it is, there are plans in motion for how to improve that. Most importantly, we want to avoid app side regressions where sharing worked before and now doesn't. With this patch, functionality is equivalent to previous kernels that supported IORING_SETUP_ATTACH_WQ with SQPOLL. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11NFSD: fix error handling in NFSv4.0 callbacksOlga Kornievskaia
When the server tries to do a callback and a client fails it due to authentication problems, we need the server to set callback down flag in RENEW so that client can recover. Suggested-by: Bruce Fields <bfields@redhat.com> Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/linux-nfs/FB84E90A-1A03-48B3-8BF7-D9D10AC2C9FE@oracle.com/T/#t
2021-03-11ext4: fix error handling in ext4_end_enable_verity()Eric Biggers
ext4 didn't properly clean up if verity failed to be enabled on a file: - It left verity metadata (pages past EOF) in the page cache, which would be exposed to userspace if the file was later extended. - It didn't truncate the verity metadata at all (either from cache or from disk) if an error occurred while setting the verity bit. Fix these bugs by adding a call to truncate_inode_pages() and ensuring that we truncate the verity metadata (both from cache and from disk) in all error paths. Also rework the code to cleanly separate the success path from the error paths, which makes it much easier to understand. Reported-by: Yunlei He <heyunlei@hihonor.com> Fixes: c93d8f885809 ("ext4: add basic fs-verity support") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210302200420.137977-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-11block: rename BIO_MAX_PAGES to BIO_MAX_VECSChristoph Hellwig
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11configfs: fix a use-after-free in __configfs_open_fileDaiyue Zhang
Commit b0841eefd969 ("configfs: provide exclusion between IO and removals") uses ->frag_dead to mark the fragment state, thus no bothering with extra refcount on config_item when opening a file. The configfs_get_config_item was removed in __configfs_open_file, but not with config_item_put. So the refcount on config_item will lost its balance, causing use-after-free issues in some occasions like this: Test: 1. Mount configfs on /config with read-only items: drwxrwx--- 289 root root 0 2021-04-01 11:55 /config drwxr-xr-x 2 root root 0 2021-04-01 11:54 /config/a --w--w--w- 1 root root 4096 2021-04-01 11:53 /config/a/1.txt ...... 2. Then run: for file in /config do echo $file grep -R 'key' $file done 3. __configfs_open_file will be called in parallel, the first one got called will do: if (file->f_mode & FMODE_READ) { if (!(inode->i_mode & S_IRUGO)) goto out_put_module; config_item_put(buffer->item); kref_put() package_details_release() kfree() the other one will run into use-after-free issues like this: BUG: KASAN: use-after-free in __configfs_open_file+0x1bc/0x3b0 Read of size 8 at addr fffffff155f02480 by task grep/13096 CPU: 0 PID: 13096 Comm: grep VIP: 00 Tainted: G W 4.14.116-kasan #1 TGID: 13096 Comm: grep Call trace: dump_stack+0x118/0x160 kasan_report+0x22c/0x294 __asan_load8+0x80/0x88 __configfs_open_file+0x1bc/0x3b0 configfs_open_file+0x28/0x34 do_dentry_open+0x2cc/0x5c0 vfs_open+0x80/0xe0 path_openat+0xd8c/0x2988 do_filp_open+0x1c4/0x2fc do_sys_open+0x23c/0x404 SyS_openat+0x38/0x48 Allocated by task 2138: kasan_kmalloc+0xe0/0x1ac kmem_cache_alloc_trace+0x334/0x394 packages_make_item+0x4c/0x180 configfs_mkdir+0x358/0x740 vfs_mkdir2+0x1bc/0x2e8 SyS_mkdirat+0x154/0x23c el0_svc_naked+0x34/0x38 Freed by task 13096: kasan_slab_free+0xb8/0x194 kfree+0x13c/0x910 package_details_release+0x524/0x56c kref_put+0xc4/0x104 config_item_put+0x24/0x34 __configfs_open_file+0x35c/0x3b0 configfs_open_file+0x28/0x34 do_dentry_open+0x2cc/0x5c0 vfs_open+0x80/0xe0 path_openat+0xd8c/0x2988 do_filp_open+0x1c4/0x2fc do_sys_open+0x23c/0x404 SyS_openat+0x38/0x48 el0_svc_naked+0x34/0x38 To fix this issue, remove the config_item_put in __configfs_open_file to balance the refcount of config_item. Fixes: b0841eefd969 ("configfs: provide exclusion between IO and removals") Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com> Signed-off-by: Yi Chen <chenyi77@huawei.com> Signed-off-by: Ge Qiu <qiuge@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-10Merge tag 's390-5.12-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: - fix various user space visible copy_to_user() instances which return the number of bytes left to copy instead of -EFAULT - make TMPFS_INODE64 available again for s390 and alpha, now that both architectures have been switched to 64-bit ino_t (see commit 96c0a6a72d18: "s390,alpha: switch to 64-bit ino_t") - make sure to release a shared hypervisor resource within the zcore device driver also on restart and power down; also remove unneeded surrounding debugfs_create return value checks - for the new hardware counter set device driver rename the uapi header file to be a bit more generic; also remove 60 second read limit which is not really necessary and without the limit the interface can be easier tested - some small cleanups, the largest being to convert all long long in our time and idle code to longs - update defconfigs * tag 's390-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: remove IBM_PARTITION and CONFIGFS_FS from zfcpdump defconfig s390: update defconfigs s390,alpha: make TMPFS_INODE64 available again s390/cio: return -EFAULT if copy_to_user() fails s390/tty3270: avoid comma separated statements s390/cpumf: remove unneeded semicolon s390/crypto: return -EFAULT if copy_to_user() fails s390/cio: return -EFAULT if copy_to_user() fails s390/cpumf: rename header file to hwctrset.h s390/zcore: release dump save area on restart or power down s390/zcore: no need to check return value of debugfs_create functions s390/cpumf: remove 60 seconds read limit s390/topology: remove always false if check s390/time,idle: get rid of unsigned long long
2021-03-10Merge tag 'for-linus-2021-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull detached mounts fix from Christian Brauner: "Creating a series of detached mounts, attaching them to the filesystem, and unmounting them can be used to trigger an integer overflow in ns->mounts causing the kernel to block any new mounts in count_mounts() and returning ENOSPC because it falsely assumes that the maximum number of mounts in the mount namespace has been reached, i.e. it thinks it can't fit the new mounts into the mount namespace anymore. Without this fix heavy use of the new mount API with move_mount() will cause the host to become unuseable and thus blocks some xfstest patches I want to resend. Depending on the number of mounts in your system, this can be reproduced on any kernel that supportes open_tree() and move_mount(). A reproducer has been sent for inclusion with xfstests. It takes care to do this in another mount namespace, not in the host's mount namespace so there shouldn't be any risk in running it but if one did run it on the host it would require a reboot in order to be able to mount again. See https://lore.kernel.org/fstests/20210309121041.753359-1-christian.brauner@ubuntu.com The root cause of this is that detached mounts aren't handled correctly when source and target mount are identical and reside on a shared mount causing a broken mount tree where the detached source itself is propagated which propagation prevents for regular bind-mounts and new mounts. This ultimately leads to a miscalculation of the number of mounts in the mount namespace. Detached mounts created via 'open_tree(fd, path, OPEN_TREE_CLONE)' are essentially like an unattached bind-mount. They can then later on be attached to the filesystem via move_mount() which calls into attach_recursive_mount(). Part of attaching it to the filesystem is making sure that mounts get correctly propagated in case the destination mountpoint is MS_SHARED, i.e. is a shared mountpoint. This is done by calling into propagate_mnt() which walks the list of peers calling propagate_one() on each mount in this list making sure it receives the propagation event. The propagate_one() function thereby skips both new mounts and bind mounts to not propagate them "into themselves". Both are identified by checking whether the mount is already attached to any mount namespace in mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for. However, detached mounts have an anonymous mount namespace attached to them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't realize they need to be skipped causing the mount to propagate "into itself" breaking the mount table and causing a disconnect between the number of mounts recorded as being beneath or reachable from the target mountpoint and the number of mounts actually recorded/counted in ns->mounts ultimately causing an overflow which in turn prevents any new mounts via the ENOSPC issue. So teach propagation to handle detached mounts by making it aware of them. I've been tracking this issue down for the last couple of days and then verifying that the fix is correct by unmounting everything in my current mount table leaving only /proc and /sys mounted and running the reproducer above overnight verifying the number of mounts counted in ns->mounts. With this fix the counts are correct and the ENOSPC issue can't be reproduced. This change will only have an effect on mounts created with the new mount API since detached mounts cannot be created with the old mount API so regressions are extremely unlikely. Here's an illustration: #### mount(): ubuntu@f1-vm:~$ sudo mount --bind /mnt/ /mnt/ ubuntu@f1-vm:~$ findmnt | grep -i mnt ├─/mnt /dev/sda2[/mnt] ext4 rw,relatime #### open_tree(OPEN_TREE_CLONE) + move_mount() with bug: ubuntu@f1-vm:~$ sudo ./mount-new /mnt/ /mnt/ ubuntu@f1-vm:~$ findmnt | grep -i mnt ├─/mnt /dev/sda2[/mnt] ext4 rw,relatime │ └─/mnt /dev/sda2[/mnt] ext4 rw,relatime #### open_tree(OPEN_TREE_CLONE) + move_mount() with the fix: ubuntu@f1-vm:~$ sudo ./mount-new /mnt /mnt ubuntu@f1-vm:~$ findmnt | grep -i mnt └─/mnt /dev/sda2[/mnt] ext4 rw,relatime" * tag 'for-linus-2021-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mount: fix mounting of detached mounts onto targets that reside on shared mounts
2021-03-10io_uring: remove indirect ctx into sqo injectionPavel Begunkov
We use ->ctx_new_list to notify sqo about new ctx pending, then sqo should stop and splice it to its sqd->ctx_list, paired with ->sq_thread_comp. The last one is broken because nobody reinitialises it, and trying to fix it would only add more complexity and bugs. And the first isn't really needed as is done under park(), that protects from races well. Add ctx into sqd->ctx_list directly (under park()), it's much simpler and allows to kill both, ctx_new_list and sq_thread_comp. note: apparently there is no real problem at the moment, because sq_thread_comp is used only by io_sq_thread_finish() followed by parking, where list_del(&ctx->sqd_list) removes it well regardless whether it's in the new or the active list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: fix invalid ctx->sq_thread_idlePavel Begunkov
We have to set ctx->sq_thread_idle before adding a ring to an SQ task, otherwise sqd races for seeing zero and accounting it as such. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10kernel: make IO threads unfreezable by defaultJens Axboe
The io-wq threads were already marked as no-freeze, but the manager was not. On resume, we perpetually have signal_pending() being true, and hence the manager will loop and spin 100% of the time. Just mark the tasks created by create_io_thread() as PF_NOFREEZE by default, and remove any knowledge of it in io-wq and io_uring. Reported-by: Kevin Locke <kevin@kevinlocke.name> Tested-by: Kevin Locke <kevin@kevinlocke.name> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: always wait for sqd exited when stopping SQPOLL threadJens Axboe
We have a tiny race where io_put_sq_data() calls io_sq_thead_stop() and finds the thread gone, but the thread has indeed not fully exited or called complete() yet. Close it up by always having io_sq_thread_stop() wait on completion of the exit event. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: remove unneeded variable 'ret'Yang Li
Fix the following coccicheck warning: ./fs/io_uring.c:8984:5-8: Unneeded variable: "ret". Return "0" on line 8998 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/1615271441-33649-1-git-send-email-yang.lee@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: move all io_kiocb init early in io_init_req()Jens Axboe
If we hit an error path in the function, make sure that the io_kiocb is fully initialized at that point so that freeing the request always sees a valid state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io-wq: fix ref leak for req in case of exit cancelationsyangerkun
do_work such as io_wq_submit_work that cancel the work may leave a ref of req as 1 if we have links. Fix it by call io_run_cancel. Fixes: 4fb6ac326204 ("io-wq: improve manager/worker handling over exec") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20210309030410.3294078-1-yangerkun@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: fix complete_post races for linked reqPavel Begunkov
Calling io_queue_next() after spin_unlock in io_req_complete_post() races with the other side extracting and reusing this request. Hand coded parts of io_req_find_next() considering that io_disarm_next() and io_req_task_queue() have (and safe) to be called with completion_lock held. It already does io_commit_cqring() and io_cqring_ev_posted(), so just reuse it for post io_disarm_next(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5672a62f3150ee7c55849f40c0037655c4f2840f.1615250156.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: add io_disarm_next() helperPavel Begunkov
A preparation patch placing all preparations before extracting a next request into a separate helper io_disarm_next(). Also, don't spuriously do ev_posted in a rare case where REQ_F_FAIL_LINK is set but there are no requests linked (i.e. after cancelling a linked timeout or setting IOSQE_IO_LINK on a last request of a submission batch). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/44ecff68d6b47e1c4e6b891bdde1ddc08cfc3590.1615250156.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>