summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-03-11configfs: fix a use-after-free in __configfs_open_fileDaiyue Zhang
Commit b0841eefd969 ("configfs: provide exclusion between IO and removals") uses ->frag_dead to mark the fragment state, thus no bothering with extra refcount on config_item when opening a file. The configfs_get_config_item was removed in __configfs_open_file, but not with config_item_put. So the refcount on config_item will lost its balance, causing use-after-free issues in some occasions like this: Test: 1. Mount configfs on /config with read-only items: drwxrwx--- 289 root root 0 2021-04-01 11:55 /config drwxr-xr-x 2 root root 0 2021-04-01 11:54 /config/a --w--w--w- 1 root root 4096 2021-04-01 11:53 /config/a/1.txt ...... 2. Then run: for file in /config do echo $file grep -R 'key' $file done 3. __configfs_open_file will be called in parallel, the first one got called will do: if (file->f_mode & FMODE_READ) { if (!(inode->i_mode & S_IRUGO)) goto out_put_module; config_item_put(buffer->item); kref_put() package_details_release() kfree() the other one will run into use-after-free issues like this: BUG: KASAN: use-after-free in __configfs_open_file+0x1bc/0x3b0 Read of size 8 at addr fffffff155f02480 by task grep/13096 CPU: 0 PID: 13096 Comm: grep VIP: 00 Tainted: G W 4.14.116-kasan #1 TGID: 13096 Comm: grep Call trace: dump_stack+0x118/0x160 kasan_report+0x22c/0x294 __asan_load8+0x80/0x88 __configfs_open_file+0x1bc/0x3b0 configfs_open_file+0x28/0x34 do_dentry_open+0x2cc/0x5c0 vfs_open+0x80/0xe0 path_openat+0xd8c/0x2988 do_filp_open+0x1c4/0x2fc do_sys_open+0x23c/0x404 SyS_openat+0x38/0x48 Allocated by task 2138: kasan_kmalloc+0xe0/0x1ac kmem_cache_alloc_trace+0x334/0x394 packages_make_item+0x4c/0x180 configfs_mkdir+0x358/0x740 vfs_mkdir2+0x1bc/0x2e8 SyS_mkdirat+0x154/0x23c el0_svc_naked+0x34/0x38 Freed by task 13096: kasan_slab_free+0xb8/0x194 kfree+0x13c/0x910 package_details_release+0x524/0x56c kref_put+0xc4/0x104 config_item_put+0x24/0x34 __configfs_open_file+0x35c/0x3b0 configfs_open_file+0x28/0x34 do_dentry_open+0x2cc/0x5c0 vfs_open+0x80/0xe0 path_openat+0xd8c/0x2988 do_filp_open+0x1c4/0x2fc do_sys_open+0x23c/0x404 SyS_openat+0x38/0x48 el0_svc_naked+0x34/0x38 To fix this issue, remove the config_item_put in __configfs_open_file to balance the refcount of config_item. Fixes: b0841eefd969 ("configfs: provide exclusion between IO and removals") Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com> Signed-off-by: Yi Chen <chenyi77@huawei.com> Signed-off-by: Ge Qiu <qiuge@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-10Merge tag 's390-5.12-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: - fix various user space visible copy_to_user() instances which return the number of bytes left to copy instead of -EFAULT - make TMPFS_INODE64 available again for s390 and alpha, now that both architectures have been switched to 64-bit ino_t (see commit 96c0a6a72d18: "s390,alpha: switch to 64-bit ino_t") - make sure to release a shared hypervisor resource within the zcore device driver also on restart and power down; also remove unneeded surrounding debugfs_create return value checks - for the new hardware counter set device driver rename the uapi header file to be a bit more generic; also remove 60 second read limit which is not really necessary and without the limit the interface can be easier tested - some small cleanups, the largest being to convert all long long in our time and idle code to longs - update defconfigs * tag 's390-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: remove IBM_PARTITION and CONFIGFS_FS from zfcpdump defconfig s390: update defconfigs s390,alpha: make TMPFS_INODE64 available again s390/cio: return -EFAULT if copy_to_user() fails s390/tty3270: avoid comma separated statements s390/cpumf: remove unneeded semicolon s390/crypto: return -EFAULT if copy_to_user() fails s390/cio: return -EFAULT if copy_to_user() fails s390/cpumf: rename header file to hwctrset.h s390/zcore: release dump save area on restart or power down s390/zcore: no need to check return value of debugfs_create functions s390/cpumf: remove 60 seconds read limit s390/topology: remove always false if check s390/time,idle: get rid of unsigned long long
2021-03-10Merge tag 'for-linus-2021-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull detached mounts fix from Christian Brauner: "Creating a series of detached mounts, attaching them to the filesystem, and unmounting them can be used to trigger an integer overflow in ns->mounts causing the kernel to block any new mounts in count_mounts() and returning ENOSPC because it falsely assumes that the maximum number of mounts in the mount namespace has been reached, i.e. it thinks it can't fit the new mounts into the mount namespace anymore. Without this fix heavy use of the new mount API with move_mount() will cause the host to become unuseable and thus blocks some xfstest patches I want to resend. Depending on the number of mounts in your system, this can be reproduced on any kernel that supportes open_tree() and move_mount(). A reproducer has been sent for inclusion with xfstests. It takes care to do this in another mount namespace, not in the host's mount namespace so there shouldn't be any risk in running it but if one did run it on the host it would require a reboot in order to be able to mount again. See https://lore.kernel.org/fstests/20210309121041.753359-1-christian.brauner@ubuntu.com The root cause of this is that detached mounts aren't handled correctly when source and target mount are identical and reside on a shared mount causing a broken mount tree where the detached source itself is propagated which propagation prevents for regular bind-mounts and new mounts. This ultimately leads to a miscalculation of the number of mounts in the mount namespace. Detached mounts created via 'open_tree(fd, path, OPEN_TREE_CLONE)' are essentially like an unattached bind-mount. They can then later on be attached to the filesystem via move_mount() which calls into attach_recursive_mount(). Part of attaching it to the filesystem is making sure that mounts get correctly propagated in case the destination mountpoint is MS_SHARED, i.e. is a shared mountpoint. This is done by calling into propagate_mnt() which walks the list of peers calling propagate_one() on each mount in this list making sure it receives the propagation event. The propagate_one() function thereby skips both new mounts and bind mounts to not propagate them "into themselves". Both are identified by checking whether the mount is already attached to any mount namespace in mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for. However, detached mounts have an anonymous mount namespace attached to them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't realize they need to be skipped causing the mount to propagate "into itself" breaking the mount table and causing a disconnect between the number of mounts recorded as being beneath or reachable from the target mountpoint and the number of mounts actually recorded/counted in ns->mounts ultimately causing an overflow which in turn prevents any new mounts via the ENOSPC issue. So teach propagation to handle detached mounts by making it aware of them. I've been tracking this issue down for the last couple of days and then verifying that the fix is correct by unmounting everything in my current mount table leaving only /proc and /sys mounted and running the reproducer above overnight verifying the number of mounts counted in ns->mounts. With this fix the counts are correct and the ENOSPC issue can't be reproduced. This change will only have an effect on mounts created with the new mount API since detached mounts cannot be created with the old mount API so regressions are extremely unlikely. Here's an illustration: #### mount(): ubuntu@f1-vm:~$ sudo mount --bind /mnt/ /mnt/ ubuntu@f1-vm:~$ findmnt | grep -i mnt ├─/mnt /dev/sda2[/mnt] ext4 rw,relatime #### open_tree(OPEN_TREE_CLONE) + move_mount() with bug: ubuntu@f1-vm:~$ sudo ./mount-new /mnt/ /mnt/ ubuntu@f1-vm:~$ findmnt | grep -i mnt ├─/mnt /dev/sda2[/mnt] ext4 rw,relatime │ └─/mnt /dev/sda2[/mnt] ext4 rw,relatime #### open_tree(OPEN_TREE_CLONE) + move_mount() with the fix: ubuntu@f1-vm:~$ sudo ./mount-new /mnt /mnt ubuntu@f1-vm:~$ findmnt | grep -i mnt └─/mnt /dev/sda2[/mnt] ext4 rw,relatime" * tag 'for-linus-2021-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mount: fix mounting of detached mounts onto targets that reside on shared mounts
2021-03-10io_uring: remove indirect ctx into sqo injectionPavel Begunkov
We use ->ctx_new_list to notify sqo about new ctx pending, then sqo should stop and splice it to its sqd->ctx_list, paired with ->sq_thread_comp. The last one is broken because nobody reinitialises it, and trying to fix it would only add more complexity and bugs. And the first isn't really needed as is done under park(), that protects from races well. Add ctx into sqd->ctx_list directly (under park()), it's much simpler and allows to kill both, ctx_new_list and sq_thread_comp. note: apparently there is no real problem at the moment, because sq_thread_comp is used only by io_sq_thread_finish() followed by parking, where list_del(&ctx->sqd_list) removes it well regardless whether it's in the new or the active list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: fix invalid ctx->sq_thread_idlePavel Begunkov
We have to set ctx->sq_thread_idle before adding a ring to an SQ task, otherwise sqd races for seeing zero and accounting it as such. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10kernel: make IO threads unfreezable by defaultJens Axboe
The io-wq threads were already marked as no-freeze, but the manager was not. On resume, we perpetually have signal_pending() being true, and hence the manager will loop and spin 100% of the time. Just mark the tasks created by create_io_thread() as PF_NOFREEZE by default, and remove any knowledge of it in io-wq and io_uring. Reported-by: Kevin Locke <kevin@kevinlocke.name> Tested-by: Kevin Locke <kevin@kevinlocke.name> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: always wait for sqd exited when stopping SQPOLL threadJens Axboe
We have a tiny race where io_put_sq_data() calls io_sq_thead_stop() and finds the thread gone, but the thread has indeed not fully exited or called complete() yet. Close it up by always having io_sq_thread_stop() wait on completion of the exit event. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: remove unneeded variable 'ret'Yang Li
Fix the following coccicheck warning: ./fs/io_uring.c:8984:5-8: Unneeded variable: "ret". Return "0" on line 8998 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/1615271441-33649-1-git-send-email-yang.lee@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: move all io_kiocb init early in io_init_req()Jens Axboe
If we hit an error path in the function, make sure that the io_kiocb is fully initialized at that point so that freeing the request always sees a valid state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io-wq: fix ref leak for req in case of exit cancelationsyangerkun
do_work such as io_wq_submit_work that cancel the work may leave a ref of req as 1 if we have links. Fix it by call io_run_cancel. Fixes: 4fb6ac326204 ("io-wq: improve manager/worker handling over exec") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20210309030410.3294078-1-yangerkun@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: fix complete_post races for linked reqPavel Begunkov
Calling io_queue_next() after spin_unlock in io_req_complete_post() races with the other side extracting and reusing this request. Hand coded parts of io_req_find_next() considering that io_disarm_next() and io_req_task_queue() have (and safe) to be called with completion_lock held. It already does io_commit_cqring() and io_cqring_ev_posted(), so just reuse it for post io_disarm_next(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5672a62f3150ee7c55849f40c0037655c4f2840f.1615250156.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: add io_disarm_next() helperPavel Begunkov
A preparation patch placing all preparations before extracting a next request into a separate helper io_disarm_next(). Also, don't spuriously do ev_posted in a rare case where REQ_F_FAIL_LINK is set but there are no requests linked (i.e. after cancelling a linked timeout or setting IOSQE_IO_LINK on a last request of a submission batch). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/44ecff68d6b47e1c4e6b891bdde1ddc08cfc3590.1615250156.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: fix io_sq_offload_create error handlingPavel Begunkov
Don't set IO_SQ_THREAD_SHOULD_STOP when io_sq_offload_create() has failed on io_uring_alloc_task_context() but leave everything to io_sq_thread_finish(), because currently io_sq_thread_finish() hangs on trying to park it. That's great it stalls there, because otherwise the following io_sq_thread_stop() would be skipped on IO_SQ_THREAD_SHOULD_STOP check and the sqo would race for sqd with freeing ctx. A simple error injection gives something like this. [ 245.463955] INFO: task sqpoll-test-hang:523 blocked for more than 122 seconds. [ 245.463983] Call Trace: [ 245.463990] __schedule+0x36b/0x950 [ 245.464005] schedule+0x68/0xe0 [ 245.464013] schedule_timeout+0x209/0x2a0 [ 245.464032] wait_for_completion+0x8b/0xf0 [ 245.464043] io_sq_thread_finish+0x44/0x1a0 [ 245.464049] io_uring_setup+0x9ea/0xc80 [ 245.464058] __x64_sys_io_uring_setup+0x16/0x20 [ 245.464064] do_syscall_64+0x38/0x50 [ 245.464073] entry_SYSCALL_64_after_hwframe+0x44/0xae Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io-wq: remove unused 'user' member of io_wqJens Axboe
Previous patches killed the last user of this, now it's just a dead member in the struct. Get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: Convert personality_idr to XArrayMatthew Wilcox (Oracle)
You can't call idr_remove() from within a idr_for_each() callback, but you can call xa_erase() from an xa_for_each() loop, so switch the entire personality_idr from the IDR to the XArray. This manifests as a use-after-free as idr_for_each() attempts to walk the rest of the node after removing the last entry from it. Fixes: 071698e13ac6 ("io_uring: allow registering credentials") Cc: stable@vger.kernel.org # 5.6+ Reported-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [Pavel: rebased (creds load was moved into io_init_req())] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7ccff36e1375f2b0ebf73d957f037b43becc0dde.1615212806.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: clean R_DISABLED startup messPavel Begunkov
There are enough of problems with IORING_SETUP_R_DISABLED, including the burden of checking and kicking off the SQO task all over the codebase -- for exit/cancel/etc. Rework it, always start the thread but don't do submit unless the flag is gone, that's much easier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: fix unrelated ctx reqs cancellationPavel Begunkov
io-wq now is per-task, so cancellations now should match against request's ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10io_uring: SQPOLL parking fixesJens Axboe
We keep running into weird dependency issues between the sqd lock and the parking state. Disentangle the SQPOLL thread from the last bits of the kthread parking inheritance, and just replace the parking state, and two associated locks, with a single rw mutex. The SQPOLL thread keeps the mutex for read all the time, except if someone has marked us needing to park. Then we drop/re-acquire and try again. This greatly simplifies the parking state machine (by just getting rid of it), and makes it a lot more obvious how it works - if you need to modify the ctx list, then you simply park the thread which will grab the lock for writing. Fold in fix from Hillf Danton on not setting STOP on a fatal signal. Fixes: e54945ae947f ("io_uring: SQPOLL stop error handling fixes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-09NFSv4.2: fix return value of _nfs4_get_security_label()Ondrej Mosnacek
An xattr 'get' handler is expected to return the length of the value on success, yet _nfs4_get_security_label() (and consequently also nfs4_xattr_get_nfs4_label(), which is used as an xattr handler) returns just 0 on success. Fix this by returning label.len instead, which contains the length of the result. Fixes: aa9c2669626c ("NFS: Client implementation of Labeled-NFS") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-09NFSD: fix dest to src mount in inter-server COPYOlga Kornievskaia
A cleanup of the inter SSC copy needs to call fput() of the source file handle to make sure that file structure is freed as well as drop the reference on the superblock to unmount the source server. Fixes: 36e1e5ba90fb ("NFSD: Fix use-after-free warning when doing inter-server copy") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Dai Ngo <dai.ngo@oracle.com>
2021-03-09xfs: fix quota accounting when a mount is idmappedDarrick J. Wong
Nowadays, we indirectly use the idmap-aware helper functions in the VFS to set the initial uid and gid of a file being created. Unfortunately, we didn't convert the quota code, which means we attach the wrong dquots to files created on an idmapped mount. Fixes: f736d93d76d3 ("xfs: support idmapped mounts") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-03-09iomap: Fix negative assignment to unsigned sis->pages in iomap_swapfile_activateRitesh Harjani
In case if isi.nr_pages is 0, we are making sis->pages (which is unsigned int) a huge value in iomap_swapfile_activate() by assigning -1. This could cause a kernel crash in kernel v4.18 (with below signature). Or could lead to unknown issues on latest kernel if the fake big swap gets used. Fix this issue by returning -EINVAL in case of nr_pages is 0, since it is anyway a invalid swapfile. Looks like this issue will be hit when we have pagesize < blocksize type of configuration. I was able to hit the issue in case of a tiny swap file with below test script. https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/scripts/swap-issue.sh kernel crash analysis on v4.18 ============================== On v4.18 kernel, it causes a kernel panic, since sis->pages becomes a huge value and isi.nr_extents is 0. When 0 is returned it is considered as a swapfile over NFS and SWP_FILE is set (sis->flags |= SWP_FILE). Then when swapoff was getting called it was calling a_ops->swap_deactivate() if (sis->flags & SWP_FILE) is true. Since a_ops->swap_deactivate() is NULL in case of XFS, it causes below panic. Panic signature on v4.18 kernel: ======================================= root@qemu:/home/qemu# [ 8291.723351] XFS (loop2): Unmounting Filesystem [ 8292.123104] XFS (loop2): Mounting V5 Filesystem [ 8292.132451] XFS (loop2): Ending clean mount [ 8292.263362] Adding 4294967232k swap on /mnt1/test/swapfile. Priority:-2 extents:1 across:274877906880k [ 8292.277834] Unable to handle kernel paging request for instruction fetch [ 8292.278677] Faulting instruction address: 0x00000000 cpu 0x19: Vector: 400 (Instruction Access) at [c0000009dd5b7ad0] pc: 0000000000000000 lr: c0000000003eb9dc: destroy_swap_extents+0xfc/0x120 sp: c0000009dd5b7d50 msr: 8000000040009033 current = 0xc0000009b6710080 paca = 0xc00000003ffcb280 irqmask: 0x03 irq_happened: 0x01 pid = 5604, comm = swapoff Linux version 4.18.0 (riteshh@xxxxxxx) (gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)) #57 SMP Wed Mar 3 01:33:04 CST 2021 enter ? for help [link register ] c0000000003eb9dc destroy_swap_extents+0xfc/0x120 [c0000009dd5b7d50] c0000000025a7058 proc_poll_event+0x0/0x4 (unreliable) [c0000009dd5b7da0] c0000000003f0498 sys_swapoff+0x3f8/0x910 [c0000009dd5b7e30] c00000000000bbe4 system_call+0x5c/0x70 Exception: c01 (System Call) at 00007ffff7d208d8 Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> [djwong: rework the comment to provide more details] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-03-09Revert "nfsd4: a client's own opens needn't prevent delegations"J. Bruce Fields
This reverts commit 94415b06eb8aed13481646026dc995f04a3a534a. That commit claimed to allow a client to get a read delegation when it was the only writer. Actually it allowed a client to get a read delegation when *any* client has a write open! The main problem is that it's depending on nfs4_clnt_odstate structures that are actually only maintained for pnfs exports. This causes clients to miss writes performed by other clients, even when there have been intervening closes and opens, violating close-to-open cache consistency. We can do this a different way, but first we should just revert this. I've added pynfs 4.1 test DELEG19 to test for this, as I should have done originally! Cc: stable@vger.kernel.org Reported-by: Timo Rothenpieler <timo@rothenpieler.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-09Revert "nfsd4: remove check_conflicting_opens warning"J. Bruce Fields
This reverts commit 50747dd5e47b "nfsd4: remove check_conflicting_opens warning", as a prerequisite for reverting 94415b06eb8a, which has a serious bug. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-08cifs: do not send close in compound create+close requestsPaulo Alcantara
In case of interrupted syscalls, prevent sending CLOSE commands for compound CREATE+CLOSE requests by introducing an CIFS_CP_CREATE_CLOSE_OP flag to indicate lower layers that it should not send a CLOSE command to the MIDs corresponding the compound CREATE+CLOSE request. A simple reproducer: #!/bin/bash mount //server/share /mnt -o username=foo,password=*** tc qdisc add dev eth0 root netem delay 450ms stat -f /mnt &>/dev/null & pid=$! sleep 0.01 kill $pid tc qdisc del dev eth0 root umount /mnt Before patch: ... 6 0.256893470 192.168.122.2 → 192.168.122.15 SMB2 402 Create Request File: ;GetInfo Request FS_INFO/FileFsFullSizeInformation;Close Request 7 0.257144491 192.168.122.15 → 192.168.122.2 SMB2 498 Create Response File: ;GetInfo Response;Close Response 9 0.260798209 192.168.122.2 → 192.168.122.15 SMB2 146 Close Request File: 10 0.260841089 192.168.122.15 → 192.168.122.2 SMB2 130 Close Response, Error: STATUS_FILE_CLOSED Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-08cifs: return proper error code in statfs(2)Paulo Alcantara
In cifs_statfs(), if server->ops->queryfs is not NULL, then we should use its return value rather than always returning 0. Instead, use rc variable as it is properly set to 0 in case there is no server->ops->queryfs. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-08cifs: change noisy error message to FYIPaulo Alcantara
A customer has reported that their dmesg were being flooded by CIFS: VFS: \\server Cancelling wait for mid xxx cmd: a CIFS: VFS: \\server Cancelling wait for mid yyy cmd: b CIFS: VFS: \\server Cancelling wait for mid zzz cmd: c because some processes that were performing statfs(2) on the share had been interrupted due to their automount setup when certain users logged in and out. Change it to FYI as they should be mostly informative rather than error messages. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-08cifs: print MIDs in decimal notationPaulo Alcantara
The MIDs are mostly printed as decimal, so let's make it consistent. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-03-08NFS: Fix open coded versions of nfs_set_cache_invalid() in NFSv4Trond Myklebust
nfs_set_cache_invalid() has code to handle delegations, and other optimisations, so let's use it when appropriate. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08NFS: Fix open coded versions of nfs_set_cache_invalid()Trond Myklebust
nfs_set_cache_invalid() has code to handle delegations, and other optimisations, so let's use it when appropriate. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08NFS: Clean up function nfs_mark_dir_for_revalidate()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08NFS: Don't gratuitously clear the inode cache when lookup failedTrond Myklebust
The fact that the lookup revalidation failed, does not mean that the inode contents have changed. Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08NFS: Don't revalidate the directory permissions on a lookup failureTrond Myklebust
There should be no reason to expect the directory permissions to change just because the directory contents changed or a negative lookup timed out. So let's avoid doing a full call to nfs_mark_for_revalidate() in that case. Furthermore, if this is a negative dentry, and we haven't actually done a new lookup, then we have no reason yet to believe the directory has changed at all. So let's remove the gratuitous directory inode invalidation altogether when called from nfs_lookup_revalidate_negative(). Reported-by: Geert Jansen <gerardu@amazon.com> Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08NFS: Correct size calculation for create reply lengthFrank Sorenson
CREATE requests return a post_op_fh3, rather than nfs_fh3. The post_op_fh3 includes an extra word to indicate 'handle_follows'. Without that additional word, create fails when full 64-byte filehandles are in use. Add NFS3_post_op_fh_sz, and correct the size calculation for NFS3_createres_sz. Signed-off-by: Frank Sorenson <sorenson@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig defaultTimo Rothenpieler
This follows what was done in 8c2fabc6542d9d0f8b16bd1045c2eda59bdcde13. With the default being m, it's impossible to build the module into the kernel. Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-03-08mount: fix mounting of detached mounts onto targets that reside on shared mountsChristian Brauner
Creating a series of detached mounts, attaching them to the filesystem, and unmounting them can be used to trigger an integer overflow in ns->mounts causing the kernel to block any new mounts in count_mounts() and returning ENOSPC because it falsely assumes that the maximum number of mounts in the mount namespace has been reached, i.e. it thinks it can't fit the new mounts into the mount namespace anymore. Depending on the number of mounts in your system, this can be reproduced on any kernel that supportes open_tree() and move_mount() by compiling and running the following program: /* SPDX-License-Identifier: LGPL-2.1+ */ #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <getopt.h> #include <limits.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mount.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> /* open_tree() */ #ifndef OPEN_TREE_CLONE #define OPEN_TREE_CLONE 1 #endif #ifndef OPEN_TREE_CLOEXEC #define OPEN_TREE_CLOEXEC O_CLOEXEC #endif #ifndef __NR_open_tree #if defined __alpha__ #define __NR_open_tree 538 #elif defined _MIPS_SIM #if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */ #define __NR_open_tree 4428 #endif #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ #define __NR_open_tree 6428 #endif #if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */ #define __NR_open_tree 5428 #endif #elif defined __ia64__ #define __NR_open_tree (428 + 1024) #else #define __NR_open_tree 428 #endif #endif /* move_mount() */ #ifndef MOVE_MOUNT_F_EMPTY_PATH #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */ #endif #ifndef __NR_move_mount #if defined __alpha__ #define __NR_move_mount 539 #elif defined _MIPS_SIM #if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */ #define __NR_move_mount 4429 #endif #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ #define __NR_move_mount 6429 #endif #if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */ #define __NR_move_mount 5429 #endif #elif defined __ia64__ #define __NR_move_mount (428 + 1024) #else #define __NR_move_mount 429 #endif #endif static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags) { return syscall(__NR_open_tree, dfd, filename, flags); } static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd, const char *to_pathname, unsigned int flags) { return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags); } static bool is_shared_mountpoint(const char *path) { bool shared = false; FILE *f = NULL; char *line = NULL; int i; size_t len = 0; f = fopen("/proc/self/mountinfo", "re"); if (!f) return 0; while (getline(&line, &len, f) > 0) { char *slider1, *slider2; for (slider1 = line, i = 0; slider1 && i < 4; i++) slider1 = strchr(slider1 + 1, ' '); if (!slider1) continue; slider2 = strchr(slider1 + 1, ' '); if (!slider2) continue; *slider2 = '\0'; if (strcmp(slider1 + 1, path) == 0) { /* This is the path. Is it shared? */ slider1 = strchr(slider2 + 1, ' '); if (slider1 && strstr(slider1, "shared:")) { shared = true; break; } } } fclose(f); free(line); return shared; } static void usage(void) { const char *text = "mount-new [--recursive] <base-dir>\n"; fprintf(stderr, "%s", text); _exit(EXIT_SUCCESS); } #define exit_usage(format, ...) \ ({ \ fprintf(stderr, format "\n", ##__VA_ARGS__); \ usage(); \ }) #define exit_log(format, ...) \ ({ \ fprintf(stderr, format "\n", ##__VA_ARGS__); \ exit(EXIT_FAILURE); \ }) static const struct option longopts[] = { {"help", no_argument, 0, 'a'}, { NULL, no_argument, 0, 0 }, }; int main(int argc, char *argv[]) { int exit_code = EXIT_SUCCESS, index = 0; int dfd, fd_tree, new_argc, ret; char *base_dir; char *const *new_argv; char target[PATH_MAX]; while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) { switch (ret) { case 'a': /* fallthrough */ default: usage(); } } new_argv = &argv[optind]; new_argc = argc - optind; if (new_argc < 1) exit_usage("Missing base directory\n"); base_dir = new_argv[0]; if (*base_dir != '/') exit_log("Please specify an absolute path"); /* Ensure that target is a shared mountpoint. */ if (!is_shared_mountpoint(base_dir)) exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir); dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC); if (dfd < 0) exit_log("%m - Failed to open base directory \"%s\"", base_dir); ret = mkdirat(dfd, "detached-move-mount", 0755); if (ret < 0) exit_log("%m - Failed to create required temporary directories"); ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir); if (ret < 0 || (size_t)ret >= sizeof(target)) exit_log("%m - Failed to assemble target path"); /* * Having a mount table with 10000 mounts is already quite excessive * and shoult account even for weird test systems. */ for (size_t i = 0; i < 10000; i++) { fd_tree = sys_open_tree(dfd, "detached-move-mount", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC | AT_EMPTY_PATH); if (fd_tree < 0) { fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd); exit_code = EXIT_FAILURE; break; } ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH); if (ret < 0) { if (errno == ENOSPC) fprintf(stderr, "%m - Buggy mount counting"); else fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd); exit_code = EXIT_FAILURE; break; } close(fd_tree); ret = umount2(target, MNT_DETACH); if (ret < 0) { fprintf(stderr, "%m - Failed to unmount %s", target); exit_code = EXIT_FAILURE; break; } } (void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR); close(dfd); exit(exit_code); } and wait for the kernel to refuse any new mounts by returning ENOSPC. How many iterations are needed depends on the number of mounts in your system. Assuming you have something like 50 mounts on a standard system it should be almost instantaneous. The root cause of this is that detached mounts aren't handled correctly when source and target mount are identical and reside on a shared mount causing a broken mount tree where the detached source itself is propagated which propagation prevents for regular bind-mounts and new mounts. This ultimately leads to a miscalculation of the number of mounts in the mount namespace. Detached mounts created via open_tree(fd, path, OPEN_TREE_CLONE) are essentially like an unattached new mount, or an unattached bind-mount. They can then later on be attached to the filesystem via move_mount() which calls into attach_recursive_mount(). Part of attaching it to the filesystem is making sure that mounts get correctly propagated in case the destination mountpoint is MS_SHARED, i.e. is a shared mountpoint. This is done by calling into propagate_mnt() which walks the list of peers calling propagate_one() on each mount in this list making sure it receives the propagation event. The propagate_one() functions thereby skips both new mounts and bind mounts to not propagate them "into themselves". Both are identified by checking whether the mount is already attached to any mount namespace in mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for. However, detached mounts have an anonymous mount namespace attached to them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't realize they need to be skipped causing the mount to propagate "into itself" breaking the mount table and causing a disconnect between the number of mounts recorded as being beneath or reachable from the target mountpoint and the number of mounts actually recorded/counted in ns->mounts ultimately causing an overflow which in turn prevents any new mounts via the ENOSPC issue. So teach propagation to handle detached mounts by making it aware of them. I've been tracking this issue down for the last couple of days and then verifying that the fix is correct by unmounting everything in my current mount table leaving only /proc and /sys mounted and running the reproducer above overnight verifying the number of mounts counted in ns->mounts. With this fix the counts are correct and the ENOSPC issue can't be reproduced. This change will only have an effect on mounts created with the new mount API since detached mounts cannot be created with the old mount API so regressions are extremely unlikely. Link: https://lore.kernel.org/r/20210306101010.243666-1-christian.brauner@ubuntu.com Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around") Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: <stable@vger.kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-08s390,alpha: make TMPFS_INODE64 available againHeiko Carstens
Both s390 and alpha have been switched to 64-bit ino_t with commit 96c0a6a72d18 ("s390,alpha: switch to 64-bit ino_t"). Therefore enable TMPFS_INODE64 for both architectures again. Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Link: https://lore.kernel.org/linux-mm/YCV7QiyoweJwvN+m@osiris/ Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-03-08erofs: fix bio->bi_max_vecs behavior changeGao Xiang
Martin reported an issue that directory read could be hung on the latest -rc kernel with some certain image. The root cause is that commit baa2c7c97153 ("block: set .bi_max_vecs as actual allocated vector number") changes .bi_max_vecs behavior. bio->bi_max_vecs is set as actual allocated vector number rather than the requested number now. Let's avoid using .bi_max_vecs completely instead. Link: https://lore.kernel.org/r/20210306040438.8084-1-hsiangkao@aol.com Reported-by: Martin DEVERA <devik@eaxlabs.cz> Reviewed-by: Chao Yu <yuchao0@huawei.com> [ Gao Xiang: note that <= 5.11 kernels are not impacted. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-07io_uring: kill io_sq_thread_fork() and return -EOWNERDEAD if the sq_thread ↵Stefan Metzmacher
is gone This brings the behavior back in line with what 5.11 and earlier did, and this is no longer needed with the improved handling of creds not needing to do unshare(). Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: run __io_sq_thread() with the initial creds from io_uring_setup()Stefan Metzmacher
With IORING_SETUP_ATTACH_WQ we should let __io_sq_thread() use the initial creds from each ctx. Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io-wq: warn on creating manager while exitingPavel Begunkov
Add a simple warning making sure that nobody tries to create a new manager while we're under IO_WQ_BIT_EXIT. That can potentially happen due to racy work submission after final put. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: cancel reqs of all iowq's on ring exitPavel Begunkov
io_ring_exit_work() have to cancel all requests, including those staying in io-wq, however it tries only cancellation of current tctx, which is NULL. If we've got task==NULL, use the ctx-to-tctx map to go over all tctx/io-wq and try cancellations on them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: warn when ring exit takes too longPavel Begunkov
We use system_unbound_wq to run io_ring_exit_work(), so it's hard to monitor whether removal hang or not. Add WARN_ONCE to catch hangs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: index io_uring->xa by ctx not filePavel Begunkov
We don't use task file notes anymore, and no need left in indexing task->io_uring->xa by file, and replace it with ctx. It's better design-wise, especially since we keep a dangling file, and so have to keep an eye on not dereferencing it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: don't take task ring-file notesPavel Begunkov
With ->flush() gone we're now leaving all uring file notes until the task dies/execs, so the ctx will not be freed until all tasks that have ever submit a request die. It was nicer with flush but not much, we could have locked as described ctx in many cases. Now we guarantee that ctx outlives all tctx in a sense that io_ring_exit_work() waits for all tctxs to drop their corresponding enties in ->xa, and ctx won't go away until then. Hence, additional io_uring file reference (a.k.a. task file notes) are not needed anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: do ctx initiated file note removalPavel Begunkov
Another preparation patch. When full quiesce is done on ctx exit, use task_work infra to remove corresponding to the ctx io_uring->xa entries. For that we use the back tctx map. Also use ->in_idle to prevent removing it while we traversing ->xa on cancellation, just ignore it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: introduce ctx to tctx back mapPavel Begunkov
For each pair tcxt-ctx create an object and chain it into ctx, so we have a way to traverse all tctx that are using current ctx. Preparation patch, will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07io_uring: make del_task_file more forgivingPavel Begunkov
Rework io_uring_del_task_file(), so it accepts an index to delete, and it's not necessarily have to be in the ->xa. Infer file from xa_erase() to maintain a single origin of truth. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-07gfs2: fix use-after-free in trans_drainBob Peterson
This patch adds code to function trans_drain to remove drained bd elements from the ail lists, if queued, before freeing the bd. If we don't remove the bd from the ail, function ail_drain will try to reference the bd after it has been freed by trans_drain. Thanks to Andy Price for his analysis of the problem. Reported-by: Andy Price <anprice@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-03-07gfs2: make function gfs2_make_fs_ro() to void typeYang Li
It fixes the following warning detected by coccinelle: ./fs/gfs2/super.c:592:5-10: Unneeded variable: "error". Return "0" on line 628 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>