summaryrefslogtreecommitdiff
path: root/fs/aio.c
AgeCommit message (Collapse)Author
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile two from Al Viro: - orangefs fix - series of fs/namei.c cleanups from me - VFS stuff coming from overlayfs tree * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: Use RCU for destroy_inode vfs: use helper for calling f_op->fsync() mm: use helper for calling f_op->mmap() vfs: use helpers for calling f_op->{read,write}_iter() vfs: pass type instead of fn to do_{loop,iter}_readv_writev() vfs: extract common parts of {compat_,}do_readv_writev() vfs: wrap write f_ops with file_{start,end}_write() vfs: deny copy_file_range() for non regular files vfs: deny fallocate() on directory vfs: create vfs helper vfs_tmpfile() namei.c: split unlazy_walk() namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate() lookup_fast(): clean up the logics around the fallback to non-rcu mode namei: fold unlazy_link() into its sole caller
2017-03-02Merge remote-tracking branch 'ovl/for-viro' into for-linusAl Viro
Overlayfs-related series from Miklos and Amir
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24userfaultfd: non-cooperative: add event for memory unmapsMike Rapoport
When a non-cooperative userfaultfd monitor copies pages in the background, it may encounter regions that were already unmapped. Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely changes in the virtual memory layout. Since there might be different uffd contexts for the affected VMAs, we first should create a temporary representation for the unmap event for each uffd context and then notify them one by one to the appropriate userfault file descriptors. The event notification occurs after the mmap_sem has been released. [arnd@arndb.de: fix nommu build] Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de [mhocko@suse.com: fix nommu build] Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-20vfs: use helpers for calling f_op->{read,write}_iter()Miklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-01-14aio: fix lock dep warningShaohua Li
lockdep reports a warnning. file_start_write/file_end_write only acquire/release the lock for regular files. So checking the files in aio side too. [ 453.532141] ------------[ cut here ]------------ [ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670 [ 453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 453.533011] Modules linked in: [ 453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964 [ 453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 453.533011] ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000 [ 453.533011] ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008 [ 453.533011] ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef [ 453.533011] Call Trace: [ 453.533011] [<ffffffff8196cffb>] dump_stack+0x67/0x9c [ 453.533011] [<ffffffff81091ee1>] __warn+0x111/0x130 [ 453.533011] [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0 [ 453.533011] [<ffffffff81091f00>] ? __warn+0x130/0x130 [ 453.533011] [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60 [ 453.533011] [<ffffffff811205d4>] lock_release+0x434/0x670 [ 453.533011] [<ffffffff8198af94>] ? import_single_range+0xd4/0x110 [ 453.533011] [<ffffffff81322195>] ? rw_verify_area+0x65/0x140 [ 453.533011] [<ffffffff813aa696>] ? aio_write+0x1f6/0x280 [ 453.533011] [<ffffffff813aa6c9>] aio_write+0x229/0x280 [ 453.533011] [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640 [ 453.533011] [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 453.533011] [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30 [ 453.533011] [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40 [ 453.533011] [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0 [ 453.533011] [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10 [ 453.533011] [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10 [ 453.533011] [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270 [ 453.533011] [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0 [ 453.533011] [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 453.533011] [<ffffffff813acb90>] SyS_io_submit+0x10/0x20 [ 453.533011] [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad [ 453.533011] [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110 [ 453.533011] ---[ end trace b2fbe664d1cc0082 ]--- Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-25ktime: Get rid of the unionThomas Gleixner
ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-22move aio compat to fs/aio.cAl Viro
... and fix the minor buglet in compat io_submit() - native one kills ioctx as cleanup when put_user() fails. Get rid of bogus compat_... in !CONFIG_AIO case, while we are at it - they should simply fail with ENOSYS, same as for native counterparts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-04don't open-code file_inode()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30aio: fix freeze protection of aio writesJan Kara
Currently we dropped freeze protection of aio writes just after IO was submitted. Thus aio write could be in flight while the filesystem was frozen and that could result in unexpected situation like aio completion wanting to convert extent type on frozen filesystem. Testcase from Dmitry triggering this is like: for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done & fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \ --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite Fix the problem by dropping freeze protection only once IO is completed in aio_complete(). Reported-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> [hch: forward ported on top of various VFS and aio changes] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30fs: remove aio_run_iocbChristoph Hellwig
Pass the ABI iocb structure to aio_setup_rw and let it handle the non-vectored I/O case as well. With that and a new helper for the AIO return value handling we can now define new aio_read and aio_write helpers that implement reads and writes in a self-contained way without duplicating too much code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30fs: remove the never implemented aio_fsync file operationChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30aio: hold an extra file reference over AIO read/write operationsChristoph Hellwig
Otherwise we might dereference an already freed file and/or inode when aio_complete is called before we return from the read_iter or write_iter method. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27fs/aio.c: eliminate redundant loads in put_aio_ring_fileRasmus Villemoes
Using a local variable we can prevent gcc from reloading aio_ring_file->f_inode->i_mapping twice, eliminating 2x2 dependent loads. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-15aio: mark AIO pseudo-fs noexecJann Horn
This ensures that do_mmap() won't implicitly make AIO memory mappings executable if the READ_IMPLIES_EXEC personality flag is set. Such behavior is problematic because the security_mmap_file LSM hook doesn't catch this case, potentially permitting an attacker to bypass a W^X policy enforced by SELinux. I have tested the patch on my machine. To test the behavior, compile and run this: #define _GNU_SOURCE #include <unistd.h> #include <sys/personality.h> #include <linux/aio_abi.h> #include <err.h> #include <stdlib.h> #include <stdio.h> #include <sys/syscall.h> int main(void) { personality(READ_IMPLIES_EXEC); aio_context_t ctx = 0; if (syscall(__NR_io_setup, 1, &ctx)) err(1, "io_setup"); char cmd[1000]; sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'", (int)getpid()); system(cmd); return 0; } In the output, "rw-s" is good, "rwxs" is bad. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23aio: make aio_setup_ring killableMichal Hocko
aio_setup_ring waits for mmap_sem in writable mode. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. This will also expedite the return to the userspace and do_exit. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Benamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-03aio: remove a pointless assignmentAl Viro
the value is never used after that point Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-09-04mm: move ->mremap() from file_operations to vm_operations_structOleg Nesterov
vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this way ->mremap() can have more users. Say, vdso. While at it, s/aio_ring_remap/aio_ring_mremap/. Note: this is the minimal change before ->mremap() finds another user in file_operations; this method should have more arguments, and it can be used to kill arch_remap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull third hunk of vfs changes from Al Viro: "This contains the ->direct_IO() changes from Omar + saner generic_write_checks() + dealing with fcntl()/{read,write}() races (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of repeatedly looking at ->f_flags, which can be changed by fcntl(2), check ->ki_flags - which cannot) + infrastructure bits for dhowells' d_inode annotations + Christophs switch of /dev/loop to vfs_iter_write()" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits) block: loop: switch to VFS ITER_BVEC configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode VFS: Make pathwalk use d_is_reg() rather than S_ISREG() VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR() VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk NFS: Don't use d_inode as a variable name VFS: Impose ordering on accesses of d_inode and d_flags VFS: Add owner-filesystem positive/negative dentry checks nfs: generic_write_checks() shouldn't be done on swapout... ocfs2: use __generic_file_write_iter() mirror O_APPEND and O_DIRECT into iocb->ki_flags switch generic_write_checks() to iocb and iter ocfs2: move generic_write_checks() before the alignment checks ocfs2_file_write_iter: stop messing with ppos udf_file_write_iter: reorder and simplify fuse: ->direct_IO() doesn't need generic_write_checks() ext4_file_write_iter: move generic_write_checks() up xfs_file_aio_write_checks: switch to iocb/iov_iter generic_write_checks(): drop isblk argument blkdev_write_iter: expand generic_file_checks() call in there ...
2015-04-16Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer core bits from Jens Axboe: "This is the core pull request for 4.1. Not a lot of stuff in here for this round, mostly little fixes or optimizations. This pull request contains: - An optimization that speeds up queue runs on blk-mq, especially for the case where there's a large difference between nr_cpu_ids and the actual mapped software queues on a hardware queue. From Chong Yuan. - Honor node local allocations for requests on legacy devices. From David Rientjes. - Cleanup of blk_mq_rq_to_pdu() from me. - exit_aio() fixup from me, greatly speeding up exiting multiple IO contexts off exit_group(). For my particular test case, fio exit took ~6 seconds. A typical case of both exposing RCU grace periods to user space, and serializing exit of them. - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only wait if __GFP_WAIT is set. From Keith Busch. - blk-mq exports and two added helpers from Mike Snitzer, which will be used by the dm-mq code. - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang" * 'for-4.1/core' of git://git.kernel.dk/linux-block: blk-mq: reduce unnecessary software queue looping aio: fix serial draining in exit_aio() blk-mq: cleanup blk_mq_rq_to_pdu() blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue() block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set() block: allocate request memory local to request queue blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set blk-mq: export blk_mq_run_hw_queues blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
2015-04-15Merge branch 'for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs update from Al Viro: "Now that net-next went in... Here's the next big chunk - killing ->aio_read() and ->aio_write(). There'll be one more pile today (direct_IO changes and generic_write_checks() cleanups/fixes), but I'd prefer to keep that one separate" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) ->aio_read and ->aio_write removed pcm: another weird API abuse infinibad: weird APIs switched to ->write_iter() kill do_sync_read/do_sync_write fuse: use iov_iter_get_pages() for non-splice path fuse: switch to ->read_iter/->write_iter switch drivers/char/mem.c to ->read_iter/->write_iter make new_sync_{read,write}() static coredump: accept any write method switch /dev/loop to vfs_iter_write() serial2002: switch to __vfs_read/__vfs_write ashmem: use __vfs_read() export __vfs_read() autofs: switch to __vfs_write() new helper: __vfs_write() switch hugetlbfs to ->read_iter() coda: switch to ->read_iter/->write_iter ncpfs: switch to ->read_iter/->write_iter net/9p: remove (now-)unused helpers p9_client_attach(): set fid->uid correctly ...
2015-04-15aio: fix serial draining in exit_aio()Jens Axboe
exit_aio() currently serializes killing io contexts. Each context killing ends up having to do percpu_ref_kill(), which in turns has to wait for an RCU grace period. This can take a long time, depending on the number of contexts. And there's no point in doing them serially, when we could be waiting for all of them in one fell swoop. This patches makes my fio thread offload test case exit 0.2s instead of almost 6s. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-14Merge branch 'for-linus-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: "Part one: - struct filename-related cleanups - saner iov_iter_init() replacements (and switching the syscalls to use of those) - ntfs switch to ->write_iter() (Anton) - aio cleanups and splitting iocb into common and async parts (Christoph) - assorted fixes (me, bfields, Andrew Elble) There's a lot more, including the completion of switchover to ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags race fixes, etc, but that goes after #for-davem merge. David has pulled it, and once it's in I'll send the next vfs pull request" * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits) sg_start_req(): use import_iovec() sg_start_req(): make sure that there's not too many elements in iovec blk_rq_map_user(): use import_single_range() sg_io(): use import_iovec() process_vm_access: switch to {compat_,}import_iovec() switch keyctl_instantiate_key_common() to iov_iter switch {compat_,}do_readv_writev() to {compat_,}import_iovec() aio_setup_vectored_rw(): switch to {compat_,}import_iovec() vmsplice_to_user(): switch to import_iovec() kill aio_setup_single_vector() aio: simplify arguments of aio_setup_..._rw() aio: lift iov_iter_init() into aio_setup_..._rw() lift iov_iter into {compat_,}do_readv_writev() NFS: fix BUG() crash in notify_change() with patch to chown_common() dcache: return -ESTALE not -EBUSY on distributed fs race NTFS: Version 2.1.32 - Update file write from aio_write to write_iter. VFS: Add iov_iter_fault_in_multipages_readable() drop bogus check in file_open_root() switch security_inode_getattr() to struct path * constify tomoyo_realpath_from_path() ...
2015-04-11mirror O_APPEND and O_DIRECT into iocb->ki_flagsAl Viro
... avoiding write_iter/fcntl races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Merge branch 'for-linus' into for-nextAl Viro
2015-04-11->aio_read and ->aio_write removedAl Viro
no remaining users Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11aio_run_iocb(): kill dead checkAl Viro
We check if ->ki_pos is positive. However, by that point we have already done rw_verify_area(), which would have rejected such unless the file had been one of /dev/mem, /dev/kmem and /proc/kcore. All of which do not have vectored rw methods, so we would've bailed out even earlier. This check had been introduced before rw_verify_area() had been added there - in fact, it was a subset of checks done on sync paths by rw_verify_area() (back then the /dev/mem exception didn't exist at all). The rest of checks (mandatory locking, etc.) hadn't been added until later. Unfortunately, by the time the call of rw_verify_area() got added, the /dev/mem exception had already appeared, so it wasn't obvious that the older explicit check downstream had become dead code. It *is* a dead code, though, since the few files for which the exception applies do not have ->aio_{read,write}() or ->{read,write}_iter() and for them we won't reach that check anyway. What's more, even if we ever introduce vectored methods for /dev/mem and friends, they'll have to cope with negative positions anyway, since readv(2) and writev(2) are using the same checks as read(2) and write(2) - i.e. rw_verify_area(). Let's bury it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11ioctx_alloc(): remove pointless checkAl Viro
Way, way back kiocb used to be picked from arrays, so ioctx_alloc() checked for multiplication overflow when calculating the size of such array. By the time fs/aio.c went into the tree (in 2002) they were already allocated one-by-one by kmem_cache_alloc(), so that check had already become pointless. Let's bury it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11aio_setup_vectored_rw(): switch to {compat_,}import_iovec()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11kill aio_setup_single_vector()Al Viro
identical to import_single_range() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11aio: simplify arguments of aio_setup_..._rw()Al Viro
We don't need req in either of those. We don't need nr_segs in caller. We don't really need len in caller either - iov_iter_count(&iter) will do. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11aio: lift iov_iter_init() into aio_setup_..._rw()Al Viro
the only non-trivial detail is that we do it before rw_verify_area(), so we'd better cap the length ourselves in aio_setup_single_rw() case (for vectored case rw_copy_check_uvector() will do that for us). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Merge branch 'iocb' into for-nextAl Viro
2015-04-06ioctx_alloc(): fix vma (and file) leak on failureAl Viro
If we fail past the aio_setup_ring(), we need to destroy the mapping. We don't need to care about anybody having found ctx, or added requests to it, since the last failure exit is exactly the failure to make ctx visible to lookups. Reproducer (based on one by Joe Mario <jmario@redhat.com>): void count(char *p) { char s[80]; printf("%s: ", p); fflush(stdout); sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid()); system(s); } int main() { io_context_t *ctx; int created, limit, i, destroyed; FILE *f; count("before"); if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL) perror("opening aio-max-nr"); else if (fscanf(f, "%d", &limit) != 1) fprintf(stderr, "can't parse aio-max-nr\n"); else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL) perror("allocating aio_context_t array"); else { for (i = 0, created = 0; i < limit; i++) { if (io_setup(1000, ctx + created) == 0) created++; } for (i = 0, destroyed = 0; i < created; i++) if (io_destroy(ctx[i]) == 0) destroyed++; printf("created %d, failed %d, destroyed %d\n", created, limit - created, destroyed); count("after"); } } Found-by: Joe Mario <jmario@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-06fix mremap() vs. ioctx_kill() raceAl Viro
teach ->mremap() method to return an error and have it fail for aio mappings in process of being killed Note that in case of ->mremap() failure we need to undo move_page_tables() we'd already done; we could call ->mremap() first, but then the failure of move_page_tables() would require undoing whatever _successful_ ->mremap() has done, which would be a lot more headache in general. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-13fs: split generic and aio kiocbChristoph Hellwig
Most callers in the kernel want to perform synchronous file I/O, but still have to bloat the stack with a full struct kiocb. Split out the parts needed in filesystem code from those in the aio code, and only allocate those needed to pass down argument on the stack. The aio code embedds the generic iocb in the one it allocates and can easily get back to it by using container_of. Also add a ->ki_complete method to struct kiocb, this is used to call into the aio code and thus removes the dependency on aio for filesystems impementing asynchronous operations. It will also allow other callers to substitute their own completion callback. We also add a new ->ki_flags field to work around the nasty layering violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add eventfd notification about ffs events"). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-13fs: don't allow to complete sync iocbs through aio_completeChristoph Hellwig
The AIO interface is fairly complex because it tries to allow filesystems to always work async and then wakeup a synchronous caller through aio_complete. It turns out that basically no one was doing this to avoid the complexity and context switches, and we've already fixed up the remaining users and can now get rid of this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-12fs: remove ki_nbytesChristoph Hellwig
There is no need to pass the total request length in the kiocb, as we already get passed in through the iov_iter argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20fs/aio.c: Remove duplicate function name in pr_debug messagesKinglong Mee
Have defined pr_fmt as below in fs/aio.c, so remove duplicate function name in pr_debug message. #define pr_fmt(fmt) "%s: " fmt, __func__ Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-12Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
2015-02-03aio: annotate aio_read_event_ring for sleep patternsDave Chinner
Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw warnings like the following due to being called from wait_event context: WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110 Modules linked in: CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8 ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8 ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300 Call Trace: [<ffffffff81daf2bd>] dump_stack+0x4c/0x65 [<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0 [<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50 [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110 [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110 [<ffffffff810bdfcf>] __might_sleep+0x7f/0x90 [<ffffffff81db8344>] mutex_lock+0x24/0x45 [<ffffffff81216b7c>] aio_read_events+0x4c/0x290 [<ffffffff81216fac>] read_events+0x1ec/0x220 [<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110 [<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50 [<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0 [<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17 ---[ end trace bde69eaf655a4fea ]--- There is not actually a bug here, so annotate the code to tell the debug logic that everything is just fine and not to fire a false positive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2015-01-20fs: remove mapping->backing_dev_infoChristoph Hellwig
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20fs: introduce f_op->mmap_capabilities for nommu mmap supportChristoph Hellwig
Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-13aio: Skip timer for io_getevents if timeout=0Fam Zheng
In this case, it is basically a polling. Let's not involve timer at all because that would hurt performance for application event loops. In an arbitrary test I've done, io_getevents syscall elapsed time reduces from 50000+ nanoseconds to a few hundereds. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13aio: Make it possible to remap aio ringPavel Emelyanov
There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-11-25Merge git://git.kvack.org/~bcrl/aio-fixesLinus Torvalds
Pull aio fix from Ben LaHaise: "Dirty page accounting fix for aio" * git://git.kvack.org/~bcrl/aio-fixes: aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
2014-11-06aio: fix uncorrent dirty pages accouting when truncating AIO ring bufferGu Zheng
https://bugzilla.kernel.org/show_bug.cgi?id=86831 Markus reported that when shutting down mysqld (with AIO support, on a ext3 formatted Harddrive) leads to a negative number of dirty pages (underrun to the counter). The negative number results in a drastic reduction of the write performance because the page cache is not used, because the kernel thinks it is still 2 ^ 32 dirty pages open. Add a warn trace in __dec_zone_state will catch this easily: static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item) { atomic_long_dec(&zone->vm_stat[item]); + WARN_ON_ONCE(item == NR_FILE_DIRTY && atomic_long_read(&zone->vm_stat[item]) < 0); atomic_long_dec(&vm_stat[item]); } [ 21.341632] ------------[ cut here ]------------ [ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242 cancel_dirty_page+0x164/0x224() [ 21.355296] Modules linked in: wutbox_cp sata_mv [ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80 [ 21.366793] Workqueue: events free_ioctx [ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>] (show_stack+0x20/0x24) [ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>] (dump_stack+0x24/0x28) [ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>] (warn_slowpath_common+0x84/0x9c) [ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>] (warn_slowpath_null+0x2c/0x34) [ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>] (cancel_dirty_page+0x164/0x224) [ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>] (truncate_inode_page+0x8c/0x158) [ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>] (truncate_inode_pages_range+0x11c/0x53c) [ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from [<c00c0f6c>] (truncate_pagecache+0x88/0xac) [ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>] (truncate_setsize+0x5c/0x74) [ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>] (put_aio_ring_file.isra.14+0x34/0x90) [ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from [<c013b424>] (aio_free_ring+0x20/0xcc) [ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>] (free_ioctx+0x24/0x44) [ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>] (process_one_work+0x134/0x47c) [ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>] (worker_thread+0x130/0x414) [ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>] (kthread+0xd4/0xec) [ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>] (ret_from_fork+0x14/0x20) [ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]--- The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty (bypasses the VFS dirty pages increment) when init, and aio fs uses *default_backing_dev_info* as the backing dev, which does not disable the dirty pages accounting capability. So truncating aio ring file will contribute to accounting dirty pages (VFS dirty pages decrement), then error occurs. The original goal is keeping these pages in memory (can not be reclaimed or swapped) in life-time via marking it dirty. But thinking more, we have already pinned pages via elevating the page's refcount, which can already achieve the goal, so the SetPageDirty seems unnecessary. In order to fix the issue, using the __set_page_dirty_no_writeback instead of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually set the dirty flags, don't disable set_page_dirty(), rely on default behaviour). With the above change, the dirty pages accounting can work well. But as we known, aio fs is an anonymous one, which should never cause any real write-back, we can ignore the dirty pages (write back) accounting by disabling the dirty pages (write back) accounting capability. So we introduce an aio private backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to replace the default one. Reported-by: Markus Königshaus <m.koenigshaus@wut.de> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable <stable@vger.kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-09-24percpu_ref: add PERCPU_REF_INIT_* flagsTejun Heo
With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>