Age | Commit message (Collapse) | Author |
|
Fix up the test in nfs_page_group_covers_page(). The simplest implementation
is to check that we have a set of intersecting or contiguous subrequests
that connect page offset 0 to nfs_page_length(req->wb_page).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
nfs_page_group_lock() is now always called with the 'nonblock'
parameter set to 'false'.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
At this point, we only expect ever to potentially see PG_REMOVE and
PG_TEARDOWN being set on the subrequests.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Since nfs_page_group_destroy() does not take any locks on the requests
to be freed, we need to ensure that we don't inadvertently free the
request in nfs_destroy_unlinked_subrequests() while the last reference
is being released elsewhere.
Do this by:
1) Taking a reference to the request unless it is already being freed
2) Checking (under the page group lock) if PG_TEARDOWN is already set before
freeing an unreferenced request in nfs_destroy_unlinked_subrequests()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
When locking the entire group in order to remove subrequests,
the locks are always taken in order, and with the page group
lock being taken after the page head is locked. The intention
is that:
1) The lock on the group head guarantees that requests may not
be removed from the group (although new entries could be appended
if we're not holding the group lock).
2) It is safe to drop and retake the page group lock while iterating
through the list, in particular when waiting for a subrequest lock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We should no longer need the inode->i_lock, now that we've
straightened out the request locking. The locking schema is now:
1) Lock page head request
2) Lock the page group
3) Lock the subrequests one by one
Note that there is a subtle race with nfs_inode_remove_request() due
to the fact that the latter does not lock the page head, when removing
it from the struct page. Only the last subrequest is locked, hence
we need to re-check that the PagePrivate(page) is still set after
we've locked all the subrequests.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
nfs_try_to_update_request() should be able to cope now.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Simplify the code, and avoid some flushes to disk.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests()
manipulate the inode flags adjusting the NFS_I(inode)->nrequests.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We don't want nfs_lock_and_join_requests() to start fiddling with
the request before the call to nfs_page_group_sync_on_bit().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Request offsets and sizes are not guaranteed to be stable unless you
are holding the request locked.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
All other callers of nfs_page_group_lock() appear to already hold the
page lock on the head page, so doing it in the opposite order here
is inefficient, although not deadlock prone since we roll back all
locks on contention.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Yes, this is a situation that should never happen (hence the WARN_ON)
but we should still ensure that we free up the locks and references to
the faulty pages.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
This fixes a race with nfs_page_group_sync_on_bit() whereby the
call to wake_up_bit() in nfs_page_group_unlock() could occur after
the page header had been freed.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Micro-optimisation to move the lockless check into the for(;;) loop.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Add a lockless check for whether or not the page might be carrying
an existing writeback before we grab the inode->i_lock.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We don't expect the page header lock to ever be held across I/O, so
it should always be safe to wait for it, even if we're doing nonblocking
writebacks.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Pull NFS client fixes from Anna Schumaker:
"A few more NFS client bugfixes from me for rc5.
Dros has a stable fix for flexfiles to prevent leaking the
nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a
potential recovery loop situation with the TEST_STATEID operation, and
Christoph fixed up the pNFS blocklayout Kconfig options to prevent
unsafe use with kernels that don't have large block device support.
Summary:
Stable fix:
- fix leaking nfs4_ff_ds_version array
Other fixes:
- improve TEST_STATEID OLD_STATEID handling to prevent recovery loop
- require 64-bit sector_t for pNFS blocklayout to prevent 32-bit
compile errors"
* tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs:
pnfs/blocklayout: require 64-bit sector_t
NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()
nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Fix a few bugs in fuse"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: set mapping error in writepage_locked when it fails
fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio
fuse: initialize the flock flag in fuse_file on allocation
|
|
The blocklayout code does not compile cleanly for a 32-bit sector_t,
and also has no reliable checks for devices sizes, which makes it
unsafe to use with a kernel that doesn't support large block devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
This ensures that we see errors on fsync when writeback fails.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When the process exit races with outstanding mcopy_atomic, it would be
better to return ESRCH error. When such race occurs the process and
it's mm are going away and returning "no such process" to the uffd
monitor seems better fit than ENOSPC.
Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Nadav reported KSM can corrupt the user data by the TLB batching
race[1]. That means data user written can be lost.
Quote from Nadav Amit:
"For this race we need 4 CPUs:
CPU0: Caches a writable and dirty PTE entry, and uses the stale value
for write later.
CPU1: Runs madvise_free on the range that includes the PTE. It would
clear the dirty-bit. It batches TLB flushes.
CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
We care about the fact that it clears the PTE write-bit, and of
course, batches TLB flushes.
CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():
if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
(pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
Since it will avoid TLB flush. And we want to do it while the PTE is
stale. Later, and before replacing the page, we would be able to
change the page.
Note that all the operations the CPU1-3 perform canhappen in parallel
since they only acquire mmap_sem for read.
We start with two identical pages. Everything below regards the same
page/PTE.
CPU0 CPU1 CPU2 CPU3
---- ---- ---- ----
Write the same
value on page
[cache PTE as
dirty in TLB]
MADV_FREE
pte_mkclean()
4 > clear_refs
pte_wrprotect()
write_protect_page()
[ success, no flush ]
pages_indentical()
[ ok ]
Write to page
different value
[Ok, using stale
PTE]
replace_page()
Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
CPU0 already wrote on the page, but KSM ignored this write, and it got
lost"
In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty
part.
This patch changes soft-dirty uses TLB batching API instead of
flush_tlb_mm and KSM checks pending TLB flush by using
mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
there are other parallel threads pending TLB flush.
[1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reported-by: Nadav Amit <namit@vmware.com>
Tested-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As Tetsuo points out:
"Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
0kB"
In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.
This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.
Use global_node_page_state() to access the global slab counters.
Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If the call to TEST_STATEID returns NFS4ERR_OLD_STATEID, then it just
means we raced with other calls to OPEN.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The client was freeing the nfs4_ff_layout_ds, but not the contained
nfs4_ff_ds_version array.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Pull xfs fixes from Darrick Wong:
"I have a couple more bug fixes for you today:
- fix memory leak when issuing discard
- fix propagation of the dax inode flag"
* tag 'xfs-4.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix per-inode DAX flag inheritance
xfs: Fix leak of discard bio
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A large number of ext4 bug fixes and cleanups for v4.13"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix copy paste error in ext4_swap_extents()
ext4: fix overflow caused by missing cast in ext4_resize_fs()
ext4, project: expand inode extra size if possible
ext4: cleanup ext4_expand_extra_isize_ea()
ext4: restructure ext4_expand_extra_isize
ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize
ext4: make xattr inode reads faster
ext4: inplace xattr block update fails to deduplicate blocks
ext4: remove unused mode parameter
ext4: fix warning about stack corruption
ext4: fix dir_nlink behaviour
ext4: silence array overflow warning
ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
ext4: release discard bio after sending discard commands
ext4: convert swap_inode_data() over to use swap() on most of the fields
ext4: error should be cleared if ea_inode isn't added to the cache
ext4: Don't clear SGID when inheriting ACLs
ext4: preserve i_mode if __ext4_set_acl() fails
ext4: remove unused metadata accounting variables
ext4: correct comment references to ext4_ext_direct_IO()
|
|
This bug was found by a static code checker tool for copy paste
problems.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
On a 32-bit platform, the value of n_blcoks_count may be wrong during
the file system is resized to size larger than 2^32 blocks. This may
caused the superblock being corrupted with zero blocks count.
Fixes: 1c6bd7173d66
Signed-off-by: Jerry Lee <jerrylee@qnap.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # 3.7+
|
|
When upgrading from old format, try to set project id
to old file first time, it will return EOVERFLOW, but if
that file is dirtied(touch etc), changing project id will
be allowed, this might be confusing for users, we could
try to expand @i_extra_isize here too.
Reported-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Clean up some goto statement, make ext4_expand_extra_isize_ea() clearer.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
|
|
Current ext4_expand_extra_isize just tries to expand extra isize, if
someone is holding xattr lock or some check fails, it will give up.
So rename its name to ext4_try_to_expand_extra_isize.
Besides that, we clean up unnecessary check and move some relative checks
into it.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
|
|
We should avoid the contention between the i_extra_isize update and
the inline data insertion, so move the xattr trylock in front of
i_extra_isize update.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
|
|
ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.
Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When an xattr block has a single reference, block is updated inplace
and it is reinserted to the cache. Later, a cache lookup is performed
to see whether an existing block has the same contents. This cache
lookup will most of the time return the just inserted entry so
deduplication is not achieved.
Running the following test script will produce two xattr blocks which
can be observed in "File ACL: " line of debugfs output:
mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G
mount /dev/sdb /mnt/sdb
touch /mnt/sdb/{x,y}
setfattr -n user.1 -v aaa /mnt/sdb/x
setfattr -n user.2 -v bbb /mnt/sdb/x
setfattr -n user.1 -v aaa /mnt/sdb/y
setfattr -n user.2 -v bbb /mnt/sdb/y
debugfs -R 'stat x' /dev/sdb | cat
debugfs -R 'stat y' /dev/sdb | cat
This patch defers the reinsertion to the cache so that we can locate
other blocks with the same contents.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
|
ext4_alloc_file_blocks() does not use its mode parameter. Remove it.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"),
we get a warning about possible stack overflow from a memcpy that
was not strictly bounded to the size of the local variable:
inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2:
include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=]
We actually had a bug here that would have been found by the warning,
but it was already fixed last year in commit 30a9d7afe70e ("ext4: fix
stack memory corruption with 64k block size").
This replaces the fixed-length structure on the stack with a variable-length
structure, using the correct upper bound that tells the compiler that
everything is really fine here. I also change the loop count to check
for the same upper bound for consistency, but the existing code is
already correct here.
Note that while clang won't allow certain kinds of variable-length arrays
in structures, this particular instance is fine, as the array is at the
end of the structure, and the size is strictly bounded.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The dir_nlink feature has been enabled by default for new ext4
filesystems since e2fsprogs-1.41 in 2008, and was automatically
enabled by the kernel for older ext4 filesystems since the
dir_nlink feature was added with ext4 in kernel 2.6.28+ when
the subdirectory count exceeded EXT4_LINK_MAX-1.
Automatically adding the file system features such as dir_nlink is
generally frowned upon, since it could cause the file system to not be
mountable on older kernel, thus preventing the administrator from
rolling back to an older kernel if necessary.
In this case, the administrator might also want to disable the feature
because glibc's fts_read() function does not correctly optimize
directory traversal for directories that use st_nlinks field of 1 to
indicate that the number of links in the directory are not tracked by
the file system, and could fail to traverse the full directory
hierarchy. Fortunately, in the past ten years very few users have
complained about incomplete file system traversal by glibc's
fts_read().
This commit also changes ext4_inc_count() to allow i_nlinks to reach
the full EXT4_LINK_MAX links on the parent directory (including "."
and "..") before changing i_links_count to be 1.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
I get a static checker warning:
fs/ext4/ext4.h:3091 ext4_set_de_type()
error: buffer overflow 'ext4_type_by_mode' 15 <= 15
It seems unlikely that we would hit this read overflow in real life, but
it's also simple enough to make the array 16 bytes instead of 15.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_find_unwritten_pgoff() does not properly handle a situation when
starting index is in the middle of a page and blocksize < pagesize. The
following command shows the bug on filesystem with 1k blocksize:
xfs_io -f -c "falloc 0 4k" \
-c "pwrite 1k 1k" \
-c "pwrite 3k 1k" \
-c "seek -a -r 0" foo
In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.
Fix the problem by neglecting buffers in a page before starting offset.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org # 3.8+
|
|
We've changed the discard command handling into parallel manner.
But, in this change, I forgot decreasing the usage count of the bio
which was used to send discard request. I'm sorry about that.
Fixes: a015434480dc ("ext4: send parallel discards on commit completions")
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
According to the commit that implemented per-inode DAX flag:
commit 58f88ca2df72 ("xfs: introduce per-inode DAX enablement")
the flag is supposed to act as "inherit flag".
Currently this only works in the situations where parent directory
already has a flag in di_flags set, otherwise inheritance does not
work. This is because setting the XFS_DIFLAG2_DAX flag is done in a
wrong branch designated for di_flags, not di_flags2.
Fix this by moving the code to branch designated for setting di_flags2,
which does test for flags in di_flags2.
Fixes: 58f88ca2df72 ("xfs: introduce per-inode DAX enablement")
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The bio describing discard operation is allocated by
__blkdev_issue_discard() which returns us a reference to it. That
reference is never released and thus we leak this bio. Drop the bio
reference once it completes in xlog_discard_endio().
CC: stable@vger.kernel.org
Fixes: 4560e78f40cb55bd2ea8f1ef4001c5baa88531c7
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Merge misc fixes from Andrew Morton:
"15 fixes"
[ This does not merge the "fortify: use WARN instead of BUG for now"
patch, which needs a bit of extra work to build cleanly with all
configurations. Arnd is on it. - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
ocfs2: don't clear SGID when inheriting ACLs
mm: allow page_cache_get_speculative in interrupt context
userfaultfd: non-cooperative: flush event_wqh at release time
ipc: add missing container_of()s for randstruct
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
userfaultfd_zeropage: return -ENOSPC in case mm has gone
mm: take memory hotplug lock within numa_zonelist_order_handler()
mm/page_io.c: fix oops during block io poll in swapin path
zram: do not free pool->size_class
kthread: fix documentation build warning
kasan: avoid -Wmaybe-uninitialized warning
userfaultfd: non-cooperative: notify about unmap of destination during mremap
mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
pid: kill pidhash_size in pidhash_init()
mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
|
|
Commit 8fba54aebbdf ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes
the ITER_BVEC page deadlock for direct io in fuse by checking in
fuse_direct_io(), whether the page is a bvec page or not, before locking
it. However, this check is missed when the "async_dio" mount option is
enabled. In this case, set_page_dirty_lock() is called from the req->end
callback in request_end(), when the fuse thread is returning from userspace
to respond to the read request. This will cause the same deadlock because
the bvec condition is not checked in this path.
Here is the stack of the deadlocked thread, while returning from userspace:
[13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds.
[13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13706.658788] glusterfs D ffffffff816c80f0 0 3006 1
0x00000080
[13706.658797] ffff8800d6713a58 0000000000000086 ffff8800d9ad7000
ffff8800d9ad5400
[13706.658799] ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0
7fffffffffffffff
[13706.658801] 0000000000000002 ffffffff816c80f0 ffff8800d6713a78
ffffffff816c790e
[13706.658803] Call Trace:
[13706.658809] [<ffffffff816c80f0>] ? bit_wait_io_timeout+0x80/0x80
[13706.658811] [<ffffffff816c790e>] schedule+0x3e/0x90
[13706.658813] [<ffffffff816ca7e5>] schedule_timeout+0x1b5/0x210
[13706.658816] [<ffffffff81073ffb>] ? gup_pud_range+0x1db/0x1f0
[13706.658817] [<ffffffff810668fe>] ? kvm_clock_read+0x1e/0x20
[13706.658819] [<ffffffff81066909>] ? kvm_clock_get_cycles+0x9/0x10
[13706.658822] [<ffffffff810f5792>] ? ktime_get+0x52/0xc0
[13706.658824] [<ffffffff816c6f04>] io_schedule_timeout+0xa4/0x110
[13706.658826] [<ffffffff816c8126>] bit_wait_io+0x36/0x50
[13706.658828] [<ffffffff816c7d06>] __wait_on_bit_lock+0x76/0xb0
[13706.658831] [<ffffffffa0545636>] ? lock_request+0x46/0x70 [fuse]
[13706.658834] [<ffffffff8118800a>] __lock_page+0xaa/0xb0
[13706.658836] [<ffffffff810c8500>] ? wake_atomic_t_function+0x40/0x40
[13706.658838] [<ffffffff81194d08>] set_page_dirty_lock+0x58/0x60
[13706.658841] [<ffffffffa054d968>] fuse_release_user_pages+0x58/0x70 [fuse]
[13706.658844] [<ffffffffa0551430>] ? fuse_aio_complete+0x190/0x190 [fuse]
[13706.658847] [<ffffffffa0551459>] fuse_aio_complete_req+0x29/0x90 [fuse]
[13706.658849] [<ffffffffa05471e9>] request_end+0xd9/0x190 [fuse]
[13706.658852] [<ffffffffa0549126>] fuse_dev_do_write+0x336/0x490 [fuse]
[13706.658854] [<ffffffffa054963e>] fuse_dev_write+0x6e/0xa0 [fuse]
[13706.658857] [<ffffffff812a9ef3>] ? security_file_permission+0x23/0x90
[13706.658859] [<ffffffff81205300>] do_iter_readv_writev+0x60/0x90
[13706.658862] [<ffffffffa05495d0>] ? fuse_dev_splice_write+0x350/0x350
[fuse]
[13706.658863] [<ffffffff812062a1>] do_readv_writev+0x171/0x1f0
[13706.658866] [<ffffffff810b3d00>] ? try_to_wake_up+0x210/0x210
[13706.658868] [<ffffffff81206361>] vfs_writev+0x41/0x50
[13706.658870] [<ffffffff81206496>] SyS_writev+0x56/0xf0
[13706.658872] [<ffffffff810257a1>] ? syscall_trace_leave+0xf1/0x160
[13706.658874] [<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71
Fix this by making should_dirty a fuse_io_priv parameter that can be
checked in fuse_aio_complete_req().
Reported-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Pull NFS client fixes from Anna Schumaker:
"Two fixes from Trond this time, now that he's back from his vacation.
The first is a stable fix for the EXCHANGE_ID issue on the mailing
list, and the other fixes a double-free situation that he found at the
same time.
Stable fix:
- Fix EXCHANGE_ID corrupt verifier issue
Other fix:
- Fix double frees in nfs4_test_session_trunk()"
* tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4: Fix double frees in nfs4_test_session_trunk()
NFSv4: Fix EXCHANGE_ID corrupt verifier issue
|
|
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
into ocfs2_iop_set_acl(). That way the function will not be called when
inheriting ACLs which is what we want as it prevents SGID bit clearing
and the mode has been properly set by posix_acl_create() anyway. Also
posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
mode itself.
Fixes: 073931017b4 ("posix_acl: Clear SGID bit when setting file permissions")
Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There may still be threads waiting on event_wqh at the time the
userfault file descriptor is closed. Flush the events wait-queue to
prevent waiting threads from hanging.
Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report
non-PF events from uffd descriptor")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|