summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-09-04aio: block exit_aio() until all context requests are completedGu Zheng
It seems that exit_aio() also needs to wait for all iocbs to complete (like io_destroy), but we missed the wait step in current implemention, so fix it in the same way as we did in io_destroy. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
2014-09-04udf: saner calling conventions for udf_new_inode()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: fix the udf_iget() vs. udf_new_inode() racesAl Viro
Currently udf_iget() (triggered by NFS) can race with udf_new_inode() leading to two inode structures with the same inode number: nfsd: iget_locked() creates inode nfsd: try to read from disk, block on that. udf_new_inode(): allocate inode with that inumber udf_new_inode(): insert it into icache, set it up and dirty udf_write_inode(): write inode into buffer cache nfsd: get CPU again, look into buffer cache, see nice and sane on-disk inode, set the in-core inode from it Fix the problem by putting inode into icache in locked state (I_NEW set) and unlocking it only after it's fully set up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: merge the pieces inserting a new non-directory object into directoryAl Viro
boilerplate code in udf_{create,mknod,symlink} taken to new helper symlink case converted to unique id calculated by udf_new_inode() - no point finding a new one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: Set i_generation fieldJan Kara
Currently UDF doesn't initialize i_generation in any way and thus NFS can easily get reallocated inodes from stale file handles. Luckily UDF already has a unique object identifier associated with each inode - i_unique. Use that for initialization of i_generation. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: Properly detect stale inodesJan Kara
NFS can easily ask for inodes that are already deleted. Currently UDF happily returns such inodes which is a bug. Return -ESTALE if udf_read_inode() is asked to read deleted inode. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: Make udf_read_inode() and udf_iget() return errorJan Kara
Currently __udf_read_inode() wasn't returning anything and we found out whether we succeeded reading inode by checking whether inode is bad or not. udf_iget() returned NULL on failure and inode pointer otherwise. Make these two functions properly propagate errors up the call stack and use the return value in callers. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: Avoid infinite loop when processing indirect ICBsJan Kara
We did not implement any bound on number of indirect ICBs we follow when loading inode. Thus corrupted medium could cause kernel to go into an infinite loop, possibly causing a stack overflow. Fix the possible stack overflow by removing recursion from __udf_read_inode() and limit number of indirect ICBs we follow to avoid infinite loops. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: Fold udf_fill_inode() into __udf_read_inode()Jan Kara
There's no good reason to separate these since udf_fill_inode() is called only from __udf_read_inode() and both do part of the same thing. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04udf: Avoid dir link count to go negativeJan Kara
If we are writing back inode of unlinked directory, its link count ends up being (u16)-1. Although the inode is deleted, udf_iget() can load the inode when NFS uses stale file handle and get confused. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-03Merge tag 'for-f2fs-3.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs bug fixes from Jaegeuk Kim: "This series includes patches to: - fix recovery routines - fix bugs related to inline_data/xattr - fix when casting the dentry names - handle EIO or ENOMEM correctly - fix memory leak - fix lock coverage" * tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits) f2fs: reposition unlock_new_inode to prevent accessing invalid inode f2fs: fix wrong casting for dentry name f2fs: simplify by using a literal f2fs: truncate stale block for inline_data f2fs: use macro for code readability f2fs: introduce need_do_checkpoint for readability f2fs: fix incorrect calculation with total/free inode num f2fs: remove rename and use rename2 f2fs: skip if inline_data was converted already f2fs: remove rewrite_node_page f2fs: avoid double lock in truncate_blocks f2fs: prevent checkpoint during roll-forward f2fs: add WARN_ON in f2fs_bug_on f2fs: handle EIO not to break fs consistency f2fs: check s_dirty under cp_mutex f2fs: unlock_page when node page is redirtied out f2fs: introduce f2fs_cp_error for readability f2fs: give a chance to mount again when encountering errors f2fs: trigger release_dirty_inode in f2fs_put_super f2fs: don't skip checkpoint if there is no dirty node pages ...
2014-09-03ext4: avoid trying to kfree an ERR_PTR pointerTheodore Ts'o
Thanks to Dan Carpenter for extending smatch to find bugs like this. (This was found using a development version of smatch.) Fixes: 36de928641ee48b2078d3fe9514242aaa2f92013 Reported-by: Dan Carpenter <dan.carpenter@oracle.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-09-02Btrfs: fix crash while doing a ranged fsyncFilipe Manana
While doing a ranged fsync, that is, one whose range doesn't cover the whole possible file range (0 to LLONG_MAX), we can crash under certain circumstances with a trace like the following: [41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC (...) [41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1 (...) [41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs] (...) [41074.644919] Stack: (...) [41074.644919] Call Trace: [41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs] [41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs] [41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs] [41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0 [41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10 [41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10 [41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs] [41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180 [41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs] [41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs] [41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30 (...) The necessary conditions that lead to such crash are: * an incremental fsync (when the inode doesn't have the BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged a file extent item ending at offset X; * the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due to a file truncate operation that reduces the file to a size smaller than X; * a ranged fsync call happens (via an msync for example), with a range that doesn't cover the whole file and the end of this range, lets call it Y, is smaller than X; * btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and calls btrfs_truncate_inode_items() to remove all items from the log tree that are associated with our file; * btrfs_truncate_inode_items() removes all of the inode's items, and the lowest file extent item it removed is the one ending at offset X, where X > 0 and X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset parameter set to X; * btrfs_ordered_update_i_size() sees that X is greater then the current ordered size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing ordered operation with a range covering the offset X, calling a BUG_ON() if such ordered operation exists. This assumption is made because the disk_i_size is only increased after the corresponding file extent item is added to the btree (btrfs_finish_ordered_io); * But because our fsync covers only a limited range, such an ordered extent might exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered extent to finish when calling btrfs_wait_ordered_range(); And then by the time btrfs_ordered_update_i_size() is called, via: btrfs_sync_file() -> btrfs_log_dentry_safe() -> btrfs_log_inode_parent() -> btrfs_log_inode() -> btrfs_truncate_inode_items() -> btrfs_ordered_update_i_size() We hit the BUG_ON(), which could never happen if the fsync range covered the whole possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to finish before calling btrfs_truncate_inode_items(). So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items from a log tree, which isn't supposed to change the in memory inode's disk_i_size. Issue found while running xfstests/generic/127 (happens very rarely for me), more specifically via the fsx calls that use memory mapped IO (and issue msync calls). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02Btrfs: fix corruption after write/fsync failure + fsync + log recoveryFilipe Manana
While writing to a file, in inode.c:cow_file_range() (and same applies to submit_compressed_extents()), after reserving an extent for the file data, we create a new extent map for the written range and insert it into the extent map cache. After that, we create an ordered operation, but if it fails (due to a transient/temporary-ENOMEM), we return without dropping that extent map, which points to a reserved extent that is freed when we return. A subsequent incremental fsync (when the btrfs inode doesn't have the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and logs a file extent item based on that extent map, which points to a disk extent that doesn't contain valid data - it was freed by us earlier, at this point it might contain any random/garbage data. Therefore, if we reach an error condition when cowing a file range after we added the new extent map to the cache, drop it from the cache before returning. Some sequence of steps that lead to this: $ mkfs.btrfs -f /dev/sdd $ mount -o commit=9999 /dev/sdd /mnt $ cd /mnt $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096" $ sync $ od -t x1 foo 0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 * 0020000 $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo # Now this write + fsync fail with -ENOMEM, which was returned by # btrfs_add_ordered_extent() in inode.c:cow_file_range(). $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo $ xfs_io -c "fsync" foo fsync: Cannot allocate memory # Now do a new write + fsync, which will succeed. Our previous # -ENOMEM was a transient/temporary error. $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo $ xfs_io -c "fsync" foo # Our file content (in page cache) is now: $ od -t x1 foo 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 * 0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0050000 # Now reboot the machine, and mount the fs, so that fsync log replay # takes place. # The file content is now weird, in particular the first 8Kb, which # do not match our data before nor after the sync command above. $ od -t x1 foo 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0050000 # In fact these first 4Kb are a duplicate of the last 4kb block. # The last write got an extent map/file extent item that points to # the same disk extent that we got in the write+fsync that failed # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to # verify that: $ btrfs-debug-tree /dev/sdd (...) item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 12582912 nr 8192 extent data offset 0 nr 8192 ram 8192 item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 8192 ram 8192 item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 extent data disk byte 12582912 nr 4096 extent data offset 0 nr 4096 ram 4096 $ umount /dev/sdd $ btrfsck /dev/sdd Checking filesystem on /dev/sdd UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f checking extents extent item 12582912 has multiple extent items ref mismatch on [12582912 4096] extent item 1, found 2 Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 backpointer mismatch on [12582912 4096] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots root 5 inode 257 errors 1000, some csum missing found 131074 bytes used err is 1 total csum bytes: 4 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123404 file data blocks allocated: 274432 referenced 274432 Btrfs v3.14.1-96-gcc7fd5a-dirty Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02aio: add missing smp_rmb() in read_events_ringJeff Moyer
We ran into a case on ppc64 running mariadb where io_getevents would return zeroed out I/O events. After adding instrumentation, it became clear that there was some missing synchronization between reading the tail pointer and the events themselves. This small patch fixes the problem in testing. Thanks to Zach for helping to look into this, and suggesting the fix. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
2014-09-02f2fs: reposition unlock_new_inode to prevent accessing invalid inodeChao Yu
As the race condition on the inode cache, following scenario can appear: [Thread a] [Thread b] ->f2fs_mkdir ->f2fs_add_link ->__f2fs_add_link ->init_inode_metadata failed here ->gc_thread_func ->f2fs_gc ->do_garbage_collect ->gc_data_segment ->f2fs_iget ->iget_locked ->wait_on_inode ->unlock_new_inode ->move_data_page ->make_bad_inode ->iput When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode should be set as bad to avoid being accessed by other thread. But in above scenario, it allows f2fs to access the invalid inode before this inode was set as bad. This patch fix the potential problem, and this issue was found by code review. change log from v1: o Add condition judgment in gc_data_segment() suggested by Changman Lee. o use iget_failed to simplify code. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-02xfs: trim eofblocks before collapse rangeBrian Foster
xfs_collapse_file_space() currently writes back the entire file undergoing collapse range to settle things down for the extent shift algorithm. While this prevents changes to the extent list during the collapse operation, the writeback itself is not enough to prevent unnecessary collapse failures. The current shift algorithm uses the extent index to iterate the in-core extent list. If a post-eof delalloc extent persists after the writeback (e.g., a prior zero range op where the end of the range aligns with eof can separate the post-eof blocks such that they are not written back and converted), xfs_bmap_shift_extents() becomes confused over the encoded br_startblock value and fails the collapse. As with the full writeback, this is a temporary fix until the algorithm is improved to cope with a volatile extent list and avoid attempts to shift post-eof extents. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: xfs_file_collapse_range is delalloc challengedDave Chinner
If we have delalloc extents on a file before we run a collapse range opertaion, we sync the range that we are going to collapse to convert delalloc extents in that region to real extents to simplify the shift operation. However, the shift operation then assumes that the extent list is not going to change as it iterates over the extent list moving things about. Unfortunately, this isn't true because we can't hold the ILOCK over all the operations. We can prevent new IO from modifying the extent list by holding the IOLOCK, but that doesn't prevent writeback from running.... And when writeback runs, it can convert delalloc extents is the range of the file prior to the region being collapsed, and this changes the indexes of all the extents in the file. That causes the collapse range operation to Go Bad. The right fix is to rewrite the extent shift operation not to be dependent on the extent list not changing across the entire operation, but this is a fairly significant piece of work to do. Hence, as a short-term workaround for the problem, sync the entire file before starting a collapse operation to remove all delalloc ranges from the file and so avoid the problem of concurrent writeback changing the extent list. Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't log inode unless extent shift makes extent modificationsBrian Foster
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse all subsequent extents down into the specified, previously punched out, region. This function performs some validation, such as whether a sufficient hole exists in the target region of the collapse, then shifts the remaining exents downward. The exit path of the function currently logs the inode unconditionally. While we must log the inode (and abort) if an error occurs and the transaction is dirty, the initial validation paths can generate errors before the transaction has been dirtied. This creates an unnecessary filesystem shutdown scenario, as the caller will cancel a transaction that has been marked dirty. Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications are made to the inode bmap. Only log the inode in the exit path if logflags has been set. This ensures we only have to cancel a dirty transaction if modifications have been made and prevents an unnecessary filesystem shutdown otherwise. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: use ranged writeback and invalidation for direct IODave Chinner
Now we are not doing silly things with dirtying buffers beyond EOF and using invalidation correctly, we can finally reduce the ranges of writeback and invalidation used by direct IO to match that of the IO being issued. Bring the writeback and invalidation ranges back to match the generic direct IO code - this will greatly reduce the perturbation of cached data when direct IO and buffered IO are mixed, but still provide the same buffered vs direct IO coherency behaviour we currently have. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't zero partial page cache pages during O_DIRECT writesDave Chinner
Similar to direct IO reads, direct IO writes are using truncate_pagecache_range to invalidate the page cache. This is incorrect due to the sub-block zeroing in the page cache that truncate_pagecache_range() triggers. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. cc: stable@vger.kernel.org Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't zero partial page cache pages during O_DIRECT writesChris Mason
xfs is using truncate_pagecache_range to invalidate the page cache during DIO reads. This is different from the other filesystems who only invalidate pages during DIO writes. truncate_pagecache_range is meant to be used when we are freeing the underlying data structs from disk, so it will zero any partial ranges in the page. This means a DIO read can zero out part of the page cache page, and it is possible the page will stay in cache. buffered reads will find an up to date page with zeros instead of the data actually on disk. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. [dchinner: catch error and warn if it fails. Comment.] cc: stable@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't dirty buffers beyond EOFDave Chinner
generic/263 is failing fsx at this point with a page spanning EOF that cannot be invalidated. The operations are: 1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes) 1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes) 1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes) where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO write attempts to invalidate the cached page over this range, it fails with -EBUSY and so any attempt to do page invalidation fails. The real question is this: Why can't that page be invalidated after it has been written to disk and cleaned? Well, there's data on the first two buffers in the page (1k block size, 4k page), but the third buffer on the page (i.e. beyond EOF) is failing drop_buffers because it's bh->b_state == 0x3, which is BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say what? OK, set_buffer_dirty() is called on all buffers from __set_page_buffers_dirty(), regardless of whether the buffer is beyond EOF or not, which means that when we get to ->writepage, we have buffers marked dirty beyond EOF that we need to clean. So, we need to implement our own .set_page_dirty method that doesn't dirty buffers beyond EOF. This is messy because the buffer code is not meant to be shared and it has interesting locking issues on the buffer dirty bits. So just copy and paste it and then modify it to suit what we need. Note: the solutions the other filesystems and generic block code use of marking the buffers clean in ->writepage does not work for XFS. It still leaves dirty buffers beyond EOF and invalidations still fail. Hence rather than play whack-a-mole, this patch simply prevents those buffers from being dirtied in the first place. cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-30Merge tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linuxLinus Torvalds
Pull file locking bugfx from Jeff Layton: "Just a bugfix for a bug that crept in to v3.15. It's in a rather rare error path, and I'm not aware of anyone having hit it, but it's worth fixing for v3.17" * tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux: locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
2014-08-30fix EBUSY on umount() from MNT_SHRINKABLEAl Viro
We need the parents of victims alive until namespace_unlock() gets to dput() of the (ex-)mountpoints. However, that screws up the "is it busy" checks in case when we have shrinkable mounts that need to be killed. Solution: go ahead and decrement refcounts of parents right in umount_tree(), increment them again just before dropping rwsem in namespace_unlock() (and let the loop in the end of namespace_unlock() finally drop those references for good, as we do now). Parents can't get freed until we drop rwsem - at least one reference is kept until then, both in case when parent is among the victims and when it is not. So they'll still be around when we get to namespace_unlock(). Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-30get rid of propagate_umount() mistakenly treating slaves as busy.Al Viro
The check in __propagate_umount() ("has somebody explicitly mounted something on that slave?") is done *before* taking the already doomed victims out of the child lists. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-29Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds
Merge patches from Andrew Morton: "22 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits) kexec: purgatory: add clean-up for purgatory directory Documentation/kdump/kdump.txt: add ARM description flush_icache_range: export symbol to fix build errors tools: selftests: fix build issue with make kselftests target ocfs2: quorum: add a log for node not fenced ocfs2: o2net: set tcp user timeout to max value ocfs2: o2net: don't shutdown connection when idle timeout ocfs2: do not write error flag to user structure we cannot copy from/to x86/purgatory: use approprate -m64/-32 build flag for arch/x86/purgatory drivers/rtc/rtc-s5m.c: re-add support for devices without irq specified xattr: fix check for simultaneous glibc header inclusion kexec: remove CONFIG_KEXEC dependency on crypto kexec: create a new config option CONFIG_KEXEC_FILE for new syscall x86,mm: fix pte_special versus pte_numa hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked mm/zpool: use prefixed module loading zram: fix incorrect stat with failed_reads lib: turn CONFIG_STACKTRACE into an actual option. mm: actually clear pmd_numa before invalidating memblock, memhotplug: fix wrong type in memblock_find_in_range_node(). ...
2014-08-29ocfs2: quorum: add a log for node not fencedJunxiao Bi
For debug use, we can see from the log whether the fence decision is made and why it is not fenced. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29ocfs2: o2net: set tcp user timeout to max valueJunxiao Bi
When tcp retransmit timeout(15mins), the connection will be closed. Pending messages may be lost during this time. So we set tcp user timeout to override the retransmit timeout to the max value. This is OK for ocfs2 since we have disk heartbeat, if peer crash, the disk heartbeat will timeout and it will be evicted, if disk heartbeat not timeout and connection idle for a long time, then this means the cluster enters split-brain state, since fence can't happen, we'd better keep the connection and wait network recover. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29ocfs2: o2net: don't shutdown connection when idle timeoutJunxiao Bi
This patch series is to fix a possible message lost bug in ocfs2 when network go bad. This bug will cause ocfs2 hung forever even network become good again. The messages may lost in this case. After the tcp connection is established between two nodes, an idle timer will be set to check its state periodically, if no messages are received during this time, idle timer will timeout, it will shutdown the connection and try to reconnect, so pending messages in tcp queues will be lost. This messages may be from dlm. Dlm may get hung in this case. This may cause the whole ocfs2 cluster hung. This is very possible to happen when network state goes bad. Do the reconnect is useless, it will fail if network state is still bad. Just waiting there for network recovering may be a good idea, it will not lost messages and some node will be fenced until cluster goes into split-brain state, for this case, Tcp user timeout is used to override the tcp retransmit timeout. It will timeout after 25 days, user should have notice this through the provided log and fix the network, if they don't, ocfs2 will fall back to original reconnect way. This patch (of 3): Some messages in the tcp queue maybe lost if we shutdown the connection and reconnect when idle timeout. If packets lost and reconnect success, then the ocfs2 cluster maybe hung. To fix this, we can leave the connection there and do the fence decision when idle timeout, if network recover before fence dicision is made, the connection survive without lost any messages. This bug can be saw when network state go bad. It may cause ocfs2 hung forever if some packets lost. With this fix, ocfs2 will recover from hung if network becomes good again. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29ocfs2: do not write error flag to user structure we cannot copy from/toBen Hutchings
If we failed to copy from the structure, writing back the flags leaks 31 bits of kernel memory (the rest of the ir_flags field). In any case, if we cannot copy from/to the structure, why should we expect putting just the flags to work? Also make sure ocfs2_info_handle_freeinode() returns the right error code if the copy_to_user() fails. Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.') Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29Merge tag 'nfs-for-3.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highlights: - NFSv3 stable fix for another POSIX ACL regression - NFSv4 stable fix for a regression with OPEN_DOWNGRADE - NFSv4 stable fix for bad close() behaviour when holding a delegation" * tag 'nfs-for-3.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv3: Fix another acl regression NFSv4: Don't clear the open state when we just did an OPEN_DOWNGRADE NFSv4: Fix problems with close in the presence of a delegation
2014-08-29Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Ext4 bug fixes for 3.17, to provide better handling of memory allocation failures, and to fix some journaling bugs involving journal checksums and FALLOC_FL_ZERO_RANGE" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix same-dir rename when inline data directory overflows jbd2: fix descriptor block size handling errors with journal_csum jbd2: fix infinite loop when recovering corrupt journal blocks ext4: update i_disksize coherently with block allocation on error path ext4: fix transaction issues for ext4_fallocate and ext_zero_range ext4: fix incorect journal credits reservation in ext4_zero_range ext4: move i_size,i_disksize update routines to helper function ext4: fix BUG_ON in mb_free_blocks() ext4: propagate errors up to ext4_find_entry()'s callers
2014-08-29f2fs: fix wrong casting for dentry nameJaegeuk Kim
The dentry name type is unsigned char *. If we don't match this type, some character codes can be changed by signed bit. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-28ext4: fix same-dir rename when inline data directory overflowsDarrick J. Wong
When performing a same-directory rename, it's possible that adding or setting the new directory entry will cause the directory to overflow the inline data area, which causes the directory to be converted to an extent-based directory. Under this circumstance it is necessary to re-read the directory when deleting the old dirent because the "old directory" context still points to i_block in the inode table, which is now an extent tree root! The delete fails with an FS error, and the subsequent fsck complains about incorrect link counts and hardlinked directories. Test case (originally found with flat_dir_test in the metadata_csum test program): # mkfs.ext4 -O inline_data /dev/sda # mount /dev/sda /mnt # mkdir /mnt/x # touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian # sync # for i in /mnt/x/*; do mv $i $i.longer; done # ls -la /mnt/x/ total 0 -rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer -rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer (Hey! Why are there four files now??) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-28jbd2: fix descriptor block size handling errors with journal_csumDarrick J. Wong
It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-28jbd2: fix infinite loop when recovering corrupt journal blocksDarrick J. Wong
When recovering the journal, don't fall into an infinite loop if we encounter a corrupt journal block. Instead, just skip the block and return an error, which fails the mount and thus forces the user to run a full filesystem fsck. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-28ext4: update i_disksize coherently with block allocation on error pathDmitry Monakhov
In case of delalloc block i_disksize may be less than i_size. So we have to update i_disksize each time we allocated and submitted some blocks beyond i_disksize. We weren't doing this on the error paths, so fix this. testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-28f2fs: simplify by using a literalDan Carpenter
We can make the code a bit simpler because we know that "!retry" is zero. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-27ext4: fix transaction issues for ext4_fallocate and ext_zero_rangeDmitry Monakhov
After commit f282ac19d86f we use different transactions for preallocation and i_disksize update which result in complain from fsck after power-failure. spotted by generic/019. IMHO this is regression because fs becomes inconsistent, even more 'e2fsck -p' will no longer works (which drives admins go crazy) Same transaction requirement applies ctime,mtime updates testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-27ext4: fix incorect journal credits reservation in ext4_zero_rangeDmitry Monakhov
Currently we reserve only 4 blocks but in worst case scenario ext4_zero_partial_blocks() may want to zeroout and convert two non adjacent blocks. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The biggest of these comes from Liu Bo, who tracked down a hang we've been hitting since moving to kernel workqueues (it's a btrfs bug, not in the generic code). His patch needs backporting to 3.16 and 3.15 stable, which I'll send once this is in. Otherwise these are assorted fixes. Most were integrated last week during KS, but I wanted to give everyone the chance to test the result, so I waited for rc2 to come out before sending" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits) Btrfs: fix task hang under heavy compressed write Btrfs: fix filemap_flush call in btrfs_file_release Btrfs: fix crash on endio of reading corrupted block btrfs: fix leak in qgroup_subtree_accounting() error path btrfs: Use right extent length when inserting overlap extent map. Btrfs: clone, don't create invalid hole extent map Btrfs: don't monopolize a core when evicting inode Btrfs: fix hole detection during file fsync Btrfs: ensure tmpfile inode is always persisted with link count of 0 Btrfs: race free update of commit root for ro snapshots Btrfs: fix regression of btrfs device replace Btrfs: don't consider the missing device when allocating new chunks Btrfs: Fix wrong device size when we are resizing the device Btrfs: don't write any data into a readonly device when scrub Btrfs: Fix the problem that the replace destroys the seed filesystem btrfs: Return right extent when fiemap gives unaligned offset and len. Btrfs: fix wrong extent mapping for DirectIO Btrfs: fix wrong write range for filemap_fdatawrite_range() Btrfs: fix wrong missing device counter decrease Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs ...
2014-08-27Btrfs: fix autodefrag with compressionChris Mason
The autodefrag code skips defrag when two extents are adjacent. But one big advantage for autodefrag is cutting down on the number of small extents, even when they are adjacent. This commit changes it to defrag all small extents. Signed-off-by: Chris Mason <clm@fb.com>
2014-08-26NFSv3: Fix another acl regressionTrond Myklebust
When creating a new object on the NFS server, we should not be sending posix setacl requests unless the preceding posix_acl_create returned a non-trivial acl. Doing so, causes Solaris servers in particular to return an EINVAL. Fixes: 013cdf1088d72 (nfs: use generic posix ACL infrastructure,,,) Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1132786 Cc: stable@vger.kernel.org # 3.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-26NFSv4: Don't clear the open state when we just did an OPEN_DOWNGRADETrond Myklebust
If we did an OPEN_DOWNGRADE, then the right thing to do on success, is to apply the new open mode to the struct nfs4_state. Instead, we were unconditionally clearing the state, making it appear to our state machinery as if we had just performed a CLOSE. Fixes: 226056c5c312b (NFSv4: Use correct locking when updating nfs4_state...) Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-26NFSv4: Fix problems with close in the presence of a delegationTrond Myklebust
In the presence of delegations, we can no longer assume that the state->n_rdwr, state->n_rdonly, state->n_wronly reflect the open stateid share mode, and so we need to calculate the initial value for calldata->arg.fmode using the state->flags. Reported-by: James Drews <drews@engr.wisc.edu> Fixes: 88069f77e1ac5 (NFSv41: Fix a potential state leakage when...) Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-25Merge tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highlights: - more fixes for read/write codepath regressions * sleeping while holding the inode lock * stricter enforcement of page contiguity when coalescing requests * fix up error handling in the page coalescing code - don't busy wait on SIGKILL in the file locking code" * tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait nfs: can_coalesce_requests must enforce contiguity nfs: disallow duplicate pages in pgio page vectors nfs: don't sleep with inode lock in lock_and_join_requests nfs: fix error handling in lock_and_join_requests nfs: use blocking page_group_lock in add_request nfs: fix nonblocking calls to nfs_page_group_lock nfs: change nfs_page_group_lock argument
2014-08-25Clarify Kconfig help text for CIFS and SMB2/SMB3Steve French
Clarify descriptions of SMB2 and SMB3 support in Kconfig Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2014-08-25f2fs: truncate stale block for inline_dataJaegeuk Kim
This verifies to truncate any allocated blocks, offset[0], by inline_data. Not figured out, but for making sure. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-25CIFS: Fix wrong filename length for SMB2Pavel Shilovsky
The existing code uses the old MAX_NAME constant. This causes XFS test generic/013 to fail. Fix it by replacing MAX_NAME with PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant definition. Cc: <stable@vger.kernel.org> # v3.7+ Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>