summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-10-30ext4: bail early when clearing inode journal flag failsJan Kara
When clearing inode journal flag, we call jbd2_journal_flush() to force all the journalled data to their final locations. Currently we ignore when this fails and continue clearing inode journal flag. This isn't a big problem because when jbd2_journal_flush() fails, journal is likely aborted anyway. But it can still lead to somewhat confusing results so rather bail out early. Coverity-id: 989044 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: bail out from make_indexed_dir() on first errorJan Kara
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node() fail, there's really something wrong with the fs and there's no point in continuing further. Just return error from make_indexed_dir() in that case. Also initialize frames array so that if we return early due to error, dx_release() doesn't try to dereference uninitialized memory (which could happen also due to error in do_split()). Coverity-id: 741300 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30jbd2: use a better hash function for the revoke tableTheodore Ts'o
The old hash function didn't work well for 64-bit block numbers, and used undefined (negative) shift right behavior. Use the generic 64-bit hash function instead. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
2014-10-30ext4: prevent bugon on race between write/fcntlDmitry Monakhov
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org
2014-10-30ext4: remove extent status procfs files if journal load failsDarrick J. Wong
If we can't load the journal, remove the procfs files for the extent status information file to avoid leaking resources. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: disallow changing journal_csum option during remountDarrick J. Wong
ext4 does not permit changing the metadata or journal checksum feature flag while mounted. Until we decide to support that, don't allow a remount to change the journal_csum flag (right now we silently fail to change anything). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: enable journal checksum when metadata checksum feature enabledDarrick J. Wong
If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: fix oops when loading block bitmap failedJan Kara
When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: fix overflow when updating superblock backups after resizeJan Kara
When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-10-29Merge branch 'akpm' (incoming from Andrew Morton)Linus Torvalds
Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) mm/balloon_compaction: fix deflation when compaction is disabled sh: fix sh770x SCIF memory regions zram: avoid NULL pointer access in concurrent situation mm/slab_common: don't check for duplicate cache names ocfs2: fix d_splice_alias() return code checking mm: rmap: split out page_remove_file_rmap() mm: memcontrol: fix missed end-writeback page accounting mm: page-writeback: inline account_page_dirtied() into single caller lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}() drivers/rtc/rtc-bq32k.c: fix register value memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock kernel/kmod: fix use-after-free of the sub_info structure drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc mm, thp: fix collapsing of hugepages on madvise drivers: of: add return value to of_reserved_mem_device_init() mm: free compound page with correct order gcov: add ARM64 to GCOV_PROFILE_ALL fsnotify: next_i is freed during fsnotify_unmount_inodes. mm/compaction.c: avoid premature range skip in isolate_migratepages_range ...
2014-10-30xfs: rework zero range to prevent invalid i_size updatesBrian Foster
The zero range operation is analogous to fallocate with the exception of converting the range to zeroes. E.g., it attempts to allocate zeroed blocks over the range specified by the caller. The XFS implementation kills all delalloc blocks currently over the aligned range, converts the range to allocated zero blocks (unwritten extents) and handles the partial pages at the ends of the range by sending writes through the pagecache. The current implementation suffers from several problems associated with inode size. If the aligned range covers an extending I/O, said I/O is discarded and an inode size update from a previous write never makes it to disk. Further, if an unaligned zero range extends beyond eof, the page write induced for the partial end page can itself increase the inode size, even if the zero range request is not supposed to update i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). The latter behavior not only incorrectly increases the inode size, but can lead to stray delalloc blocks on the inode. Typically, post-eof preallocation blocks are either truncated on release or inode eviction or explicitly written to by xfs_zero_eof() on natural file size extension. If the inode size increases due to zero range, however, associated blocks leak into the address space having never been converted or mapped to pagecache pages. A direct I/O to such an uncovered range cannot convert the extent via writeback and will BUG(). For example: $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> ... $ xfs_io -d -c "pread 128k 128k" <file> <BUG> If the entire delalloc extent happens to not have page coverage whatsoever (e.g., delalloc conversion couldn't find a large enough free space extent), even a full file writeback won't convert what's left of the extent and we'll assert on inode eviction. Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. Use the existing hole punch and prealloc mechanisms as primitives for zero range. This implementation is not efficient nor ideal as we writeback dirty data over the range and remove existing extents rather than convert to unwrittern. The former writeback, however, is currently the only mechanism available to ensure consistency between pagecache and extent state. Even a pagecache truncate/delalloc punch prior to hole punch has lead to inconsistencies due to racing with writeback. This provides a consistent, correct implementation of zero range that survives fsstress/fsx testing without assert failures. The implementation can be optimized from this point forward once the fundamental issue of pagecache and delalloc extent state consistency is addressed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-30xfs: Check error during inode btree iteration in xfs_bulkstat()Jan Kara
xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In case of specific fs corruption that could result in xfs_bulkstat() entering an infinite loop because we would be looping over the same chunk over and over again. Fix the problem by checking the return value and terminating the loop properly. Coverity-id: 1231338 cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jie Liu <jeff.u.liu@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-29ocfs2: fix d_splice_alias() return code checkingRichard Weinberger
d_splice_alias() can return a valid dentry, NULL or an ERR_PTR. Currently the code checks not for ERR_PTR and will cuase an oops in ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL(). Signed-off-by: Richard Weinberger <richard@nod.at> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29fsnotify: next_i is freed during fsnotify_unmount_inodes.Jerry Hoemann
During file system stress testing on 3.10 and 3.12 based kernels, the umount command occasionally hung in fsnotify_unmount_inodes in the section of code: spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { spin_unlock(&inode->i_lock); continue; } As this section of code holds the global inode_sb_list_lock, eventually the system hangs trying to acquire the lock. Multiple crash dumps showed: The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point back at itself. As this is not the value of list upon entry to the function, the kernel never exits the loop. To help narrow down problem, the call to list_del_init in inode_sb_list_del was changed to list_del. This poisons the pointers in the i_sb_list and causes a kernel to panic if it transverse a freed inode. Subsequent stress testing paniced in fsnotify_unmount_inodes at the bottom of the list_for_each_entry_safe loop showing next_i had become free. We believe the root cause of the problem is that next_i is being freed during the window of time that the list_for_each_entry_safe loop temporarily releases inode_sb_list_lock to call fsnotify and fsnotify_inode_delete. The code in fsnotify_unmount_inodes attempts to prevent the freeing of inode and next_i by calling __iget. However, the code doesn't do the __iget call on next_i if i_count == 0 or if i_state & (I_FREEING | I_WILL_FREE) The patch addresses this issue by advancing next_i in the above two cases until we either find a next_i which we can __iget or we reach the end of the list. This makes the handling of next_i more closely match the handling of the variable "inode." The time to reproduce the hang is highly variable (from hours to days.) We ran the stress test on a 3.10 kernel with the proposed patch for a week without failure. During list_for_each_entry_safe, next_i is becoming free causing the loop to never terminate. Advance next_i in those cases where __iget is not done. Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current kernel. This contains: - Two error handling fixes from Jan Kara. One for null_blk on failure to add a device, and the other for the block/scsi_ioctl SCSI_IOCTL_SEND_COMMAND fixing up the error jump point. - A commit added in the merge window for the bio integrity bits unfortunately disabled merging for all requests if CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that integrity checking wont disallow merges when not enabled. - A fix from Ming Lei for merging and generating too many segments. This caused a BUG in virtio_blk. - Two error handling printk() fixups from Robert Elliott, improving the information given when we rate limit. - Error handling fixup on elevator_init() failure from Sudip Mukherjee. - A fix from Tony Battersby, fixing up a memory leak in the scatterlist handling with scsi-mq" * 'for-linus' of git://git.kernel.dk/linux-block: block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined lib/scatterlist: fix memory leak with scsi-mq block: fix wrong error return in elevator_init() scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND null_blk: Cleanup error recovery in null_add_dev() blk-merge: recaculate segment if it isn't less than max segments fs: clarify rate limit suppressed buffer I/O errors fs: merge I/O error prints into one line
2014-10-28isofs_cmp(): we'll never see a dentry for . or ..Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28overlayfs: fix lockdep misannotationMiklos Szeredi
In an overlay directory that shadows an empty lower directory, say /mnt/a/empty102, do: touch /mnt/a/empty102/x unlink /mnt/a/empty102/x rmdir /mnt/a/empty102 It's actually harmless, but needs another level of nesting between I_MUTEX_CHILD and I_MUTEX_NORMAL. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28ovl: fix check for cursorMiklos Szeredi
ovl_cache_entry.name is now an array not a pointer, so it makes no sense test for it being NULL. Detected by coverity. From: Miklos Szeredi <mszeredi@suse.cz> Fixes: 68bf8611076a ("overlayfs: make ovl_cache_entry->name an array instead of +pointer") Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28overlayfs: barriers for opening upper-layer directoryAl Viro
make sure that a) all stores done by opening struct file don't leak past storing the reference in od->upperfile b) the lockless side has read dependency barrier Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-29xfs: bulkstat doesn't release AGI buffer on errorDave Chinner
The recent refactoring of the bulkstat code left a small landmine in the code. If a inobt read fails, then the tree walk is aborted and returns without releasing the AGI buffer or freeing the cursor. This can lead to a subsequent bulkstat call hanging trying to grab the AGI buffer again. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-28Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent itemsFilipe Manana
We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns > 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret > 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path->slots[0] is zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-28Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull two nfsd fixes from Bruce Fields: "One regression from the 3.16 xdr rewrite, one an older bug exposed by a separate bug in the client's new SEEK code" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: nfsd4: fix crash on unknown operation number nfsd4: fix response size estimation for OP_SEQUENCE
2014-10-27Btrfs: properly clean up btrfs_end_io_wq_cacheJosef Bacik
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on unload, which makes us unable to unload and then re-load the btrfs module. This fixes the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()Filipe Manana
If we couldn't find our extent item, we accessed the current slot (path->slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27btrfs: use macro accessors in superblock validation checksDavid Sterba
The initial patch c926093ec516f5d316 (btrfs: add more superblock checks) did not properly use the macro accessors that wrap endianness and the code would not work correctly on big endian machines. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "overlayfs merge + leak fix for d_splice_alias() failure exits" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: overlayfs: embed middle into overlay_readdir_data overlayfs: embed root into overlay_readdir_data overlayfs: make ovl_cache_entry->name an array instead of pointer overlayfs: don't hold ->i_mutex over opening the real directory fix inode leaks on d_splice_alias() failure exits fs: limit filesystem stacking depth overlay: overlay filesystem documentation overlayfs: implement show_options overlayfs: add statfs support overlay filesystem shmem: support RENAME_WHITEOUT ext4: support RENAME_WHITEOUT vfs: add RENAME_WHITEOUT vfs: add whiteout support vfs: export check_sticky() vfs: introduce clone_private_mount() vfs: export __inode_permission() to modules vfs: export do_splice_direct() to modules vfs: add i_op->dentry_open()
2014-10-24overlayfs: embed middle into overlay_readdir_dataAl Viro
same story... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24overlayfs: embed root into overlay_readdir_dataAl Viro
no sense having it a pointer - all instances have it pointing to local variable in the same stack frame Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24overlayfs: make ovl_cache_entry->name an array instead of pointerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24overlayfs: don't hold ->i_mutex over opening the real directoryAl Viro
just use it to serialize the assignment Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-23Merge branch 'overlayfs.v25' of ↵Al Viro
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus
2014-10-23fix inode leaks on d_splice_alias() failure exitsAl Viro
d_splice_alias() callers expect it to either stash the inode reference into a new alias, or drop the inode reference. That makes it possible to just return d_splice_alias() result from ->lookup() instance, without any extra housekeeping required. Unfortunately, that should include the failure exits. If d_splice_alias() returns an error, it leaves the dentry it has been given negative and thus it *must* drop the inode reference. Easily fixed, but it goes way back and will need backporting. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24fs: limit filesystem stacking depthMiklos Szeredi
Add a simple read-only counter to super_block that indicates how deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24overlayfs: implement show_optionsErez Zadok
This is useful because of the stacking nature of overlayfs. Users like to find out (via /proc/mounts) which lower/upper directory were used at mount time. AV: even failing ovl_parse_opt() could've done some kstrdup() AV: failure of ovl_alloc_entry() should end up with ENOMEM, not EINVAL Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24overlayfs: add statfs supportAndy Whitcroft
Add support for statfs to the overlayfs filesystem. As the upper layer is the target of all write operations assume that the space in that filesystem is the space in the overlayfs. There will be some inaccuracy as overwriting a file will copy it up and consume space we were not expecting, but it is better than nothing. Use the upper layer dentry and mount from the overlayfs root inode, passing the statfs call to that filesystem. Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24overlay filesystemMiklos Szeredi
Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer. This type of mechanism is most often used for live CDs but there's a wide variety of other uses. The implementation differs from other "union filesystem" implementations in that after a file is opened all operations go directly to the underlying, lower or upper, filesystems. This simplifies the implementation and allows native performance in these cases. The dentry tree is duplicated from the underlying filesystems, this enables fast cached lookups without adding special support into the VFS. This uses slightly more memory than union mounts, but dentries are relatively small. Currently inodes are duplicated as well, but it is a possible optimization to share inodes for non-directories. Opening non directories results in the open forwarded to the underlying filesystem. This makes the behavior very similar to union mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file descriptors). Usage: mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay The following cotributions have been folded into this patch: Neil Brown <neilb@suse.de>: - minimal remount support - use correct seek function for directories - initialise is_real before use - rename ovl_fill_cache to ovl_dir_read Felix Fietkau <nbd@openwrt.org>: - fix a deadlock in ovl_dir_read_merged - fix a deadlock in ovl_remove_whiteouts Erez Zadok <ezk@fsl.cs.sunysb.edu> - fix cleanup after WARN_ON Sedat Dilek <sedat.dilek@googlemail.com> - fix up permission to confirm to new API Robin Dong <hao.bigrat@gmail.com> - fix possible leak in ovl_new_inode - create new inode in ovl_link Andy Whitcroft <apw@canonical.com> - switch to __inode_permission() - copy up i_uid/i_gid from the underlying inode AV: - ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits - ovl_clear_empty() - one failure exit forgetting to do unlock_rename(), lack of check for udir being the parent of upper, dropping and regaining the lock on udir (which would require _another_ check for parent being right). - bogus d_drop() in copyup and rename [fix from your mail] - copyup/remove and copyup/rename races [fix from your mail] - ovl_dir_fsync() leaving ERR_PTR() in ->realfile - ovl_entry_free() is pointless - it's just a kfree_rcu() - fold ovl_do_lookup() into ovl_lookup() - manually assigning ->d_op is wrong. Just use ->s_d_op. [patches picked from Miklos]: * copyup/remove and copyup/rename races * bogus d_drop() in copyup and rename Also thanks to the following people for testing and reporting bugs: Jordi Pujol <jordipujolp@gmail.com> Andy Whitcroft <apw@canonical.com> Michal Suchanek <hramrach@centrum.cz> Felix Fietkau <nbd@openwrt.org> Erez Zadok <ezk@fsl.cs.sunysb.edu> Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24ext4: support RENAME_WHITEOUTMiklos Szeredi
Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is created before the rename takes place. The whiteout inode is added to the old entry instead of deleting it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: add RENAME_WHITEOUTMiklos Szeredi
This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a whiteout of source. The whiteout creation is atomic relative to the rename. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: add whiteout supportMiklos Szeredi
Whiteout isn't actually a new file type, but is represented as a char device (Linus's idea) with 0/0 device number. This has several advantages compared to introducing a new whiteout file type: - no userspace API changes (e.g. trivial to make backups of upper layer filesystem, without losing whiteouts) - no fs image format changes (you can boot an old kernel/fsck without whiteout support and things won't break) - implementation is trivial Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: export check_sticky()Miklos Szeredi
It's already duplicated in btrfs and about to be used in overlayfs too. Move the sticky bit check to an inline helper and call the out-of-line helper only in the unlikly case of the sticky bit being set. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: introduce clone_private_mount()Miklos Szeredi
Overlayfs needs a private clone of the mount, so create a function for this and export to modules. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: export __inode_permission() to modulesMiklos Szeredi
We need to be able to check inode permissions (but not filesystem implied permissions) for stackable filesystems. Expose this interface for overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: export do_splice_direct() to modulesMiklos Szeredi
Export do_splice_direct() to modules. Needed by overlay filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: add i_op->dentry_open()Miklos Szeredi
Add a new inode operation i_op->dentry_open(). This is for stacked filesystems that want to return a struct file from a different filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-23nfsd4: fix crash on unknown operation numberJ. Bruce Fields
Unknown operation numbers are caught in nfsd4_decode_compound() which sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The error causes the main loop in nfsd4_proc_compound() to skip most processing. But nfsd4_proc_compound also peeks ahead at the next operation in one case and doesn't take similar precautions there. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-10-22fs, jbd: use a more generic hash functionSasha Levin
While the hash function used by the revoke hashtable is good somewhere else, it's not really good here. The default hash shift (8) means that one third of the hashing function gets lost (and is undefined anyways (8 - 12 = negative shift)): "(block << (hash_shift - 12))) & (table->hash_size - 1)" Instead, just use the kernel's generic hash function that gets used everywhere else. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2014-10-22quota: Properly return errors from dquot_writeback_dquots()Jan Kara
Due to a switched left and right side of an assignment, dquot_writeback_dquots() never returned error. This could result in errors during quota writeback to not be reported to userspace properly. Fix it. CC: stable@vger.kernel.org Coverity-id: 1226884 Signed-off-by: Jan Kara <jack@suse.cz>
2014-10-22ext3: Don't check quota format when there are no quota filesJan Kara
The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
2014-10-21fs: clarify rate limit suppressed buffer I/O errorsRobert Elliott
When quiet_error applies rate limiting to buffer_io_error calls, what the they apply to is unclear because the name is so generic, particularly if the messages are interleaved with others: [ 1936.063572] quiet_error: 664293 callbacks suppressed [ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write [ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write Also, the function uses printk_ratelimit(), although printk.h includes a comment advising "Please don't use... Instead use printk_ratelimited()." Change buffer_io_error to check the BH_Quiet bit itself, drop the printk_ratelimit call, and print using printk_ratelimited. This makes the messages look like: [ 387.208839] buffer_io_error: 676394 callbacks suppressed [ 387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write [ 387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-21fs: merge I/O error prints into one lineRobert Elliott
buffer.c uses two printk calls to print these messages: [67353.422338] Buffer I/O error on device sdr, logical block 212868488 [67353.422338] lost page write due to I/O error on sdr In a busy system, they may be interleaved with other prints, losing the context for the second message. Merge them into one line with one printk call so the prints are atomic. Also, differentiate between async page writes, sync page writes, and async page reads. Also, shorten "device" to "dev" to match the block layer prints: [67353.467906] blk_update_request: critical target error, dev sdr, sector 1707107328 Also, use %llu rather than %Lu. Resulting prints look like: [ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992 [ 1361.383522] quiet_error: 659876 callbacks suppressed [ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write [ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>