summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-01-19eCryptfs: Remove mmap from directory operationsTyler Hicks
Adrian reported that mkfontscale didn't work inside of eCryptfs mounts. Strace revealed the following: open("./", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 fcntl64(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) open("./fonts.scale", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 getdents(3, /* 80 entries */, 32768) = 2304 open("./.", O_RDONLY) = 5 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0 fstat64(5, {st_mode=S_IFDIR|0755, st_size=16384, ...}) = 0 mmap2(NULL, 16384, PROT_READ, MAP_PRIVATE, 5, 0) = 0xb7fcf000 close(5) = 0 --- SIGBUS (Bus error) @ 0 (0) --- +++ killed by SIGBUS +++ The mmap2() on a directory was successful, resulting in a SIGBUS signal later. This patch removes mmap() from the list of possible ecryptfs_dir_fops so that mmap() isn't possible on eCryptfs directory files. https://bugs.launchpad.net/ecryptfs/+bug/400443 Reported-by: Adrian C. <anrxc@sysphere.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19eCryptfs: Add getattr functionTyler Hicks
The i_blocks field of an eCryptfs inode cannot be trusted, but generic_fillattr() uses it to instantiate the blocks field of a stat() syscall when a filesystem doesn't implement its own getattr(). Users have noticed that the output of du is incorrect on newly created files. This patch creates ecryptfs_getattr() which calls into the lower filesystem's getattr() so that eCryptfs can use its kstat.blocks value after calling generic_fillattr(). It is important to note that the block count includes the eCryptfs metadata stored in the beginning of the lower file plus any padding used to fill an extent before encryption. https://bugs.launchpad.net/ecryptfs/+bug/390833 Reported-by: Dominic Sacré <dominic.sacre@gmx.de> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19eCryptfs: Use notify_change for truncating lower inodesTyler Hicks
When truncating inodes in the lower filesystem, eCryptfs directly invoked vmtruncate(). As Christoph Hellwig pointed out, vmtruncate() is a filesystem helper function, but filesystems may need to do more than just a call to vmtruncate(). This patch moves the lower inode truncation out of ecryptfs_truncate() and renames the function to truncate_upper(). truncate_upper() updates an iattr for the lower inode to indicate if the lower inode needs to be truncated upon return. ecryptfs_setattr() then calls notify_change(), using the updated iattr for the lower inode, to complete the truncation. For eCryptfs functions needing to truncate, ecryptfs_truncate() is reintroduced as a simple way to truncate the upper inode to a specified size and then truncate the lower inode accordingly. https://bugs.launchpad.net/bugs/451368 Reported-by: Christoph Hellwig <hch@lst.de> Acked-by: Dustin Kirkland <kirkland@canonical.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-18Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: xfs_swap_extents needs to handle dynamic fork offsets xfs: fix missing error check in xfs_rtfree_range xfs: fix stale inode flush avoidance xfs: Remove inode iolock held check during allocation xfs: reclaim all inodes by background tree walks xfs: Avoid inodes in reclaim when flushing from inode cache xfs: reclaim inodes under a write lock
2010-01-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: do_add_mount() should sanitize mnt_flags CIFS shouldn't make mountpoints shrinkable mnt_flags fixes in do_remount() attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared() may_umount() needs namespace_sem Fix configfs leak Fix the -ESTALE handling in do_filp_open() ecryptfs: Fix refcnt leak on ecryptfs_follow_link() error path Fix ACC_MODE() for real Unrot uml mconsole a bit hppfs: handle ->put_link() Kill 9p readlink() fix autofs/afs/etc. magic mountpoint breakage
2010-01-16nommu: fix shared mmap after truncate shrinkage problemsDavid Howells
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16nommu: fix race between ramfs truncation and shared mmapDavid Howells
Fix the race between the truncation of a ramfs file and an attempt to make a shared mmap of region of that file. The problem is that do_mmap_pgoff() calls f_op->get_unmapped_area() to verify that the file region is made of contiguous pages and to find its base address - but there isn't any locking to guarantee this region until vma_prio_tree_insert() is called by add_vma_to_mm(). Note that moving the functionality into f_op->mmap() doesn't help as that is also called before vma_prio_tree_insert(). Instead make ramfs_nommu_check_mappings() grab nommu_region_sem whilst it does its checks. This means that this function will wait whilst mmaps take place. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16do_add_mount() should sanitize mnt_flagsAl Viro
MNT_WRITE_HOLD shouldn't leak into new vfsmount and neither should MNT_SHARED (the latter will be set properly, along with the rest of shared-subtree data structures) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16CIFS shouldn't make mountpoints shrinkableAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16mnt_flags fixes in do_remount()Al Viro
* need vfsmount_lock over modifying it * need to preserve MNT_SHARED/MNT_UNBINDABLE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()Al Viro
race in mnt_flags update Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16may_umount() needs namespace_semAl Viro
otherwise it races with clone_mnt() changing mnt_share/mnt_slaves Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-15inotify: only warn once for inotify problemsEric Paris
inotify will WARN() if it finds that the idr and the fsnotify internals somehow got out of sync. It was only supposed to do this once but due to this stupid bug it would warn every single time a problem was detected. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15inotify: do not reuse watch descriptorsEric Paris
Since commit 7e790dd5fc937bc8d2400c30a05e32a9e9eef276 ("inotify: fix error paths in inotify_update_watch") inotify changed the manor in which it gave watch descriptors back to userspace. Previous to this commit inotify acted like the following: inotify_add_watch(X, Y, Z) = 1 inotify_rm_watch(X, 1); inotify_add_watch(X, Y, Z) = 2 but after this patch inotify would return watch descriptors like so: inotify_add_watch(X, Y, Z) = 1 inotify_rm_watch(X, 1); inotify_add_watch(X, Y, Z) = 1 which I saw as equivalent to opening an fd where open(file) = 1; close(1); open(file) = 1; seemed perfectly reasonable. The issue is that quite a bit of userspace apparently relies on the behavior in which watch descriptors will not be quickly reused. KDE relies on it, I know some selinux packages rely on it, and I have heard complaints from other random sources such as debian bug 558981. Although the man page implies what we do is ok, we broke userspace so this patch almost reverts us to the old behavior. It is still slightly racey and I have patches that would fix that, but they are rather large and this will fix it for all real world cases. The race is as follows: - task1 creates a watch and blocks in idr_new_watch() before it updates the hint. - task2 creates a watch and updates the hint. - task1 updates the hint with it's older wd - task removes the watch created by task2 - task adds a new watch and will reuse the wd originally given to task2 it requires moving some locking around the hint (last_wd) but this should solve it for the real world and be -stable safe. As a side effect this patch papers over a bug in the lib/idr code which is causing a large number WARN's to pop on people's system and many reports in kerneloops.org. I'm working on the root cause of that idr bug seperately but this should make inotify immune to that issue. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15xfs: xfs_swap_extents needs to handle dynamic fork offsetsDave Chinner
When swapping extents, we can corrupt inodes by swapping data forks that are in incompatible formats. This is caused by the two indoes having different fork offsets due to the presence of an attribute fork on an attr2 filesystem. xfs_fsr tries to be smart about setting the fork offset, but the trick it plays only works on attr1 (old fixed format attribute fork) filesystems. Changing the way xfs_fsr sets up the attribute fork will prevent this situation from ever occurring, so in the kernel code we can get by with a preventative fix - check that the data fork in the defragmented inode is in a format valid for the inode it is being swapped into. This will lead to files that will silently and potentially repeatedly fail defragmentation, so issue a warning to the log when this particular failure occurs to let us know that xfs_fsr needs updating/fixing. To help identify how to improve xfs_fsr to avoid this issue, add trace points for the inodes being swapped so that we can determine why the swap was rejected and to confirm that the code is making the right decisions and modifications when swapping forks. A further complication is even when the swap is allowed to proceed when the fork offset is different between the two inodes then value for the maximum number of extents the data fork can hold can be wrong. Make sure these are also set correctly after the swap occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: fix missing error check in xfs_rtfree_rangeDave Chinner
When xfs_rtfind_forw() returns an error, the block is returned uninitialised. xfs_rtfree_range() is not checking the error return, so could be using an uninitialised block number for modifying bitmap summary info. The problem was found by gcc when compiling the *userspace* libxfs code - it is an copy of the kernel code with the exact same bug. gcc gives an uninitialised variable warning on the userspace code but not on the kernel code. You gotta love the consistency (Mmmm, slightly chewy today!). Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: fix stale inode flush avoidanceDave Chinner
When reclaiming stale inodes, we need to guarantee that inodes are unpinned before returning with a "clean" status. If we don't we can reclaim inodes that are pinned, leading to use after free in the transaction subsystem as transactions complete. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: Remove inode iolock held check during allocationDave Chinner
lockdep complains about a the lock not being initialised as we do an ASSERT based check that the lock is not held before we initialise it to catch inodes freed with the lock held. lockdep does this check for us in the lock initialisation code, so remove the ASSERT to stop the lockdep warning. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: reclaim all inodes by background tree walksDave Chinner
We cannot do direct inode reclaim without taking the flush lock to ensure that we do not reclaim an inode under IO. We check the inode is clean before doing direct reclaim, but this is not good enough because the inode flush code marks the inode clean once it has copied the in-core dirty state to the backing buffer. It is the flush lock that determines whether the inode is still under IO, even though it is marked clean, and the inode is still required at IO completion so we can't reclaim it even though it is clean in core. Hence the requirement that we need to take the flush lock even on clean inodes because this guarantees that the inode writeback IO has completed and it is safe to reclaim the inode. With delayed write inode flushing, we coul dend up waiting a long time on the flush lock even for a clean inode. The background reclaim already handles this efficiently, so avoid all the problems by killing the direct reclaim path altogether. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: Avoid inodes in reclaim when flushing from inode cacheDave Chinner
The reclaim code will handle flushing of dirty inodes before reclaim occurs, so avoid them when determining whether an inode is a candidate for flushing to disk when walking the radix trees. This is based on a test patch from Christoph Hellwig. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-15xfs: reclaim inodes under a write lockDave Chinner
Make the inode tree reclaim walk exclusive to avoid races with concurrent sync walkers and lookups. This is a version of a patch posted by Christoph Hellwig that avoids all the code duplication. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-14Fix configfs leakAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14Fix the -ESTALE handling in do_filp_open()Al Viro
Instead of playing sick games with path saving, cleanups, just retry the entire thing once with LOOKUP_REVAL added. Post-.34 we'll convert all -ESTALE handling in there to that style, rather than playing with many retry loops deep in the call chain. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14ecryptfs: Fix refcnt leak on ecryptfs_follow_link() error pathOGAWA Hirofumi
If ->follow_link handler return the error, it should decrement nd->path refcnt. But, ecryptfs_follow_link() doesn't decrement. This patch fix it by using usual nd_set_link() style error handling, instead of playing with nd->path. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14Fix ACC_MODE() for realAl Viro
commit 5300990c0370e804e49d9a59d928c5d53fb73487 had stepped on a rather nasty mess: definitions of ACC_MODE used to be different. Fixed the resulting breakage, converting them to variant that takes O_... value; all callers have that and it actually simplifies life (see tomoyo part of changes). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14hppfs: handle ->put_link()Al Viro
current code works only because nothing in procfs has non-trivial ->put_link(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14Kill 9p readlink()Al Viro
For symlinks generic_readlink() will work just fine and for directories we don't want ->readlink() at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-14fix autofs/afs/etc. magic mountpoint breakageAl Viro
We end up trying to kfree() nd.last.name on open("/mnt/tmp", O_CREAT) if /mnt/tmp is an autofs direct mount. The reason is that nd.last_type is bogus here; we want LAST_BIND for everything of that kind and we get LAST_NORM left over from finding parent directory. So make sure that it *is* set properly; set to LAST_BIND before doing ->follow_link() - for normal symlinks it will be changed by __vfs_follow_link() and everything else needs it set that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-13Merge branch 'fasync-helper'Linus Torvalds
* fasync-helper: fasync: split 'fasync_helper()' into separate add/remove functions
2010-01-12lib: Introduce generic list_sort functionDave Chinner
There are two copies of list_sort() in the tree already, one in the DRM code, another in ubifs. Now XFS needs this as well. Create a generic list_sort() function from the ubifs version and convert existing users to it so we don't end up with yet another copy in the tree. Signed-off-by: Dave Chinner <david@fromorbit.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Ensure we force all busy extents in range to disk xfs: Don't flush stale inodes xfs: fix timestamp handling in xfs_setattr xfs: use DECLARE_EVENT_CLASS
2010-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixesLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Use MAX_LFS_FILESIZE for meta inode size GFS2: Fix gfs2_xattr_acl_chmod() GFS2: Fix locking bug in rename GFS2: Ensure uptodate inode size when using O_APPEND
2010-01-11Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Fix dquot_transfer for filesystems different from ext4
2010-01-11smaps: fix wrong rss countMinchan Kim
A long time ago we regarded zero page as file_rss and vm_normal_page doesn't return NULL. But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can return NULL in case of zero page. Also we don't count it with file_rss any more. Then, RSS and PSS can't be matched. For consistency, Let's ignore zero page in smaps_pte_range. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11proc: partially revert "procfs: provide stack information for threads"KOSAKI Motohiro
Commit d899bf7b (procfs: provide stack information for threads) introduced to show stack information in /proc/{pid}/status. But it cause large performance regression. Unfortunately /proc/{pid}/status is used ps command too and ps is one of most important component. Because both to take mmap_sem and page table walk are heavily operation. If many process run, the ps performance is, [before d899bf7b] % perf stat ps >/dev/null Performance counter stats for 'ps': 4090.435806 task-clock-msecs # 0.032 CPUs 229 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 234 page-faults # 0.000 M/sec 8587565207 cycles # 2099.425 M/sec 9866662403 instructions # 1.149 IPC 3789415411 cache-references # 926.409 M/sec 30419509 cache-misses # 7.437 M/sec 128.859521955 seconds time elapsed [after d899bf7b] % perf stat ps > /dev/null Performance counter stats for 'ps': 4305.081146 task-clock-msecs # 0.028 CPUs 480 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 237 page-faults # 0.000 M/sec 9021211334 cycles # 2095.480 M/sec 10605887536 instructions # 1.176 IPC 3612650999 cache-references # 839.160 M/sec 23917502 cache-misses # 5.556 M/sec 152.277819582 seconds time elapsed Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps provide almost same information. we can use it. Commit d899bf7b introduced two features: 1) Add the annotattion of [thread stack: xxxx] mark to /proc/{pid}/task/{tid}/maps. 2) Add StackUsage field to /proc/{pid}/status. I only revert (2), because I haven't seen (1) cause regression. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11quota: Fix dquot_transfer for filesystems different from ext4Jan Kara
Commit fd8fbfc1 modified the way we find amount of reserved space belonging to an inode. The amount of reserved space is checked from dquot_transfer and thus inode_reserved_space gets called even for filesystems that don't provide get_reserved_space callback which results in a BUG. Fix the problem by checking get_reserved_space callback and return 0 if the filesystem does not provide it. CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2010-01-11GFS2: Use MAX_LFS_FILESIZE for meta inode sizeSteven Whitehouse
Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so use MAX_LFS_FILESIZE instead. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-10xfs: Ensure we force all busy extents in range to diskDave Chinner
When we search for and find a busy extent during allocation we force the log out to ensure the extent free transaction is on disk before the allocation transaction. The current implementation has a subtle bug in it--it does not handle multiple overlapping ranges. That is, if we free lots of little extents into a single contiguous extent, then allocate the contiguous extent, the busy search code stops searching at the first extent it finds that overlaps the allocated range. It then uses the commit LSN of the transaction to force the log out to. Unfortunately, the other busy ranges might have more recent commit LSNs than the first busy extent that is found, and this results in xfs_alloc_search_busy() returning before all the extent free transactions are on disk for the range being allocated. This can lead to potential metadata corruption or stale data exposure after a crash because log replay won't replay all the extent free transactions that cover the allocation range. Modified-by: Alex Elder <aelder@sgi.com> (Dropped the "found" argument from the xfs_alloc_busysearch trace event.) Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10xfs: Don't flush stale inodesDave Chinner
Because inodes remain in cache much longer than inode buffers do under memory pressure, we can get the situation where we have stale, dirty inodes being reclaimed but the backing storage has been freed. Hence we should never, ever flush XFS_ISTALE inodes to disk as there is no guarantee that the backing buffer is in cache and still marked stale when the flush occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10xfs: fix timestamp handling in xfs_setattrChristoph Hellwig
We currently have some rather odd code in xfs_setattr for updating the a/c/mtime timestamps: - first we do a non-transaction update if all three are updated together - second we implicitly update the ctime for various changes instead of relying on the ATTR_CTIME flag - third we set the timestamps to the current time instead of the arguments in the iattr structure in many cases. This patch makes sure we update it in a consistent way: - always transactional - ctime is only updated if ATTR_CTIME is set or we do a size update, which is a special case - always to the times passed in from the caller instead of the current time The only non-size caller of xfs_setattr that doesn't come from the VFS is updated to set ATTR_CTIME and pass in a valid ctime value. Reported-by: Eric Blake <ebb9@byu.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10xfs: use DECLARE_EVENT_CLASSChristoph Hellwig
Using DECLARE_EVENT_CLASS allows us to to use trace event code instead of duplicating it in the binary. This was not available before 2.6.33 so it had to be done as a separate step once the prerequisite was merged. This only requires changes to xfs_trace.h and the results are rather impressive: hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o* text data bss dec hex filename 607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o 1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-08Merge branch 'reiserfs/kill-bkl' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks reiserfs: Fix unreachable statement reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock reiserfs: Relax lock on xattr removing reiserfs: Relax the lock before truncating pages reiserfs: Fix recursive lock on lchown reiserfs: Fix mistake in down_write() conversion
2010-01-08Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: kill some warnings on i386 builds
2010-01-08Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: nfs: fix oops in nfs_rename() sunrpc: fix build-time warning sunrpc: on successful gss error pipe write, don't return error SUNRPC: Fix the return value in gss_import_sec_context() SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos()
2010-01-08xfs: kill some warnings on i386 buildsDave Chinner
Randy Dunlap Reported printk() format-related warnings reported on i386 builds in his environment. Dave Chinner provided this patch to eliminate them. Signed-off by: Dave Chinner <david@fromorbit.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-08GFS2: Fix gfs2_xattr_acl_chmod()Steven Whitehouse
The ref counting for the bh returned by gfs2_ea_find() was wrong. This patch ensures that we always drop the ref count to that bh correctly. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08GFS2: Fix locking bug in renameSteven Whitehouse
The rename code was taking a resource group lock in cases where it wasn't actually needed, this caused problems if the rename was resulting in an inode being unlinked. The patch ensures that we only take the rgrp lock early if it is really needed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08GFS2: Ensure uptodate inode size when using O_APPENDSteven Whitehouse
The VFS reads the inode size during generic_file_aio_write() but with no locking around it. In order to get the expected result from O_APPEND opens, this patch updated the inode size before calling generic_file_aio_write() There is of course still a race here, in that there is nothing to prevent another node coming in and extending the file in the mean time. On the other hand, when used with file locking this will ensure that the expected results are obtained. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-07reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locksFrederic Weisbecker
Fix remaining xattr locks acquired in reiserfs_xattr_set_handle() while we are holding the reiserfs lock to avoid lock inversions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-07reiserfs: Fix unreachable statementJiri Slaby
Stanse found an unreachable statement in reiserfs_ioctl. There is a if followed by error assignment and `break' with no braces. Add the braces so that we don't break every time, but only in error case, so that REISERFS_IOC_SETVERSION actually works when it returns no error. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Reiserfs <reiserfs-devel@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>