summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-08-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: brlock vfsmount_lock fs: scale files_lock lglock: introduce special lglock and brlock spin locks tty: fix fu_list abuse fs: cleanup files_lock locking fs: remove extra lookup in __lookup_hash fs: fs_struct rwlock to spinlock apparmor: use task path helpers fs: dentry allocation consolidation fs: fix do_lookup false negative mbcache: Limit the maximum number of cache entries hostfs ->follow_link() braino hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy remove SWRITE* I/O types kill BH_Ordered flag vfs: update ctime when changing the file's permission by setfacl cramfs: only unlock new inodes fix reiserfs_evict_inode end_writeback second call
2010-08-18fs: brlock vfsmount_lockNick Piggin
fs: brlock vfsmount_lock Use a brlock for the vfsmount lock. It must be taken for write whenever modifying the mount hash or associated fields, and may be taken for read when performing mount hash lookups. A new lock is added for the mnt-id allocator, so it doesn't need to take the heavy vfsmount write-lock. The number of atomics should remain the same for fastpath rlock cases, though code would be slightly slower due to per-cpu access. Scalability is not not be much improved in common cases yet, due to other locks (ie. dcache_lock) getting in the way. However path lookups crossing mountpoints should be one case where scalability is improved (currently requiring the global lock). The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node Altix system (high latency to remote nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, about 5% slower. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fs: scale files_lockNick Piggin
fs: scale files_lock Improve scalability of files_lock by adding per-cpu, per-sb files lists, protected with an lglock. The lglock provides fast access to the per-cpu lists to add and remove files. It also provides a snapshot of all the per-cpu lists (although this is very slow). One difficulty with this approach is that a file can be removed from the list by another CPU. We must track which per-cpu list the file is on with a new variale in the file struct (packed into a hole on 64-bit archs). Scalability could suffer if files are frequently removed from different cpu's list. However loads with frequent removal of files imply short interval between adding and removing the files, and the scheduler attempts to avoid moving processes too far away. Also, even in the case of cross-CPU removal, the hardware has much more opportunity to parallelise cacheline transfers with N cachelines than with 1. A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs degenerates to contending on a single lock, which is no worse than before. When more than one CPU are allocating files, even if they are always freed by different CPUs, there will be more parallelism than the single-lock case. Testing results: On a 2 socket, 8 core opteron, I measure the number of times the lock is taken to remove the file, the number of times it is removed by the same CPU that added it, and the number of times it is removed by the same node that added it. Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%) kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%) dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%) So a file is removed from the same CPU it was added by over 90% of the time. It remains within the same node 95% of the time. Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile. throughput 2.6.34-rc2 24.5 +patch 24.9 us sys idle IO wait (in %) 2.6.34-rc2 51.25 28.25 17.25 3.25 +patch 53.75 18.5 19 8.75 So significantly less CPU time spent in kernel code, higher idle time and slightly higher throughput. Single threaded performance difference was within the noise of microbenchmarks. That is not to say penalty does not exist, the code is larger and more memory accesses required so it will be slightly slower. Cc: linux-kernel@vger.kernel.org Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18tty: fix fu_list abuseNick Piggin
tty: fix fu_list abuse tty code abuses fu_list, which causes a bug in remount,ro handling. If a tty device node is opened on a filesystem, then the last link to the inode removed, the filesystem will be allowed to be remounted readonly. This is because fs_may_remount_ro does not find the 0 link tty inode on the file sb list (because the tty code incorrectly removed it to use for its own purpose). This can result in a filesystem with errors after it is marked "clean". Taking idea from Christoph's initial patch, allocate a tty private struct at file->private_data and put our required list fields in there, linking file and tty. This makes tty nodes behave the same way as other device nodes and avoid meddling with the vfs, and avoids this bug. The error handling is not trivial in the tty code, so for this bugfix, I take the simple approach of using __GFP_NOFAIL and don't worry about memory errors. This is not a problem because our allocator doesn't fail small allocs as a rule anyway. So proper error handling is left as an exercise for tty hackers. [ Arguably filesystem's device inode would ideally be divorced from the driver's pseudo inode when it is opened, but in practice it's not clear whether that will ever be worth implementing. ] Cc: linux-kernel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fs: cleanup files_lock lockingNick Piggin
fs: cleanup files_lock locking Lock tty_files with a new spinlock, tty_files_lock; provide helpers to manipulate the per-sb files list; unexport the files_lock spinlock. Cc: linux-kernel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fs: remove extra lookup in __lookup_hashNick Piggin
fs: remove extra lookup in __lookup_hash Optimize lookup for create operations, where no dentry should often be common-case. In cases where it is not, such as unlink, the added overhead is much smaller than the removed. Also, move comments about __d_lookup racyness to the __d_lookup call site. d_lookup is intuitive; __d_lookup is what needs commenting. So in that same vein, add kerneldoc comments to __d_lookup and clean up some of the comments: - We are interested in how the RCU lookup works here, particularly with renames. Make that explicit, and point to the document where it is explained in more detail. - RCU is pretty standard now, and macros make implementations pretty mindless. If we want to know about RCU barrier details, we look in RCU code. - Delete some boring legacy comments because we don't care much about how the code used to work, more about the interesting parts of how it works now. So comments about lazy LRU may be interesting, but would better be done in the LRU or refcount management code. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fs: fs_struct rwlock to spinlockNick Piggin
fs: fs_struct rwlock to spinlock struct fs_struct.lock is an rwlock with the read-side used to protect root and pwd members while taking references to them. Taking a reference to a path typically requires just 2 atomic ops, so the critical section is very small. Parallel read-side operations would have cacheline contention on the lock, the dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a real parallelism increase. Replace it with a spinlock to avoid one or two atomic operations in typical path lookup fastpath. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fs: dentry allocation consolidationNick Piggin
fs: dentry allocation consolidation There are 2 duplicate copies of code in dentry allocation in path lookup. Consolidate them into a single function. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fs: fix do_lookup false negativeNick Piggin
fs: fix do_lookup false negative In do_lookup, if we initially find no dentry, we take the directory i_mutex and re-check the lookup. If we find a dentry there, then we revalidate it if needed. However if that revalidate asks for the dentry to be invalidated, we return -ENOENT from do_lookup. What should happen instead is an attempt to allocate and lookup a new dentry. This is probably not noticed because it is rare. It is only reached if a concurrent create races in first (in which case, the dentry probably won't be invalidated anyway), or if the racy __d_lookup has failed due to a false-negative (which is very rare). Fix this by removing code and have it use the normal reval path. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18mbcache: Limit the maximum number of cache entriesAndreas Gruenbacher
Limit the maximum number of mb_cache entries depending on the number of hash buckets: if the only limit to the number of cache entries is the available memory the hash chains can grow very long, taking a long time to search. At least partially solves https://bugzilla.lustre.org/show_bug.cgi?id=22771. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18hostfs ->follow_link() brainoAl Viro
we want the assignment to err done inside the if () to be visible after it, so (re)declaring err inside if () body is wrong. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpyAl Viro
... not harmless in this case - we have a string in the end of buffer already. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18remove SWRITE* I/O typesChristoph Hellwig
These flags aren't real I/O types, but tell ll_rw_block to always lock the buffer instead of giving up on a failed trylock. Instead add a new write_dirty_buffer helper that implements this semantic and use it from the existing SWRITE* callers. Note that the ll_rw_block code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which this patch fixes. In the ufs code clean up the helper that used to call ll_rw_block to mirror sync_dirty_buffer, which is the function it implements for compound buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18kill BH_Ordered flagChristoph Hellwig
Instead of abusing a buffer_head flag just add a variant of sync_dirty_buffer which allows passing the exact type of write flag required. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18vfs: update ctime when changing the file's permission by setfaclJan Kara
generic_acl_set didn't update the ctime of the file when its permission was changed. Steps to reproduce: # touch aaa # stat -c %Z aaa 1275289822 # setfacl -m 'u::x,g::x,o::x' aaa # stat -c %Z aaa 1275289822 <- unchanged But, according to the spec of the ctime, vfs must update it. Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>. CC: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18cramfs: only unlock new inodesAlexander Shishkin
Commit 77b8a75f5bb introduced a warning at fs/inode.c:692 unlock_new_inode(), caused by unlock_new_inode() being called on existing inodes as well. This patch changes setup_inode() to only call unlock_new_inode() for I_NEW inodes. Signed-off-by: Alexander Shishkin <virtuoso@slind.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18fix reiserfs_evict_inode end_writeback second callSergey Senozhatsky
reiserfs_evict_inode calls end_writeback two times hitting kernel BUG at fs/inode.c:298 becase inode->i_state is I_CLEAR already. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix false warning saying one of two super blocks is broken nilfs2: fix list corruption after ifile creation failure
2010-08-17Make do_execve() take a const filename pointerDavid Howells
Make do_execve() take a const filename pointer so that kernel_execve() compiles correctly on ARM: arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type This also requires the argv and envp arguments to be consted twice, once for the pointer array and once for the strings the array points to. This is because do_execve() passes a pointer to the filename (now const) to copy_strings_kernel(). A simpler alternative would be to cast the filename pointer in do_execve() when it's passed to copy_strings_kernel(). do_execve() may not change any of the strings it is passed as part of the argv or envp lists as they are some of them in .rodata, so marking these strings as const should be fine. Further kernel_execve() and sys_execve() need to be changed to match. This has been test built on x86_64, frv, arm and mips. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-16nilfs2: fix false warning saying one of two super blocks is brokenRyusuke Konishi
After applying commit b2ac86e1, the following message got appeared after unclean shutdown: > NILFS warning: broken superblock. using spare superblock. This turns out to be a false message due to the change which updates two super blocks alternately. The secondary super block now can be selected if it's newer than the primary one. This kills the false warning by suppressing it if another super block is not actually broken. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-08-16nilfs2: fix list corruption after ifile creation failureRyusuke Konishi
If nilfs_attach_checkpoint() gets a memory allocation failure during creation of ifile, it will return without removing nilfs_sb_info struct from ns_supers list. When a concurrently mounted snapshot is unmounted or another new snapshot is mounted after that, this causes kernel oops as below: > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: [<f83662ff>] nilfs_find_sbinfo+0x74/0xa4 [nilfs2] > *pde = 00000000 > Oops: 0000 [#1] SMP <snip> > Call Trace: > [<f835dc29>] ? nilfs_get_sb+0x165/0x532 [nilfs2] > [<c1173c87>] ? ida_get_new_above+0x16d/0x187 > [<c109a7f8>] ? alloc_vfsmnt+0x7e/0x10a > [<c1070790>] ? kstrdup+0x2c/0x40 > [<c1089041>] ? vfs_kern_mount+0x96/0x14e > [<c108913d>] ? do_kern_mount+0x32/0xbd > [<c109b331>] ? do_mount+0x642/0x6a1 > [<c101a415>] ? do_page_fault+0x0/0x2d1 > [<c1099c00>] ? copy_mount_options+0x80/0xe2 > [<c10705d8>] ? strndup_user+0x48/0x67 > [<c109b3f1>] ? sys_mount+0x61/0x90 > [<c10027cc>] ? sysenter_do_call+0x12/0x22 This fixes the problem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: stable@kernel.org
2010-08-15mm: fix up some user-visible effects of the stack guard pageLinus Torvalds
This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-14fs/dcache: fix function param name in kernel-docRandy Dunlap
Fix parameter name in kernel-doc notation (causes a warning). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-13Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: clean up compiler warning in start_this_handle()
2010-08-13Merge branch 'bkl/ioctl' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: bkl: Remove locked .ioctl file operation v4l: Remove reference to bkl ioctl in compat ioctl handling logfs: kill BKL
2010-08-13Mark arguments to certain syscalls as being constDavid Howells
Mark arguments to certain system calls as being const where they should be but aren't. The list includes: (*) The filename arguments of various stat syscalls, execve(), various utimes syscalls and some mount syscalls. (*) The filename arguments of some syscall helpers relating to the above. (*) The buffer argument of various write syscalls. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-14bkl: Remove locked .ioctl file operationArnd Bergmann
The last user is gone, so we can safely remove this Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: John Kacur <jkacur@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-08-14logfs: kill BKLArnd Bergmann
logfs does not need the BKL, so use ->unlocked_ioctl instead of ->ioctl in file operations. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Joern Engel <joern@logfs.org> [ fixed trivial conflict ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-08-13Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6Linus Torvalds
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: [S390] partitions: fix build error in ibm partition detection code [S390] appldata: fix dev_get_stats 64 bit conversion [S390] wire up prlimit64 and fanotify* syscalls [S390] zcrypt: fix Kconfig dependencies [S390] sys_personality: follow u_long to unsigned int conversion [S390] dasd: fix format string types
2010-08-13Merge branch 'fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: O2net: Disallow o2net accept connection request from itself. ocfs2/dlm: remove potential deadlock -V3 ocfs2/dlm: avoid incorrect bit set in refmap on recovery master Fix the nested PR lock calling issue in ACL ocfs2: Count more refcount records in file system fragmentation. ocfs2 fix o2dlm dlm run purgelist (rev 3) ocfs2/dlm: fix a dead lock ocfs2: do not overwrite error codes in ocfs2_init_acl
2010-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [NFS] Set CONFIG_KEYS when CONFIG_NFS_USE_KERNEL_DNS is set AFS: Implement an autocell mount capability [ver #2] DNS: If the DNS server returns an error, allow that to be cached [ver #2] NFS: Use kernel DNS resolver [ver #2] cifs: update README to include details about 'fsc' option
2010-08-13[S390] partitions: fix build error in ibm partition detection codeHeiko Carstens
9c867fbe "partitions: fix sometimes unreadable partition strings" coverted one line within the ibm partition code incorrectly. Fix this to get rid of a build error. fs/partitions/ibm.c: In function 'ibm_partition': [...] fs/partitions/ibm.c:185: error: too many arguments to function 'strlcat' Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-08-12Revert "fsnotify: store struct file not struct path"Linus Torvalds
This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay the final work in fput" that was a horribly ugly hack to make it work at all). The 'struct file' approach not only causes that disgusting hack, it somehow breaks pulseaudio, probably due to some other subtlety with f_count handling. Fix up various conflicts due to later fsnotify work. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12[NFS] Set CONFIG_KEYS when CONFIG_NFS_USE_KERNEL_DNS is setSteve French
Previous patch relied on DNS_RESOLVER setting CONFIG_KEYS but needs to be selected in NFS config when using the new DNS resolver Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> CC: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-12Merge branch 'params' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'params' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (22 commits) param: don't deref arg in __same_type() checks param: update drivers/acpi/debug.c to new scheme param: use module_param in drivers/message/fusion/mptbase.c ide: use module_param_named rather than module_param_call param: update drivers/char/ipmi/ipmi_watchdog.c to new scheme param: lock if_sdio's lbs_helper_name and lbs_fw_name against sysfs changes. param: lock myri10ge_fw_name against sysfs changes. param: simple locking for sysfs-writable charp parameters param: remove unnecessary writable charp param: add kerneldoc to moduleparam.h param: locking for kernel parameters param: make param sections const. param: use free hook for charp (fix leak of charp parameters) param: add a free hook to kernel_param_ops. param: silence .init.text references from param ops Add param ops struct for hvc_iucv driver. nfs: update for module_param_named API change AppArmor: update for module_param_named API change param: use ops in struct kernel_param, rather than get and set fns directly param: move the EXPORT_SYMBOL to after the definitions. ...
2010-08-12Add a dummy printk function for the maintenance of unused printksDavid Howells
Add a dummy printk function for the maintenance of unused printks through gcc format checking, and also so that side-effect checking is maintained too. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12mm: fix writeback_in_progress()Jan Kara
Commit 83ba7b071f3 ("writeback: simplify the write back thread queue") broke writeback_in_progress() as in that commit we started to remove work items from the list at the moment we start working on them and not at the moment they are finished. Thus if the flusher thread was doing some work but there was no other work queued, writeback_in_progress() returned false. This could in particular cause unnecessary queueing of background writeback from balance_dirty_pages() or writeout work from writeback_sb_if_idle(). This patch fixes the problem by introducing a bit in the bdi state which indicates that the flusher thread is processing some work and uses this bit for writeback_in_progress() test. NOTE: Both callsites of writeback_in_progress() (namely, writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually need a different information than what writeback_in_progress() provides. They would need to know whether *the kind of writeback they are going to submit* is already queued. But this information isn't that simple to provide so let's fix writeback_in_progress() for the time being. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: merge for_kupdate and !for_kupdate casesWu Fengguang
Unify the logic for kupdate and non-kupdate cases. There won't be starvation because the inodes requeued into b_more_io will later be spliced _after_ the remaining inodes in b_io, hence won't stand in the way of other inodes in the next run. It avoids unnecessary redirty_tail() calls, hence the update of i_dirtied_when. The timestamp update is undesirable because it could later delay the inode's periodic writeback, or may exclude the inode from the data integrity sync operation (which checks timestamp to avoid extra work and livelock). === How the redirty_tail() comes about: It was a long story.. This redirty_tail() was introduced with wbc.more_io. The initial patch for more_io actually does not have the redirty_tail(), and when it's merged, several 100% iowait bug reports arised: reiserfs: http://lkml.org/lkml/2007/10/23/93 jfs: commit 29a424f28390752a4ca2349633aaacc6be494db5 JFS: clear PAGECACHE_TAG_DIRTY for no-write pages ext2: http://www.spinics.net/linux/lists/linux-ext4/msg04762.html They are all old bugs hidden in various filesystems that become "visible" with the more_io patch. At the time, the ext2 bug is thought to be "trivial", so not fixed. Instead the following updated more_io patch with redirty_tail() is merged: http://www.spinics.net/linux/lists/linux-ext4/msg04507.html This will in general prevent 100% on ext2 and possibly other unknown FS bugs. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: fix queue_io() orderingWu Fengguang
This was not a bug, since b_io is empty for kupdate writeback. The next patch will do requeue_io() for non-kupdate writeback, so let's fix it. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: don't redirty tail an inode with dirty pagesWu Fengguang
Avoid delaying writeback for an expire inode with lots of dirty pages, but no active dirtier at the moment. Previously we only do that for the kupdate case. Any filesystem that does delayed allocation or unwritten extent conversion after IO completion will cause this - for example, XFS. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: avoid unnecessary calculation of bdi dirty thresholdsWu Fengguang
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so that the latter can be avoided when under global dirty background threshold (which is the normal state for most systems). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11AFS: Implement an autocell mount capability [ver #2]wanglei
Implement the ability for the root directory of a mounted AFS filesystem to accept lookups of arbitrary directory names, to interpet the names as the names of cells, to look the cell names up in the DNS for AFSDB records and to mount the root.cell volume of the nominated cell on the pseudo-directory created by lookup. This facility is requested by passing: -o autocell to the mountpoint for which this is desired, usually the /afs mount. To use this facility, a DNS upcall program is required for AFSDB records. This can be obtained from: http://people.redhat.com/~dhowells/afs/dns.afsdb.c It should be compiled with -lresolv and -lkeyutils and installed as, say: /usr/sbin/dns.afsdb Then the following line needs to be added to /sbin/request-key.conf: create dns_resolver afsdb:* * /usr/sbin/dns.afsdb %k This can be tested by mounting AFS, say: insmod dns_resolver.ko insmod af-rxrpc.ko insmod kafs.ko rootcell=grand.central.org mount -t afs "#grand.central.org:root.cell." /afs -o autocell and doing: ls /afs/grand.central.org/ which should show: archive/ cvs/ doc/ local/ project/ service/ software/ user/ www/ if it works. Signed-off-by: Wang Lei <wang840925@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-11DNS: If the DNS server returns an error, allow that to be cached [ver #2]Wang Lei
If the DNS server returns an error, allow that to be cached in the DNS resolver key in lieu of a value. Userspace passes the desired error number as an option in the payload: "#dnserror=<number>" Userspace must map h_errno from the name resolution routines to an appropriate Linux error before passing it up. Something like the following mapping is recommended: [HOST_NOT_FOUND] = ENODATA, [TRY_AGAIN] = EAGAIN, [NO_RECOVERY] = ECONNREFUSED, [NO_DATA] = ENODATA, in lieu of Linux errors specifically for representing name service errors. The filesystem must map these errors appropropriately before passing them to userspace. AFS is made to map ENODATA and EAGAIN to EDESTADDRREQ for the return to userspace; ECONNREFUSED is allowed to stand as is. The error can be seen in /proc/keys as a negative number after the description of the key. Compare, for example, the following key entries: 2f97238c I--Q-- 1 53s 3f010000 0 0 dns_resol afsdb:grand.centrall.org: -61 338bfbbe I--Q-- 1 59m 3f010000 0 0 dns_resol afsdb:grand.central.org: 37 If the error option is supplied in the payload, the main part of the payload is discarded. The key should have an expiry time set by userspace. Signed-off-by: Wang Lei <wang840925@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-11NFS: Use kernel DNS resolver [ver #2]Bryan Schumaker
Use the kernel DNS resolver to translate hostnames to IP addresses. Create a new config option to choose between the legacy DNS resolver and the new resolver. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-11cifs: update README to include details about 'fsc' optionSuresh Jayaraman
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: isofs: Fix lseek() to position beyond 4 GB vfs: remove unused MNT_STRICTATIME vfs: show unreachable paths in getcwd and proc vfs: only add " (deleted)" where necessary vfs: add prepend_path() helper vfs: __d_path: dont prepend the name of the root dentry ia64: perfmon: add d_dname method vfs: add helpers to get root and pwd cachefiles: use path_get instead of lone dget fs/sysv/super.c: add support for non-PDP11 v7 filesystems V7: Adjust sanity checks for some volumes Add v7 alias v9fs: fixup for inode_setattr being removed Manual merge to take Al's version of the fs/sysv/super.c file: it merged cleanly, but Al had removed an unnecessary header include, so his side was better.
2010-08-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linusLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: Squashfs: fix checkpatch.pl warnings Squashfs: fix filename typo Squashfs: update Kconfig and documentation for LZO Squashfs: fix block size use in LZO decompressor Squashfs: Add LZO compression support squashfs: fix filename in header comment Squashfs: Make XATTR config name consistent with other file systems squashfs: fix compiler inline warning
2010-08-11Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds
* 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Fix groups code when num_devices is not divisible by group_width exofs: Remove useless optimization exofs: exofs_file_fsync and exofs_file_flush correctness exofs: Remove superfluous dependency on buffer_head and writeback
2010-08-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits) ceph: generalize mon requests, add pool op support ceph: only queue async writeback on cap revocation if there is dirty data ceph: do not ignore osd_idle_ttl mount option ceph: constify dentry_operations ceph: whitespace cleanup ceph: add flock/fcntl lock support ceph: define on-wire types, constants for file locking support ceph: add CEPH_FEATURE_FLOCK to the supported feature bits ceph: support v2 reconnect encoding ceph: support v2 client_caps encoding ceph: move AES iv definition to shared header ceph: fix decoding of pool snap info ceph: make ->sync_fs not wait if wait==0 ceph: warn on missing snap realm ceph: print useful error message when crush rule not found ceph: use %pU to print uuid (fsid) ceph: sync header defs with server code ceph: clean up header guards ceph: strip misleading/obsolete version, feature info ceph: specify supported features in super.h ...
2010-08-11fs/sysv/super.c: add support for non-PDP11 v7 filesystemsLubomir Rintel
This adds byte order autodetection (of PDP-11 and LE filesystems). No attempt is made to detect big-endian filesystems -- were there any? Tested with PDP-11 v7 filesystems and PC-IX maintenance floppy. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>