summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-11-05ceph/super.c: quiet sparse noiseH Hartley Sweeten
Quiet the sparse noise: warning: symbol 'create_fs_client' was not declared. Should it be static? warning: symbol 'destroy_fs_client' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05ceph/mds_client.c: quiet sparse noiseH Hartley Sweeten
Quiet the following sparse noise: warning: symbol 'get_nonsnap_parent' was not declared. Should it be static? warning: symbol 'done_closing_sessions' was not declared. Should it be static? Local functions don't need external visability. Make them static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05ceph: use new D_COMPLETE dentry flagSage Weil
We used to use a flag on the directory inode to track whether the dcache contents for a directory were a complete cached copy. Switch to a dentry flag CEPH_D_COMPLETE that is safely updated by ->d_prune(). Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-03ceph: clear parent D_COMPLETE flag when on dentry pruneSage Weil
When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag on the parent dentry. Do this for the live and snapshotted namespaces. Do not bother for the .snap dir contents, since we do not cache that. Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-02Merge branch 'for-3.2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
* 'for-3.2' of git://linux-nfs.org/~bfields/linux: nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_access
2011-11-02Merge branch 'akpm' (Andrew's incoming - part two)Linus Torvalds
Says Andrew: "60 patches. That's good enough for -rc1 I guess. I have quite a lot of detritus to be rechecked, work through maintainers, etc. - most of the remains of MM - rtc - various misc - cgroups - memcg - cpusets - procfs - ipc - rapidio - sysctl - pps - w1 - drivers/misc - aio" * akpm: (60 commits) memcg: replace ss->id_lock with a rwlock aio: allocate kiocbs in batches drivers/misc/vmw_balloon.c: fix typo in code comment drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop w1: disable irqs in critical section drivers/w1/w1_int.c: multiple masters used same init_name drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal drivers/power/ds2780_battery.c: add a nolock function to w1 interface drivers/power/ds2780_battery.c: create central point for calling w1 interface w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it pps gpio client: add missing dependency pps: new client driver using GPIO pps: default echo function include/linux/dma-mapping.h: add dma_zalloc_coherent() sysctl: make CONFIG_SYSCTL_SYSCALL default to n sysctl: add support for poll() RapidIO: documentation update drivers/net/rionet.c: fix ethernet address macros for LE platforms RapidIO: fix potential null deref in rio_setup_device() RapidIO: add mport driver for Tsi721 bridge ...
2011-11-02aio: allocate kiocbs in batchesJeff Moyer
In testing aio on a fast storage device, I found that the context lock takes up a fair amount of cpu time in the I/O submission path. The reason is that we take it for every I/O submitted (see __aio_get_req). Since we know how many I/Os are passed to io_submit, we can preallocate the kiocbs in batches, reducing the number of times we take and release the lock. In my testing, I was able to reduce the amount of time spent in _raw_spin_lock_irq by .56% (average of 3 runs). The command I used to test this was: aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev> I also tested the patch with various numbers of events passed to io_submit, and I ran the xfstests aio group of tests to ensure I didn't break anything. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Daniel Ehrenberg <dehrenberg@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02sysctl: add support for poll()Lucas De Marchi
Adding support for poll() in sysctl fs allows userspace to receive notifications of changes in sysctl entries. This adds a infrastructure to allow files in sysctl fs to be pollable and implements it for hostname and domainname. [akpm@linux-foundation.org: s/declare/define/ for definitions] Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02proc: fix races against execve() of /proc/PID/fd**Vasiliy Kulikov
fd* files are restricted to the task's owner, and other users may not get direct access to them. But one may open any of these files and run any setuid program, keeping opened file descriptors. As there are permission checks on open(), but not on readdir() and read(), operations on the kept file descriptors will not be checked. It makes it possible to violate procfs permission model. Reading fdinfo/* may disclosure current fds' position and flags, reading directory contents of fdinfo/ and fd/ may disclosure the number of opened files by the target task. This information is not sensible per se, but it can reveal some private information (like length of a password stored in a file) under certain conditions. Used existing (un)lock_trace functions to check for ptrace_may_access(), but instead of using EPERM return code from it use EACCES to be consistent with existing proc_pid_follow_link()/proc_pid_readlink() return code. If they differ, attacker can guess what fds exist by analyzing stat() return code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*, readdir() and lookup() for fd/ and fdinfo/. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02procfs: report EISDIR when reading sysctl dirs in procPavel Emelyanov
On reading sysctl dirs we should return -EISDIR instead of -EINVAL. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02hfs: fix hfs_find_init() sb->ext_tree NULL ptr oopsPhillip Lougher
Clement Lecigne reports a filesystem which causes a kernel oops in hfs_find_init() trying to dereference sb->ext_tree which is NULL. This proves to be because the filesystem has a corrupted MDB extent record, where the extents file does not fit into the first three extents in the file record (the first blocks). In hfs_get_block() when looking up the blocks for the extent file (HFS_EXT_CNID), it fails the first blocks special case, and falls through to the extent code (which ultimately calls hfs_find_init()) which is in the process of being initialised. Hfs avoids this scenario by always having the extents b-tree fitting into the first blocks (the extents B-tree can't have overflow extents). The fix is to check at mount time that the B-tree fits into first blocks, i.e. fail if HFS_I(inode)->alloc_blocks >= HFS_I(inode)->first_blocks Note, the existing commit 47f365eb57573 ("hfs: fix oops on mount with corrupted btree extent records") becomes subsumed into this as a special case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly acceptable for the catalog B-Tree file to grow beyond three extents, with the remaining extent descriptors in the extents overfow. This fixes CVE-2011-2203 Reported-by: Clement LECIGNE <clement.lecigne@netasq.com> Signed-off-by: Phillip Lougher <plougher@redhat.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02isofs: add readpages supportNamjae Jeon
Use mpage_readpages() instead of multiple calls to isofs_readpage() to reduce the CPU utilization and make performance higher. Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02ramfs: remove module leftoversRichard Weinberger
Since ramfs is hard-selected to "y", the module leftovers make no sense. Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02binfmt_elf: fix PIE execution with randomization disabledJiri Kosina
The case of address space randomization being disabled in runtime through randomize_va_space sysctl is not treated properly in load_elf_binary(), resulting in SIGKILL coming at exec() time for certain PIE-linked binaries in case the randomization has been disabled at runtime prior to calling exec(). Handle the randomize_va_space == 0 case the same way as if we were not supporting .text randomization at all. Based on original patch by H.J. Lu and Josh Boyer. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: vfs: add d_prune dentry operation vfs: protect i_nlink filesystems: add set_nlink() filesystems: add missing nlink wrappers logfs: remove unnecessary nlink setting ocfs2: remove unnecessary nlink setting jfs: remove unnecessary nlink setting hypfs: remove unnecessary nlink setting vfs: ignore error on forced remount readlinkat: ensure we return ENOENT for the empty pathname for normal lookups vfs: fix dentry leak in simple_fill_super()
2011-11-02Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits) jbd2: Unify log messages in jbd2 code jbd/jbd2: validate sb->s_first in journal_get_superblock() ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled ext4: fix a typo in struct ext4_allocation_context ext4: Don't normalize an falloc request if it can fit in 1 extent. ext4: remove comments about extent mount option in ext4_new_inode() ext4: let ext4_discard_partial_buffers handle unaligned range correctly ext4: return ENOMEM if find_or_create_pages fails ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock() ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten ext4: optimize locking for end_io extent conversion ext4: remove unnecessary call to waitqueue_active() ext4: Use correct locking for ext4_end_io_nolock() ext4: fix race in xattr block allocation path ext4: trace punch_hole correctly in ext4_ext_map_blocks ext4: clean up AGGRESSIVE_TEST code ext4: move variables to their scope ext4: fix quota accounting during migration ext4: migrate cleanup ...
2011-11-02Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Cleanup metadata flags handling udf: Skip mirror metadata FE loading when metadata FE is ok ext3: Allow quota file use root reservation udf: Remove web reference from UDF MAINTAINERS entry quota: Drop path reference on error exit from quotactl udf: Neaten udf_debug uses udf: Neaten logging output, use vsprintf extension %pV udf: Convert printks to pr_<level> udf: Rename udf_warning to udf_warn udf: Rename udf_error to udf_err udf: Promote some debugging messages to udf_error ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY. udf: Add readpages support for udf. ext3/balloc.c: local functions should be static ext2: fix the outdated comment in ext2_nfs_get_inode() ext3: remove deprecated oldalloc fs/ext3/balloc.c: delete useless initialization fs/ext2/balloc.c: delete useless initialization ext3: fix message in ext3_remount for rw-remount case ext3: Remove i_mutex from ext3_sync_file() Fix up trivial (printf format cleanup) conflicts in fs/udf/udfdecl.h
2011-11-02Merge branch 'for-linus' of git://github.com/richardweinberger/linuxLinus Torvalds
* 'for-linus' of git://github.com/richardweinberger/linux: (90 commits) um: fix ubd cow size um: Fix kmalloc argument order in um/vdso/vma.c um: switch to use of drivers/Kconfig UserModeLinux-HOWTO.txt: fix a typo UserModeLinux-HOWTO.txt: remove ^H characters um: we need sys/user.h only on i386 um: merge delay_{32,64}.c um: distribute exports to where exported stuff is defined um: kill system-um.h um: generic ftrace.h will do... um: segment.h is x86-only and needed only there um: asm/pda.h is not needed anymore um: hw_irq.h can go generic as well um: switch to generic-y um: clean Kconfig up a bit um: a couple of missing dependencies... um: kill useless argument of free_chan() and free_one_chan() um: unify ptrace_user.h um: unify KSTK_... um: fix gcov build breakage ...
2011-11-02um: kill useless include of user.hAl Viro
everything in USER_OBJ gets it via -include user.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2011-11-02vfs: add d_prune dentry operationSage Weil
This adds a d_prune dentry operation that is called by the VFS prior to pruning (i.e. unhashing and killing) a hashed dentry from the dcache. Wrap dentry_lru_del() and use the new _prune() helper in the cases where we are about to unhash and kill the dentry. This will be used by Ceph to maintain a flag indicating whether the complete contents of a directory are contained in the dcache, allowing it to satisfy lookups and readdir without addition server communication. Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other DCACHE_OP_ bits. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02vfs: protect i_nlinkMiklos Szeredi
Prevent direct modification of i_nlink by making it const and adding a non-const __i_nlink alias. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02filesystems: add set_nlink()Miklos Szeredi
Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02filesystems: add missing nlink wrappersMiklos Szeredi
Replace direct i_nlink updates with the respective updater function (inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-11-02logfs: remove unnecessary nlink settingMiklos Szeredi
alloc_inode() initializes i_nlink to 1. Remove unnecessary re-initialization. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Joern Engel <joern@logfs.org> CC: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02ocfs2: remove unnecessary nlink settingMiklos Szeredi
alloc_inode() initializes i_nlink to 1. Remove unnecessary re-initialization. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Joel Becker <jlbec@evilplan.org> CC: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02jfs: remove unnecessary nlink settingMiklos Szeredi
alloc_inode() initializes i_nlink to 1. Remove unnecessary re-initialization. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02vfs: ignore error on forced remountMiklos Szeredi
On emergency remount we want to force MS_RDONLY on the super block even if ->remount_fs() failed for some reason. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02readlinkat: ensure we return ENOENT for the empty pathname for normal lookupsAndy Whitcroft
Since the commit below which added O_PATH support to the *at() calls, the error return for readlink/readlinkat for the empty pathname has switched from ENOENT to EINVAL: commit 65cfc6722361570bfe255698d9cd4dccaf47570d Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 13 15:56:26 2011 -0400 readlinkat(), fchownat() and fstatat() with empty relative pathnames This is both unexpected for userspace and makes readlink/readlinkat inconsistant with all other interfaces; and inconsistant with our stated return for these pathnames. As the readlinkat call does not have a flags parameter we cannot use the AT_EMPTY_PATH approach used in the other calls. Therefore expose whether the original path is infact entry via a new user_path_at_empty() path lookup function. Use this to determine whether to default to EINVAL or ENOENT for failures. Addresses http://bugs.launchpad.net/bugs/817187 [akpm@linux-foundation.org: remove unused getname_flags()] Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02vfs: fix dentry leak in simple_fill_super()Konstantin Khlebnikov
put dentry if inode allocation failed, d_genocide() cannot release it Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-01jbd2: Unify log messages in jbd2 codeEryu Guan
Some jbd2 code prints out kernel messages with "JBD2: " prefix, at the same time other jbd2 code prints with "JBD: " prefix. Unify the prefix to "JBD2: ". Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01jbd/jbd2: validate sb->s_first in journal_get_superblock()Eryu Guan
I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3 image has s_first = 0 in journal superblock, and the 0 is passed to journal->j_head in journal_reset(), then to blocknr in cleanup_journal_tail(), in the end the J_ASSERT failed. So validate s_first after reading journal superblock from disk in journal_get_superblock() to ensure s_first is valid. The following script could reproduce it: fstype=ext3 blocksize=1024 img=$fstype.img offset=0 found=0 magic="c0 3b 39 98" dd if=/dev/zero of=$img bs=1M count=8 mkfs -t $fstype -b $blocksize -F $img filesize=`stat -c %s $img` while [ $offset -lt $filesize ] do if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then echo "Found journal: $offset" found=1 break fi offset=`echo "$offset+$blocksize" | bc` done if [ $found -ne 1 ];then echo "Magic \"$magic\" not found" exit 1 fi dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1 mkdir -p ./mnt mount -o loop $img ./mnt Cc: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01ext4: let ext4_ext_rm_leaf work with EXT_DEBUG definedYongqiang Yang
The variable 'block' is removed by commit 750c9c47, so use the replacement ex_ee_block instead. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabledYongqiang Yang
This patch fixes a syntax error which omits a comma. Besides this, logical block number is unsigend 32 bits, so printk should use %u instead %d. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_accessBenny Halevy
Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-11-01Merge branch 'pstore' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux * 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: make pstore write function return normal success/fail value pstore: change mutex locking to spin_locks pstore: defer inserting OOPS entries into pstore
2011-11-01sysfs: Make sysfs_rename safe with sysfs_dirents in rbtrees.Eric W. Biederman
In sysfs_rename we need to remove the optimization of not calling sysfs_unlink_sibling and sysfs_link_sibling if the renamed parent directory is not changing. This optimization is no longer valid now that sysfs dirents are stored in an rbtree sorted by name. Move the assignment of s_ns before the call of sysfs_link_sibling. With no sysfs_dirent fields changing after the call of sysfs_link_sibling this allows sysfs_link_sibling to take any of the directory entries into account when it builds the rbtrees, and s_ns looks like a prime canidate to be used in the rbtree in the future. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Greg KH <gregkh@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31epoll: fix spurious lockdep warningsNelson Elhage
epoll can acquire recursively acquire ep->mtx on multiple "struct eventpoll"s at once in the case where one epoll fd is monitoring another epoll fd. This is perfectly OK, since we're careful about the lock ordering, but it causes spurious lockdep warnings. Annotate the recursion using mutex_lock_nested, and add a comment explaining the nesting rules for good measure. Recent versions of systemd are triggering this, and it can also be demonstrated with the following trivial test program: --------------------8<-------------------- int main(void) { int e1, e2; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create1(0); e2 = epoll_create1(0); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } --------------------8<-------------------- Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31fat: follow rename pack_hex_byte() to hex_byte_pack()Andy Shevchenko
There is no functional change. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31treewide: use __printf not __attribute__((format(printf,...)))Joe Perches
Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=*.[ch] -w "__attribute__" * | \ grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31fs/pipe.c: add ->statfs callback for pipefsPavel Emelyanov
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire pipfs up so that the statfs succeeds. This is required by checkpoint-restart in the userspace to make it possible to distinguish pipes from fifos. When we dump information about task's open files we use the /proc/pid/fd directoy's symlinks and the fact that opening any of them gives us exactly the same dentry->inode pair as the original process has. Now if a task we're dumping has opened pipe and fifo we need to detect this and act accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is the most precise way. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31fs/buffer.c: add device information for error output in __find_get_block_slow()Tao Ma
On the ext4 mailing list[1], we got some report about errors in __find_get_block_slow(), but the information is very limited. If the device information is given, we can know the name of the sick volume. Futhermore, we can get the corresponding status of that block(group, inode block etc) by analyzing the disk layout. [1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2 Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31vmscan: fix shrinker callback bug in fs/super.cMikulas Patocka
The callback must not return -1 when nr_to_scan is zero. Fix the bug in fs/super.c and add this requirement to the callback specification. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31lib/string.c: introduce memchr_inv()Akinobu Mita
memchr_inv() is mainly used to check whether the whole buffer is filled with just a specified byte. The function name and prototype are stolen from logfs and the implementation is from SLUB. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31ext4: warn if direct reclaim tries to writeback pagesMel Gorman
Direct reclaim should never writeback pages. Warn if an attempt is made. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31xfs: warn if direct reclaim tries to writeback pagesMel Gorman
Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: distinguish between mlocked and pinned pagesChristoph Lameter
Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious.Robert P. J. Day
Add the leading word "tmpfs" to the Kconfig string to make it blindingly obvious that this selection refers to tmpfs. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: remove oom_disable_countDavid Rientjes
This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31Cross Memory AttachChristopher Yeoh
The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31/proc/self/numa_maps: restore "huge" tag for hugetlb vmasAndrew Morton
The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>