summaryrefslogtreecommitdiff
path: root/fs/namei.c
AgeCommit message (Collapse)Author
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-17Merge uncontroversial parts of branch 'readlink' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull partial readlink cleanups from Miklos Szeredi. This is the uncontroversial part of the readlink cleanup patch-set that simplifies the default readlink handling. Miklos and Al are still discussing the rest of the series. * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: vfs: make generic_readlink() static vfs: remove ".readlink = generic_readlink" assignments vfs: default to generic_readlink() vfs: replace calling i_op->readlink with vfs_readlink() proc/self: use generic_readlink ecryptfs: use vfs_get_link() bad_inode: add missing i_op initializers
2016-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "In this pile: - autofs-namespace series - dedupe stuff - more struct path constification" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features ocfs2: charge quota for reflinked blocks ocfs2: fix bad pointer cast ocfs2: always unlock when completing dio writes ocfs2: don't eat io errors during _dio_end_io_write ocfs2: budget for extent tree splits when adding refcount flag ocfs2: prohibit refcounted swapfiles ocfs2: add newlines to some error messages ocfs2: convert inode refcount test to a helper simple_write_end(): don't zero in short copy into uptodate exofs: don't mess with simple_write_{begin,end} 9p: saner ->write_end() on failing copy into non-uptodate page fix gfs2_stuffed_write_end() on short copies fix ceph_write_end() nfs_write_end(): fix handling of short copies vfs: refactor clone/dedupe_file_range common functions fs: try to clone files first in vfs_copy_file_range vfs: misc struct path constification namespace.c: constify struct path passed to a bunch of primitives quota: constify struct path in quota_on ...
2016-12-16Merge branch 'work.autofs' into for-linusAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-16Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This update contains: - try to clone on copy-up - allow renaming a directory - split source into managable chunks - misc cleanups and fixes It does not contain the read-only fd data inconsistency fix, which Al didn't like. I'll leave that to the next year..." * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits) ovl: fix reStructuredText syntax errors in documentation ovl: fix return value of ovl_fill_super ovl: clean up kstat usage ovl: fold ovl_copy_up_truncate() into ovl_copy_up() ovl: create directories inside merged parent opaque ovl: opaque cleanup ovl: show redirect_dir mount option ovl: allow setting max size of redirect ovl: allow redirect_dir to default to "on" ovl: check for emptiness of redirect dir ovl: redirect on rename-dir ovl: lookup redirects ovl: consolidate lookup for underlying layers ovl: fix nested overlayfs mount ovl: check namelen ovl: split super.c ovl: use d_is_dir() ovl: simplify lookup ovl: check lower existence of rename target ovl: rename: simplify handling of lower/merged directory ...
2016-12-16Revert "vfs: rename: check backing inode being equal"Miklos Szeredi
This reverts commit 9409e22acdfc9153f88d9b1ed2bd2a5b34d2d3ca. Since commit 51f7e52dc943 ("ovl: share inode for hard link") there's no need to call d_real_inode() to check two overlay inodes for equality. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09vfs: make generic_readlink() staticMiklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09vfs: remove ".readlink = generic_readlink" assignmentsMiklos Szeredi
If .readlink == NULL implies generic_readlink(). Generated by: to_del="\.readlink.*=.*generic_readlink" for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09vfs: default to generic_readlink()Miklos Szeredi
If i_op->readlink is NULL, but i_op->get_link is set then vfs_readlink() defaults to calling generic_readlink(). The IOP_DEFAULT_READLINK flag indicates that the above conditions are met and the default action can be taken. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09vfs: replace calling i_op->readlink with vfs_readlink()Miklos Szeredi
Also check d_is_symlink() in callers instead of inode->i_op->readlink because following patches will allow NULL ->readlink for symlinks. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-05namei: fold should_follow_link() with the step into not-followed linkAl Viro
All callers are followed by the same boilerplate - "if it has returned 0, update nd->path/inode/seq - we are not following a symlink here". Pull it into the function itself, renaming it into step_into(). Rename WALK_GET to WALK_FOLLOW, while we are at it - more descriptive name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05namei: pass both WALK_GET and WALK_MORE to should_follow_link()Al Viro
... and pull put_link() logics into it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05namei: invert WALK_PUT logicsAl Viro
... turning the condition for put_link() in walk_component() into "WALK_MORE not passed and depth is non-zero". Again, makes for simpler arguments. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()Al Viro
Simplifies the arguments both for it and for walk_component() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05namei: saner calling conventions for mountpoint_last()Al Viro
leave the result in nd->path, have caller do follow_mount() and copy it to the final destination. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05namei.c: get rid of user_path_parent()Al Viro
direct use of filename_parentat() is just as readable Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05vfs: misc struct path constificationAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-02vfs: change d_manage() to take a struct pathIan Kent
For the autofs module to be able to reliably check if a dentry is a mountpoint in a multiple namespace environment the ->d_manage() dentry operation will need to take a path argument instead of a dentry. Link: http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This update contains fixes to the "use mounter's permission to access underlying layers" area, and miscellaneous other fixes and cleanups. No new features this time" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: use vfs_get_link() vfs: add vfs_get_link() helper ovl: use generic_readlink ovl: explain error values when removing acl from workdir ovl: Fix info leak in ovl_lookup_temp() ovl: during copy up, switch to mounter's creds early ovl: lookup: do getxattr with mounter's permission ovl: copy_up_xattr(): use strnlen
2016-10-14vfs: add vfs_get_link() helperMiklos Szeredi
This helper is for filesystems that want to read the symlink and are better off with the get_link() interface (returning a char *) rather than the readlink() interface (copy into a userspace buffer). Also call the LSM hook for readlink (not get_link) since this is for symlink reading not following. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning
2016-09-27fs: rename "rename2" i_op to "rename"Miklos Szeredi
Generated patch: sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2` sed -i "s/\brename2\b/rename/g" `git grep -wl rename2` Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27vfs: remove unused i_op->renameMiklos Szeredi
No in-tree uses remain. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-16vfs: update ovl inode before relatime checkMiklos Szeredi
On overlayfs relatime_need_update() needs inode times to be correct on overlay inode. But i_mtime and i_ctime are updated by filesystem code on underlying inode only, so they will be out-of-date on the overlay inode. This patch copies the times from the underlying inode if needed. This can't be done if called from RCU lookup (link following) but link m/ctime are not updated by fs, so this is all right. This patch doesn't change functionality for anything but overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-08-07fs: return EPERM on immutable inodeEryu Guan
In most cases, EPERM is returned on immutable inode, and there're only a few places returning EACCES. I noticed this when running LTP on overlayfs, setxattr03 failed due to unexpected EACCES on immutable inode. So converting all EACCES to EPERM on immutable inode. Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-29Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns vfs updates from Eric Biederman: "This tree contains some very long awaited work on generalizing the user namespace support for mounting filesystems to include filesystems with a backing store. The real world target is fuse but the goal is to update the vfs to allow any filesystem to be supported. This patchset is based on a lot of code review and testing to approach that goal. While looking at what is needed to support the fuse filesystem it became clear that there were things like xattrs for security modules that needed special treatment. That the resolution of those concerns would not be fuse specific. That sorting out these general issues made most sense at the generic level, where the right people could be drawn into the conversation, and the issues could be solved for everyone. At a high level what this patchset does a couple of simple things: - Add a user namespace owner (s_user_ns) to struct super_block. - Teach the vfs to handle filesystem uids and gids not mapping into to kuids and kgids and being reported as INVALID_UID and INVALID_GID in vfs data structures. By assigning a user namespace owner filesystems that are mounted with only user namespace privilege can be detected. This allows security modules and the like to know which mounts may not be trusted. This also allows the set of uids and gids that are communicated to the filesystem to be capped at the set of kuids and kgids that are in the owning user namespace of the filesystem. One of the crazier corner casees this handles is the case of inodes whose i_uid or i_gid are not mapped into the vfs. Most of the code simply doesn't care but it is easy to confuse the inode writeback path so no operation that could cause an inode write-back is permitted for such inodes (aka only reads are allowed). This set of changes starts out by cleaning up the code paths involved in user namespace permirted mounts. Then when things are clean enough adds code that cleanly sets s_user_ns. Then additional restrictions are added that are possible now that the filesystem superblock contains owner information. These changes should not affect anyone in practice, but there are some parts of these restrictions that are changes in behavior. - Andy's restriction on suid executables that does not honor the suid bit when the path is from another mount namespace (think /proc/[pid]/fd/) or when the filesystem was mounted by a less privileged user. - The replacement of the user namespace implicit setting of MNT_NODEV with implicitly setting SB_I_NODEV on the filesystem superblock instead. Using SB_I_NODEV is a stronger form that happens to make this state user invisible. The user visibility can be managed but it caused problems when it was introduced from applications reasonably expecting mount flags to be what they were set to. There is a little bit of work remaining before it is safe to support mounting filesystems with backing store in user namespaces, beyond what is in this set of changes. - Verifying the mounter has permission to read/write the block device during mount. - Teaching the integrity modules IMA and EVM to handle filesystems mounted with only user namespace root and to reduce trust in their security xattrs accordingly. - Capturing the mounters credentials and using that for permission checks in d_automount and the like. (Given that overlayfs already does this, and we need the work in d_automount it make sense to generalize this case). Furthermore there are a few changes that are on the wishlist: - Get all filesystems supporting posix acls using the generic posix acls so that posix_acl_fix_xattr_from_user and posix_acl_fix_xattr_to_user may be removed. [Maintainability] - Reducing the permission checks in places such as remount to allow the superblock owner to perform them. - Allowing the superblock owner to chown files with unmapped uids and gids to something that is mapped so the files may be treated normally. I am not considering even obvious relaxations of permission checks until it is clear there are no more corner cases that need to be locked down and handled generically. Many thanks to Seth Forshee who kept this code alive, and putting up with me rewriting substantial portions of what he did to handle more corner cases, and for his diligent testing and reviewing of my changes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits) fs: Call d_automount with the filesystems creds fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns evm: Translate user/group ids relative to s_user_ns when computing HMAC dquot: For now explicitly don't support filesystems outside of init_user_ns quota: Handle quota data stored in s_user_ns in quota_setxquota quota: Ensure qids map to the filesystem vfs: Don't create inodes with a uid or gid unknown to the vfs vfs: Don't modify inodes with a uid or gid unknown to the vfs cred: Reject inodes with invalid ids in set_create_file_as() fs: Check for invalid i_uid in may_follow_link() vfs: Verify acls are valid within superblock's s_user_ns. userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS fs: Refuse uid/gid changes which don't map into s_user_ns selinux: Add support for unprivileged mounts from user namespaces Smack: Handle labels consistently in untrusted mounts Smack: Add support for unprivileged mounts from user namespaces fs: Treat foreign mounts as nosuid fs: Limit file caps to the user namespace of the super block userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag userns: Remove implicit MNT_NODEV fragility. ...
2016-07-29Revert "vfs: add lookup_hash() helper"Linus Torvalds
This reverts commit 3c9fe8cdff1b889a059a30d22f130372f2b3885f. As Miklos points out in commit c1b2cc1a765a, the "lookup_hash()" helper is now unused, and in fact, with the hash salting changes, since the hash of a dentry name now depends on the directory dentry it is in, the helper function isn't even really likely to be useful. So rather than keep it around in case somebody else might end up finding a use for it, let's just remove the helper and not trick people into thinking it might be a useful thing. For example, I had obviously completely missed how the helper didn't follow the normal dentry hashing patterns, and how the hash salting patch broke overlayfs. Things would quietly build and look sane, but not work. Suggested-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted cleanups and fixes. Probably the most interesting part long-term is ->d_init() - that will have a bunch of followups in (at least) ceph and lustre, but we'll need to sort the barrier-related rules before it can get used for really non-trivial stuff. Another fun thing is the merge of ->d_iput() callers (dentry_iput() and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all except the one in __d_lookup_lru())" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) fs/dcache.c: avoid soft-lockup in dput() vfs: new d_init method vfs: Update lookup_dcache() comment bdev: get rid of ->bd_inodes Remove last traces of ->sync_page new helper: d_same_name() dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends() vfs: clean up documentation vfs: document ->d_real() vfs: merge .d_select_inode() into .d_real() unify dentry_iput() and dentry_unlink_inode() binfmt_misc: ->s_root is not going anywhere drop redundant ->owner initializations ufs: get rid of redundant checks orangefs: constify inode_operations missed comment updates from ->direct_IO() prototype change file_inode(f)->i_mapping is f->f_mapping trim fsnotify hooks a bit 9p: new helper - v9fs_parent_fid() debugfs: ->d_parent is never NULL or negative ...
2016-07-24vfs: Update lookup_dcache() commentOleg Drokin
commit 6c51e513a3aa ("lookup_dcache(): lift d_alloc() into callers") removed the need_lookup argument from lookup_dcache(), but the comment was forgotten. Also it no longer allocates a new dentry if nothing was found. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-23fs: Call d_automount with the filesystems credsEric W. Biederman
Seth Forshee reported a mount regression in nfs autmounts with "fs: Add user namespace member to struct super_block". It turns out that the assumption that current->cred is something reasonable during mount while necessary to improve support of unprivileged mounts is wrong in the automount path. To fix the existing filesystems override current->cred with the init_cred before calling d_automount and restore current->cred after d_automount completes. To support unprivileged mounts would require a more nuanced cred selection, so fail on unprivileged mounts for the time being. As none of the filesystems that currently set FS_USERNS_MOUNT implement d_automount this check is only good for preventing future problems. Fixes: 6e4eab577a0c ("fs: Add user namespace member to struct super_block") Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-07-05vfs: Don't create inodes with a uid or gid unknown to the vfsEric W. Biederman
It is expected that filesystems can not represent uids and gids from outside of their user namespace. Keep things simple by not even trying to create filesystem nodes with non-sense uids and gids. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-07-05vfs: Don't modify inodes with a uid or gid unknown to the vfsEric W. Biederman
When a filesystem outside of init_user_ns is mounted it could have uids and gids stored in it that do not map to init_user_ns. The plan is to allow those filesystems to set i_uid to INVALID_UID and i_gid to INVALID_GID for unmapped uids and gids and then to handle that strange case in the vfs to ensure there is consistent robust handling of the weirdness. Upon a careful review of the vfs and filesystems about the only case where there is any possibility of confusion or trouble is when the inode is written back to disk. In that case filesystems typically read the inode->i_uid and inode->i_gid and write them to disk even when just an inode timestamp is being updated. Which leads to a rule that is very simple to implement and understand inodes whose i_uid or i_gid is not valid may not be written. In dealing with access times this means treat those inodes as if the inode flag S_NOATIME was set. Reads of the inodes appear safe and useful, but any write or modification is disallowed. The only inode write that is allowed is a chown that sets the uid and gid on the inode to valid values. After such a chown the inode is normal and may be treated as such. Denying all writes to inodes with uids or gids unknown to the vfs also prevents several oddball cases where corruption would have occurred because the vfs does not have complete information. One problem case that is prevented is attempting to use the gid of a directory for new inodes where the directories sgid bit is set but the directories gid is not mapped. Another problem case avoided is attempting to update the evm hash after setxattr, removexattr, and setattr. As the evm hash includeds the inode->i_uid or inode->i_gid not knowning the uid or gid prevents a correct evm hash from being computed. evm hash verification also fails when i_uid or i_gid is unknown but that is essentially harmless as it does not cause filesystem corruption. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-30fs: Check for invalid i_uid in may_follow_link()Seth Forshee
Filesystem uids which don't map into a user namespace may result in inode->i_uid being INVALID_UID. A symlink and its parent could have different owners in the filesystem can both get mapped to INVALID_UID, which may result in following a symlink when this would not have otherwise been permitted when protected symlinks are enabled. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-06-30vfs: merge .d_select_inode() into .d_real()Miklos Szeredi
The two methods essentially do the same: find the real dentry/inode belonging to an overlay dentry. The difference is in the usage: vfs_open() uses ->d_select_inode() and expects the function to perform copy-up if necessary based on the open flags argument. file_dentry() uses ->d_real() passing in the overlay dentry as well as the underlying inode. vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real() with a zero inode would have worked just as well here. This patch merges the functionality of ->d_select_inode() into ->d_real() by adding an 'open_flags' argument to the latter. [Al Viro] Make the signature of d_real() match that of ->d_real() again. And constify the inode argument, while we are at it. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-06-23vfs: Generalize filesystem nodev handling.Eric W. Biederman
Introduce a function may_open_dev that tests MNT_NODEV and a new superblock flab SB_I_NODEV. Use this new function in all of the places where MNT_NODEV was previously tested. Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those filesystems should never support device nodes, and a simple superblock flags makes that very hard to get wrong. With SB_I_NODEV set if any device nodes somehow manage to show up on on a filesystem those device nodes will be unopenable. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-10vfs: make the string hashes salt the hashLinus Torvalds
We always mixed in the parent pointer into the dentry name hash, but we did it late at lookup time. It turns out that we can simplify that lookup-time action by salting the hash with the parent pointer early instead of late. A few other users of our string hashes also wanted to mix in their own pointers into the hash, and those are updated to use the same mechanism. Hash users that don't have any particular initial salt can just use the NULL pointer as a no-salt. Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Fixes for crap of assorted ages: EOPENSTALE one is 4.2+, autofs one is 4.6, d_walk - 3.2+. The atomic_open() and coredump ones are regressions from this window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coredump: fix dumping through pipes fix a regression in atomic_open() fix d_walk()/non-delayed __d_free() race autofs braino fix for do_last() fix EOPENSTALE bug in do_last()
2016-06-07fix a regression in atomic_open()Al Viro
open("/foo/no_such_file", O_RDONLY | O_CREAT) on should fail with EACCES when /foo is not writable; failing with ENOENT is obviously wrong. That got broken by a braino introduced when moving the creat_error logics from atomic_open() to lookup_open(). Easy to fix, fortunately. Spotted-by: "Yan, Zheng" <ukernel@gmail.com> Tested-by: "Yan, Zheng" <ukernel@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-05devpts: Make each mount of devpts an independent filesystem.Eric W. Biederman
The /dev/ptmx device node is changed to lookup the directory entry "pts" in the same directory as the /dev/ptmx device node was opened in. If there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx uses that filesystem. Otherwise the open of /dev/ptmx fails. The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that userspace can now safely depend on each mount of devpts creating a new instance of the filesystem. Each mount of devpts is now a separate and equal filesystem. Reserved ttys are now available to all instances of devpts where the mounter is in the initial mount namespace. A new vfs helper path_pts is introduced that finds a directory entry named "pts" in the directory of the passed in path, and changes the passed in path to point to it. The helper path_pts uses a function path_parent_directory that was factored out of follow_dotdot. In the implementation of devpts: - devpts_mnt is killed as it is no longer meaningful if all mounts of devpts are equal. - pts_sb_from_inode is replaced by just inode->i_sb as all cached inodes in the tty layer are now from the devpts filesystem. - devpts_add_ref is rolled into the new function devpts_ptmx. And the unnecessary inode hold is removed. - devpts_del_ref is renamed devpts_release and reduced to just a deacrivate_super. - The newinstance mount option continues to be accepted but is now ignored. In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as they are never used. Documentation/filesystems/devices.txt is updated to describe the current situation. This has been verified to work properly on openwrt-15.05, centos5, centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3, ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1, slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the caveat that on centos6 and on slackware-14.1 that there wind up being two instances of the devpts filesystem mounted on /dev/pts, the lower copy does not end up getting used. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg KH <greg@kroah.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-05autofs braino fix for do_last()Al Viro
It's an analogue of commit 7500c38a (fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()"). The same problem (->lookup()-returned unhashed negative dentry just might be an autofs one with ->d_manage() that would wait until the daemon makes it positive) applies in do_last() - we need to do follow_managed() first. Fortunately, remaining callers of follow_managed() are OK - only autofs has that weirdness (negative dentry that does not mean an instant -ENOENT)) and autofs never has its negative dentries hashed, so we can't pick one from a dcache lookup. ->d_manage() is a bloody mess ;-/ Cc: stable@vger.kernel.org # v4.6 Spotted-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-04fix EOPENSTALE bug in do_last()Al Viro
EOPENSTALE occuring at the last component of a trailing symlink ends up with do_last() retrying its lookup. After the symlink body has been discarded. The thing is, all this retry_lookup logics in there is not needed at all - the upper layers will do the right thing if we simply return that -EOPENSTALE as we would with any other error. Trying to microoptimize in do_last() is a lot of headache for no good reason. Cc: stable@vger.kernel.org # v4.2+ Tested-by: Oleg Drokin <green@linuxhacker.ru> Reviewed-and-Tested-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-29hash_string: Fix zero-length case for !DCACHE_WORD_ACCESSGeorge Spelvin
The self-test was updated to cover zero-length strings; the function needs to be updated, too. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: George Spelvin <linux@sciencehorizons.net> Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-28Rename other copy of hash_string to hashlen_stringGeorge Spelvin
The original name was simply hash_string(), but that conflicted with a function with that name in drivers/base/power/trace.c, and I decided that calling it "hashlen_" was better anyway. But you have to do it in two places. [ This caused build errors for architectures that don't define CONFIG_DCACHE_WORD_ACCESS - Linus ] Signed-off-by: George Spelvin <linux@sciencehorizons.net> Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-28Merge branch 'hash' of git://ftp.sciencehorizons.net/linuxLinus Torvalds
Pull string hash improvements from George Spelvin: "This series does several related things: - Makes the dcache hash (fs/namei.c) useful for general kernel use. (Thanks to Bruce for noticing the zero-length corner case) - Converts the string hashes in <linux/sunrpc/svcauth.h> to use the above. - Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two 32-bit multiplies will do well enough. - Rids the world of the bad hash multipliers in hash_32. This finishes the job started in commit 689de1d6ca95 ("Minimal fix-up of bad hashing behavior of hash_64()") The vast majority of Linux architectures have hardware support for 32x32-bit multiply and so derive no benefit from "simplified" multipliers. The few processors that do not (68000, h8/300 and some models of Microblaze) have arch-specific implementations added. Those patches are last in the series. - Overhauls the dcache hash mixing. The patch in commit 0fed3ac866ea ("namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion. Replaced with a much more careful design that's simultaneously faster and better. (My own invention, as there was noting suitable in the literature I could find. Comments welcome!) - Modify the hash_name() loop to skip the initial HASH_MIX(). This would let us salt the hash if we ever wanted to. - Sort out partial_name_hash(). The hash function is declared as using a long state, even though it's truncated to 32 bits at the end and the extra internal state contributes nothing to the result. And some callers do odd things: - fs/hfs/string.c only allocates 32 bits of state - fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes - Modify bytemask_from_count to handle inputs of 1..sizeof(long) rather than 0..sizeof(long)-1. This would simplify users other than full_name_hash" Special thanks to Bruce Fields for testing and finding bugs in v1. (I learned some humbling lessons about "obviously correct" code.) On the arch-specific front, the m68k assembly has been tested in a standalone test harness, I've been in contact with the Microblaze maintainers who mostly don't care, as the hardware multiplier is never omitted in real-world applications, and I haven't heard anything from the H8/300 world" * 'hash' of git://ftp.sciencehorizons.net/linux: h8300: Add <asm/hash.h> microblaze: Add <asm/hash.h> m68k: Add <asm/hash.h> <linux/hash.h>: Add support for architecture-specific functions fs/namei.c: Improve dcache hash function Eliminate bad hash multipliers from hash_32() and hash_64() Change hash_64() return value to 32 bits <linux/sunrpc/svcauth.h>: Define hash_str() in terms of hashlen_string() fs/namei.c: Add hashlen_string() function Pull out string hash to <linux/stringhash.h>
2016-05-28<linux/hash.h>: Add support for architecture-specific functionsGeorge Spelvin
This is just the infrastructure; there are no users yet. This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares the existence of <asm/hash.h>. That file may define its own versions of various functions, and define HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones. Included is a self-test (in lib/test_hash.c) that verifies the basics. It is NOT in general required that the arch-specific functions compute the same thing as the generic, but if a HAVE_* symbol is defined with the value 1, then equality is tested. Signed-off-by: George Spelvin <linux@sciencehorizons.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Philippe De Muyter <phdm@macq.eu> Cc: linux-m68k@lists.linux-m68k.org Cc: Alistair Francis <alistai@xilinx.com> Cc: Michal Simek <michal.simek@xilinx.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: uclinux-h8-devel@lists.sourceforge.jp
2016-05-28fs/namei.c: Improve dcache hash functionGeorge Spelvin
Patch 0fed3ac866 improved the hash mixing, but the function is slower than necessary; there's a 7-instruction dependency chain (10 on x86) each loop iteration. Word-at-a-time access is a very tight loop (which is good, because link_path_walk() is one of the hottest code paths in the entire kernel), and the hash mixing function must not have a longer latency to avoid slowing it down. There do not appear to be any published fast hash functions that: 1) Operate on the input a word at a time, and 2) Don't need to know the length of the input beforehand, and 3) Have a single iterated mixing function, not needing conditional branches or unrolling to distinguish different loop iterations. One of the algorithms which comes closest is Yann Collet's xxHash, but that's two dependent multiplies per word, which is too much. The key insights in this design are: 1) Barring expensive ops like multiplies, to diffuse one input bit across 64 bits of hash state takes at least log2(64) = 6 sequentially dependent instructions. That is more cycles than we'd like. 2) An operation like "hash ^= hash << 13" requires a second temporary register anyway, and on a 2-operand machine like x86, it's three instructions. 3) A better use of a second register is to hold a two-word hash state. With careful design, no temporaries are needed at all, so it doesn't increase register pressure. And this gets rid of register copying on 2-operand machines, so the code is smaller and faster. 4) Using two words of state weakens the requirement for one-round mixing; we now have two rounds of mixing before cancellation is possible. 5) A two-word hash state also allows operations on both halves to be done in parallel, so on a superscalar processor we get more mixing in fewer cycles. I ended up using a mixing function inspired by the ChaCha and Speck round functions. It is 6 simple instructions and 3 cycles per iteration (assuming multiply by 9 can be done by an "lea" instruction): x ^= *input++; y ^= x; x = ROL(x, K1); x += y; y = ROL(y, K2); y *= 9; Not only is this reversible, two consecutive rounds are reversible: if you are given the initial and final states, but not the intermediate state, it is possible to compute both input words. This means that at least 3 words of input are required to create a collision. (It also has the property, used by hash_name() to avoid a branch, that it hashes all-zero to all-zero.) The rotate constants K1 and K2 were found by experiment. The search took a sample of random initial states (I used 1023) and considered the effect of flipping each of the 64 input bits on each of the 128 output bits two rounds later. Each of the 8192 pairs can be considered a biased coin, and adding up the Shannon entropy of all of them produces a score. The best-scoring shifts also did well in other tests (flipping bits in y, trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits), so the choice was made with the additional constraint that the sum of the shifts is odd and not too close to the word size. The final state is then folded into a 32-bit hash value by a less carefully optimized multiply-based scheme. This also has to be fast, as pathname components tend to be short (the most common case is one iteration!), but there's some room for latency, as there is a fair bit of intervening logic before the hash value is used for anything. (Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need a better benchmark; the numbers seem to show a slight dip in performance between 4.6.0 and this patch, but they're too noisy to quote.) Special thanks to Bruce fields for diligent testing which uncovered a nasty fencepost error in an earlier version of this patch. [checkpatch.pl formatting complaints noted and respectfully disagreed with.] Signed-off-by: George Spelvin <linux@sciencehorizons.net> Tested-by: J. Bruce Fields <bfields@redhat.com>
2016-05-28fs/namei.c: Add hashlen_string() functionGeorge Spelvin
We'd like to make more use of the highly-optimized dcache hash functions throughout the kernel, rather than have every subsystem create its own, and a function that hashes basic null-terminated strings is required for that. (The name is to emphasize that it returns both hash and length.) It's actually useful in the dcache itself, specifically d_alloc_name(). Other uses in the next patch. full_name_hash() is also tweaked to make it more generally useful: 1) Take a "char *" rather than "unsigned char *" argument, to be consistent with hash_name(). 2) Handle zero-length inputs. If we want more callers, we don't want to make them worry about corner cases. Signed-off-by: George Spelvin <linux@sciencehorizons.net>
2016-05-26Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
2016-05-19Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - A new LSM, "LoadPin", from Kees Cook is added, which allows forcing of modules and firmware to be loaded from a specific device (this is from ChromeOS, where the device as a whole is verified cryptographically via dm-verity). This is disabled by default but can be configured to be enabled by default (don't do this if you don't know what you're doing). - Keys: allow authentication data to be stored in an asymmetric key. Lots of general fixes and updates. - SELinux: add restrictions for loading of kernel modules via finit_module(). Distinguish non-init user namespace capability checks. Apply execstack check on thread stacks" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits) LSM: LoadPin: provide enablement CONFIG Yama: use atomic allocations when reporting seccomp: Fix comment typo ima: add support for creating files using the mknodat syscall ima: fix ima_inode_post_setattr vfs: forbid write access when reading a file into memory fs: fix over-zealous use of "const" selinux: apply execstack check on thread stacks selinux: distinguish non-init user namespace capability checks LSM: LoadPin for kernel file loading restrictions fs: define a string representation of the kernel_read_file_id enumeration Yama: consolidate error reporting string_helpers: add kstrdup_quotable_file string_helpers: add kstrdup_quotable_cmdline string_helpers: add kstrdup_quotable selinux: check ss_initialized before revalidating an inode label selinux: delay inode label lookup as long as possible selinux: don't revalidate an inode's label when explicitly setting it selinux: Change bool variable name to index. KEYS: Add KEYCTL_DH_COMPUTE command ...
2016-05-17Merge branch 'work.const-path' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct path' constification update from Al Viro: "'struct path' is passed by reference to a bunch of Linux security methods; in theory, there's nothing to stop them from modifying the damn thing and LSM community being what it is, sooner or later some enterprising soul is going to decide that it's a good idea. Let's remove the temptation and constify all of those..." * 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: constify ima_d_path() constify security_sb_pivotroot() constify security_path_chroot() constify security_path_{link,rename} apparmor: remove useless checks for NULL ->mnt constify security_path_{mkdir,mknod,symlink} constify security_path_{unlink,rmdir} apparmor: constify common_perm_...() apparmor: constify aa_path_link() apparmor: new helper - common_path_perm() constify chmod_common/security_path_chmod constify security_sb_mount() constify chown_common/security_path_chown tomoyo: constify assorted struct path * apparmor_path_truncate(): path->mnt is never NULL constify vfs_truncate() constify security_path_truncate() [apparmor] constify struct path * in a bunch of helpers