summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-09-04ocfs2: take inode lock in ocfs2_iop_set/get_acl()Tariq Saeed
This bug in mainline code is pointed out by Mark Fasheh. When ocfs2_iop_set_acl() and ocfs2_iop_get_acl() are entered from VFS layer, inode lock is not held. This seems to be regression from older kernels. The patch is to fix that. Orabug: 20189959 Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed()Tariq Saeed
PID: 614 TASK: ffff882a739da580 CPU: 3 COMMAND: "ocfs2dc" #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5 #2 [ffff882ecc375af0] oops_end at ffffffff815091d8 #3 [ffff882ecc375b20] die at ffffffff8101868b #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0 #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5 #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb [exception RIP: ocfs2_ci_checkpointed+208] RIP: ffffffffa0a7e940 RSP: ffff882ecc375cf0 RFLAGS: 00010002 RAX: 0000000000000001 RBX: 000000000000654b RCX: ffff8812dc83f1f8 RDX: 00000000000017d9 RSI: ffff8812dc83f1f8 RDI: ffffffffa0b2c318 RBP: ffff882ecc375d20 R8: ffff882ef6ecfa60 R9: ffff88301f272200 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff R13: ffff8812dc83f4f0 R14: 0000000000000000 R15: ffff8812dc83f1f8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2] #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2] #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2] #10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2] #11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2] #12 [ffff882ecc375ee8] kthread at ffffffff81090da7 #13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884 assert is tripped because the tran is not checkpointed and the lock level is PR. Some time ago, chmod command had been executed. As result, the following call chain left the inode cluster lock in PR state, latter on causing the assert. system_call_fastpath -> my_chmod -> sys_chmod -> sys_fchmodat -> notify_change -> ocfs2_setattr -> posix_acl_chmod -> ocfs2_iop_set_acl -> ocfs2_set_acl -> ocfs2_acl_set_mode Here is how. 1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) 1120 { 1247 ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do. .. 1258 if (!status && attr->ia_valid & ATTR_MODE) { 1259 status = posix_acl_chmod(inode, inode->i_mode); 519 posix_acl_chmod(struct inode *inode, umode_t mode) 520 { .. 539 ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS); 287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ... 288 { 289 return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL); 224 int ocfs2_set_acl(handle_t *handle, 225 struct inode *inode, ... 231 { .. 252 ret = ocfs2_acl_set_mode(inode, di_bh, 253 handle, mode); 168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ... 170 { 183 if (handle == NULL) { >>> BUG: inode lock not held in ex at this point <<< 184 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb), 185 OCFS2_INODE_UPDATE_CREDITS); ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX mode (it should be). How this could have happended? We are the lock master, were holding lock EX and have released it in ocfs2_setattr.#1247. Note that there are no holders of this lock at this point. Another node needs the lock in PR, and we downconvert from EX to PR. So the inode lock is PR when do the trans in ocfs2_acl_set_mode.#184. The trans stays in core (not flushed to disc). Now another node want the lock in EX, downconvert thread gets kicked (the one that tripped assert abovt), finds an unflushed trans but the lock is not EX (it is PR). If the lock was at EX, it would have flushed the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before downconverting (to NULL) for the request. ocfs2_setattr must not drop inode lock ex in this code path. If it does, takes it again before the trans, say in ocfs2_set_acl, another cluster node can get in between, execute another setattr, overwriting the one in progress on this node, resulting in a mode acl size combo that is a mix of the two. Orabug: 20189959 Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: optimize error handling in dlm_request_joinNorton.Zhu
Currently error handling in dlm_request_join is a little obscure, so optimize it to promote readability. If packet.code is invalid, reset it to JOIN_DISALLOW to keep it meaningful. It only influences the log printing. Signed-off-by: Norton.Zhu <norton.zhu@huawei.com> Cc: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: fix a tiny case that inode can not removedYiwen Jiang
When running dirop_fileop_racer we found a case that inode can not removed. Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create two dirs /race/1/ and /race/2/ in the filesystem. Node A Node B rm -r /race/2/ mv /race/1/ /race/2/ call ocfs2_unlink(), get the EX mode of /race/2/ wait for B unlock /race/2/ decrease i_nlink of /race/2/ to 0, and add inode of /race/2/ into orphan dir, unlock /race/2/ got EX mode of /race/2/. because /race/1/ is dir, so inc i_nlink of /race/2/ and update into disk, unlock /race/2/ because i_nlink of /race/2/ is not zero, this inode will always remain in orphan dir This patch fixes this case by test whether i_nlink of new dir is zero. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: add ip_alloc_sem in direct IO to protect allocation changesWeiWei Wang
In ocfs2, ip_alloc_sem is used to protect allocation changes on the node. In direct IO, we add ip_alloc_sem to protect date consistent between direct-io and ocfs2_truncate_file race (buffer io use ip_alloc_sem already). Although inode->i_mutex lock is used to avoid concurrency of above situation, i think ip_alloc_sem is still needed because protect allocation changes is significant. Other filesystem like ext4 also uses rw_semaphore to protect data consistent between get_block-vs-truncate race by other means, So ip_alloc_sem in ocfs2 direct io is needed. Signed-off-by: Weiwei Wang <wangww631@huawei.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: clear the rest of the buffers on errorGoldwyn Rodrigues
In case a validation fails, clear the rest of the buffers and return the error to the calling function. This also facilitates bubbling up the error originating from ocfs2_error to calling functions. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: acknowledge return value of ocfs2_error()Goldwyn Rodrigues
Caveat: This may return -EROFS for a read case, which seems wrong. This is happening even without this patch series though. Should we convert EROFS to EIO? Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: add errors=continueGoldwyn Rodrigues
OCFS2 is often used in high-availaibility systems. However, ocfs2 converts the filesystem to read-only at the drop of the hat. This may not be necessary, since turning the filesystem read-only would affect other running processes as well, decreasing availability. This attempt is to add errors=continue, which would return the EIO to the calling process and terminate furhter processing so that the filesystem is not corrupted further. However, the filesystem is not converted to read-only. As a future plan, I intend to create a small utility or extend fsck.ocfs2 to fix small errors such as in the inode. The input to the utility such as the inode can come from the kernel logs so we don't have to schedule a downtime for fixing small-enough errors. The patch changes the ocfs2_error to return an error. The error returned depends on the mount option set. If none is set, the default is to turn the filesystem read-only. Perhaps errors=continue is not the best option name. Historically it is used for making an attempt to progress in the current process itself. Should we call it errors=eio? or errors=killproc? Suggestions/Comments welcome. Sources are available at: https://github.com/goldwynr/linux/tree/error-cont Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: flush inode data to disk and free inode when i_count becomes zeroXue jiufei
Disk inode deletion may be heavily delayed when one node unlink a file after the same dentry is freed on another node(say N1) because of memory shrink but inode is left in memory. This inode can only be freed while N1 doing the orphan scan work. However, N1 may skip orphan scan for several times because other nodes may do the work earlier. In our tests, it may take 1 hour on 4 nodes cluster and it hurts the user experience. So we think the inode should be freed after the data flushed to disk when i_count becomes zero to avoid such circumstances. Signed-off-by: Joyce.xue <xuejiufei@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: trusted xattr missing CAP_SYS_ADMIN checkSanidhya Kashyap
The trusted extended attributes are only visible to the process which hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2 xattr_handler trusted list. The check is important because this will be used for implementing mechanisms in the userspace for which other ordinary processes should not have access to. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Taesoo kim <taesoo@gatech.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: set filesytem read-only when ocfs2_delete_entry failed.jiangyiwen
In ocfs2_rename, it will lead to an inode with two entried(old and new) if ocfs2_delete_entry(old) failed. Thus, filesystem will be inconsistent. The case is described below: ocfs2_rename -> ocfs2_start_trans -> ocfs2_add_entry(new) -> ocfs2_delete_entry(old) -> __ocfs2_journal_access *failed* because of -ENOMEM -> ocfs2_commit_trans So filesystem should be set to read-only at the moment. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2/dlm: use list_for_each_entry instead of list_for_eachJoseph Qi
Use list_for_each_entry instead of list_for_each to simplify code. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: remove unneeded code in dlm_register_domain_handlersJoseph Qi
The last goto statement is unneeded, so remove it. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: fix BUG when o2hb_register_callback failsJoseph Qi
In dlm_register_domain_handlers, if o2hb_register_callback fails, it will call dlm_unregister_domain_handlers to unregister. This will trigger the BUG_ON in o2hb_unregister_callback because hc_magic is 0. So we should call o2hb_setup_callback to initialize hc first. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: remove unneeded code in ocfs2_dlm_initJoseph Qi
status is already initialized and it will only be 0 or negatives in the code flow. So remove the unneeded assignment after the lable 'local'. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: adjust code to match locking/unlocking orderJoseph Qi
Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the corresponding locking order, although it won't cause issues, adjust the code so that it looks more reasonable. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: clean up unused local variables in ocfs2_file_write_iterJoseph Qi
Since commit 86b9c6f3f891 ("ocfs2: remove filesize checks for sync I/O journal commit") removes filesize checks for sync I/O journal commit, variables old_size and old_clusters are not actually used any more. So clean them up. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: do not log twice error messagesChristophe JAILLET
'o2hb_map_slot_data' and 'o2hb_populate_slot_data' are called from only one place, in 'o2hb_region_dev_write'. Return value is checked and 'mlog_errno' is called to log a message if it is not 0. So there is no need to call 'mlog_errno' directly within these functions. This would result on logging the message twice. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_accessJoseph Qi
When storage network is unstable, it may trigger the BUG in __ocfs2_journal_access because of buffer not uptodate. We can retry the write in this case or return error instead of BUG. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Zhangguanghui <zhang.guanghui@h3c.com> Tested-by: Zhangguanghui <zhang.guanghui@h3c.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: fix several issues of append dioJoseph Qi
1) Take rw EX lock in case of append dio. 2) Explicitly treat the error code -EIOCBQUEUED as normal. 3) Set di_bh to NULL after brelse if it may be used again later. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: fix race between dio and recover orphanJoseph Qi
During direct io the inode will be added to orphan first and then deleted from orphan. There is a race window that the orphan entry will be deleted twice and thus trigger the BUG when validating OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan. ocfs2_direct_IO_write ... ocfs2_add_inode_to_orphan >>>>>>>> race window. 1) another node may rm the file and then down, this node take care of orphan recovery and clear flag OCFS2_DIO_ORPHANED_FL. 2) since rw lock is unlocked, it may race with another orphan recovery and append dio. ocfs2_del_inode_from_orphan So take inode mutex lock when recovering orphans and make rw unlock at the end of aio write in case of append dio. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ntfs: delete unnecessary checks before calling iput()SF Markus Elfring
iput() tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Julia Lawall <julia.lawall@lip6.fr> Reviewed-by: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04fsnotify: get rid of fsnotify_destroy_mark_locked()Jan Kara
fsnotify_destroy_mark_locked() is subtle to use because it temporarily releases group->mark_mutex. To avoid future problems with this function, split it into two. fsnotify_detach_mark() is the part that needs group->mark_mutex and fsnotify_free_mark() is the part that must be called outside of group->mark_mutex. This way it's much clearer what's going on and we also avoid some pointless acquisitions of group->mark_mutex. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04fsnotify: remove mark->free_listJan Kara
Free list is used when all marks on given inode / mount should be destroyed when inode / mount is going away. However we can free all of the marks without using a special list with some care. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04fsnotify: fix check in inotify fdinfo printingJan Kara
A check in inotify_fdinfo() checking whether mark is valid was always true due to a bug. Luckily we can never get to invalidated marks since we hold mark_mutex and invalidated marks get removed from the group list when they are invalidated under that mutex. Anyway fix the check to make code more future proof. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04fs/notify: optimize inotify/fsnotify code for unwatched filesDave Hansen
I have a _tiny_ microbenchmark that sits in a loop and writes single bytes to a file. Writing one byte to a tmpfs file is around 2x slower than reading one byte from a file, which is a _bit_ more than I expecte. This is a dumb benchmark, but I think it's hard to deny that write() is a hot path and we should avoid unnecessary overhead there. I did a 'perf record' of 30-second samples of read and write. The top item in a diffprofile is srcu_read_lock() from fsnotify(). There are active inotify fd's from systemd, but nothing is actually listening to the file or its part of the filesystem. I *think* we can avoid taking the srcu_read_lock() for the common case where there are no actual marks on the file. This means that there will both be nothing to notify for *and* implies that there is no need for clearing the ignore mask. This patch gave a 13.1% speedup in writes/second on my test, which is an improvement from the 10.8% that I saw with the last version. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04capabilities: ambient capabilitiesAndy Lutomirski
Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) | (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) | (pI & fI) | pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker *already* has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is *not* a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2 Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter <cl@linux.com> * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. */ #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <cap-ng.h> #include <sys/prctl.h> #include <linux/capability.h> /* * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. */ #define PR_CAP_AMBIENT 47 #define PR_CAP_AMBIENT_IS_SET 1 #define PR_CAP_AMBIENT_RAISE 2 #define PR_CAP_AMBIENT_LOWER 3 #define PR_CAP_AMBIENT_CLEAR_ALL 4 static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); /* Note the two 0s at the end. Kernel checks for these */ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char **argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter <cl@linux.com> # Original author Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Aaron Jones <aaronmdjones@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Markku Savela <msa@moth.iki.fi> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dioRyan Ding
ocfs2_file_write_iter() is usng the wrong return value ('written'). This will cause ocfs2_rw_unlock() be called both in write_iter & end_io, triggering a BUG_ON. This issue was introduced by commit 7da839c47589 ("ocfs2: use __generic_file_write_iter()"). Orabug: 21612107 Fixes: 7da839c47589 ("ocfs2: use __generic_file_write_iter()") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-03Merge tag 'for-f2fs-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "The major work includes fixing and enhancing the existing extent_cache feature, which has been well settling down so far and now it becomes a default mount option accordingly. Also, this version newly registers a f2fs memory shrinker to reclaim several objects consumed by a couple of data structures in order to avoid memory pressures. Another new feature is to add ioctl(F2FS_GARBAGE_COLLECT) which triggers a cleaning job explicitly by users. Most of the other patches are to fix bugs occurred in the corner cases across the whole code area" * tag 'for-f2fs-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (85 commits) f2fs: upset segment_info repair f2fs: avoid accessing NULL pointer in f2fs_drop_largest_extent f2fs: update extent tree in batches f2fs: fix to release inode correctly f2fs: handle f2fs_truncate error correctly f2fs: avoid unneeded initializing when converting inline dentry f2fs: atomically set inode->i_flags f2fs: fix wrong pointer access during try_to_free_nids f2fs: use __GFP_NOFAIL to avoid infinite loop f2fs: lookup neighbor extent nodes for merging later f2fs: split __insert_extent_tree_ret for readability f2fs: kill dead code in __insert_extent_tree f2fs: adjust showing of extent cache stat f2fs: add largest/cached stat in extent cache f2fs: fix incorrect mapping for bmap f2fs: add annotation for space utilization of regular/inline dentry f2fs: fix to update cached_en of extent tree properly f2fs: fix typo f2fs: check the node block address of newly allocated nid f2fs: go out for insert_inode_locked failure ...
2015-09-03Merge tag 'dlm-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set mainly includes a change to the way the dlm uses the SCTP API in the kernel, removing the direct dependency on the sctp module. Other odd SCTP-related fixes are also included. The other notable fix is for a long standing regression in the behavior of lock value blocks for user space locks" * tag 'dlm-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: print error from kernel_sendpage dlm: fix lvb copy for user locks dlm: sctp_accept_from_sock() can be static dlm: fix reconnecting but not sending data dlm: replace BUG_ON with a less severe handling dlm: use sctp 1-to-1 API dlm: fix not reconnecting on connecting error handling dlm: fix race while closing connections dlm: fix connection stealing if using SCTP
2015-09-03Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Pretty much all bug fixes and clean ups for 4.3, after a lot of features and other churn going into 4.2" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: Revert "ext4: remove block_device_ejected" ext4: ratelimit the file system mounted message ext4: silence a format string false positive ext4: simplify some code in read_mmp_block() ext4: don't manipulate recovery flag when freezing no-journal fs jbd2: limit number of reserved credits ext4 crypto: remove duplicate header file ext4: update c/mtime on truncate up jbd2: avoid infinite loop when destroying aborted journal ext4, jbd2: add REQ_FUA flag when recording an error in the superblock ext4 crypto: fix spelling typo in comment ext4 crypto: exit cleanly if ext4_derive_key_aes() fails ext4: reject journal options for ext2 mounts ext4: implement cgroup writeback support ext4: replace ext4_io_submit->io_op with ->io_wbc ext4 crypto: check for too-short encrypted file names ext4 crypto: use a jbd2 transaction when adding a crypto policy jbd2: speedup jbd2_journal_dirty_metadata()
2015-09-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3 removal, quota & udf fixes from Jan Kara: "The biggest change in the pull is the removal of ext3 filesystem driver (~28k lines removed). Ext4 driver is a full featured replacement these days and both RH and SUSE use it for several years without issues. Also there are some workarounds in VM & block layer mainly for ext3 which we could eventually get rid of. Other larger change is addition of proper error handling for dquot_initialize(). The rest is small fixes and cleanups" [ I wasn't convinced about the ext3 removal and worried about things falling through the cracks for legacy users, but ext4 maintainers piped up and were all unanimously in favor of removal, and maintaining all legacy ext3 support inside ext4. - Linus ] * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Don't modify filesystem for read-only mounts quota: remove an unneeded condition ext4: memory leak on error in ext4_symlink() mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition ext4: Improve ext4 Kconfig test block: Remove forced page bouncing under IO fs: Remove ext3 filesystem driver doc: Update doc about journalling layer jfs: Handle error from dquot_initialize() reiserfs: Handle error from dquot_initialize() ocfs2: Handle error from dquot_initialize() ext4: Handle error from dquot_initialize() ext2: Handle error from dquot_initalize() quota: Propagate error from ->acquire_dquot()
2015-09-03Merge branch 'hpfs' (patches from Mikulas)Linus Torvalds
Merge hpfs upddate from Mikulas Patocka. * emailed patches from Mikulas Patocka <mikulas@twibright.com>: hpfs: update ctime and mtime on directory modification hpfs: support hotfixes
2015-09-03hpfs: update ctime and mtime on directory modificationMikulas Patocka
Update ctime and mtime when a directory is modified. (though OS/2 doesn't update them anyway) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org # v3.3+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-03hpfs: support hotfixesMikulas Patocka
When the OS/2 driver hits a disk write error, it writes the sector to another location and adds the sector mapping to the hotfix map. This patch makes the hpfs driver understand the hotfix map and remap accesses accoring to it. Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-09-01Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk() fixes, Documentation and MAINTAINERS updates)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) MAINTAINERS: update my e-mail address mod_devicetable: add space before */ scsi: a100u2w: trivial typo in printk i2c: Fix typo in i2c-bfin-twi.c treewide: fix typos in comment blocks Doc: fix trivial typo in SubmittingPatches proportions: Spelling s/consitent/consistent/ dm: Spelling s/consitent/consistent/ aic7xxx: Fix typo in error message pcmcia: Fix typo in locking documentation scsi/arcmsr: Fix typos in error log drm/nouveau/gr: Fix typo in nv10.c [SCSI] Fix printk typos in drivers/scsi staging: comedi: Grammar s/Enable support a/Enable support for a/ Btrfs: Spelling s/consitent/consistent/ README: GTK+ is a acronym ASoC: omap: Fix typo in config option description mm: tlb.c: Fix error message ntfs: super.c: Fix error log fix typo in Documentation/SubmittingPatches ...
2015-09-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace updates from Eric Biederman: "This finishes up the changes to ensure proc and sysfs do not start implementing executable files, as the there are application today that are only secure because such files do not exist. It akso fixes a long standing misfeature of /proc/<pid>/mountinfo that did not show the proper source for files bind mounted from /proc/<pid>/ns/*. It also straightens out the handling of clone flags related to user namespaces, fixing an unnecessary failure of unshare(CLONE_NEWUSER) when files such as /proc/<pid>/environ are read while <pid> is calling unshare. This winds up fixing a minor bug in unshare flag handling that dates back to the first version of unshare in the kernel. Finally, this fixes a minor regression caused by the introduction of sysfs_create_mount_point, which broke someone's in house application, by restoring the size of /sys/fs/cgroup to 0 bytes. Apparently that application uses the directory size to determine if a tmpfs is mounted on /sys/fs/cgroup. The bind mount escape fixes are present in Al Viros for-next branch. and I expect them to come from there. The bind mount escape is the last of the user namespace related security bugs that I am aware of" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: fs: Set the size of empty dirs to 0. userns,pidns: Force thread group sharing, not signal handler sharing. unshare: Unsharing a thread does not require unsharing a vm nsfs: Add a show_path method to fix mountinfo mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC vfs: Commit to never having exectuables on proc and sysfs.
2015-09-01f2fs: upset segment_info repairYunlei He
upset segment_info like this: 276000|161 0|0 4|70 3|0 3|0 0|0 0|91 4|0 4|232 4|39 276104|0 4|0 4|1 4|0 4|0 4|280 4|0 4|42 4|262 4|38 276204|179 4|89 4|39 4|24 4|0 4|96 4|3 4|428 4|0 4|118 276304|112 4|97 4|0 4|0 4|0 4|68 4|0 4|0 4|86 4|138 276404|0 4|0 0|166 5|39 4|101 0|111 Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-31Merge tag 'char-misc-4.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver patches from Greg KH: "Here's the "big" char/misc driver update for 4.3-rc1. Not much really interesting here, just a number of little changes all over the place, and some nice consolidation of the nvmem drivers to a common framework. As usual, the mei drivers stand out as the largest "churn" to handle new devices and features in their hardware. All have been in linux-next for a while with no issues" * tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits) auxdisplay: ks0108: initialize local parport variable extcon: palmas: Fix build break due to devm_gpiod_get_optional API change extcon: palmas: Support GPIO based USB ID detection extcon: Fix signedness bugs about break error handling extcon: Drop owner assignment from i2c_driver extcon: arizona: Simplify pdata symantics for micd_dbtime extcon: arizona: Declare 3-pole jack if we detect open circuit on mic extcon: Add exception handling to prevent the NULL pointer access extcon: arizona: Ensure variables are set for headphone detection extcon: arizona: Use gpiod inteface to handle micd_pol_gpio gpio extcon: arizona: Add basic microphone detection DT/ACPI bindings extcon: arizona: Update to use the new device properties API extcon: palmas: Remove the mutually_exclusive array extcon: Remove optional print_state() function pointer of struct extcon_dev extcon: Remove duplicate header file in extcon.h extcon: max77843: Clear IRQ bits state before request IRQ toshiba laptop: replace ioremap_cache with ioremap misc: eeprom: max6875: clean up max6875_read() misc: eeprom: clean up eeprom_read() misc: eeprom: 93xx46: clean up eeprom_93xx46_bin_read/write ...
2015-08-28f2fs: avoid accessing NULL pointer in f2fs_drop_largest_extentChao Yu
If extent cache is disable, we will encounter oops when triggering direct IO as below: BUG: unable to handle kernel NULL pointer dereference at 0000000c IP: [<f0b9c61e>] f2fs_drop_largest_extent+0xe/0x30 [f2fs] *pdpt = 000000002bb9a001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: f2fs(O) fuse bnep rfcomm bluetooth nfsd dm_crypt nfs_acl auth_rpcgss oid_registry nfs binfmt_misc fscache lockd sunrpc grace snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore joydev psmouse hid_generic i2c_piix4 serio_raw ppdev mac_hid parport_pc lp parport ext4 jbd2 mbcache usbhid hid e1000 CPU: 3 PID: 3608 Comm: dd Tainted: G O 4.2.0-rc4 #12 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 task: ef161600 ti: ebd5e000 task.ti: ebd5e000 EIP: 0060:[<f0b9c61e>] EFLAGS: 00010202 CPU: 3 EIP is at f2fs_drop_largest_extent+0xe/0x30 [f2fs] EAX: 00000000 EBX: ddebc000 ECX: 00000000 EDX: 00000000 ESI: ebd5fdf8 EDI: 00000000 EBP: ebd5fd58 ESP: ebd5fd58 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: 0000000c CR3: 2c24ee40 CR4: 000006f0 Stack: ebd5fda4 f0b8c005 00000000 00000001 00000000 f0b8c430 c816cd68 ddebc000 ddebc088 00001000 00000555 00000555 ffffffff c160bb00 00055501 00000000 00000000 00000100 00000000 ebd5fe20 f0b8c430 00000046 ef161600 00001000 Call Trace: [<f0b8c005>] __allocate_data_block+0x1a5/0x260 [f2fs] [<f0b8c430>] ? f2fs_direct_IO+0x370/0x440 [f2fs] [<c160bb00>] ? down_read+0x30/0x50 [<f0b8c430>] f2fs_direct_IO+0x370/0x440 [f2fs] [<c113e115>] generic_file_direct_write+0xa5/0x260 [<c10b53f8>] ? current_fs_time+0x18/0x50 [<c113e38b>] __generic_file_write_iter+0xbb/0x210 [<c113e50f>] ? generic_file_write_iter+0x2f/0x320 [<c113e63c>] generic_file_write_iter+0x15c/0x320 [<f0b77f29>] f2fs_file_write_iter+0x39/0x80 [f2fs] [<c11984d9>] __vfs_write+0xa9/0xe0 [<c1199227>] vfs_write+0x97/0x180 [<c119955b>] SyS_write+0x5b/0xd0 [<c160dcd0>] sysenter_do_call+0x12/0x12 Code: 10 8b 50 1c 89 53 14 eb ca 8d 74 26 00 85 f6 74 86 eb a6 0f 0b 90 8d b4 26 00 00 00 00 55 89 e5 3e 8d 74 26 00 8b 80 d4 02 00 00 <8b> 48 0c 39 d1 77 0e 03 48 14 39 ca 73 07 c7 40 14 00 00 00 00 EIP: [<f0b9c61e>] f2fs_drop_largest_extent+0xe/0x30 [f2fs] SS:ESP 0068:ebd5fd58 CR2: 000000000000000c ---[ end trace a38c07026a1afffd ]--- This is because when extent cache is disable, extent_tree pointer in struct f2fs_inode_info should be NULL, but in f2fs_drop_largest_extent we access this NULL pointer directly without checking state of extent cache, then, the oops occurs. Let's fix it by checking state of extent cache before accessing. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-27dlm: print error from kernel_sendpageBob Peterson
Print a dlm-specific error when a socket error occurs when sending a dlm message. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-26f2fs: update extent tree in batchesChao Yu
This patch introduce a new helper f2fs_update_extent_tree_range which can do extent mapping update at a specified range. The main idea is: 1) punch all mapping info in extent node(s) which are at a specified range; 2) try to merge new extent mapping with adjacent node, or failing that, insert the mapping into extent tree as a new node. In order to see the benefit, I add a function for stating time stamping count as below: uint64_t rdtsc(void) { uint32_t lo, hi; __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); return (uint64_t)hi << 32 | lo; } My test environment is: ubuntu, intel i7-3770, 16G memory, 256g micron ssd. truncation path: update extent cache from truncate_data_blocks_range non-truncataion path: update extent cache from other paths total: all update paths a) Removing 128MB file which has one extent node mapping whole range of file: 1. dd if=/dev/zero of=/mnt/f2fs/128M bs=1M count=128 2. sync 3. rm /mnt/f2fs/128M Before: total count average truncation: 7651022 32768 233.49 Patched: total count average truncation: 3321 33 100.64 b) fsstress: fsstress -d /mnt/f2fs -l 5 -n 100 -p 20 Test times: 5 times. Before: total count average truncation: 5812480.6 20911.6 277.95 non-truncation: 7783845.6 13440.8 579.12 total: 13596326.2 34352.4 395.79 Patched: total count average truncation: 1281283.0 3041.6 421.25 non-truncation: 7355844.4 13662.8 538.38 total: 8637127.4 16704.4 517.06 1) For the updates in truncation path: - we can see updating in batches leads total tsc and update count reducing explicitly; - besides, for a single batched updating, punching multiple extent nodes in a loop, result in executing more operations, so our average tsc increase intensively. 2) For the updates in non-truncation path: - there is a little improvement, that is because for the scenario that we just need to update in the head or tail of extent node, new interface optimize to update info in extent node directly, rather than removing original extent node for updating and then inserting that updated one into cache as new node. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-25writeback: sync_inodes_sb() must write out I_DIRTY_TIME inodes and always ↵Tejun Heo
call wait_sb_inodes() e79729123f63 ("writeback: don't issue wb_writeback_work if clean") updated writeback path to avoid kicking writeback work items if there are no inodes to be written out; unfortunately, the avoidance logic was too aggressive and broke sync_inodes_sb(). * sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME inodes dont't contribute to bdi/wb_has_dirty_io() tests and were being skipped over. * inodes are taken off wb->b_dirty/io/more_io lists after writeback starts on them. sync_inodes_sb() skipping wait_sb_inodes() when bdi_has_dirty_io() breaks it by making it return while writebacks are in-flight. This patch fixes the breakages by * Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs(). The callers are already testing the condition. * Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that it always calls into bdi_split_work_to_wbs() and wait_sb_inodes(). * Making bdi_split_work_to_wbs() consider the b_dirty_time list for WB_SYNC_ALL writebacks. Kudos to Eryu, Dave and Jan for tracking down the issue. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e79729123f63 ("writeback: don't issue wb_writeback_work if clean") Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.com Reported-and-bisected-by: Eryu Guan <eguan@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.com> Cc: Ted Ts'o <tytso@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-25dlm: fix lvb copy for user locksDavid Teigland
For a userland lock request, the previous and current lock modes are used to decide when the lvb should be copied back to the user. The wrong previous value was used, so that it always matched the current value. This caused the lvb to be copied back to the user in the wrong cases. Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-24f2fs: fix to release inode correctlyChao Yu
In following call stack, if unfortunately we lose all chances to truncate inode page in remove_inode_page, eventually we will add the nid allocated previously into free nid cache, this nid is with NID_NEW status and with NEW_ADDR in its blkaddr pointer: - f2fs_create - f2fs_add_link - __f2fs_add_link - init_inode_metadata - new_inode_page - new_node_page - set_node_addr(, NEW_ADDR) - f2fs_init_acl failed - remove_inode_page failed - handle_failed_inode - remove_inode_page failed - iput - f2fs_evict_inode - remove_inode_page failed - alloc_nid_failed cache a nid with valid blkaddr: NEW_ADDR This may not only cause resource leak of previous inode, but also may cause incorrect use of the previous blkaddr which is located in NO.nid node entry when this nid is reused by others. This patch tries to add this inode to orphan list if we fail to truncate inode, so that we can obtain a second chance to release it in orphan recovery flow. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-24f2fs: handle f2fs_truncate error correctlyChao Yu
This patch fixes to return error number of f2fs_truncate, so that we can handle the error correctly in callers. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-24f2fs: avoid unneeded initializing when converting inline dentryChao Yu
When converting inline dentry, we will zero out target dentry page before duplicating data of inline dentry into target page, it become overhead since inline dentry size is not small. So this patch tries to remove unneeded initializing in the space of target dentry page. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-24f2fs: atomically set inode->i_flagsZhang Zhen
According to commit 5f16f3225b06 ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()"). Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-24f2fs: fix wrong pointer access during try_to_free_nidsJaegeuk Kim
If we release the lock in list_for_each_entry_safe, we can lose the tmp pointer by alloc_nid. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>