summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-09-15fscrypto: make filename crypto functions return 0 on successEric Biggers
Several filename crypto functions: fname_decrypt(), fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned the output length on success or -errno on failure. However, the output length was redundant with the value written to 'oname->len'. It is also potentially error-prone to make callers have to check for '< 0' instead of '!= 0'. Therefore, make these functions return 0 instead of a length, and make the callers who cared about the return value being a length use 'oname->len' instead. For consistency also make other callers check for a nonzero result rather than a negative result. This change also fixes the inconsistency of fname_encrypt() actually already returning 0 on success, not a length like the other filename crypto functions and as documented in its function comment. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-15fscrypto: rename completion callbacks to reflect usageEric Biggers
fscrypt_complete() was used only for data pages, not for all encryption/decryption. Rename it to page_crypt_complete(). dir_crypt_complete() was used for filename encryption/decryption for both directory entries and symbolic links. Rename it to fname_crypt_complete(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15fscrypto: remove unnecessary includesEric Biggers
This patch removes some #includes that are clearly not needed, such as a reference to ecryptfs, which is unrelated to the new filesystem encryption code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15fscrypto: improved validation when loading inode encryption metadataEric Biggers
- Validate fscrypt_context.format and fscrypt_context.flags. If unrecognized values are set, then the kernel may not know how to interpret the encrypted file, so it should fail the operation. - Validate that AES_256_XTS is used for contents and that AES_256_CTS is used for filenames. It was previously possible for the kernel to accept these reversed, though it would have taken manual editing of the block device. This was not intended. - Fail cleanly rather than BUG()-ing if a file has an unexpected type. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15ext4: fix memory leak when symlink decryption failsEric Biggers
This bug was introduced in v4.8-rc1. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2016-09-15jbd2: move more common code into journal_init_common()Geliang Tang
There are some repetitive code in jbd2_journal_init_dev() and jbd2_journal_init_inode(). So this patch moves the common code into journal_init_common() helper to simplify the code. And fix the coding style warnings reported by checkpatch.pl by the way. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2016-09-15ext4: remove unused definition for MAX_32_NUMFabian Frederick
MAX_32_NUM isn't used in ext4 Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15ext4: create EXT4_MAX_BLOCKS() macroFabian Frederick
Create a macro to calculate length + offset -> maximum blocks This adds more readability. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15ext4: remove unneeded test in ext4_alloc_file_blocks()Fabian Frederick
ext4_alloc_file_blocks() is called from ext4_zero_range() and ext4_fallocate() both already testing EXT4_INODE_EXTENTS We can call ext_depth(inode) unconditionnally. [ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get called for a indirect-mapped inode in the future. -- tytso ] Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15ext4: fix memory leak in ext4_insert_range()Fabian Frederick
Running xfstests generic/013 with kmemleak gives the following: unreferenced object 0xffff8801d3d27de0 (size 96): comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40 [<ffffffff81179805>] __kmalloc+0xf5/0x1d0 [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0 [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0 [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0 [<ffffffff81181334>] vfs_fallocate+0x134/0x210 [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60 [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f [<ffffffffffffffff>] 0xffffffffffffffff Problem seems mitigated by dropping refs and freeing path when there's no path[depth].p_ext Cc: stable@vger.kernel.org Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15ext4: bugfix for mmaped pages in mpage_release_unused_pages()wangguang
Pages clear buffers after ext4 delayed block allocation failed, However, it does not clean its pte_dirty flag. if the pages unmap ,in cording to the pte_dirty , unmap_page_range may try to call __set_page_dirty, which may lead to the bugon at mpage_prepare_extent_to_map:head = page_buffers(page);. This patch just call clear_page_dirty_for_io to clean pte_dirty at mpage_release_unused_pages for pages mmaped. Steps to reproduce the bug: (1) mmap a file in ext4 addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); memset(addr, 'i', 4096); (2) return EIO at ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent which causes this log message to be print: ext4_msg(sb, KERN_CRIT, "Delayed block allocation failed for " "inode %lu at logical offset %llu with" " max blocks %u with error %d", inode->i_ino, (unsigned long long)map->m_lblk, (unsigned)map->m_len, -err); (3)Unmap the addr cause warning at __set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page)); (4) wait for a minute,then bugon happen. Cc: stable@vger.kernel.org Signed-off-by: wangguang <wangguang03@zte.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05ext4: improve ext4lazyinit scalabilityDmitry Monakhov
ext4lazyinit is a global thread. This thread performs itable initalization under li_list_mtx mutex. It basically does the following: ext4_lazyinit_thread ->mutex_lock(&eli->li_list_mtx); ->ext4_run_li_request(elr) ->ext4_init_inode_table-> Do a lot of IO if the list is large And when new mount/umount arrive they have to block on ->li_list_mtx because lazy_thread holds it during full walk procedure. ext4_fill_super ->ext4_register_li_request ->mutex_lock(&ext4_li_info->li_list_mtx); ->list_add(&elr->lr_request, &ext4_li_info >li_request_list); In my case mount takes 40minutes on server with 36 * 4Tb HDD. Common user may face this in case of very slow dev ( /dev/mmcblkXXX) Even more. If one of filesystems was frozen lazyinit_thread will simply block on sb_start_write() so other mount/umount will be stuck forever. This patch changes logic like follows: - grab ->s_umount read sem before processing new li_request. After that it is safe to drop li_list_mtx because all callers of li_remove_request are holding ->s_umount for write. - li_thread skips frozen SB's Locking order: Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so the only way to to grab ->s_mount inside li_thread is via down_read_trylock xfstests:ext4/023 #PSBM-49658 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05ext4: cleanup ext4_sync_parent()Jan Kara
A condition !hlist_empty(&inode->i_dentry) is always true for open file. Just remove it. Also ext4_sync_parent() could use some explanation why races with rmdir() are not an issue - add a comment explaining that. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05ext4: remove old feature helpersKaho Ng
Use the ext4_{has,set,clear}_feature_* helpers to replace the old feature helpers. Signed-off-by: Kaho Ng <ngkaho1234@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-09-05ext4: enable quota enforcement based on mount optionsJan Kara
When quota information is stored in quota files, we enable only quota accounting on mount and enforcement is enabled only in response to Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a possibility to enable quota enforcement on mount by specifying corresponding quota mount option (usrquota, grpquota, prjquota). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05ext4: reinforce check of i_dtime when clearing high fields of uid and gidDaeho Jeong
Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid of deleted and evicted inode to fix up interoperability with old kernels. However, it checks only i_dtime of an inode to determine whether the inode was deleted and evicted, and this is very risky, because i_dtime can be used for the pointer maintaining orphan inode list, too. We need to further check whether the i_dtime is being used for the orphan inode list even if the i_dtime is not NULL. We found that high 16-bit fields of uid/gid of inode are unintentionally and permanently cleared when the inode truncation is just triggered, but not finished, and the inode metadata, whose high uid/gid bits are cleared, is written on disk, and the sudden power-off follows that in order. Cc: stable@vger.kernel.org Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Hobin Woo <hobin.woo@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-31mbcache: fix to detect failure of register_shrinkerChao Yu
register_shrinker in mb_cache_create may fail due to no memory. This patch fixes to do the check of return value of register_shrinker and handle the error case, otherwise mb_cache_create may return with no error, but losing the inner shrinker. Signed-off-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: enforce online defrag restriction for encrypted filesEric Whitney
Online defragging of encrypted files is not currently implemented. However, the move extent ioctl can still return successfully when called. For example, this occurs when xfstest ext4/020 is run on an encrypted file system, resulting in a corrupted test file and a corresponding test failure. Until the proper functionality is implemented, fail the move extent ioctl if either the original or donor file is encrypted. Cc: stable@vger.kernel.org Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: factor out loop for freeing inode xattr spaceJan Kara
Move loop to make enough space in the inode from ext4_expand_extra_isize_ea() into a separate function to make that function smaller and better readable and also to avoid delaration of variables inside a loop block. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: remove (almost) unused variables from ext4_expand_extra_isize_ea()Jan Kara
'start' variable is completely unused in ext4_expand_extra_isize_ea(). Variable 'first' is used only once in one place. So just remove them. Variables 'entry' and 'last' are only really used later in the function inside a loop. Move their declarations there. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: factor out xattr movingJan Kara
Factor out function for moving xattrs from inode into external xattr block from ext4_expand_extra_isize_ea(). That function is already quite long and factoring out this rather standalone functionality helps readability. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: replace bogus assertion in ext4_xattr_shift_entries()Jan Kara
We were checking whether computed offsets do not exceed end of block in ext4_xattr_shift_entries(). However this does not make sense since we always only decrease offsets. So replace that assertion with a check whether we really decrease xattrs value offsets. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: remove checks for e_value_blockJan Kara
Currently we don't support xattrs with e_value_block set. We don't allow them to pass initial xattr check so there's no point for checking for this later. Since these tests were untested, bugs were creeping in and not all places which should have checked were checking e_value_block anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: Check that external xattr value block is zeroJan Kara
Currently we don't support xattrs with values stored out of line. Check for that in ext4_xattr_check_names() to make sure we never work with such xattrs since not all the code counts with that resulting is possible weird corruption issues. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29ext4: fixup free space calculations when expanding inodesJan Kara
Conditions checking whether there is enough free space in an xattr block and when xattr is large enough to make enough space in the inode forgot to account for the fact that inode need not be completely filled up with xattrs. Thus we could move unnecessarily many xattrs out of inode or even falsely claim there is not enough space to expand the inode. We also forgot to update the amount of free space in xattr block when moving more xattrs and thus could decide to move too big xattr resulting in unexpected failure. Fix these problems by properly updating free space in the inode and xattr block as we move xattrs. To simplify the math, avoid shifting xattrs after removing each one xattr and instead just shift xattrs only once there is enough free space in the inode. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix bugs that could cause kernel deadlocks or file system corruption while moving xattrs to expand the extended inode. Also add some sanity checks to the block group descriptors to make sure we don't end up overwriting the superblock" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid deadlock when expanding inode size ext4: properly align shifted xattrs when expanding inodes ext4: fix xattr shifting when expanding inodes part 2 ext4: fix xattr shifting when expanding inodes ext4: validate that metadata blocks do not overlap superblock ext4: reserve xattr index for the Hurd
2016-08-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "11 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: silently skip readahead for DAX inodes dax: fix device-dax region base fs/seq_file: fix out-of-bounds read mm: memcontrol: avoid unused function warning mm: clarify COMPACTION Kconfig text treewide: replace config_enabled() with IS_ENABLED() (2nd round) printk: fix parsing of "brl=" option soft_dirty: fix soft_dirty during THP split sysctl: handle error writing UINT_MAX to u32 fields get_maintainer: quiet noisy implicit -f vcs_file_exists checking byteswap: don't use __builtin_bswap*() with sparse
2016-08-26Merge branch 'for-linus-4.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've queued up a few different fixes in here. These range from enospc corners to fsync and quota fixes, and a few targeted at error handling for corrupt metadata/fuzzing" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix lockdep warning on deadlock against an inode's log mutex Btrfs: detect corruption when non-root leaf has zero item Btrfs: check btree node's nritems btrfs: don't create or leak aliased root while cleaning up orphans Btrfs: fix em leak in find_first_block_group btrfs: do not background blkdev_put() Btrfs: clarify do_chunk_alloc()'s return value btrfs: fix fsfreeze hang caused by delayed iputs deal btrfs: update btrfs_space_info's bytes_may_use timely btrfs: divide btrfs_update_reserved_bytes() into two functions btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() btrfs: qgroup: Fix qgroup incorrectness caused by log replay btrfs: relocation: Fix leaking qgroups numbers on data extents btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() btrfs: waiting on qgroup rescan should not always be interruptible btrfs: properly track when rescan worker is running btrfs: flush_space: treat return value of do_chunk_alloc properly Btrfs: add ASSERT for block group's memory leak btrfs: backref: Fix soft lockup in __merge_refs function Btrfs: fix memory leak of reloc_root
2016-08-26Merge tag 'dlm-4.8-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm fix from David Teigland: "This fixes a bug introduced by recent debugfs cleanup" * tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: fix malfunction of dlm_tool caused by debugfs changes
2016-08-26Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Here's a set of block fixes for the current 4.8-rc release. This contains: - a fix for a secure erase regression, from Adrian. - a fix for an mmc use-after-free bug regression, also from Adrian. - potential zero pointer deference in bdev freezing, from Andrey. - a race fix for blk_set_queue_dying() from Bart. - a set of xen blkfront fixes from Bob Liu. - three small fixes for bcache, from Eric and Kent. - a fix for a potential invalid NVMe state transition, from Gabriel. - blk-mq CPU offline fix, preventing us from issuing and completing a request on the wrong queue. From me. - revert two previous floppy changes, since they caused a user visibile regression. A better fix is in the works. - ensure that we don't send down bios that have more than 256 elements in them. Fixes a crash with bcache, for example. From Ming. - a fix for deferencing an error pointer with cgroup writeback. Fixes a regression. From Vegard" * 'for-linus' of git://git.kernel.dk/linux-block: mmc: fix use-after-free of struct request Revert "floppy: refactor open() flags handling" Revert "floppy: fix open(O_ACCMODE) for ioctl-only open" fs/block_dev: fix potential NULL ptr deref in freeze_bdev() blk-mq: improve warning for running a queue on the wrong CPU blk-mq: don't overwrite rq->mq_ctx block: make sure a big bio is split into at most 256 bvecs nvme: Fix nvme_get/set_features() with a NULL result pointer bdev: fix NULL pointer dereference xen-blkfront: free resources if xlvbd_alloc_gendisk fails xen-blkfront: introduce blkif_set_queue_limits() xen-blkfront: fix places not updated after introducing 64KB page granularity bcache: pr_err: more meaningful error message when nr_stripes is invalid bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two. bcache: register_bcache(): call blkdev_put() when cache_alloc() fails block: Fix race triggered by blk_set_queue_dying() block: Fix secure erase nvme: Prevent controller state invalid transition
2016-08-26fs/seq_file: fix out-of-bounds readVegard Nossum
seq_read() is a nasty piece of work, not to mention buggy. It has (I think) an old bug which allows unprivileged userspace to read beyond the end of m->buf. I was getting these: BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880 Read of size 2713 by task trinity-c2/1329 CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: kasan_object_err+0x1c/0x80 kasan_report_error+0x2cb/0x7e0 kasan_report+0x4e/0x80 check_memory_region+0x13e/0x1a0 kasan_check_read+0x11/0x20 seq_read+0xcd2/0x1480 proc_reg_read+0x10b/0x260 do_loop_readv_writev.part.5+0x140/0x2c0 do_readv_writev+0x589/0x860 vfs_readv+0x7b/0xd0 do_readv+0xd8/0x2c0 SyS_readv+0xb/0x10 do_syscall_64+0x1b3/0x4b0 entry_SYSCALL64_slow_path+0x25/0x25 Object at ffff880116889100, in cache kmalloc-4096 size: 4096 Allocated: PID = 1329 save_stack_trace+0x26/0x80 save_stack+0x46/0xd0 kasan_kmalloc+0xad/0xe0 __kmalloc+0x1aa/0x4a0 seq_buf_alloc+0x35/0x40 seq_read+0x7d8/0x1480 proc_reg_read+0x10b/0x260 do_loop_readv_writev.part.5+0x140/0x2c0 do_readv_writev+0x589/0x860 vfs_readv+0x7b/0xd0 do_readv+0xd8/0x2c0 SyS_readv+0xb/0x10 do_syscall_64+0x1b3/0x4b0 return_from_SYSCALL_64+0x0/0x6a Freed: PID = 0 (stack is not available) Memory state around the buggy address: ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Disabling lock debugging due to kernel taint This seems to be the same thing that Dave Jones was seeing here: https://lkml.org/lkml/2016/8/12/334 There are multiple issues here: 1) If we enter the function with a non-empty buffer, there is an attempt to flush it. But it was not clearing m->from after doing so, which means that if we try to do this flush twice in a row without any call to traverse() in between, we are going to be reading from the wrong place -- the splat above, fixed by this patch. 2) If there's a short write to userspace because of page faults, the buffer may already contain multiple lines (i.e. pos has advanced by more than 1), but we don't save the progress that was made so the next call will output what we've already returned previously. Since that is a much less serious issue (and I have a headache after staring at seq_read() for the past 8 hours), I'll leave that for now. Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Dave Jones <davej@codemonkey.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-26dlm: fix malfunction of dlm_tool caused by debugfs changesEric Ren
With the current kernel, `dlm_tool lockdebug` fails as below: "dlm_tool lockdebug ED0BD86DCE724393918A1AE8FDBF1EE3 can't open /sys/kernel/debug/dlm/ED0BD86DCE724393918A1AE8FDBF1EE3: Operation not permitted" This is because table_open() depends on file->f_op to tell which seq_file ops should be passed down. But, the original file ops in file->f_op is replaced by "debugfs_full_proxy_file_operations" with commit 49d200deaa68 ("debugfs: prevent access to removed files' private data"). Currently, I can think up 2 solutions: 1st, replace debugfs_create_file() with debugfs_create_file_unsafe(); 2nd, make different table_open#() accordingly. The 1st one is neat, but I don't thoroughly understand its risk. Maybe someone has a better one. Signed-off-by: Eric Ren <zren@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>
2016-08-25fs/block_dev: fix potential NULL ptr deref in freeze_bdev()Andrey Ryabinin
Calling freeze_bdev() twice on the same block device without mounted filesystem get_super() will return NULL, which will lead to NULL-ptr dereference later in drop_super(). Check get_super() result to fix that. Note, that this is a purely theoretical issue. We have only 3 freeze_bdev() callers. 2 of them are in filesystem code and used on a device with mounted fs. The third one in lock_fs() has protection in upper-layer code against freezing block device the second time without thawing it first. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-25Btrfs: fix lockdep warning on deadlock against an inode's log mutexFilipe Manana
Commit 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink"), which landed in 4.8-rc2, introduced a possibility for a deadlock due to double locking of an inode's log mutex by the same task, which lockdep reports with: [23045.433975] ============================================= [23045.434748] [ INFO: possible recursive locking detected ] [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted [23045.436044] --------------------------------------------- [23045.436044] xfs_io/3688 is trying to acquire lock: [23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] but task is already holding lock: [23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] other info that might help us debug this: [23045.436044] Possible unsafe locking scenario: [23045.436044] CPU0 [23045.436044] ---- [23045.436044] lock(&ei->log_mutex); [23045.436044] lock(&ei->log_mutex); [23045.436044] *** DEADLOCK *** [23045.436044] May be due to missing lock nesting notation [23045.436044] 3 locks held by xfs_io/3688: [23045.436044] #0: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs] [23045.436044] #1: (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0 [23045.436044] #2: (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] stack backtrace: [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1 [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [23045.436044] 0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70 [23045.436044] ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68 [23045.436044] 0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68 [23045.436044] Call Trace: [23045.436044] [<ffffffff8127074d>] dump_stack+0x67/0x90 [23045.436044] [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e [23045.436044] [<ffffffff8109155f>] ? mark_lock+0x24/0x201 [23045.436044] [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74 [23045.436044] [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3 [23045.436044] [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3 [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7 [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs] [23045.436044] [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465 [23045.436044] [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs] [23045.436044] [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs] [23045.436044] [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs] [23045.436044] [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs] [23045.436044] [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs] [23045.436044] [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e [23045.436044] [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e [23045.436044] [<ffffffff811acc79>] do_fsync+0x31/0x4a [23045.436044] [<ffffffff811ace99>] SyS_fsync+0x10/0x14 [23045.436044] [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8 [23045.436044] [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa An example reproducer for this is: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/foo $ sync $ mv /mnt/dir/foo /mnt/dir/bar $ touch /mnt/dir/foo $ xfs_io -c "fsync" /mnt/dir/bar This is because while logging the inode of file bar we end up logging its parent directory (since its inode has an unlink_trans field matching the current transaction id due to the rename operation), which in turn logs the inodes for all its new dentries, so that the new inode for the new file named foo gets logged which in turn triggered another logging attempt for the inode we are fsync'ing, since that inode had an old name that corresponds to the name of the new inode. So fix this by ensuring that when logging the inode for a new dentry that has a name matching an old name of some other inode, we don't log again the original inode that we are fsync'ing. Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25Btrfs: detect corruption when non-root leaf has zero itemLiu Bo
Right now we treat leaf which has zero item as a valid one because we could have an empty tree, that is, a root that is also a leaf without any item, however, in the same case but when the leaf is not a root, we can end up with hitting the BUG_ON(1) in btrfs_extend_item() called by setup_inline_extent_backref(). This makes us check the situation as a corruption if leaf is not its own root. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25Btrfs: check btree node's nritemsLiu Bo
When btree node (level = 1) has nritems which equals to zero, we can end up with panic due to insert_ptr()'s BUG_ON(slot > nritems); where slot is 1 and nritems is 0, as copy_for_split() calls insert_ptr(.., path->slots[1] + 1, ...); A invalid value results in the whole mess, this adds the check for btree's node nritems so that we stop reading block when when something is wrong. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: don't create or leak aliased root while cleaning up orphansJeff Mahoney
commit 909c3a22da3 (Btrfs: fix loading of orphan roots leading to BUG_ON) avoids the BUG_ON but can add an aliased root to the dead_roots list or leak the root. Since we've already been loading roots into the radix tree, we should use it before looking the root up on disk. Cc: <stable@vger.kernel.org> # 4.5 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25Btrfs: fix em leak in find_first_block_groupJosef Bacik
We need to call free_extent_map() on the em we look up. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: do not background blkdev_put()Anand Jain
At the end of unmount/dev-delete, if the device exclusive open is not actually closed, then there might be a race with another program in the userland who is trying to open the device in exclusive mode and it may fail for eg: unmount /btrfs; fsck /dev/x btrfs dev del /dev/x /btrfs; fsck /dev/x so here background blkdev_put() is not a choice Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25Btrfs: clarify do_chunk_alloc()'s return valueLiu Bo
Function start_transaction() can return ERR_PTR(1) when flush is BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is start_transaction (return ERR_PTR(1)) -> btrfs_block_rsv_add (return 1) -> reserve_metadata_bytes (return 1) -> flush_space (return 1) -> do_chunk_alloc (return 1) With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the flush_state of ALLOC_CHUNK and it successfully allocates a new chunk, then instead of trying to reserve space again, reserve_metadata_bytes returns 1 immediately. Eventually the callers who call start_transaction() usually just do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get a panic when dereferencing a pointer which is ERR_PTR(1). The following patch fixes the above problem. "btrfs: flush_space: treat return value of do_chunk_alloc properly" https://patchwork.kernel.org/patch/7778651/ This add comments to clarify do_chunk_alloc()'s return value. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: fix fsfreeze hang caused by delayed iputs dealWang Xiaoguang
When running fstests generic/068, sometimes we got below deadlock: xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080 ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000 ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001 ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8 Call Trace: [<ffffffff816a9045>] schedule+0x35/0x80 [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140 [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100 [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [<ffffffff810d32b5>] percpu_down_read+0x35/0x50 [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40 [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs] [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs] [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [<ffffffff81230a1a>] evict+0xba/0x1a0 [<ffffffff812316b6>] iput+0x196/0x200 [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs] [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs] [<ffffffff81218040>] freeze_super+0xf0/0x190 [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0 [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140 [<ffffffff81229409>] SyS_ioctl+0x79/0x90 [<ffffffff81003c12>] do_syscall_64+0x62/0x110 [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25 >From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. There still maybe some codes adding delayed iputs, see below sample race window: CPU1 | CPU2 |-> freeze_super() | |-> sync_filesystem(sb); | | |-> cleaner_kthread() | | |-> btrfs_delete_unused_bgs() | | |-> btrfs_remove_chunk() | | |-> btrfs_remove_block_group() | | |-> btrfs_add_delayed_iput() | | |-> sb->s_writers.frozen = SB_FREEZE_FS; | |-> sb_wait_write(sb, SB_FREEZE_FS); | | acquire SB_FREEZE_FS lock. | | | |-> btrfs_freeze() | |-> btrfs_commit_transaction() | |-> btrfs_run_delayed_iputs() | | will handle delayed iputs, | | that means start_transaction() | | will be called, which will try | | to get SB_FREEZE_FS lock. | To fix this issue, introduce a "int fs_frozen" to record internally whether fs has been frozen. If fs has been frozen, we can not handle delayed iputs. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment to btrfs_freeze ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: update btrfs_space_info's bytes_may_use timelyWang Xiaoguang
This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() | bytes_may_use += 64M |-> btrfs_prealloc_file_range() |-> btrfs_reserve_extent() |-> btrfs_add_reserved_bytes() | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not | change bytes_may_use, and bytes_reserved += 64M. Now | bytes_may_use + bytes_reserved == 128M, which is greater | than btrfs_space_info's total_bytes, false enospc occurs. | Note, the bytes_may_use decrease operation will be done in | end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 | CPU 2 | |-> cow_file_range() |-> __btrfs_buffered_write() |-> btrfs_reserve_extent() | | | | | | | | | ..... | |-> btrfs_check_data_free_space() | | | | |-> extent_clear_unlock_delalloc() | In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: divide btrfs_update_reserved_bytes() into two functionsWang Xiaoguang
This patch divides btrfs_update_reserved_bytes() into btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and next patch will extend btrfs_add_reserved_bytes()to fix some false ENOSPC error, please see later patch for detailed info. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()Wang Xiaoguang
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses wrong file offset for reloc_inode, it uses cluster->start and cluster->end, which indeed are extent's bytenr. The correct value should be cluster->[start|end] minus block group's start bytenr. start bytenr cluster->start | | extent | extent | ...| extent | |----------------------------------------------------------------| | block group reloc_inode | Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: qgroup: Fix qgroup incorrectness caused by log replayQu Wenruo
When doing log replay at mount time(after power loss), qgroup will leak numbers of replayed data extents. The cause is almost the same of balance. So fix it by manually informing qgroup for owner changed extents. The bug can be detected by btrfs/119 test case. Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: relocation: Fix leaking qgroups numbers on data extentsQu Wenruo
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota rework. When balancing data extents, qgroup will leak all its numbers for relocated data extents. The relocation is done in the following steps for data extents: 1) Create data reloc tree and inode 2) Copy all data extents to data reloc tree And commit transaction 3) Create tree reloc tree(special snapshot) for any related subvolumes 4) Replace file extent in tree reloc tree with new extents in data reloc tree And commit transaction 5) Merge tree reloc tree with original fs, by swapping tree blocks For 1)~4), since tree reloc tree and data reloc tree doesn't count to qgroup, everything is OK. But for 5), the swapping of tree blocks will only info qgroup to track metadata extents. If metadata extents contain file extents, qgroup number for file extents will get lost, leading to corrupted qgroup accounting. The fix is, before commit transaction of step 5), manually info qgroup to track all file extents in data reloc tree. Since at commit transaction time, the tree swapping is done, and qgroup will account these data extents correctly. Cc: Mark Fasheh <mfasheh@suse.de> Reported-by: Mark Fasheh <mfasheh@suse.de> Reported-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()Qu Wenruo
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions: 1. btrfs_qgroup_insert_dirty_extent_nolock() Almost the same with original code. For delayed_ref usage, which has delayed refs locked. Change the return value type to int, since caller never needs the pointer, but only needs to know if they need to free the allocated memory. 2. btrfs_qgroup_insert_dirty_extent() The more encapsulated version. Will do the delayed_refs lock, memory allocation, quota enabled check and other things. The original design is to keep exported functions to minimal, but since more btrfs hacks exposed, like replacing path in balance, we need to record dirty extents manually, so we have to add such functions. Also, add comment for both functions, to info developers how to keep qgroup correct when doing hacks. Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: waiting on qgroup rescan should not always be interruptibleJeff Mahoney
We wait on qgroup rescan completion in three places: file system shutdown, the quota disable ioctl, and the rescan wait ioctl. If the user sends a signal while we're waiting, we continue happily along. This is expected behavior for the rescan wait ioctl. It's racy in the shutdown path but mostly works due to other unrelated synchronization points. In the quota disable path, it Oopses the kernel pretty much immediately. Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: properly track when rescan worker is runningJeff Mahoney
The qgroup_flags field is overloaded such that it reflects the on-disk status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is used to indicate that a rescan operation is in progress, but if the file system is unmounted while a rescan is running, the rescan operation is paused. If the file system is then mounted read-only, the flag will still be present but the rescan operation will not have been resumed. When we go to umount, btrfs_qgroup_wait_for_completion will see the flag and interpret it to mean that the rescan worker is still running and will wait for a completion that will never come. This patch uses a separate flag to indicate when the worker is running. The locking and state surrounding the qgroup rescan worker needs a lot of attention beyond this patch but this is enough to avoid a hung umount. Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by; Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: flush_space: treat return value of do_chunk_alloc properlyAlex Lyakas
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. But flush_space will not convert this to 0, and will also return 1. As a result, reserve_metadata_bytes will think that flush_space failed, and may potentially return this value "1" to the caller (depends how reserve_metadata_bytes was called). The caller will also treat this as an error. For example, btrfs_block_rsv_refill does: int ret = -ENOSPC; ... ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); return 0; } return ret; So it will return -ENOSPC. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>