summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2020-01-20btrfs: drop create parameter to btrfs_get_extent()Omar Sandoval
We only pass this as 1 from __extent_writepage_io(). The parameter basically means "pretend I didn't pass in a page". This is silly since we can simply not pass in the page. Get rid of the parameter from btrfs_get_extent(), and since it's used as a get_extent_t callback, remove it from get_extent_t and btree_get_extent(), neither of which need it. While we're here, let's document btrfs_get_extent(). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove redundant i_size check in __extent_writepage_io()Omar Sandoval
In __extent_writepage_io(), we check whether i_size <= page_offset(page). Note that if i_size < page_offset(page), then i_size >> PAGE_SHIFT < page->index. If i_size == page_offset(page), then i_size >> PAGE_SHIFT == page->index && offset_in_page(i_size) == 0. __extent_writepage() already has a check for these cases that returns without calling __extent_writepage_io(): end_index = i_size >> PAGE_SHIFT pg_offset = offset_in_page(i_size); if (page->index > end_index || (page->index == end_index && !pg_offset)) { page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE); unlock_page(page); return 0; } Get rid of the one in __extent_writepage_io(), which was obsoleted in 211c17f51f46 ("Fix corners in writepage and btrfs_truncate_page"). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove trivial goto label in __extent_writepage()Omar Sandoval
Since 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack usage"), done_unlocked is simply a return 0. Get rid of it. Mid-statement block returns don seem to make the code less readable here. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove unnecessary pg_offset assignments in __extent_writepage()Omar Sandoval
We're initializing pg_offset to 0, setting it immediately, then reassigning it to 0 again after. The former became unnecessary in 211c17f51f46 ("Fix corners in writepage and btrfs_truncate_page"). The latter is a leftover that should've been removed in 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack usage"). Remove both. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_itemOmar Sandoval
ordered->start, ordered->len, and ordered->disk_len correspond to fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively. It's confusing to translate between the two naming schemes. Since a btrfs_ordered_extent is basically a pending btrfs_file_extent_item, let's make the former use the naming from the latter. Note that I didn't touch the names in tracepoints just in case there are scripts depending on the current naming. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove dead snapshot-aware defrag codeOmar Sandoval
Snapshot-aware defrag has been disabled since commit 8101c8dbf624 ("Btrfs: disable snapshot aware defrag for now") almost 6 years ago. Let's remove the dead code. If someone is up to the task of bringing it back, they can dig it up from git. This is logically a revert of commit 38c227d87c49 ("Btrfs: snapshot-aware defrag") except that now we have to clear the EXTENT_DEFRAG bit to avoid need_force_cow() returning true forever. The reasons to disable were caused by runtime problems (like long stalls or memory consumption) on heavily referenced extents (eg. thousands of snapshots). There were attempts to fix that but never finished. Current defrag breaks the extent references and some users prefer that behaviour over the one implemented by snapshot aware (ie. keeping links for defragmentation). To enable both usecases we'd need to extend defrag ioctl but let's do that properly from scratch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums()Omar Sandoval
We can encode this in the offset parameter: -1 means use the page offsets, anything else is a valid offset. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappersOmar Sandoval
Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: reset device back to allocation state when removingJohannes Thumshirn
When closing a device, btrfs_close_one_device() first allocates a new device, copies the device to close's name, replaces it in the dev_list with the copy and then finally frees it. This involves two memory allocation, which can potentially fail. As this code path is tricky to unwind, the allocation failures where handled by BUG_ON()s. But this copying isn't strictly needed, all that is needed is resetting the device in question to it's state it had after the allocation. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: decrement number of open devices after closing the device not beforeJohannes Thumshirn
In btrfs_close_one_device we're decrementing the number of open devices before we're calling btrfs_close_bdev(). As there is no intermediate exit between these points in this function it is technically OK to do so, but it makes the code a bit harder to understand. Move both operations closer together and move the decrement step after btrfs_close_bdev(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: use simple_dir_inode_operations for placeholder subvolume directoryOmar Sandoval
When you snapshot a subvolume containing a subvolume, you get a placeholder directory where the subvolume would be. These directories have their own btrfs_dir_ro_inode_operations. Al pointed out [1] that these directories can use simple_lookup() instead of btrfs_lookup(), as they are always empty. Furthermore, they can use the default generic_permission() instead of btrfs_permission(); the additional checks in the latter don't matter because we can't write to the directory anyways. Finally, they can use the default generic_update_time() instead of btrfs_update_time(), as the inode doesn't exist on disk and doesn't need any special handling. All together, this means that we can get rid of btrfs_dir_ro_inode_operations and use simple_dir_inode_operations instead. 1: https://lore.kernel.org/linux-btrfs/20190929052934.GY26530@ZenIV.linux.org.uk/ Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove impossible WARN_ON in btrfs_destroy_dev_replace_tgtdev()Johannes Thumshirn
We have a user report, that cppcheck is complaining about a possible NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev(). We're first dereferencing the 'tgtdev' variable and the later check for the validity of the pointer with a WARN_ON(!tgtdev); But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the WARN_ON() is impossible to hit. Just remove it to silence the checker's complains. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove superfluous BUG_ON() in integrity checksJohannes Thumshirn
btrfsic_process_superblock() BUG_ON()s if 'state' is NULL. But this can never happen as the only caller from btrfsic_process_superblock() is btrfsic_mount() which allocates 'state' some lines above calling btrfsic_process_superblock() and checks for the allocation to succeed. Let's just remove the impossible to hit BUG_ON(). Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: fix possible NULL-pointer dereference in integrity checksJohannes Thumshirn
A user reports a possible NULL-pointer dereference in btrfsic_process_superblock(). We are assigning state->fs_info to a local fs_info variable and afterwards checking for the presence of state. While we would BUG_ON() a NULL state anyways, we can also just remove the local fs_info copy, as fs_info is only used once as the first argument for btrfs_num_copies(). There we can just pass in state->fs_info as well. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: kill min_allocable_bytes in inc_block_group_roJosef Bacik
This is a relic from a time before we had a proper reservation mechanism and you could end up with really full chunks at chunk allocation time. This doesn't make sense anymore, so just kill it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: don't pass system_chunk into can_overcommitJosef Bacik
We have the space_info, we can just check its flags to see if it's the system chunk space info. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Opencode ordered_data_tree_panicNikolay Borisov
It's a simple wrapper over btrfs_panic and is called only once. Just open code it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: relocation: Output current relocation stage at ↵Qu Wenruo
btrfs_relocate_block_group() There are two relocation stages but both print the same message. Add the description of the stage. This can help debugging or provides informative message to users. BTRFS info (device dm-5): balance: start -d -m -s BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup BTRFS info (device dm-5): found 2 extents, stage: move data extents BTRFS info (device dm-5): relocating block group 22020096 flags system|dup BTRFS info (device dm-5): found 1 extents, stage: move data extents BTRFS info (device dm-5): relocating block group 13631488 flags data BTRFS info (device dm-5): found 1 extents, stage: move data extents BTRFS info (device dm-5): found 1 extents, stage: update data pointers BTRFS info (device dm-5): balance: ended with status: 0 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: remove unused condition check in btrfs_page_mkwrite()Yunfeng Ye
The condition '!ret2' is always true. commit 717beb96d969 ("Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion") left behind the check after moving this code out of the goto, so remove the unused condition check. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Remove redundant WARN_ON in walk_down_log_treeNikolay Borisov
level <0 and level >= BTRFS_MAX_LEVEL are already performed upon extent buffer read by tree checker in btrfs_check_node. go. As far as 'level <= 0' we are guaranteed that level is '> 0' because the value of level _before_ reading 'next' is larger than 1 (otherwise we wouldn't have executed that code at all) this in turn guarantees that 'level' after btrfs_read_buffer is 'level - 1' since we verify this invariant in: btrfs_read_buffer btree_read_extent_buffer_pages btrfs_verify_level_key This guarantees that level can never be '<= 0' so the warn on is never triggered. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Remove WARN_ON in walk_log_treeNikolay Borisov
The log_root passed to walk_log_tree is guaranteed to have its root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by merit that all log roots of an ordinary root are allocated in alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID. In case walk_log_tree is called for a log tree found by btrfs_read_fs_root in btrfs_recover_log_trees, that function already ensures found_key.objectid is BTRFS_TREE_LOG_OBJECTID. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extentNikolay Borisov
__btrfs_free_reserved_extent now performs the actions of btrfs_free_and_pin_reserved_extent. But this name is a bit of a misnomer, since the extent is not really freed but just pinned. Reflect this in the new name. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Open code __btrfs_free_reserved_extent in btrfs_free_reserved_extentNikolay Borisov
__btrfs_free_reserved_extent performs 2 entirely different operations depending on whether its 'pin' argument is true or false. This patch lifts the 2nd case (pin is false) into it's sole caller btrfs_free_reserved_extent. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Don't discard unwritten extentsNikolay Borisov
All callers of btrfs_free_reserved_extent (respectively __btrfs_free_reserved_extent with in set to 0) pass in extents which have only been reserved but not yet written to. Namely, * in cow_file_range that function is called only if create_io_em fails or btrfs_add_ordered_extent fail, both of which happen _before_ any IO is submitted to the newly reserved range * in submit_compressed_extents the code flow is similar - out_free_reserve can be called only before btrfs_submit_compressed_write which is where any writes to the range could occur * btrfs_new_extent_direct also calls btrfs_free_reserved_extent only if extent_map fails, before any IO is issued * __btrfs_prealloc_file_range also calls btrfs_free_reserved_extent in case insertion of the metadata fails * btrfs_alloc_tree_block again can only be called in case in-memory operations fail, before any IO is submitted * btrfs_finish_ordered_io - this is the only caller where discarding the extent could have a material effect, since it can be called for an extent which was partially written. With this change the submission of discards is optimised since discards are now not being created for extents which are known to not have been touched on disk. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: qgroup: return ENOTCONN instead of EINVAL when quotas are not enabledMarcos Paulo de Souza
[PROBLEM] qgroup create/remove code is currently returning EINVAL when the user tries to create a qgroup on a subvolume without quota enabled. EINVAL is already being used for too many error scenarios so that is hard to depict what is the problem. [FIX] Currently scrub and balance code return -ENOTCONN when the user tries to cancel/pause and no scrub or balance is currently running for the desired subvolume. Do the same here by returning -ENOTCONN when a user tries to create/delete/assing/list a qgroup on a subvolume without quota enabled. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: qgroup: remove one-time use variables for quota_root checksMarcos Paulo de Souza
Remove some variables that are set only to be checked later, and never used. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: sysfs, merge btrfs_sysfs_add devices_kobj and fsidAnand Jain
Merge btrfs_sysfs_add_fsid() and btrfs_sysfs_add_devices_kobj() functions as these two are small and they are called one after the other. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: sysfs, rename btrfs_sysfs_add_device()Anand Jain
btrfs_sysfs_add_device() creates the directory /sys/fs/btrfs/UUID/devices but its function name is misleading. Rename it to btrfs_sysfs_add_devices_kobj() instead. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: sysfs, btrfs_sysfs_add_fsid() drop unused argument parentAnand Jain
Commit 24bd69cb ("Btrfs: sysfs: add support to add parent for fsid") added parent argument in preparation to show the seed fsid under the sprout fsid as in the patch [1] in the mailing list. [1] Btrfs: sysfs: support seed devices in the sysfs layout But later this idea was superseded by another idea to rename the fsid as in the commit f93c39970b1d ("btrfs: factor out sysfs code for updating sprout fsid"). So we don't need parent argument anymore. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: sysfs, rename devices kobject holder to devices_kobjAnand Jain
The struct member btrfs_device::device_dir_kobj holds the kobj of the sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from device_dir_kobj to devices_kobj. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: fill ncopies for all raid table entriesDavid Sterba
Make the number of copies explicit even for entries that use the default 0 value for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: use raid_attr table in calc_stripe_length for nparityDavid Sterba
The table is already used for ncopies, replace open coding of stripes with the raid_attr value. Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20Btrfs: fix missing hole after hole punching and fsync when using NO_HOLESFilipe Manana
When using the NO_HOLES feature, if we punch a hole into a file and then fsync it, there are cases where a subsequent fsync will miss the fact that a hole was punched, resulting in the holes not existing after replaying the log tree. Essentially these cases all imply that, tree-log.c:copy_items(), is not invoked for the leafs that delimit holes, because nothing changed those leafs in the current transaction. And it's precisely copy_items() where we currenly detect and log holes, which works as long as the holes are between file extent items in the input leaf or between the beginning of input leaf and the previous leaf or between the last item in the leaf and the next leaf. First example where we miss a hole: *) The extent items of the inode span multiple leafs; *) The punched hole covers a range that affects only the extent items of the first leaf; *) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). That results in the hole not existing after replaying the log tree. For example, if the fs/subvolume tree has the following layout for a particular inode: Leaf N, generation 10: [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] Leaf N + 1, generation 10: [ EXTENT_ITEM (128K 64K) ... ] If at transaction 11 we punch a hole coverting the range [0, 128K[, we end up dropping the two extent items from leaf N, but we don't touch the other leaf, so we end up in the following state: Leaf N, generation 11: [ ... INODE_ITEM INODE_REF ] Leaf N + 1, generation 10: [ EXTENT_ITEM (128K 64K) ... ] A full fsync after punching the hole will only process leaf N because it was modified in the current transaction, but not leaf N + 1, since it was not modified in the current transaction (generation 10 and not 11). As a result the fsync will not log any holes, because it didn't process any leaf with extent items. Second example where we will miss a hole: *) An inode as its items spanning 5 (or more) leafs; *) A hole is punched and it covers only the extents items of the 3rd leaf. This resulsts in deleting the entire leaf and not touching any of the other leafs. So the only leaf that is modified in the current transaction, when punching the hole, is the first leaf, which contains the inode item. During the full fsync, the only leaf that is passed to copy_items() is that first leaf, and that's not enough for the hole detection code in copy_items() to determine there's a hole between the last file extent item in the 2nd leaf and the first file extent item in the 3rd leaf (which was the 4th leaf before punching the hole). Fix this by scanning all leafs and punch holes as necessary when doing a full fsync (less common than a non-full fsync) when the NO_HOLES feature is enabled. The lack of explicit file extent items to mark holes makes it necessary to scan existing extents to determine if holes exist. A test case for fstests follows soon. Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-17Merge tag 'for-5.5-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that have been in the works during last twp weeks. All have a user visible effect and are stable material: - scrub: properly update progress after calling cancel ioctl, calling 'resume' would start from the beginning otherwise - fix subvolume reference removal, after moving out of the original path the reference is not recognized and will lead to transaction abort - fix reloc root lifetime checks, could lead to crashes when there's subvolume cleaning running in parallel - fix memory leak when quotas get disabled in the middle of extent accounting - fix transaction abort in case of balance being started on degraded mount on eg. RAID1" * tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: check rw_devices, not num_devices for balance Btrfs: always copy scrub arguments back to user space btrfs: relocation: fix reloc_root lifespan and access btrfs: fix memory leak in qgroup accounting btrfs: do not delete mismatched root refs btrfs: fix invalid removal of root ref btrfs: rework arguments of btrfs_unlink_subvol
2020-01-17btrfs: check rw_devices, not num_devices for balanceJosef Bacik
The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-17Btrfs: always copy scrub arguments back to user spaceFilipe Manana
If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-13btrfs: relocation: fix reloc_root lifespan and accessQu Wenruo
[BUG] There are several different KASAN reports for balance + snapshot workloads. Involved call paths include: should_ignore_root+0x54/0xb0 [btrfs] build_backref_tree+0x11af/0x2280 [btrfs] relocate_tree_blocks+0x391/0xb80 [btrfs] relocate_block_group+0x3e5/0xa00 [btrfs] btrfs_relocate_block_group+0x240/0x4d0 [btrfs] btrfs_relocate_chunk+0x53/0xf0 [btrfs] btrfs_balance+0xc91/0x1840 [btrfs] btrfs_ioctl_balance+0x416/0x4e0 [btrfs] btrfs_ioctl+0x8af/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10 create_reloc_root+0x9f/0x460 [btrfs] btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs] create_pending_snapshot+0xa9b/0x15f0 [btrfs] create_pending_snapshots+0x111/0x140 [btrfs] btrfs_commit_transaction+0x7a6/0x1360 [btrfs] btrfs_mksubvol+0x915/0x960 [btrfs] btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs] btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs] btrfs_ioctl+0x241b/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10 btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs] create_pending_snapshot+0x209/0x15f0 [btrfs] create_pending_snapshots+0x111/0x140 [btrfs] btrfs_commit_transaction+0x7a6/0x1360 [btrfs] btrfs_mksubvol+0x915/0x960 [btrfs] btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs] btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs] btrfs_ioctl+0x241b/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10 [CAUSE] All these call sites are only relying on root->reloc_root, which can undergo btrfs_drop_snapshot(), and since we don't have real refcount based protection to reloc roots, we can reach already dropped reloc root, triggering KASAN. [FIX] To avoid such access to unstable root->reloc_root, we should check BTRFS_ROOT_DEAD_RELOC_TREE bit first. This patch introduces wrappers that provide the correct way to check the bit with memory barriers protection. Most callers don't distinguish merged reloc tree and no reloc tree. The only exception is should_ignore_root(), as merged reloc tree can be ignored, while no reloc tree shouldn't. [CRITICAL SECTION ANALYSIS] Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the DEAD_RELOC_TREE bit has extra help from transaction as a higher level barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are: NULL: reloc_root is NULL PTR: reloc_root is not NULL 0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set (NULL, 0) Initial state __ | /\ Section A btrfs_init_reloc_root() \/ | __ (PTR, 0) reloc_root initialized /\ | | btrfs_update_reloc_root() | Section B | | (PTR, DEAD) reloc_root has been merged \/ | __ === btrfs_commit_transaction() ==================== | /\ clean_dirty_subvols() | | | Section C (NULL, DEAD) reloc_root cleanup starts \/ | __ btrfs_drop_snapshot() /\ | | Section D (NULL, 0) Back to initial state \/ Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds transaction handle, so none of such caller can cross transaction boundary. In Section A, every caller just found no DEAD bit, and grab reloc_root. In the cross section A-B, caller may get no DEAD bit, but since reloc_root is still completely valid thus accessing reloc_root is completely safe. No test_bit() caller can cross the boundary of Section B and Section C. In Section C, every caller found the DEAD bit, so no one will access reloc_root. In the cross section C-D, either caller gets the DEAD bit set, avoiding access reloc_root no matter if it's safe or not. Or caller get the DEAD bit cleared, then access reloc_root, which is already NULL, nothing will be wrong. The memory write barriers are between the reloc_root updates and bit set/clear, the pairing read side is before test_bit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ barriers ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-08btrfs: fix memory leak in qgroup accountingJohannes Thumshirn
When running xfstests on the current btrfs I get the following splat from kmemleak: unreferenced object 0xffff88821b2404e0 (size 32): comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&.... 10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&.... backtrace: [<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs] [<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs] [<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs] [<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs] [<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs] [<00000000f0188930>] process_one_work+0x1cf/0x350 [<00000000af5f2f8e>] worker_thread+0x28/0x3c0 [<00000000b55a1add>] kthread+0x109/0x120 [<00000000f88cbd17>] ret_from_fork+0x35/0x40 This corresponds to: (gdb) l *(btrfs_find_all_roots_safe+0x41) 0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413). 1408 1409 tmp = ulist_alloc(GFP_NOFS); 1410 if (!tmp) 1411 return -ENOMEM; 1412 *roots = ulist_alloc(GFP_NOFS); 1413 if (!*roots) { 1414 ulist_free(tmp); 1415 return -ENOMEM; 1416 } 1417 Following the lifetime of the allocated 'roots' ulist, it gets freed again in btrfs_qgroup_account_extent(). But this does not happen if the function is called with the 'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent() does a short leave and directly returns. Instead of directly returning we should jump to the 'out_free' in order to free all resources as expected. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-08btrfs: do not delete mismatched root refsJosef Bacik
btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in any way, and then continue to delete the reference. This shouldn't happen, we have these values because there's more to the reference than the original root and the sub root. If any of these checks fail, return -ENOENT. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-08btrfs: fix invalid removal of root refJosef Bacik
If we have the following sequence of events btrfs sub create A btrfs sub create A/B btrfs sub snap A C mkdir C/foo mv A/B C/foo rm -rf * We will end up with a transaction abort. The reason for this is because we create a root ref for B pointing to A. When we create a snapshot of C we still have B in our tree, but because the root ref points to A and not C we will make it appear to be empty. The problem happens when we move B into C. This removes the root ref for B pointing to A and adds a ref of B pointing to C. When we rmdir C we'll see that we have a ref to our root and remove the root ref, despite not actually matching our reference name. Now btrfs_del_root_ref() allowing this to work is a bug as well, however we know that this inode does not actually point to a root ref in the first place, so we shouldn't be calling btrfs_del_root_ref() in the first place and instead simply look up our dir index for this item and do the rest of the removal. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-08btrfs: rework arguments of btrfs_unlink_subvolJosef Bacik
btrfs_unlink_subvol takes the name of the dentry and the root objectid based on what kind of inode this is, either a real subvolume link or a empty one that we inherited as a snapshot. We need to fix how we unlink in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework btrfs_unlink_subvol to just take the dentry and handle getting the right objectid given the type of inode this is. There is no functional change here, simply pushing the work into btrfs_unlink_subvol() proper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-03Merge tag 'for-5.5-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes for btrfs: - blkcg accounting problem with compression that could stall writes - setting up blkcg bio for compression crashes due to NULL bdev pointer - fix possible infinite loop in writeback for nocow files (here possible means almost impossible, 13 things that need to happen to trigger it)" * tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix infinite loop during nocow writeback due to race btrfs: fix compressed write bio blkcg attribution btrfs: punt all bios created in btrfs_submit_compressed_write()
2019-12-30Btrfs: fix infinite loop during nocow writeback due to raceFilipe Manana
When starting writeback for a range that covers part of a preallocated extent, due to a race with writeback for another range that also covers another part of the same preallocated extent, we can end up in an infinite loop. Consider the following example where for inode 280 we have two dirty ranges: range A, from 294912 to 303103, 8192 bytes range B, from 348160 to 438271, 90112 bytes and we have the following file extent item layout for our inode: leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5 (...) item 27 key (280 108 200704) itemoff 14598 itemsize 53 extent data disk bytenr 0 nr 0 type 1 (regular) extent data offset 0 nr 94208 ram 94208 item 28 key (280 108 294912) itemoff 14545 itemsize 53 extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc) extent data offset 0 nr 81920 ram 81920 Then the following happens: 1) Writeback starts for range B (from 348160 to 438271), execution of run_delalloc_nocow() starts; 2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at the extent item at slot 28, pointing to the prealloc extent item covering the range from 294912 to 376831. This extent covers part of our range; 3) An ordered extent is created against that extent, covering the file range from 348160 to 376831 (28672 bytes); 4) We adjust 'cur_offset' to 376832 and move on to the next iteration of the while loop; 5) The call to btrfs_lookup_file_extent() leaves us at the same leaf, pointing to slot 29, 1 slot after the last item (the extent item we processed in the previous iteration); 6) Because we are a slot beyond the last item, we call btrfs_next_leaf(), which releases the search path before doing a another search for the last key of the leaf (280 108 294912); 7) Right after btrfs_next_leaf() released the path, and before it did another search for the last key of the leaf, writeback for the range A (from 294912 to 303103) completes (it was previously started at some point); 8) Upon completion of the ordered extent for range A, the prealloc extent we previously found got split into two extent items, one covering the range from 294912 to 303103 (8192 bytes), with a type of regular extent (and no longer prealloc) and another covering the range from 303104 to 376831 (73728 bytes), with a type of prealloc and an offset of 8192 bytes. So our leaf now has the following layout: leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5 (...) item 27 key (280 108 200704) itemoff 14598 itemsize 53 extent data disk bytenr 0 nr 0 type 1 extent data offset 0 nr 8192 ram 94208 item 28 key (280 108 208896) itemoff 14545 itemsize 53 extent data disk bytenr 10433142784 nr 86016 type 1 extent data offset 0 nr 86016 ram 86016 item 29 key (280 108 294912) itemoff 14492 itemsize 53 extent data disk bytenr 10433052672 nr 81920 type 1 extent data offset 0 nr 8192 ram 81920 item 30 key (280 108 303104) itemoff 14439 itemsize 53 extent data disk bytenr 10433052672 nr 81920 type 2 extent data offset 8192 nr 73728 ram 81920 9) After btrfs_next_leaf() returns, we have our path pointing to that same leaf and at slot 30, since it has a key we didn't have before and it's the first key greater then the key that was previously the last key of the leaf (key (280 108 294912)); 10) The extent item at slot 30 covers the range from 303104 to 376831 which is in our target range, so we process it, despite having already created an ordered extent against this extent for the file range from 348160 to 376831. This is because we skip to the next extent item only if its end is less than or equals to the start of our delalloc range, and not less than or equals to the current offset ('cur_offset'); 11) As a result we compute 'num_bytes' as: num_bytes = min(end + 1, extent_end) - cur_offset; = min(438271 + 1, 376832) - 376832 = 0 12) We then call create_io_em() for a 0 bytes range starting at offset 376832; 13) Then create_io_em() enters an infinite loop because its calls to btrfs_drop_extent_cache() do nothing due to the 0 length range passed to it. So no existing extent maps that cover the offset 376832 get removed, and therefore calls to add_extent_mapping() return -EEXIST, resulting in an infinite loop. This loop from create_io_em() is the following: do { btrfs_drop_extent_cache(BTRFS_I(inode), em->start, em->start + em->len - 1, 0); write_lock(&em_tree->lock); ret = add_extent_mapping(em_tree, em, 1); write_unlock(&em_tree->lock); /* * The caller has taken lock_extent(), who could race with us * to add em? */ } while (ret == -EEXIST); Also, each call to btrfs_drop_extent_cache() triggers a warning because the start offset passed to it (376832) is smaller then the end offset (376832 - 1) passed to it by -1, due to the 0 length: [258532.052621] ------------[ cut here ]------------ [258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs] (...) [258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G W 5.4.0-rc7-btrfs-next-64 #1 [258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs] (...) [258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287 [258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000 [258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308 [258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001 [258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000 [258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308 [258532.052705] FS: 00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000 [258532.052707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0 [258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [258532.052717] Call Trace: [258532.052718] ? preempt_schedule_common+0x32/0x70 [258532.052722] ? ___preempt_schedule+0x16/0x20 [258532.052741] create_io_em+0xff/0x180 [btrfs] [258532.052767] run_delalloc_nocow+0x942/0xb10 [btrfs] [258532.052791] btrfs_run_delalloc_range+0x30b/0x520 [btrfs] [258532.052812] ? find_lock_delalloc_range+0x221/0x250 [btrfs] [258532.052834] writepage_delalloc+0xe4/0x140 [btrfs] [258532.052855] __extent_writepage+0x110/0x4e0 [btrfs] [258532.052876] extent_write_cache_pages+0x21c/0x480 [btrfs] [258532.052906] extent_writepages+0x52/0xb0 [btrfs] [258532.052911] do_writepages+0x23/0x80 [258532.052915] __filemap_fdatawrite_range+0xd2/0x110 [258532.052938] btrfs_fdatawrite_range+0x1b/0x50 [btrfs] [258532.052954] start_ordered_ops+0x57/0xa0 [btrfs] [258532.052973] ? btrfs_sync_file+0x225/0x490 [btrfs] [258532.052988] btrfs_sync_file+0x225/0x490 [btrfs] [258532.052997] __x64_sys_msync+0x199/0x200 [258532.053004] do_syscall_64+0x5c/0x250 [258532.053007] entry_SYSCALL_64_after_hwframe+0x49/0xbe [258532.053010] RIP: 0033:0x7f86a7dfd760 (...) [258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a [258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760 [258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000 [258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c [258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3 [258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001 [258532.053032] irq event stamp: 1653450494 [258532.053035] hardirqs last enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50 [258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20 [258532.053039] softirqs last enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd [258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120 [258532.053043] ---[ end trace 8476fce13d9ce20a ]--- Which results in flooding dmesg/syslog since btrfs_drop_extent_cache() uses WARN_ON() and not WARN_ON_ONCE(). So fix this issue by changing run_delalloc_nocow()'s loop to move to the next extent item when the current extent item ends at at offset less than or equals to the current offset instead of the start offset. Fixes: 80ff385665b7fc ("Btrfs: update nodatacow code v2") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30btrfs: fix compressed write bio blkcg attributionDennis Zhou
Bio attribution is handled at bio_set_dev() as once we have a device, we have a corresponding request_queue and then can derive the current css. In special cases, we want to attribute to bio to someone else. This can be done by calling bio_associate_blkg_from_css() or kthread_associate_blkcg() depending on the scenario. Btrfs does this for compressed writeback as they are handled by kworkers, so the latter can be done here. Commit 1a41802701ec ("btrfs: drop bio_set_dev where not needed") removes early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the above assumption that we'll have a request_queue when we are doing association. To fix this, switch to using kthread_associate_blkcg(). Without this, we crash in btrfs/024: [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510 [ 3052.107013] #PF: supervisor read access in kernel mode [ 3052.107014] #PF: error_code(0x0000) - not-present page [ 3052.107015] PGD 0 P4D 0 [ 3052.107021] Oops: 0000 [#1] SMP [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712 [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0 [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282 [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000 [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0 [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400 [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0 [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8 [ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000 [ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0 [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3052.293372] PKRU: 55555554 [ 3052.293376] Call Trace: [ 3052.402552] btrfs_submit_compressed_write+0x137/0x390 [ 3052.402558] submit_compressed_extents+0x40f/0x4c0 [ 3052.422401] btrfs_work_helper+0x246/0x5a0 [ 3052.422408] process_one_work+0x200/0x570 [ 3052.438601] ? process_one_work+0x180/0x570 [ 3052.438605] worker_thread+0x4c/0x3e0 [ 3052.438614] kthread+0x103/0x140 [ 3052.460735] ? process_one_work+0x570/0x570 [ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0 [ 3052.460744] ret_from_fork+0x24/0x30 Fixes: 1a41802701ec ("btrfs: drop bio_set_dev where not needed") Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30btrfs: punt all bios created in btrfs_submit_compressed_write()Dennis Zhou
Compressed writes happen in the background via kworkers. However, this causes bios to be attributed to root bypassing any cgroup limits from the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will punt the bio to an appropriate cgroup specific workqueue and attribute the IO properly. However, if btrfs_submit_compressed_write() creates a new bio, we don't tag it the same way. Add the appropriate tagging for subsequent bios. Fixes: ec39f7696ccfa ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios") Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-17Merge tag 'for-5.5-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A mix of regression fixes and regular fixes for stable trees: - fix swapped error messages for qgroup enable/rescan - fixes for NO_HOLES feature with clone range - fix deadlock between iget/srcu lock/synchronize srcu while freeing an inode - fix double lock on subvolume cross-rename - tree log fixes * fix missing data checksums after replaying a log tree * also teach tree-checker about this problem * skip log replay on orphaned roots - fix maximum devices constraints for RAID1C -3 and -4 - send: don't print warning on read-only mount regarding orphan cleanup - error handling fixes" * tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: remove WARN_ON for readonly mount btrfs: do not leak reloc root if we fail to read the fs root btrfs: skip log replay on orphaned roots btrfs: handle ENOENT in btrfs_uuid_tree_iterate btrfs: abort transaction after failed inode updates in create_subvol Btrfs: fix hole extent items with a zero size after range cloning Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues Btrfs: make tree checker detect checksum items with overlapping ranges Btrfs: fix missing data checksums after replaying a log tree btrfs: return error pointer from alloc_test_extent_buffer btrfs: fix devs_max constraints for raid1c3 and raid1c4 btrfs: tree-checker: Fix error format string for size_t btrfs: don't double lock the subvol_sem for rename exchange btrfs: handle error in btrfs_cache_block_group btrfs: do not call synchronize_srcu() in inode_tree_del Btrfs: fix cloning range with a hole when using the NO_HOLES feature btrfs: Fix error messages in qgroup_rescan_init
2019-12-13btrfs: send: remove WARN_ON for readonly mountAnand Jain
We log warning if root::orphan_cleanup_state is not set to ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is mounted as readonly we skip the orphan item cleanup during the lookup and root::orphan_cleanup_state remains at the init state 0 instead of ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the warning as below. WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE); WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: Call Trace: :: _btrfs_ioctl_send+0x7b/0x110 [btrfs] btrfs_ioctl+0x150a/0x2b00 [btrfs] :: do_vfs_ioctl+0xa9/0x620 ? __fget+0xac/0xe0 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproducer: mkfs.btrfs -fq /dev/sdb mount /dev/sdb /btrfs btrfs subvolume create /btrfs/sv1 btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1 umount /btrfs mount -o ro /dev/sdb /btrfs btrfs send /btrfs/ss1 -f /tmp/f The warning exists because having orphan inodes could confuse send and cause it to fail or produce incorrect streams. The two cases that would cause such send failures, which are already fixed are: 1) Inodes that were unlinked - these are orphanized and remain with a link count of 0. These caused send operations to fail because it expected to always find at least one path for an inode. However this is no longer a problem since send is now able to deal with such inodes since commit 46b2f4590aab ("Btrfs: fix send failure when root has deleted files still open") and treats them as having been completely removed (the state after an orphan cleanup is performed). 2) Inodes that were in the process of being truncated. These resulted in send not knowing about the truncation and potentially issue write operations full of zeroes for the range from the new file size to the old file size. This is no longer a problem because we no longer create orphan items for truncation since commit f7e9e8fc792f ("Btrfs: stop creating orphan items for truncate"). As such before these commits, the WARN_ON here provided a clue in case something went wrong. Instead of being a warning against the root::orphan_cleanup_state value, it could have been more accurate by checking if there were actually any orphan items, and then issue a warning only if any exists, but that would be more expensive to check. Since orphanized inodes no longer cause problems for send, just remove the warning. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/ CC: stable@vger.kernel.org # 4.19+ Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: do not leak reloc root if we fail to read the fs rootJosef Bacik
If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: skip log replay on orphaned rootsJosef Bacik
My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: handle ENOENT in btrfs_uuid_tree_iterateJosef Bacik
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the uuid tree we'll just continue and do btrfs_next_item(). However we've done a btrfs_release_path() at this point and no longer have a valid path. So increment the key and go back and do a normal search. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>