summaryrefslogtreecommitdiff
path: root/fs/btrfs/scrub.c
AgeCommit message (Collapse)Author
2018-01-22btrfs: rename btrfs_device::scrub_device to scrub_ctxAnand Jain
btrfs_device::scrub_device is not a device which is being scrubbed, but it holds the scrub context, so rename to reflect the same. No functional changes here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22Btrfs: fix scrub to repair raid6 corruptionLiu Bo
The raid6 corruption is that, suppose that all disks can be read without problems and if the content that was read out doesn't match its checksum, currently for raid6 btrfs at most retries twice, - the 1st retry is to rebuild with all other stripes, it'll eventually be a raid5 xor rebuild, - if the 1st fails, the 2nd retry will deliberately fail parity p so that it will do raid6 style rebuild, however, the chances are that another non-parity stripe content also has something corrupted, so that the above retries are not able to return correct content. We've fixed normal reads to rebuild raid6 correctly with more retries in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix scrub to do the exactly same rebuild process. [1]: https://patchwork.kernel.org/patch/10091755/ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: sink unlock_extent parameter gfp_flagsDavid Sterba
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the parameter to the function, though we lose some of the slightly better semantics of GFP_KERNEL in some places, it's worth cleaning up the callchains. Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22Btrfs: use struct completion in scrub_submit_raid56_bio_waitLiu Bo
This changes to use struct completion directly and removes 'struct scrub_bio_ret' along with the code using it. This struct is used to get the return value from bio, but the caller can access bio to get the return value directly and is holding a reference on it so it won't go away underneath us and can be removed safely. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGTAnand Jain
Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_MISSINGAnand Jain
Currently device state is being managed by each individual int variable such as struct btrfs_device::missing. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by : Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATAAnand Jain
Currently device state is being managed by each individual int variable such as struct btrfs_device::in_fs_metadata. Instead of that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLEAnand Jain
Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for ↵Zygo Blaxell
uncompressed extents The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> [ minor coding style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: scrub: get rid of sector_tDavid Sterba
The use of sector_t is not necessry, it's just for a warning. Switch to u64 and rename the variable and use byte units instead of 512b, ie. dropping the >> 9 shifts. The messages are adjusted as well. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-09Merge branch 'for-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The changes range through all types: cleanups, core chagnes, sanity checks, fixes, other user visible changes, detailed list below: - deprecated: user transaction ioctl - mount option ssd does not change allocation alignments - degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check - defrag: do not reset compression afterwards; the NOCOMPRESS flag can be now overriden by defrag - prep work for better extent reference tracking (related to the qgroup slowness with balance) - prep work for compression heuristics - memory allocation reductions (may help latencies on a loaded system) - better accounting for io waiting states - error handling improvements (removed BUGs) - added more sanity checks for shared refs - fix readdir vs pagefault deadlock under some circumstances - fix for 'no-hole' mode, certain combination of compressed and inline extents - send: fix emission of invalid clone operations - fixup file mode if setting acls fail - more fixes from fuzzing - oher cleanups" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits) btrfs: submit superblock io with REQ_META and REQ_PRIO btrfs: remove unnecessary memory barrier in btrfs_direct_IO btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent btrfs: pass fs_info to btrfs_del_root instead of tree_root Btrfs: add one more sanity check for shared ref type Btrfs: remove BUG_ON in __add_tree_block Btrfs: remove BUG() in add_data_reference Btrfs: remove BUG() in print_extent_item Btrfs: remove BUG() in btrfs_extent_inline_ref_size Btrfs: convert to use btrfs_get_extent_inline_ref_type Btrfs: add a helper to retrive extent inline ref type btrfs: scrub: simplify scrub worker initialization btrfs: scrub: clean up division in scrub_find_csum btrfs: scrub: clean up division in __scrub_mark_bitmap btrfs: scrub: use bool for flush_all_writes btrfs: preserve i_mode if __btrfs_set_acl() fails btrfs: Remove extraneous chunk_objectid variable btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs: Remove extra parentheses from condition in copy_items() ...
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-21btrfs: scrub: simplify scrub worker initializationDavid Sterba
Minor simplification, merge calls to one. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: clean up division in scrub_find_csumDavid Sterba
Use proper helpers for 64bit division. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: clean up division in __scrub_mark_bitmapDavid Sterba
Use proper helpers for 64bit division and then cast to narrower type. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: use bool for flush_all_writesDavid Sterba
flush_all_writes is an atomic but does not use the semantics at all, it's just on/off indicator, we can use bool. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18btrfs: use appropriate define for the fsidAnand Jain
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we should use the matching constant for the fsid buffer. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: account that we're waiting for IO in scrub_submit_raid56_bio_waitDavid Sterba
Correctly account for IO when waiting for a submitted bio in scrub. This only for the accounting purposes and should not change other behaviour. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: drop newlines from strings when using btrfs_* helpersDavid Sterba
The helpers append "\n" so we can keep the actual strings shorter. The extra newline will print an empty line. Some messages have been slightly modified to be more consistent with the rest (lowercase first letter). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-05Merge branch 'for-4.13-part1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal, refactoring or cleanups. Except the two top patches, all have been in for-next for an extensive amount of time. User visible changes: - statx support - quota override tunable - improved compression thresholds - obsoleted mount option alloc_start Core updates: - bio-related updates: - faster bio cloning - no allocation failures - preallocated flush bios - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates - prep work for btree_inode removal - dir-item validation - qgoup fixes and updates - cleanups: - removed unused struct members, unused code, refactoring - argument refactoring (fs_info/root, caller -> callee sink) - SEARCH_TREE ioctl docs" * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: Remove false alert when fiemap range is smaller than on-disk extent btrfs: Don't clear SGID when inheriting ACLs btrfs: fix integer overflow in calc_reclaim_items_nr btrfs: scrub: fix target device intialization while setting up scrub context btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Add quick exit for non-fs extents Btrfs: rework delayed ref total_bytes_pinned accounting Btrfs: return old and new total ref mods when adding delayed refs Btrfs: always account pinned bytes when dropping a tree block ref Btrfs: update total_bytes_pinned when pinning down extents Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 btrfs: fix validation of XATTR_ITEM dir items btrfs: Verify dir_item in iterate_object_props btrfs: Check name_len before in btrfs_del_root_ref btrfs: Check name_len before reading btrfs_get_name ...
2017-06-29btrfs: fix integer overflow in calc_reclaim_items_nrChris Mason
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af244 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29btrfs: scrub: fix target device intialization while setting up scrub contextDavid Sterba
The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a helper but wrongly sets up the target device. Incidentally there's a local variable with the same name as a parameter in the previous function, so this got caught during runtime as crash in test btrfs/027. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: sink gfp parameter to btrfs_io_bio_allocDavid Sterba
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we change it back from GFP_KERNEL in scrub. I'd rather save a few stack bytes from not passing the gfp flags in the remaining, more imporatant, contexts and the bio allocating API now looks more consistent. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_io_bio_alloc never fails, skip error handlingDavid Sterba
Update direct callers of btrfs_io_bio_alloc that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: add memalloc_nofs protection around init_ipathDavid Sterba
init_ipath is called from a safe ioctl context and from scrub when printing an error. The protection is added for three reasons: * init_data_container calls vmalloc and this does not work as expected in the GFP_NOFS context, so this silently does GFP_KERNEL and might deadlock in some cases * keep the context constraint of GFP_NOFS, used by scrub * we want to use GFP_KERNEL unconditionally inside init_ipath or its callees Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: embed scrub_wr_ctx into scrub contextDavid Sterba
The structure scrub_wr_ctx is not used anywhere just the scrub context, we can move the members there. The tgtdev is renamed so it's more clear that it belongs to the "wr" part. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: use fs_info::sectorsize and drop it from scrub contextDavid Sterba
As we now have the node/block sizes in fs_info, we can use them and can drop the local copies. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: simplify cleanup of wr_ctx in scrub_free_ctxDavid Sterba
We don't need to take the mutex and zero out wr_cur_bio, as this is called after the scrub finished. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: inline helper scrub_free_wr_ctxDavid Sterba
The helper scrub_free_wr_ctx is used only once and fits into scrub_free_ctx as it continues sctx shutdown, no need to keep it separate. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: inline helper scrub_setup_wr_ctxDavid Sterba
The helper scrub_setup_wr_ctx is used only once and fits into scrub_setup_ctx as it continues intialization, no need to keep it separate. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove inode argument from repair_io_failureJosef Bacik
Once we remove the btree_inode we won't have an inode to pass anymore, just pass the fs_info directly and the inum since we use that to print out the repair message. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-09block: switch bios to blk_status_tChristoph Hellwig
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-18btrfs: scrub: Fix RAID56 recovery race conditionQu Wenruo
When scrubbing a RAID5 which has recoverable data corruption (only one data stripe is corrupted), sometimes scrub will report more csum errors than expected. Sometimes even unrecoverable error will be reported. The problem can be easily reproduced by the following steps: 1) Create a btrfs with RAID5 data profile with 3 devs 2) Mount it with nospace_cache or space_cache=v2 To avoid extra data space usage. 3) Create a 128K file and sync the fs, unmount it Now the 128K file lies at the beginning of the data chunk 4) Locate the physical bytenr of data chunk on dev3 Dev3 is the 1st data stripe. 5) Corrupt the first 64K of the data chunk stripe on dev3 6) Mount the fs and scrub it The correct csum error number should be 16 (assuming using x86_64). Larger csum error number can be reported in a 1/3 chance. And unrecoverable error can also be reported in a 1/10 chance. The root cause of the problem is RAID5/6 recover code has race condition, due to the fact that full scrub is initiated per device. While for other mirror based profiles, each mirror is independent with each other, so race won't cause any big problem. For example: Corrupted | Correct | Correct | | Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) | ------------------------------------------------------------------------ Read out D1 |Read out D2 |Read full stripe | Check csum |Check csum |Check parity | Csum mismatch |Csum match, continue |Parity mismatch | handle_errored_block | |handle_errored_block | Read out full stripe | | Read out full stripe| D1 csum error(err++) | | D1 csum error(err++)| Recover D1 | | Recover D1 | So D1's csum error is accounted twice, just because handle_errored_block() doesn't have enough protection, and race can happen. On even worse case, for example D1's recovery code is re-writing D1/D2/P, and P's recovery code is just reading out full stripe, then we can cause unrecoverable error. This patch will use previously introduced lock_full_stripe() and unlock_full_stripe() to protect the whole scrub_handle_errored_block() function for RAID56 recovery. So no extra csum error nor unrecoverable error. Reported-by: Goffredo Baroncelli <kreijack@libero.it> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: scrub: Introduce full stripe lock for RAID56Qu Wenruo
Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protection, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which store a rb_tree of mutexes for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: switch to div64_u64 if with a u64 divisorLiu Bo
This is fixing code pieces where we use div_u64 when passing a u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: update scrub_parity to use u64 stripe_lenLiu Bo
Commit 3d8da6781760 ("Btrfs: fix divide error upon chunk's stripe_len") changed stripe_len in struct map_lookup to u64, but didn't update stripe_len in struct scrub_parity. This updates the type and switches to div64_u64_rem to match u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: use clear_page where appropriateDavid Sterba
There's a helper to clear whole page, with a arch-specific optimized code. The replaced cases do not seem to be in performace critical code, but we still might get some percent gain. Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: Prevent scrub recheck from racing with dev replaceQu Wenruo
scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses bbio without protection of bio_counter. This can lead to use-after-free if racing with dev replace cancel. Fix it by increasing bio_counter before calling btrfs_map_sblock() and decreasing the bio_counter when corresponding recover is finished. Cc: Liu Bo <bo.li.liu@oracle.com> Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: Wait for in-flight bios before freeing target device for raid56Qu Wenruo
When raid56 dev-replace is cancelled by running scrub, we will free target device without waiting for in-flight bios, causing the following NULL pointer deference or general protection failure. BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0 IP: generic_make_request_checks+0x4d/0x610 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O 4.11.0-rc2 #72 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] task: ffff88002875b4c0 task.stack: ffffc90001334000 RIP: 0010:generic_make_request_checks+0x4d/0x610 Call Trace: ? generic_make_request+0xc7/0x360 generic_make_request+0x24/0x360 ? generic_make_request+0xc7/0x360 submit_bio+0x64/0x120 ? page_in_rbio+0x4d/0x80 [btrfs] ? rbio_orig_end_io+0x80/0x80 [btrfs] finish_rmw+0x3f4/0x540 [btrfs] validate_rbio_for_rmw+0x36/0x40 [btrfs] raid_rmw_end_io+0x7a/0x90 [btrfs] bio_endio+0x56/0x60 end_workqueue_fn+0x3c/0x40 [btrfs] btrfs_scrubparity_helper+0xef/0x620 [btrfs] btrfs_endio_raid56_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8 In btrfs_dev_replace_finishing(), we will call btrfs_rm_dev_replace_blocked() to wait bios before destroying the target device when scrub is finished normally. However when dev-replace is aborted, either due to error or cancelled by scrub, we didn't wait for bios, this can lead to use-after-free if there are bios holding the target device. Furthermore, for raid56 scrub, at least 2 places are calling btrfs_map_sblock() without protection of bio_counter, leading to the problem. This patch fixes the problem: 1) Wait for bio_counter before freeing target device when canceling replace 2) When calling btrfs_map_sblock() for raid56, use bio_counter to protect the call. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: scrub: Don't append on-disk pages for raid56 scrubQu Wenruo
In the following situation, scrub will calculate wrong parity to overwrite the correct one: RAID5 full stripe: Before | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --------------------------------------------------- 0 | 0x0000 (Bad) | 0xcdcd | 0x0000 | --------------------------------------------------- 4K | 0xcdcd | 0xcdcd | 0x0000 | ... | 0xcdcd | 0xcdcd | 0x0000 | --------------------------------------------------- 64K After scrubbing dev3 only: | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --------------------------------------------------- 0 | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | --------------------------------------------------- 4K | 0xcdcd | 0xcdcd | 0x0000 | ... | 0xcdcd | 0xcdcd | 0x0000 | --------------------------------------------------- 64K The reason is that after raid56 read rebuild rbio->stripe_pages are all correctly recovered (0xcd for data stripes). However when we check and repair parity in scrub_parity_check_and_repair(), we will append pages in sparity->spages list to rbio->bio_pages[], which contains old on-disk data. And when we submit parity data to disk, we calculate parity using rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback to rbio->stripe_pages[]. The patch fix it by not appending pages from sparity->spages. So finish_parity_scrub() will use rbio->stripe_pages[] which is correct. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: drop redundant parameters from btrfs_map_sblockDavid Sterba
All callers pass 0 for mirror_num and 1 for need_raid_map. Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: set scrub page's io_error if failing to submit ioLiu Bo
Scrub repairs data by the unit called scrub_block, which may contain several pages. Scrub always tries to look up a good copy of a whole block, but if there's no such copy, it tries to do repair page by page. If we don't set page's io_error when checking this bad copy, in the last step, we may skip this page when repairing bad copy from good copy. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_ctx.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_parity.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_block.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_recover.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28btrfs: Make check_extent_to_block take btrfs_inodeNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28btrfs: make repair_io_failure take btrfs_inodeNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28btrfs: Make btrfs_lookup_ordered_range take btrfs_inodeNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17btrfs: convert btrfs_inc_block_group_ro to accept fs_infoJeff Mahoney
btrfs_inc_block_group_ro is either passed the extent root or the dev root, but it doesn't do anything with the dev tree. Let's convert to passing an fs_info and using the extent root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>