summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2015-03-25Merge branch 'cleanups-for-4.1-v2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
2015-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Most of these are fixing extent reservation accounting, or corners with tree writeback during commit. Josef's set does add a test, which isn't strictly a fix, but it'll keep us from making this same mistake again" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix outstanding_extents accounting in DIO Btrfs: add sanity test for outstanding_extents accounting Btrfs: just free dummy extent buffers Btrfs: account merges/splits properly Btrfs: prepare block group cache before writing Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) Btrfs: account for the correct number of extents for delalloc reservations Btrfs: fix merge delalloc logic Btrfs: fix comp_oper to get right order Btrfs: catch transaction abortion after waiting for it btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-17Btrfs: fix outstanding_extents accounting in DIOJosef Bacik
We are keeping track of how many extents we need to reserve properly based on the amount we want to write, but we were still incrementing outstanding_extents if we wrote less than what we requested. This isn't quite right since we will be limited to our max extent size. So instead lets do something horrible! Keep track of how many outstanding_extents we reserved, and decrement each time we allocate an extent. If we use our entire reserve make sure to jack up outstanding_extents on the inode so the accounting works out properly. Thanks, Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17Btrfs: add sanity test for outstanding_extents accountingJosef Bacik
I introduced a regression wrt outstanding_extents accounting. These are tricky areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE at any time. So add sanity tests to cover the various conditions that are tricky in order to make sure we don't introduce regressions in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17Btrfs: just free dummy extent buffersJosef Bacik
If we fail during our sanity tests we could get NULL deref's because we unload the module before the dummy extent buffers are free'd via RCU. So check for this case and just free the things directly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17Btrfs: account merges/splits properlyJosef Bacik
My fix Btrfs: fix merge delalloc logic only fixed half of the problems, it didn't fix the case where we have two large extents on either side and then join them together with a new small extent. We need to instead keep track of how many extents we have accounted for with each side of the new extent, and then see how many extents we need for the new large extent. If they match then we know we need to keep our reservation, otherwise we need to drop our reservation. This shows up with a case like this [BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K] Previously the logic would have said that the number extents required for the new size (3) is larger than the number of extents required for the largest side (2) therefore we need to keep our reservation. But this isn't the case, since both sides require a reservation of 2 which leads to 4 for the whole range currently reserved, but we only need 3, so we need to drop one of the reservations. The same problem existed for splits, we'd think we only need 3 extents when creating the hole but in reality we need 4. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17Btrfs: prepare block group cache before writingJosef Bacik
Writing the block group cache will modify the extent tree quite a bit because it truncates the old space cache and pre-allocates new stuff. To try and cut down on the churn lets do the setup dance first, then later on hopefully we can avoid looping with newly dirtied roots. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-13Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)Josef Bacik
Dave could hit this assert consistently running btrfs/078. This is because when we update the block groups we could truncate the free space, which would try to delete the csums for that range and dirty the csum root. For this to happen we have to have already written out the csum root so it's kind of hard to hit this case. This patch fixes this by changing the logic to only write the dirty block groups if the dirty_cowonly_roots list is empty. This will get us the same effect as before since we add the extent root last, and will cover the case that we dirty some other root again but not the extent root. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13Btrfs: account for the correct number of extents for delalloc reservationsJosef Bacik
Direct IO can easily pass in an buffer that is greater than BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the delalloc reservation code. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13Btrfs: fix merge delalloc logicJosef Bacik
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a regression when re-dirtying already dirty areas. We have logic in split to make sure we are taking the largest space into account but didn't have it for merge, so it was sometimes making us think we were turning a tiny extent into a huge extent, when in reality we already had a huge extent and needed to use the other side in our logic. This fixes the regression that was reported by a user on list. Thanks, Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13Btrfs: fix comp_oper to get right orderLiu Bo
Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13Btrfs: catch transaction abortion after waiting for itLiu Bo
This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297. Fsync() can report success when it actually doesn't. When we have several threads running fsync() at the same tiem and in one fsync() we get a transaction abortion due to some problems(in the test case it's disk failures), and other fsync()s may return successfully which makes userspace programs think that data is now safely flushed into disk. It's because that after fsyncs() fail btrfs_sync_log() due to disk failures, they get to try btrfs_commit_transaction() where it finds that there is already a transaction being committed, and they'll just call wait_for_commit() and return. Note that we actually check "trans->aborted" in btrfs_end_transaction, but it's likely that the error message is still not yet throwed out and only after wait_for_commit() we're sure whether the transaction is committed successfully. This add the necessary check and it now passes the test. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13btrfs: fix sizeof format specifier in btrfs_check_super_valid()Fabian Frederick
This patch fixes mips compilation warning: fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid': fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat] Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Outside of misc fixes, Filipe has a few fsync corners and we're pulling in one more of Josef's fixes from production use here" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. Btrfs: fix data loss in the fast fsync path Btrfs: remove extra run_delayed_refs in update_cowonly_root Btrfs: incremental send, don't rename a directory too soon btrfs: fix lost return value due to variable shadowing Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr Btrfs: fix off-by-one logic error in btrfs_realloc_node Btrfs: add missing inode update when punching hole Btrfs: abort the transaction if we fail to update the free space cache inode Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-05Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.Quentin Casasnovas
Improper arithmetics when calculting the address of the extended ref could lead to an out of bounds memory read and kernel panic. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> cc: stable@vger.kernel.org # v3.7+ Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05Btrfs: fix data loss in the fast fsync pathFilipe Manana
When using the fast file fsync code path we can miss the fact that new writes happened since the last file fsync and therefore return without waiting for the IO to finish and write the new extents to the fsync log. Here's an example scenario where the fsync will miss the fact that new file data exists that wasn't yet durably persisted: 1. fs_info->last_trans_committed == N - 1 and current transaction is transaction N (fs_info->generation == N); 2. do a buffered write; 3. fsync our inode, this clears our inode's full sync flag, starts an ordered extent and waits for it to complete - when it completes at btrfs_finish_ordered_io(), the inode's last_trans is set to the value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> btrfs_set_inode_last_trans); 4. transaction N is committed, so fs_info->last_trans_committed is now set to the value N and fs_info->generation remains with the value N; 5. do another buffered write, when this happens btrfs_file_write_iter sets our inode's last_trans to the value N + 1 (that is fs_info->generation + 1 == N + 1); 6. transaction N + 1 is started and fs_info->generation now has the value N + 1; 7. transaction N + 1 is committed, so fs_info->last_trans_committed is set to the value N + 1; 8. fsync our inode - because it doesn't have the full sync flag set, we only start the ordered extent, we don't wait for it to complete (only in a later phase) therefore its last_trans field has the value N + 1 set previously by btrfs_file_write_iter(), and so we have: inode->last_trans <= fs_info->last_trans_committed (N + 1) (N + 1) Which made us not log the last buffered write and exit the fsync handler immediately, returning success (0) to user space and resulting in data loss after a crash. This can actually be triggered deterministically and the following excerpt from a testcase I made for xfstests triggers the issue. It moves a dummy file across directories and then fsyncs the old parent directory - this is just to trigger a transaction commit, so moving files around isn't directly related to the issue but it was chosen because running 'sync' for example does more than just committing the current transaction, as it flushes/waits for all file data to be persisted. The issue can also happen at random periods, since the transaction kthread periodicaly commits the current transaction (about every 30 seconds by default). The body of the test is: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file 'foo', the one we check for data loss. # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' # bit from its flags (btrfs inode specific flags). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io # Now create one other file and 2 directories. We will move this second file # from one directory to the other later because it forces btrfs to commit its # currently open transaction if we fsync the old parent directory. This is # necessary to trigger the data loss bug that affected btrfs. mkdir $SCRATCH_MNT/testdir_1 touch $SCRATCH_MNT/testdir_1/bar mkdir $SCRATCH_MNT/testdir_2 # Make sure everything is durably persisted. sync # Write more 8Kb of data to our file. $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io # Move our 'bar' file into a new directory. mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar # Fsync our first directory. Because it had a file moved into some other # directory, this made btrfs commit the currently open transaction. This is # a condition necessary to trigger the data loss bug. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of # data we wrote previously to be persisted and available if a crash happens. # This did not happen with btrfs, because of the transaction commit that # happened when we fsynced the parent directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now check that all data we wrote before are available. echo "File content after log replay:" od -t x1 $SCRATCH_MNT/foo status=0 exit The expected golden output for the test, which is what we get with this fix applied (or when running against ext3/4 and xfs), is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0040000 Without this fix applied, the output shows the test file does not have the second 8Kb extent that we successfully fsynced: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 So fix this by skipping the fsync only if we're doing a full sync and if the inode's last_trans is <= fs_info->last_trans_committed, or if the inode is already in the log. Also remove setting the inode's last_trans in btrfs_file_write_iter since it's useless/unreliable. Also because btrfs_file_write_iter no longer sets inode->last_trans to fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't bail out if last_trans is 0, otherwise something as simple as the following example wouldn't log the second write on the last fsync: 1. write to file 2. fsync file 3. fsync file |--> btrfs_inode_in_log() returns true and it set last_trans to 0 4. write to file |--> btrfs_file_write_iter() no longers sets last_trans, so it remained with a value of 0 5. fsync |--> inode->last_trans == 0, so it bails out without logging the second write A test case for xfstests will be sent soon. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05Btrfs: remove extra run_delayed_refs in update_cowonly_rootJosef Bacik
This got added with my dirty_bgs patch, it's not needed. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-03btrfs: remove shadowing variables in __btrfs_map_blockDavid Sterba
1) We can safely use the function's 'i'. Fixes warning fs/btrfs/volumes.c:5257:7: warning: declaration of 'i' shadows a previous local fs/btrfs/volumes.c:4951:6: warning: shadowed declaration is here 2) A local variable duplicates name of an argument, we can use the value directly. Fixes warning fs/btrfs/volumes.c:5433:8: warning: declaration of 'length' shadows a parameter fs/btrfs/volumes.c:4935:27: warning: shadowed declaration is here Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: switch helper macros to static inlines in sysfs.hDavid Sterba
The conversion macros use nested container_of that leads to a warning fs/btrfs/sysfs.c: In function 'btrfs_feature_visible': fs/btrfs/sysfs.c:183:8: warning: declaration of '__mptr' shadows a previous local fs/btrfs/sysfs.c:183:8: warning: shadowed declaration is here Use of functions will add proper type checking. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup, use correct type in div_u64_remDavid Sterba
div_u64_rem expects u32 for divisior and reminder. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: replace remaining do_div calls with div_u64 variantsDavid Sterba
Switch to div_u64_rem that does type checking and has more obvious semantics than do_div. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup 64bit/32bit divs, provably bounded valuesDavid Sterba
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type. Get rid of a few more do_div instances. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: use explicit initializer for seq_elemDavid Sterba
Using {} as initializer for struct seq_elem does not properly initialize the list_head member, but it currently works because it gets set through btrfs_get_tree_mod_seq if 'seq' is 0. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: remove shadowing variables in __btrfs_buffered_writeDavid Sterba
There are lockstart and lockend defined in the function and not used after their duplicate definition scope ends, it's safe to reuse them. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup, use kmalloc_array/kcalloc array helpersDavid Sterba
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional overflow checks, the zeroing variant is kcalloc. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup 64bit/32bit divs, compile time constantsDavid Sterba
Switch to div_u64 if the divisor is a numeric constant or sum of sizeof()s. We can remove a few instances of do_div that has the hidden semtantics of changing the 1st argument. Small power-of-two divisors are converted to bitshifts, large values are kept intact for clarity. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: use cond_resched_lock where possibleDavid Sterba
Clean the opencoded variant, cond_resched_lock also checks the lock for contention so it might help in some cases that were not covered by simple need_resched(). Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: need_resched not needed with cond_reschedDavid Sterba
Cleanup, no special reason to do if (need_resched()) cond_resched(); Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-02Btrfs: incremental send, don't rename a directory too soonFilipe Manana
There's one more case where we can't issue a rename operation for a directory as soon as we process it. We used to delay directory renames only if they have some ancestor directory with a higher inode number that got renamed too, but there's another case where we need to delay the rename too - when a directory A is renamed to the old name of a directory B but that directory B has its rename delayed because it has now (in the send root) an ancestor with a higher inode number that was renamed. If we don't delay the directory rename in this case, the receiving end of the send stream will attempt to rename A to the old name of B before B got renamed to its new name, which results in a "directory not empty" error. So fix this by delaying directory renames for this case too. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a $ mkdir /mnt/b $ mkdir /mnt/c $ touch /mnt/a/file $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/c /mnt/x $ mv /mnt/a /mnt/x/y $ mv /mnt/b /mnt/a $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.send $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ btrfs receive /mnt2 -f /tmp/1.send $ btrfs receive /mnt2 -f /tmp/2.send ERROR: rename b -> a failed. Directory not empty A test case for xfstests follows soon. Reported-by: Ames Cornish <ames@cornishes.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02btrfs: fix lost return value due to variable shadowingDavid Sterba
A block-local variable stores error code but btrfs_get_blocks_direct may not return it in the end as there's a ret defined in the function scope. CC: <stable@vger.kernel.org> # 3.6+ Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO") Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattrFilipe Manana
The return value from btrfs_lookup_xattr() can be a pointer encoding an error, therefore deal with it. This fixes commit 5f5bc6b1e2d5 ("Btrfs: make xattr replace operations atomic"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02Btrfs: fix off-by-one logic error in btrfs_realloc_nodeFilipe Manana
The end_slot variable actually matches the number of pointers in the node and not the last slot (which is 'nritems - 1'). Therefore in order to check that the current slot in the for loop doesn't match the last one, the correct logic is to check if 'i' is less than 'end_slot - 1' and not 'end_slot - 2'. Fix this and set end_slot to be 'nritems - 1', as it's less confusing since the variable name implies it's inclusive rather then exclusive. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02Btrfs: add missing inode update when punching holeFilipe Manana
When punching a file hole if we endup only zeroing parts of a page, because the start offset isn't a multiple of the sector size or the start offset and length fall within the same page, we were not updating the inode item. This prevented an fsync from doing anything, if no other file changes happened in the current transaction, because the fields in btrfs_inode used to check if the inode needs to be fsync'ed weren't updated. This issue is easy to reproduce and the following excerpt from the xfstest case I made shows how to trigger it: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file. $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Fsync the file, this makes btrfs update some btrfs inode specific fields # that are used to track if the inode needs to be written/updated to the fsync # log or not. After this fsync, the new values for those fields indicate that # a subsequent fsync does not need to touch the fsync log. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Force a commit of the current transaction. After this point, any operation # that modifies the data or metadata of our file, should update those fields in # the btrfs inode with values that make the next fsync operation write to the # fsync log. sync # Punch a hole in our file. This small range affects only 1 page. # This made the btrfs hole punching implementation write only some zeroes in # one page, but it did not update the btrfs inode fields used to determine if # the next fsync needs to write to the fsync log. $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo # Another variation of the previously mentioned case. $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo # Now fsync the file. This was a no-operation because the previous hole punch # operation didn't update the inode's fields mentioned before, so they remained # with the values they had after the first fsync - that is, they indicate that # it is not needed to write to fsync log. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo echo "File content before:" od -t x1 $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Enable writes and mount the fs. This makes the fsync log replay code run. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Because the last fsync didn't do anything, here the file content matched what # it was after the first fsync, before the holes were punched, and not what it # was after the holes were punched. echo "File content after:" od -t x1 $SCRATCH_MNT/foo This issue has been around since 2012, when the punch hole implementation was added, commit 2aaa66558172 ("Btrfs: add hole punching"). A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02Btrfs: abort the transaction if we fail to update the free space cache inodeJosef Bacik
Our gluster boxes were hitting a problem where they'd run out of space when updating the block group cache and therefore wouldn't be able to update the free space inode. This is a problem because this is how we invalidate the cache and protect ourselves from errors further down the stack, so if this fails we have to abort the transaction so we make sure we don't end up with stale free space cache. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02Btrfs: fix fsync race leading to ordered extent memory leaksFilipe Manana
We can have multiple fsync operations against the same file during the same transaction and they can collect the same ordered extents while they don't complete (still accessible from the inode's ordered tree). If this happens, those ordered extents will never get their reference counts decremented to 0, leading to memory leaks and inode leaks (an iput for an ordered extent's inode is scheduled only when the ordered extent's refcount drops to 0). The following sequence diagram explains this race: CPU 1 CPU 2 btrfs_sync_file() btrfs_sync_file() mutex_lock(inode->i_mutex) btrfs_log_inode() btrfs_get_logged_extents() --> collects ordered extent X --> increments ordered extent X's refcount btrfs_submit_logged_extents() mutex_unlock(inode->i_mutex) mutex_lock(inode->i_mutex) btrfs_sync_log() btrfs_wait_logged_extents() --> list_del_init(&ordered->log_list) btrfs_log_inode() btrfs_get_logged_extents() --> Adds ordered extent X to logged_list because at this point: list_empty(&ordered->log_list) && test_bit(BTRFS_ORDERED_LOGGED, &ordered->flags) == 0 --> Increments ordered extent X's refcount --> check if ordered extent's io is finished or not, start it if necessary and wait for it to finish --> sets bit BTRFS_ORDERED_LOGGED on ordered extent X's flags and adds it to trans->ordered btrfs_sync_log() finishes btrfs_submit_logged_extents() btrfs_log_inode() finishes mutex_unlock(inode->i_mutex) btrfs_sync_file() finishes btrfs_sync_log() btrfs_wait_logged_extents() --> Sees ordered extent X has the bit BTRFS_ORDERED_LOGGED set in its flags --> X's refcount is untouched btrfs_sync_log() finishes btrfs_sync_file() finishes btrfs_commit_transaction() --> called by transaction kthread for e.g. btrfs_wait_pending_ordered() --> waits for ordered extent X to complete --> decrements ordered extent X's refcount by 1 only, corresponding to the increment done by the fsync task ran by CPU 1 In the scenario of the above diagram, after the transaction commit, the ordered extent will remain with a refcount of 1 forever, leaking the ordered extent structure and preventing the i_count of its inode from ever decreasing to 0, since the delayed iput is scheduled only when the ordered extent's refcount drops to 0, preventing the inode from ever being evicted by the VFS. Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to mean that an ordered extent is already being processed by an fsync call, which will attach it to the current transaction, preventing it from being collected by subsequent fsync operations against the same inode. This race was introduced with the following change (added in 3.19 and backported to stable 3.18 and 3.17): Btrfs: make sure logged extents complete in the current transaction V3 commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f I ran into this issue while running xfstests/generic/113 in a loop, which failed about 1 out of 10 runs with the following warning in dmesg: [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]() [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo ppy e1000 scsi_mod [last unloaded: btrfs] [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 2612.457709] 0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8 [ 2612.459829] 0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000 [ 2612.461564] ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068 [ 2612.463163] Call Trace: [ 2612.463719] [<ffffffff8142425e>] dump_stack+0x4c/0x65 [ 2612.464789] [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb [ 2612.466026] [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs] [ 2612.467247] [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c [ 2612.468416] [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs] [ 2612.469625] [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs] [ 2612.471251] [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs] [ 2612.472536] [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26 [ 2612.473742] [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs] [ 2612.475477] [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba [ 2612.476695] [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs] [ 2612.477911] [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef [ 2612.479106] [<ffffffff811540e2>] kill_anon_super+0x13/0x1e [ 2612.480226] [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs] [ 2612.481471] [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50 [ 2612.482686] [<ffffffff811547a7>] deactivate_super+0x3f/0x43 [ 2612.483791] [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78 [ 2612.484842] [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14 [ 2612.485900] [<ffffffff8105d019>] task_work_run+0x8f/0xbc [ 2612.486960] [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b [ 2612.488083] [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 2612.489333] [<ffffffff8142a17f>] int_signal+0x12/0x17 [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]--- [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds. Have a nice day... Kmemleak confirmed the ordered extent leak (and btrfs inode specific structures such as delayed nodes): $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff880154290db0 (size 576): comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s) hex dump (first 32 bytes): 01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff .@.........N.... 00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff ..........)T.... backtrace: [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83 [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190 [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs] [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs] [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs] [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs] [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs] [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff88014ef11db0 (size 576): comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s) hex dump (first 32 bytes): 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff ...........N.... backtrace: [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83 [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190 [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs] [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs] [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs] [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs] [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs] [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff8800336feda8 (size 584): comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s) hex dump (first 32 bytes): 00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00 .@>........B.... 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 ................ backtrace: [<ffffffff8114eb34>] create_object+0x172/0x29a [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41 [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18 [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198 [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs] [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs] [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs] [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47 [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36 [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs] [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs] [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12 [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 CC: <stable@vger.kernel.org> # 3.19, 3.18 and 3.17 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "I'm still testing more fixes, but I wanted to get out the fix for the btrfs raid5/6 memory corruption I mentioned in my merge window pull" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix allocation size calculations in alloc_btrfs_bio
2015-02-22Merge branch 'for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted stuff from this cycle. The big ones here are multilayer overlayfs from Miklos and beginning of sorting ->d_inode accesses out from David" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits) autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation procfs: fix race between symlink removals and traversals debugfs: leave freeing a symlink body until inode eviction Documentation/filesystems/Locking: ->get_sb() is long gone trylock_super(): replacement for grab_super_passive() fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) SELinux: Use d_is_positive() rather than testing dentry->d_inode Smack: Use d_is_positive() rather than testing dentry->d_inode TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR() Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb VFS: Split DCACHE_FILE_TYPE into regular and special types VFS: Add a fallthrough flag for marking virtual dentries VFS: Add a whiteout dentry type VFS: Introduce inode-getting helpers for layered/unioned fs environments Infiniband: Fix potential NULL d_inode dereference posix_acl: fix reference leaks in posix_acl_create autofs4: Wrong format for printing dentry ...
2015-02-22VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)David Howells
Convert the following where appropriate: (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry). (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry). (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more complicated than it appears as some calls should be converted to d_can_lookup() instead. The difference is whether the directory in question is a real dir with a ->lookup op or whether it's a fake dir with a ->d_automount op. In some circumstances, we can subsume checks for dentry->d_inode not being NULL into this, provided we the code isn't in a filesystem that expects d_inode to be NULL if the dirent really *is* negative (ie. if we're going to use d_inode() rather than d_backing_inode() to get the inode pointer). Note that the dentry type field may be set to something other than DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS manages the fall-through from a negative dentry to a lower layer. In such a case, the dentry type of the negative union dentry is set to the same as the type of the lower dentry. However, if you know d_inode is not NULL at the call site, then you can use the d_is_xxx() functions even in a filesystem. There is one further complication: a 0,0 chardev dentry may be labelled DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was intended for special directory entry types that don't have attached inodes. The following perl+coccinelle script was used: use strict; my @callers; open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') || die "Can't grep for S_ISDIR and co. callers"; @callers = <$fd>; close($fd); unless (@callers) { print "No matches\n"; exit(0); } my @cocci = ( '@@', 'expression E;', '@@', '', '- S_ISLNK(E->d_inode->i_mode)', '+ d_is_symlink(E)', '', '@@', 'expression E;', '@@', '', '- S_ISDIR(E->d_inode->i_mode)', '+ d_is_dir(E)', '', '@@', 'expression E;', '@@', '', '- S_ISREG(E->d_inode->i_mode)', '+ d_is_reg(E)' ); my $coccifile = "tmp.sp.cocci"; open($fd, ">$coccifile") || die $coccifile; print($fd "$_\n") || die $coccifile foreach (@cocci); close($fd); foreach my $file (@callers) { chomp $file; print "Processing ", $file, "\n"; system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 || die "spatch failed"; } [AV: overlayfs parts skipped] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20Btrfs: fix allocation size calculations in alloc_btrfs_bioChris Mason
Since commit 8e5cfb55d3f (Btrfs: Make raid_map array be inlined in btrfs_bio structure), the raid map array is allocated along with the btrfs bio in alloc_btrfs_bio. The calculation used to decide how much we need to allocate was using the wrong parameter passed into the allocation function. The passed in real_stripes will be zero if a target replace operation is not currently running. We want to use total_stripes instead. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: David Sterba <dsterba@suse.cz> Tested-by: David Sterba <dsterba@suse.cz>
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...
2015-02-14Btrfs: don't remove extents and xattrs when logging new namesFilipe Manana
If we are recording in the tree log that an inode has new names (new hard links were added), we would drop items, belonging to the inode, that we shouldn't: 1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime flags, we ended up dropping all the extent and xattr items that were previously logged. This was done only in memory, since logging a new name doesn't imply syncing the log; 2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime flags, we ended up dropping all the xattr items that were previously logged. Like the case before, this was done only in memory because logging a new name doesn't imply syncing the log. This led to some surprises in scenarios such as the following: 1) write some extents to an inode; 2) fsync the inode; 3) truncate the inode or delete/modify some of its xattrs 4) add a new hard link for that inode 5) fsync some other file, to force the log tree to be durably persisted 6) power failure happens The next time the fs is mounted, the fsync log replay code is executed, and the resulting file doesn't have the content it had when the last fsync against it was performed, instead if has a content matching what it had when the last transaction commit happened. So change the behaviour such that when a new name is logged, only the inode item and reference items are processed. This is easy to reproduce with the test I just made for xfstests, whose main body is: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file with some data. $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Make sure the file is durably persisted. sync # Append some data to our file, to increase its size. $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Fsync the file, so from this point on if a crash/power failure happens, our # new data is guaranteed to be there next time the fs is mounted. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Now shrink our file to 5000 bytes. $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo # Now do an expanding truncate to a size larger than what we had when we last # fsync'ed our file. This is just to verify that after power failure and # replaying the fsync log, our file matches what it was when we last fsync'ed # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of # data had a value of 0xcc. $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo # Add one hard link to our file. This made btrfs drop all of our file's # metadata from the fsync log, including the metadata relative to the # extent we just wrote and fsync'ed. This change was made only to the fsync # log in memory, so adding the hard link alone doesn't change the persisted # fsync log. This happened because the previous truncates set the runtime # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link # Now make sure the in memory fsync log is durably persisted. # Creating and fsync'ing another file will do it. # After this our persisted fsync log will no longer have metadata for our file # foo that points to the extent we wrote and fsync'ed before. touch $SCRATCH_MNT/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar # As expected, before the crash/power failure, we should be able to see a file # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all # the remaining bytes with value 0x00. echo "File content before:" od -t x1 $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # After mounting the fs again, the fsync log was replayed. # The expected result is to see a file with a size of 12Kb, with its first 8Kb # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc. # The btrfs bug used to leave the file as it used te be as of the last # transaction commit - that is, with a size of 8Kb with all bytes having a # value of 0xaa. echo "File content after:" od -t x1 $SCRATCH_MNT/foo The test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: fix fsync data loss after adding hard link to inodeFilipe Manana
We have a scenario where after the fsync log replay we can lose file data that had been previously fsync'ed if we added an hard link for our inode and after that we sync'ed the fsync log (for example by fsync'ing some other file or directory). This is because when adding an hard link we updated the inode item in the log tree with an i_size value of 0. At that point the new inode item was in memory only and a subsequent fsync log replay would not make us lose the file data. However if after adding the hard link we sync the log tree to disk, by fsync'ing some other file or directory for example, we ended up losing the file data after log replay, because the inode item in the persisted log tree had an an i_size of zero. This is easy to reproduce, and the following excerpt from my test for xfstests shows this: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create one file with data and fsync it. # This made the btrfs fsync log persist the data and the inode metadata with # a correct inode->i_size (4096 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now add one hard link to our file. This made the btrfs code update the fsync # log, in memory only, with an inode metadata having a size of 0. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link # Now force persistence of the fsync log to disk, for example, by fsyncing some # other file. touch $SCRATCH_MNT/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar # Before a power loss or crash, we could read the 4Kb of data from our file as # expected. echo "File content before:" od -t x1 $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # After the fsync log replay, because the fsync log had a value of 0 for our # inode's i_size, we couldn't read anymore the 4Kb of data that we previously # wrote and fsync'ed. The size of the file became 0 after the fsync log replay. echo "File content after:" od -t x1 $SCRATCH_MNT/foo Another alternative test, that doesn't need to fsync an inode in the same transaction it was created, is: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file with some data. $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Make sure the file is durably persisted. sync # Append some data to our file, to increase its size. $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Fsync the file, so from this point on if a crash/power failure happens, our # new data is guaranteed to be there next time the fs is mounted. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Add one hard link to our file. This made btrfs write into the in memory fsync # log a special inode with generation 0 and an i_size of 0 too. Note that this # didn't update the inode in the fsync log on disk. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link # Now make sure the in memory fsync log is durably persisted. # Creating and fsync'ing another file will do it. touch $SCRATCH_MNT/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar # As expected, before the crash/power failure, we should be able to read the # 12Kb of file data. echo "File content before:" od -t x1 $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # After mounting the fs again, the fsync log was replayed. # The btrfs fsync log replay code didn't update the i_size of the persisted # inode because the inode item in the log had a special generation with a # value of 0 (and it couldn't know the correct i_size, since that inode item # had a 0 i_size too). This made the last 4Kb of file data inaccessible and # effectively lost. echo "File content after:" od -t x1 $SCRATCH_MNT/foo This isn't a new issue/regression. This problem has been around since the log tree code was added in 2008: Btrfs: Add a write ahead tree log to optimize synchronous operations (commit e02119d5a7b4396c5a872582fddc8bd6d305a70a) Test cases for xfstests follow soon. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block groupForrest Liu
Removing large amount of block group in a transaction may encounters BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata() will grab metadata reservation from transaction handle, and btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when delete unused block group. The problem can be reproduce by following script mntpath=/btrfs loopdev=/dev/loop0 filepath=/home/forrest/image umount $mntpath losetup -d $loopdev truncate --size 1000g $filepath losetup $loopdev $filepath mkfs.btrfs -f $loopdev mount $loopdev $mntpath for j in `seq 1 1 1000`; do fallocate -l 1g $mntpath/$j done # wait cleaner thread remove unused block group sleep 300 The call trace that results from the BUG_ON() is: [ 613.093084] ------------[ cut here ]------------ [ 613.097928] kernel BUG at fs/btrfs/inode.c:3142! [ 613.105855] invalid opcode: 0000 [#1] SMP [ 613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E) [ 613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G E 3.19.0-rc7-custom #2 [ 613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000 [ 613.154969] RIP: 0010:[<ffffffffa01441c2>] [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs] [ 613.157780] RSP: 0018:ffff880039cf7c48 EFLAGS: 00010286 [ 613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000 [ 613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138 [ 613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000 [ 613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000 [ 613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001 [ 613.170932] FS: 0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000 [ 613.173316] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0 [ 613.177554] Stack: [ 613.178712] ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800 [ 613.181297] ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60 [ 613.183782] ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0 [ 613.186171] Call Trace: [ 613.187493] [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs] [ 613.189801] [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs] [ 613.192126] [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs] [ 613.194267] [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs] [ 613.196567] [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs] [ 613.198687] [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs] [ 613.200758] [<ffffffff8108f232>] kthread+0xd2/0xf0 [ 613.202616] [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180 [ 613.204738] [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0 [ 613.206652] [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180 [ 613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 [ 613.216562] RIP [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs] [ 613.218828] RSP <ffff880039cf7c48> [ 613.220382] ---[ end trace 71073106deb8a457 ]--- This patch replace btrfs_join_transaction() with btrfs_start_transaction() in btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add() Signed-off-by: Forrest Liu <forrestl@synology.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: account for large extents with enospcJosef Bacik
On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: don't set and clear delalloc for O_DIRECT writesJosef Bacik
We do this to get the space accounting, but this is just needless churn on the io_tree, so just drop setting/clearing delalloc and just drop the reserved data space when we have a successfull allocation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: only adjust outstanding_extents when we do a short writeJosef Bacik
We have this weird dance where we always inc outstanding_extents when we do a O_DIRECT write, even if we allocate the entire range. To get around this we also drop the metadata space if we successfully write. This is an unnecessary dance, we only need to jack up outstanding_extents if we don't satisfy the entire range request in get_blocks_direct, otherwise we are good using our original reservation. So drop the unconditional inc and the drop of the metadata space that we have for the unconditional inc. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14btrfs: Fix out-of-space bugZhao Lei
Btrfs will report NO_SPACE when we create and remove files for several times, and we can't write to filesystem until mount it again. Steps to reproduce: 1: Create a single-dev btrfs fs with default option 2: Write a file into it to take up most fs space 3: Delete above file 4: Wait about 100s to let chunk removed 5: goto 2 Script is like following: #!/bin/bash # Recommend 1.2G space, too large disk will make test slow DEV="/dev/sda16" MNT="/mnt/tmp" dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2 file_size_m=$((dev_size * 75 / 100 / 1024 / 1024)) echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev" for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done echo "mkfs $DEV" mkfs.btrfs -f "$DEV" >/dev/null || exit 2 echo "mount $DEV $MNT" mount "$DEV" "$MNT" || exit 2 for ((loop_i = 0; loop_i < 20; loop_i++)); do echo echo "loop $loop_i" echo "dd file..." cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m") "${cmd[@]}" 2>/dev/null || { # NO_SPACE error triggered echo "dd failed: ${cmd[*]}" exit 1 } echo "rm file..." rm -f "$MNT"/file0 || exit 2 for ((i = 0; i < 10; i++)); do df "$MNT" | tail -1 sleep 10 done done Reason: It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a which is used to remove empty block groups automatically, but the reason is not in that patch. Code before works well because btrfs don't need to create and delete chunks so many times with high complexity. Above bug is caused by many reason, any of them can trigger it. Reason1: When we remove some continuous chunks but leave other chunks after, these disk space should be used by chunk-recreating, but in current code, only first create will successed. Fixed by Forrest Liu <forrestl@synology.com> in: Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Reason2: contains_pending_extent() return wrong value in calculation. Fixed by Forrest Liu <forrestl@synology.com> in: Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Reason3: btrfs_check_data_free_space() try to commit transaction and retry allocating chunk when the first allocating failed, but space_info->full is set in first allocating, and prevent second allocating in retry. Fixed in this patch by clear space_info->full in commit transaction. Tested for severial times by above script. Changelog v3->v4: use light weight int instead of atomic_t to record have_remove_bgs in transaction, suggested by: Josef Bacik <jbacik@fb.com> Changelog v2->v3: v2 fixed the bug by adding more commit-transaction, but we only need to reclaim space when we are really have no space for new chunk, noticed by: Filipe David Manana <fdmanana@gmail.com> Actually, our code already have this type of commit-and-retry, we only need to make it working with removed-bgs. v3 fixed the bug with above way. Changelog v1->v2: v1 will introduce a new bug when delete and create chunk in same disk space in same transaction, noticed by: Filipe David Manana <fdmanana@gmail.com> V2 fix this bug by commit transaction after remove block grops. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Suggested-by: Filipe David Manana <fdmanana@gmail.com> Suggested-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: scrub, fix sleep in atomic contextFilipe Manana
My previous patch "Btrfs: fix scrub race leading to use-after-free" introduced the possibility to sleep in an atomic context, which happens when the scrub_lock mutex is held at the time scrub_pending_bio_dec() is called - this function can be called under an atomic context. Chris ran into this in a debug kernel which gave the following trace: [ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621 [ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress [ 1928.981324] INFO: lockdep is turned off. [ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G W 3.19.0-rc7-mason+ #41 [ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 [ 1929.022207] ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78 [ 1929.037267] ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8 [ 1929.052315] 0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8 [ 1929.067381] Call Trace: [ 1929.072344] <IRQ> [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e [ 1929.083968] [<ffffffff810856c4>] ___might_sleep+0x174/0x230 [ 1929.095352] [<ffffffff810857d2>] __might_sleep+0x52/0x90 [ 1929.106223] [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0 [ 1929.117951] [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10 [ 1929.129708] [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs] [ 1929.143370] [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs] [ 1929.157191] [<ffffffff812fa603>] bio_endio+0x53/0xa0 [ 1929.167382] [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs] [ 1929.180161] [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs] [ 1929.194318] [<ffffffff812fa603>] bio_endio+0x53/0xa0 [ 1929.204496] [<ffffffff8130401b>] blk_update_request+0x1eb/0x450 [ 1929.216569] [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500 [ 1929.229176] [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0 [ 1929.240740] [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0 [ 1929.252654] [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150 [ 1929.264725] [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170 [ 1929.276635] [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0 [ 1929.288014] [<ffffffff8105d92e>] __do_softirq+0xde/0x600 [ 1929.298885] [<ffffffff8105df6d>] irq_exit+0xbd/0xd0 (...) Fix this by using a reference count on the scrub context structure instead of locking the scrub_lock mutex. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: fix scheduler warning when syncing logFilipe Manana
We try to lock a mutex while the current task state is not TASK_RUNNING, which results in the following warning when CONFIG_DEBUG_LOCK_ALLOC=y: [30736.772501] ------------[ cut here ]------------ [30736.774545] WARNING: CPU: 9 PID: 19972 at kernel/sched/core.c:7300 __might_sleep+0x8b/0xa8() [30736.783453] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff8107499b>] prepare_to_wait+0x43/0x89 [30736.786261] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc psmouse parport pcspkr microcode serio_raw evdev processor thermal_sys i2c_piix4 i2c_core button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod [30736.794323] CPU: 9 PID: 19972 Comm: fsstress Not tainted 3.19.0-rc7-btrfs-next-5+ #1 [30736.795821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [30736.798788] 0000000000000009 ffff88042743fbd8 ffffffff814248ed ffff88043d32f2d8 [30736.800504] ffff88042743fc28 ffff88042743fc18 ffffffff81045338 0000000000000001 [30736.802131] ffffffff81064514 ffffffff817c52d1 000000000000026d 0000000000000000 [30736.803676] Call Trace: [30736.804256] [<ffffffff814248ed>] dump_stack+0x4c/0x65 [30736.805245] [<ffffffff81045338>] warn_slowpath_common+0xa1/0xbb [30736.806360] [<ffffffff81064514>] ? __might_sleep+0x8b/0xa8 [30736.807391] [<ffffffff81045398>] warn_slowpath_fmt+0x46/0x48 [30736.808511] [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89 [30736.809620] [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89 [30736.810691] [<ffffffff81064514>] __might_sleep+0x8b/0xa8 [30736.811703] [<ffffffff81426eaf>] mutex_lock_nested+0x2f/0x3a0 [30736.812889] [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab [30736.814138] [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf [30736.819878] [<ffffffffa038cfff>] wait_for_writer.isra.12+0x91/0xaa [btrfs] [30736.821260] [<ffffffff810748bd>] ? signal_pending_state+0x31/0x31 [30736.822410] [<ffffffffa0391f0a>] btrfs_sync_log+0x160/0x947 [btrfs] [30736.823574] [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab [30736.824847] [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf [30736.825972] [<ffffffffa036e555>] btrfs_sync_file+0x2b0/0x319 [btrfs] [30736.827684] [<ffffffff8117901a>] vfs_fsync_range+0x21/0x23 [30736.828932] [<ffffffff81179038>] vfs_fsync+0x1c/0x1e [30736.829917] [<ffffffff8117928b>] do_fsync+0x34/0x4e [30736.830862] [<ffffffff811794b3>] SyS_fsync+0x10/0x14 [30736.831819] [<ffffffff8142a512>] system_call_fastpath+0x12/0x17 [30736.832982] ---[ end trace c0b57df60d32ae5c ]--- Fix this my acquiring the mutex after calling finish_wait(), which sets the task's state to TASK_RUNNING. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-12Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info