summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-03-10Btrfs: remove unneeded field / smaller extent_map structureFilipe Manana
We don't need to have an unsigned int field in the extent_map struct to tell us whether the extent map is in the inode's extent_map tree or not. We can use the rb_node struct field and the RB_CLEAR_NODE and RB_EMPTY_NODE macros to achieve the same task. This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a 64 bits system). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip locking when searching commit rootWang Shilong
We won't change commit root, skip locking dance with commit root when walking backrefs, this can speed up btrfs send operations. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: wake up @scrub_pause_wait as much as we canWang Shilong
check if @scrubs_running=@scrubs_paused condition inside wait_event() is not an atomic operation which means we may inc/dec @scrub_running/ paused at any time. Let's wake up @scrub_pause_wait as much as we can to let commit transaction blocked less. An example below: Thread1 Thread2 |->scrub_blocked_if_needed() |->scrub_pending_trans_workers_inc |->increase @scrub_paused |->increase @scrub_running |->wake up scrub_pause_wait list |->scrub blocked |->increase @scrub_paused Thread3 is commiting transaction which is blocked at btrfs_scrub_pause(). So after Thread2 increase @scrub_paused, we meet the condition @scrub_paused=@scrub_running, but transaction will be still blocked until another calling to wake up @scrub_pause_wait. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: cancel scrub on transaction abortionWang Shilong
If we fail to commit transaction, we'd better cancel scrub operations. Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: device_replace: fix deadlock for nocow caseWang Shilong
commit cb7ab02156e4 cause a following deadlock found by xfstests,btrfs/011: Thread1 is commiting transaction which is blocked at btrfs_scrub_pause(). Thread2 is calling btrfs_file_aio_write() which has held inode's @i_mutex and commit transaction(blocked because Thread1 is committing transaction). Thread3 is copy_nocow_page worker which will also try to hold inode @i_mutex, so thread3 will wait Thread1 finished. Thread4 is waiting pending workers finished which will wait Thread3 finished. So the problem is like this: Thread1--->Thread4--->Thread3--->Thread2---->Thread1 Deadlock happens! we fix it by letting Thread1 go firstly, which means we won't block transaction commit while we are waiting pending workers finished. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix a possible deadlock between scrub and transaction committingWang Shilong
btrfs_scrub_continue() will be called when cleaning up transaction.However, this can only be called if btrfs_scrub_pause() is called before. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Use PTR_ERR_OR_ZEROSachin Kamat
PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it also include missing err.h header. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix send issuing outdated paths for utimes, chown and chmodFilipe Manana
When doing an incremental send, if we had a directory pending a move/rename operation and none of its parents, except for the immediate parent, were pending a move/rename, after processing the directory's references, we would be issuing utimes, chown and chmod intructions against am outdated path - a path which matched the one in the parent root. This change also simplifies a bit the code that deals with building a path for a directory which has a move/rename operation delayed. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d/e $ mkdir /mnt/btrfs/a/b/c/f $ chmod 0777 /mnt/btrfs/a/b/c/d/e $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2 $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2 $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2 $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2 $ chmod 0700 /mnt/btrfs/a/b/f2/e2 $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: chmod a/b/c/d/e failed. No such file or directory A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: correctly determine if blocks are shared in btrfs_compare_treesFilipe Manana
Just comparing the pointers (logical disk addresses) of the btree nodes is not completely bullet proof, we have to check if their generation numbers match too. It is guaranteed that a COW operation will result in a block with a different logical disk address than the original block's address, but over time we can reuse that former logical disk address. For example, creating a 2Gb filesystem on a loop device, and having a script running in a loop always updating the access timestamp of a file, resulted in the same logical disk address being reused for the same fs btree block in about only 4 minutes. This could make us skip entire subtrees when doing an incremental send (which is currently the only user of btrfs_compare_trees). However the odds of getting 2 blocks at the same tree level, with the same logical disk address, equal first slot keys and different generations, should hopefully be very low. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix send attempting to rmdir non-empty directoriesFilipe Manana
The incremental send algorithm assumed that it was possible to issue a directory remove (rmdir) if the the inode number it was currently processing was greater than (or equal) to any inode that referenced the directory's inode. This wasn't a valid assumption because any such inode might be a child directory that is pending a move/rename operation, because it was moved into a directory that has a higher inode number and was moved/renamed too - in other words, the case the following commit addressed: 9f03740a956d7ac6a1b8f8c455da6fa5cae11c22 (Btrfs: fix infinite path build loops in incremental send) This made an incremental send issue an rmdir operation before the target directory was actually empty, which made btrfs receive fail. Therefore it needs to wait for all pending child directory inodes to be moved/renamed before sending an rmdir operation. Simple steps to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/x $ mkdir /mnt/btrfs/a/b/y $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY $ rmdir /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: rmdir o259-6-0 failed. Directory not empty A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: send, don't send rmdir for same target multiple timesFilipe Manana
When doing an incremental send, if we delete a directory that has N > 1 hardlinks for the same file and that file has the highest inode number inside the directory contents, an incremental send would send N times an rmdir operation against the directory. This made the btrfs receive command fail on the second rmdir instruction, as the target directory didn't exist anymore. Steps to reproduce the issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ rm -f /mnt/btrfs/a/b/c/foo.txt $ rm -f /mnt/btrfs/a/b/c/bar.txt $ rmdir /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: rmdir o259-6-0 failed. No such file or directory A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: incremental send, fix invalid path after dir renameFilipe Manana
This fixes yet one more case not caught by the commit titled: Btrfs: fix infinite path build loops in incremental send In this case, even before the initial full send, we have a directory which is a child of a directory with a higher inode number. Then we perform the initial send, and after we rename both the child and the parent, without moving them around. After doing these 2 renames, an incremental send sent a rename instruction for the child directory which contained an invalid "from" path (referenced the parent's old name, not the new one), which made the btrfs receive command fail. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b $ mkdir /mnt/btrfs/d $ mkdir /mnt/btrfs/a/b/c $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umout /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory" A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't insert useless holes when punching beyond the inode's sizeFilipe Manana
If we punch beyond the size of an inode, we'll correctly remove any prealloc extents, but we'll also insert file extent items representing holes (disk bytenr == 0) that start with a key offset that lies beyond the inode's size and are not contiguous with the last file extent item. Example: $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo btrfs-debug-tree output: item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1 item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13 inode ref index 2 namelen 3 name: foo item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 90112 ram 122880 extent compression 0 item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 45056 ram 45056 extent compression 2 item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 860160 ram 860160 extent compression 0 The last extent item, which represents a hole, is useless as it lies beyond the inode's size. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: cleanup delayed-ref.c:find_ref_head()Filipe Manana
The argument last wasn't used, all callers supplied a NULL value for it. Also removed unnecessary intermediate storage of the result of key comparisons. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove unnecessary ref heads rb tree searchFilipe Manana
When we didn't find the exact ref head we were looking for, if return_bigger != 0 we set a new search key to match either the next node after the last one we found or the first one in the ref heads rb tree, and then did another full tree search. For both cases this ended up being pointless as we would end up returning an entry we already had before repeating the search. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: wake up transaction thread upon remountJustin Maggard
Now that we can adjust the commit interval with a remount, we need to wake up the transaction thread or else he will continue to sleep until the previous transaction interval has elapsed before waking up. So, if we go from a large commit interval to something smaller, the transaction thread will not wake up until the large interval has expired. This also causes the cleaner thread to stay sleeping, since it gets woken up by the transaction thread. Fix it by simply waking up the transaction thread during a remount. Signed-off-by: Justin Maggard <jmaggard10@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: stop joining the log transaction if sync log failsMiao Xie
If the log sync fails, there is something wrong in the log tree, we should not continue to join the log transaction and log the metadata. What we should do is to do a full commit. This patch fixes this problem by setting ->last_trans_log_full_commit to the current transaction id, it will tell the tasks not to join the log transaction, and do a full commit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: just wait or commit our own log sub-transactionMiao Xie
We might commit the log sub-transaction which didn't contain the metadata we logged. It was because we didn't record the log transid and just select the current log sub-transaction to commit, but the right one might be committed by the other task already. Actually, we needn't do anything and it is safe that we go back directly in this case. This patch improves the log sync by the above idea. We record the transid of the log sub-transaction in which we log the metadata, and the transid of the log sub-transaction we have committed. If the committed transid is >= the transid we record when logging the metadata, we just go back. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix skipped error handle when log sync failedMiao Xie
It is possible that many tasks sync the log tree at the same time, but only one task can do the sync work, the others will wait for it. But those wait tasks didn't get the result of the log sync, and returned 0 when they ended the wait. It caused those tasks skipped the error handle, and the serious problem was they told the users the file sync succeeded but in fact they failed. This patch fixes this problem by introducing a log context structure, we insert it into the a global list. When the sync fails, we will set the error number of every log context in the list, then the waiting tasks get the error number of the log context and handle the error if need. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use signed integer instead of unsigned long integer for log transidMiao Xie
The log trans id is initialized to be 0 every time we create a log tree, and the log tree need be re-created after a new transaction is started, it means the log trans id is unlikely to be a huge number, so we can use signed integer instead of unsigned long integer to save a bit space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove unnecessary memory barrier in btrfs_sync_log()Miao Xie
Mutex unlock implies certain memory barriers to make sure all the memory operation completes before the unlock, and the next mutex lock implies memory barriers to make sure the all the memory happens after the lock. So it is a full memory barrier(smp_mb), we needn't add memory barriers. Remove them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't start the log transaction if the log tree init failsMiao Xie
The old code would start the log transaction even the log tree init failed, it was unnecessary. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix the skipped transaction commit during the file syncMiao Xie
We may abort the wait earlier if ->last_trans_log_full_commit was set to the current transaction id, at this case, we need commit the current transaction instead of the log sub-transaction. But the current code didn't tell the caller to do it (return 0, not -EAGAIN). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ↵Miao Xie
->last_trans_log_full_commit ->last_trans_log_full_commit may be changed by the other tasks without lock, so we need prevent the compiler from the optimize access just like tmp = fs_info->last_trans_log_full_commit if (tmp == ...) ... <do something> if (tmp == ...) ... In fact, we need get the new value of ->last_trans_log_full_commit during the second access. Fix it by ACCESS_ONCE(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: avoid warning bomb of btrfs_invalidate_inodesLiu Bo
So after transaction is aborted, we need to cleanup inode resources by calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes roots' refs to be zero in old times and sets a WARN_ON(), however, this is not always true within cleaning up transaction, so we get to detect transaction abortion and not warn at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix possible deadlock in btrfs_cleanup_transactionLiu Bo
[13654.480669] ====================================================== [13654.480905] [ INFO: possible circular locking dependency detected ] [13654.481003] 3.12.0+ #4 Tainted: G W O [13654.481060] ------------------------------------------------------- [13654.481060] btrfs-transacti/9347 is trying to acquire lock: [13654.481060] (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] but task is already holding lock: [13654.481060] (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs] [13654.481060] which lock already depends on the new lock. [13654.481060] the existing dependency chain (in reverse order) is: [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs] [13654.481060] [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs] [13654.481060] [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs] [13654.481060] [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs] [13654.481060] [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs] [13654.481060] [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs] [13654.481060] [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs] [13654.481060] [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs] [13654.481060] [<ffffffff8114be91>] do_writepages+0x21/0x50 [13654.481060] [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60 [13654.481060] [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20 [13654.481060] [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs] [13654.481060] [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs] [13654.481060] [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs] [13654.481060] [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs] [13654.481060] [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs] [13654.481060] [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs] [13654.481060] [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs] [13654.481060] [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs] [13654.481060] [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs] [13654.481060] [<ffffffff811b2409>] mount_fs+0x39/0x1b0 [13654.481060] [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0 [13654.481060] [<ffffffff811cfcce>] do_mount+0x23e/0xa90 [13654.481060] [<ffffffff811d05a3>] SyS_mount+0x83/0xc0 [13654.481060] [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70 [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs] [13654.481060] [<ffffffff81079efa>] kthread+0xea/0xf0 [13654.481060] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [13654.481060] other info that might help us debug this: [13654.481060] Possible unsafe locking scenario: [13654.481060] CPU0 CPU1 [13654.481060] ---- ---- [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] *** DEADLOCK *** [...] ====================================================== btrfs_destroy_all_ordered_extents() gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock, while btrfs_[add,remove]_ordered_extent() acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock. This patch fixes the above problem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: faster/more efficient insertion of file extent itemsFilipe David Borba Manana
This is an extension to my previous commit titled: "Btrfs: faster file extent item replace operations" (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: always choose work from prio_head firstStanislaw Gruszka
In case we do not refill, we can overwrite cur pointer from prio_head by one from not prioritized head, what looks as something that was not intended. This change make we always take works from prio_head first until it's not empty. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Revert "Btrfs: remove transaction from btrfs send"Wang Shilong
This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2. Previously i was thinking we can use readonly root's commit root safely while it is not true, readonly root may be cowed with the following cases. 1.snapshot send root will cow source root. 2.balance,device operations will also cow readonly send root to relocate. So i have two ideas to make us safe to use commit root. -->approach 1: make it protected by transaction and end transaction properly and we research next item from root node(see btrfs_search_slot_for_read()). -->approach 2: add another counter to local root structure to sync snapshot with send. and add a global counter to sync send with exclusive device operations. So with approach 2, send can use commit root safely, because we make sure send root can not be cowed during send. Unfortunately, it make codes *ugly* and more complex to maintain. To make snapshot and send exclusively, device operations and send operation exclusively with each other is a little confusing for common users. So why not drop into previous way. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip readonly root for snapshot-aware defragmentWang Shilong
Btrfs send is assuming readonly root won't change, let's skip readonly root. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: switch to btrfs_previous_extent_item()Wang Shilong
Since we have introduced btrfs_previous_extent_item() to search previous extent item, just switch into it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip submitting barrier for missing deviceHidetoshi Seto
I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: unlock extent and pages on error in cow_file_rangeJosef Bacik
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I made it so we just return an error instead of unlocking all of our various stuff. This is a mistake and causes us to hang when we run into this. This patch fixes this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: balance delayed inode updatesJosef Bacik
While trying to reproduce a delayed ref problem I noticed the box kept falling over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's. Turns out this is because we only throttle delayed inode updates in btrfs_dirty_inode, which doesn't actually get called that often, especially when all you are doing is creating a bunch of files. So balance delayed inode updates everytime we create a new inode. With this patch we no longer use up all of our ram with delayed inode updates. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: add simple debugfs interfaceDavid Sterba
Help during debugging to export various interesting infromation and tunables without the need of extra mount options or ioctls. Usage: * declare your variable in sysfs.h, and include where you need it * define the variable in sysfs.c and make it visible via debugfs_create_TYPE Depends on CONFIG_DEBUG_FS. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: lower memory requirements in common caseDavid Sterba
The fs_path structure uses an inline buffer and falls back to a chain of allocations, but vmalloc is not necessary because PATH_MAX fits into PAGE_SIZE. The size of fs_path has been reduced to 256 bytes from PAGE_SIZE, usually 4k. Experimental measurements show that most paths on a single filesystem do not exceed 200 bytes, and these get stored into the inline buffer directly, which is now 230 bytes. Longer paths are kmalloced when needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: make some tree searches in send.c more efficientFilipe David Borba Manana
We have this pattern where we do search for a contiguous group of items in a tree and everytime we find an item, we process it, then we release our path, increment the offset of the search key, do another full tree search and repeat these steps until a tree search can't find more items we're interested in. Instead of doing these full tree searches after processing each item, just process the next item/slot in our leaf and don't release the path. Since all these trees are read only and we always use the commit root for a search and skip node/leaf locks, we're not affecting concurrency on the trees. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use right extent item position in send when finding extent clonesFilipe David Borba Manana
This was a leftover from the commit: 74dd17fbe3d65829e75d84f00a9525b2ace93998 (Btrfs: fix btrfs send for inline items and compression) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove BUG_ON from name_cache_deleteDavid Sterba
If cleaning the name cache fails, we could try to proceed at the cost of some memory leak. This is not expected to happen often. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove BUG from process_all_refsDavid Sterba
There are only 2 static callers, the BUG would normally be never reached, but let's be nice. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: squeeze bitfilelds in fs_pathDavid Sterba
We know that buf_len is at most PATH_MAX, 4k, and can merge it with the reversed member. This saves 3 bytes in favor of inline_buf. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove virtual_mem member from fs_pathDavid Sterba
We don't need to keep track of that, it's available via is_vmalloc_addr. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove prepared member from fs_pathDavid Sterba
The member is used only to return value back from fs_path_prepare_for_add, we can do it locally and save 8 bytes for the inline_buf path. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: replace check with an assert in gen_unique_nameDavid Sterba
The buffer passed to snprintf can hold the fully expanded format string, 64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the check entirely is a good idea, hence the ASSERT. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: more send support for parent/child dir relationship inversionFilipe David Borba Manana
The commit titled "Btrfs: fix infinite path build loops in incremental send" didn't cover a particular case where the parent-child relationship inversion of directories doesn't imply a rename of the new parent directory. This was due to a simple logic mistake, a logical and instead of a logical or. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3 $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix send dealing with file renames and directory movesFilipe David Borba Manana
This fixes a case that the commit titled: Btrfs: fix infinite path build loops in incremental send didn't cover. If the parent-child relationship between 2 directories is inverted, both get renamed, and the former parent has a file that got renamed too (but remains a child of that directory), the incremental send operation would use the file's old path after sending an unlink operation for that old path, causing receive to fail on future operations like changing owner, permissions or utimes of the corresponding inode. This is not a regression from the commit mentioned before, as without that commit we would fall into the issues that commit fixed, so it's just one case that wasn't covered before. Simple steps to reproduce this issue are: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ touch /mnt/btrfs/a/b/c/d/file $ mkdir -p /mnt/btrfs/a/b/x $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2 $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: only add roots if necessary in find_parent_nodes()Wang Shilong
find_all_leafs() dosen't need add all roots actually, add roots only if we need, this can avoid unnecessary ulist dance. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctlHugo Mills
The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit and 64-bit systems. This means that it is impossible to use btrfs receive on a system with a 64-bit kernel and 32-bit userspace, because the structure size (and hence the ioctl number) is different. This patch adds a compatibility structure and ioctl to deal with the above case. Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: add missing error check in incremental sendFilipe David Borba Manana
Function wait_for_parent_move() returns negative value if an error happened, 0 if we don't need to wait for the parent's move, and 1 if the wait is needed. Before this change an error return value was being treated like the return value 1, which was not correct. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix use-after-free in the finishing procedure of the device replaceMiao Xie
During device replace test, we hit a null pointer deference (It was very easy to reproduce it by running xfstests' btrfs/011 on the devices with the virtio scsi driver). There were two bugs that caused this problem: - We might allocate new chunks on the replaced device after we updated the mapping tree. And we forgot to replace the source device in those mapping of the new chunks. - We might get the mapping information which including the source device before the mapping information update. And then submit the bio which was based on that mapping information after we freed the source device. For the first bug, we can fix it by doing mapping tree update and source device remove in the same context of the chunk mutex. The chunk mutex is used to protect the allocable device list, the above method can avoid the new chunk allocation, and after we remove the source device, all the new chunks will be allocated on the new device. So it can fix the first bug. For the second bug, we need make sure all flighting bios are finished and no new bios are produced during we are removing the source device. To fix this problem, we introduced a global @bio_counter, we not only inc/dec @bio_counter outsize of map_blocks, but also inc it before submitting bio and dec @bio_counter when ending bios. Since Raid56 is a little different and device replace dosen't support raid56 yet, it is not addressed in the patch and I add comments to make sure we will fix it in the future. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>