summaryrefslogtreecommitdiff
path: root/fs/btrfs/tree-log.c
AgeCommit message (Collapse)Author
2013-08-09Btrfs: release both paths before logging dir/changed extentsJosef Bacik
The ceph guys tripped over this bug where we were still holding onto the original path that we used to copy the inode with when logging. This is based on Chris's fix which was reported to fix the problem. We need to drop the paths in two cases anyway so just move the drop up so that we don't have duplicate code. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-14Btrfs: exclude logged extents before replying when we are mixedJosef Bacik
With non-mixed block groups we replay the logs before we're allowed to do any writes, so we get away with not pinning/removing the data extents until right when we replay them. However with mixed block groups we allocate out of the same pool, so we could easily allocate a metadata block that was logged in our tree log. To deal with this we just need to notice that we have mixed block groups and do the normal excluding/removal dance during the pin stage of the log replay and that way we don't allocate metadata blocks from areas we have logged data extents. With this patch we now pass xfstests generic/311 with mixed block groups turned on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: merge pending IO for tree log write backMiao Xie
Before applying this patch, we flushed the log tree of the fs/file tree firstly, and then flushed the log root tree. It is ineffective, especially on the hard disk. This patch improved this problem by wrapping the above two flushes by the same blk_plug. By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s) on my scsi disk whose disk buffer was enabled. Test step: # mkfs.btrfs -f -m single <disk> # mount <disk> <mnt> # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: kill replicate code in replay_one_bufferLiu Bo
EXTREF is treated same as REF, so we can make the code tidy. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: cleanup the similar code of the fs root readMiao Xie
There are several functions whose code is similar, such as btrfs_find_last_root() btrfs_read_fs_root_no_radix() Besides that, some functions are invoked twice, it is unnecessary, for example, we are sure that all roots which is found in btrfs_find_orphan_roots() have their orphan items, so it is unnecessary to check the orphan item again. So cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06btrfs: make static code static & remove dead codeEric Sandeen
Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: remove almost all of the BUG()'s from tree-log.cJosef Bacik
There were a whole bunch and I was doing it for other things. I haven't tested these error paths but at the very least this is better than panicing. I've only left 2 BUG_ON()'s since they are logic errors and I want to replace them with a ASSERT framework that we can compile out for production users. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: deal with free space cache errors while replaying logJosef Bacik
So everybody who got hit by my fsync bug will still continue to hit this BUG_ON() in the free space cache, which is pretty heavy handed. So I took a file system that had this bug and fixed up all the BUG_ON()'s and leaks that popped up when I tried to mount a broken file system like this. With this patch we just fail to mount instead of panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: check return value of commit when recovering logJosef Bacik
We need to check the return value of the commit in case something goes wrong, otherwise we could end up going down the line and doing more stuff (like orphan cleanup) before we notice we should have errored out. We need to do this before we free up the log_tree_root since the caller will handle all of that. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: don't try and free ebs twice in log replayJosef Bacik
This work is done by btrfs_free_path() anyway so there's no need for this duplicate work. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: remove unused argument of btrfs_extend_item()Tsutomu Itoh
Argument 'trans' is not used in btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: cleanup of function where fixup_low_keys() is calledTsutomu Itoh
If argument 'trans' is unnecessary in the function where fixup_low_keys() is called, 'trans' is deleted. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: fix bad extent loggingJosef Bacik
A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: log ram bytes properlyJosef Bacik
When logging changed extents I was logging ram_bytes as the current length, which isn't correct, it's supposed to be the ram bytes of the original extent. This is for compression where even if we split the extent we need to know the ram bytes so when we uncompress the extent we know how big it will be. This was still working out right with compression for some reason but I think we were getting lucky. It was definitely off for prealloc which is why I noticed it, btrfsck was complaining about it. With this patch btrfsck no longer complains after a log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06btrfs: Cleanup some redundant codes in btrfs_log_inode()Zhi Yong Wu
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-04-13Btrfs: make sure nbytes are right after log replayJosef Bacik
While trying to track down a tree log replay bug I noticed that fsck was always complaining about nbytes not being right for our fsynced file. That is because the new fsync stuff doesn't wait for ordered extents to complete, so the inodes nbytes are not necessarily updated properly when we log it. So to fix this we need to set nbytes to whatever it is on the inode that is on disk, so when we replay the extents we can just add the bytes that are being added as we replay the extent. This makes it work for the case that we have the wrong nbytes or the case that we logged everything and nbytes is actually correct. With this I'm no longer getting nbytes errors out of btrfsck. Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-03-04Btrfs: use set_nlink if our i_nlink is 0Josef Bacik
We need to inc the nlink of deleted entries when running replay so we can do the unlink on the fs_root and get everything cleaned up and then have the orphan cleanup do the right thing. The problem is inc_nlink complains about this, even thought it still does the right thing. So use set_nlink() if our i_nlink is 0 to keep users from seeing the warnings during log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-03-01Btrfs: delete inline extents when we find them during loggingJosef Bacik
Apparently when we do inline extents we allow the data to overlap the last chunk of the btrfs_file_extent_item, which means that we can possibly have a btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item. This messes with us when we try to overwrite the extent when logging new extents since we expect for it to be the right size. To fix this just delete the item and try to do the insert again which will give us the proper sized btrfs_file_extent_item. This fixes a panic where map_private_extent_buffer would blow up because we're trying to write past the end of the leaf. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-28Btrfs: fix memory leak of log rootsLiu Bo
When we abort a transaction while fsyncing, we'll skip freeing log roots part of committing a transaction, which leads to memory leak. This adds a 'free log roots' in putting super when no more users hold references on log roots, so it's safe and clean. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-26btrfs: cleanup for open-coded alignmentQu Wenruo
Though most of the btrfs codes are using ALIGN macro for page alignment, there are still some codes using open-coded alignment like the following: ------ u64 mask = ((u64)root->stripesize - 1); u64 ret = (val + mask) & ~mask; ------ Or even hidden one: ------ num_bytes = (end - start + blocksize) & ~(blocksize - 1); ------ Sometimes these open-coded alignment is not so easy to understand for newbie like me. This commit changes the open-coded alignment to the ALIGN macro for a better readability. Also there is a previous patch from David Sterba with similar changes, but the patch is for 3.2 kernel and seems not merged. http://www.spinics.net/lists/linux-btrfs/msg12747.html Cc: David Sterba <dave@jikos.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20btrfs: remove cache only arguments from defrag pathEric Sandeen
The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20Btrfs: kill unused argument of btrfs_pin_extent_for_log_replayLiu Bo
Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20Btrfs: wait on ordered extents at the last possible momentJosef Bacik
Since we don't actually copy the extent information from the source tree in the fast case we don't need to wait for ordered io to be completed in order to fsync, we just need to wait for the io to be completed. So when we're logging our file just attach all of the ordered extents to the log, and then when the log syncs just wait for IO_DONE on the ordered extents and then write the super. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24Btrfs: use right range to find checksum for compressed extentsLiu Bo
For compressed extents, the range of checksum is covered by disk length, and the disk length is different with ram length, so we need to use disk length instead to get us the right checksum. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24Btrfs: do not allow logged extents to be merged or removedJosef Bacik
We drop the extent map tree lock while we're logging extents, so somebody could come in and merge another extent into this one and screw up our logging, or they could even remove us from the list which would keep us from logging the extent or freeing our ref on it, so we need to make sure to not clear LOGGING until after the extent is logged, and then we can merge it to adjacent extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-12-16Btrfs: use tokens where we can in the tree logJosef Bacik
If we are syncing over and over the overhead of doing all those maps in fill_inode_item and log_changed_extents really starts to hurt, so use map tokens so we can avoid all the extra mapping. Since the token maps from our offset to the end of the page make sure to set the first thing in the item first so we really only do one map. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16Btrfs: log changed inodes based on the extent map treeJosef Bacik
We don't really need to copy extents from the source tree since we have all of the information already available to us in the extent_map tree. So instead just write the extents straight to the log tree and don't bother to copy the extent items from the source tree. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16Btrfs: don't bother copying if we're only logging the inodeJosef Bacik
We don't copy inode items anwyay, we just copy them straight into the log from the in memory inode. So if we know we're only logging the inode, don't bother dropping anything, just try to insert it and either if it succeeds or we get EEXIST we can update the inode item in the log and carry on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16Btrfs: only log the inode item if we can get away with itJosef Bacik
Currently we copy all the file information into the log, inode item, the refs, xattrs etc. Except most of this doesn't change from fsync to fsync, just the inode item changes. So set a flag if an xattr changes or a link is added, and otherwise only log the inode item. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12Btrfs: fix missing log when BTRFS_INODE_NEEDS_FULL_SYNC is setMiao Xie
If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent, but now we forget to take it into account, and set a wrong max key, if so, we will skip the file extent metadata when doing logging. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12Btrfs: fix unprotected extent map operation when logging file extentsMiao Xie
We forget to protect the modified_extents list, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12Btrfs: fix wrong file extent lengthMiao Xie
There are two types of the file extent - inline extent and regular extent, When we log file extents, we didn't take inline extent into account, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12Btrfs: do not log extents when we only log new namesLiu Bo
When we log new names, we need to log just enough to recreate the inode during log replay, and there is no need to log extents along with it. This actually fixes a bug revealed by xfstests 241, where it shows that we're logging some extents that have not updated metadata, so we don't get proper EXTENT_DATA items to be copied to log tree. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-12btrfs: Fix compilation with user namespace support enabledEric W. Biederman
When compiling with user namespace support btrfs fails like: fs/btrfs/tree-log.c: In function ‘fill_inode_item’: fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’ fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’ fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’ fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’ Fix this by using i_uid_read and i_gid_read in Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-10-09btrfs: init ref_index to zero in add_inode_refChris Mason
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-09Btrfs: make filesystem read-only when submitting barrier failsStefan Behrens
So far the return code of barrier_all_devices() is ignored, which means that errors are ignored. The result can be a corrupt filesystem which is not consistent. This commit adds code to evaluate the return code of barrier_all_devices(). The normal btrfs_error() mechanism is used to switch the filesystem into read-only mode when errors are detected. In order to decide whether barrier_all_devices() should return error or success, the number of disks that are allowed to fail the barrier submission is calculated. This calculation accounts for the worst RAID level of metadata, system and data. If single, dup or RAID0 is in use, a single disk error is already considered to be fatal. Otherwise a single disk error is tolerated. The calculation of the number of disks that are tolerated to fail the barrier operation is performed when the filesystem gets mounted, when a balance operation is started and finished, and when devices are added or removed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-09Btrfs: don't bother committing delayed inode updates when fsyncingJosef Bacik
We can just copy the in memory inode into the tree log directly, no sense in updating the fs tree so we can copy it into the tree log tree. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09Btrfs: be smarter about dropping things from the tree logJosef Bacik
When we truncate existing items in the tree log we've been searching for each individual item and removing them. This is unnecessary churn and searching, just keep track of the slot we are on and how many items we need to delete and delete them all at once. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09Btrfs: don't lookup csums for prealloc extentsJosef Bacik
The tree logging stuff was looking up csums to copy over for prealloc extents which is just work we don't need to be doing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09Btrfs: cache extent state when writing out dirty metadata pagesJosef Bacik
Everytime we write out dirty pages we search for an offset in the tree, convert the bits in the state, and then when we wait we search for the offset again and clear the bits. So for every dirty range in the io tree we are doing 4 rb searches, which is suboptimal. With this patch we are only doing 2 searches for every cycle (modulo weird things happening). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09btrfs: extended inode refsMark Fasheh
This patch adds basic support for extended inode refs. This includes support for link and unlink of the refs, which basically gets us support for rename as well. Inode creation does not need changing - extended refs are only added after the ref array is full. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-08btrfs: improved readablity for add_inode_refJan Schmidt
Moved part of the code into a sub function and replaced most of the gotos by ifs, hoping that it will be easier to read now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-08Btrfs: handle not finding the extent exactly when logging changed extentsJosef Bacik
I started hitting warnings when running xfstest 68 in a loop because there were EM's that were not lined up properly with the physical extents. This is ok, if we do something like punch a hole or write to a preallocated space or something like that we can have an EM that doesn't cover the entire physical extent. So fix the tree logging stuff to cope with this case so we don't just commit the transaction. With this patch I no longer see the warnings from the tree logging code. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04Btrfs: do not hold the write_lock on the extent tree while loggingJosef Bacik
Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This is because I'm an idiot and didn't realize that rwlock's were spin locks, so we've been holding this thing while doing allocations and such which is not good. This patch fixes this by dropping the write lock before we do anything heavy and re-acquire it when it is done. We also need to take a ref on the em's in case their corresponding pages are evicted and mark them as being logged so that releasepage does not remove them and doesn't remove them from our local list. Thanks, Reported-by: Dave Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: fix unprotected ->log_batchMiao Xie
We forget to protect ->log_batch when syncing a file, this patch fix this problem by atomic operation. And ->log_batch is used to check if there are parallel sync operations or not, so it is unnecessary to reset it to 0 after the sync operation of the current log tree complete. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: add hole punchingJosef Bacik
This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: remove unused hint byte argument for btrfs_drop_extentsJosef Bacik
I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: check if an inode has no checksum when logging itLiu Bo
This is based on Josef's "Btrfs: turbo charge fsync". If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum items any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: fix a bug in checking whether a inode is already in logLiu Bo
This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: improve fsync by filtering extents that we wantLiu Bo
This is based on Josef's "Btrfs: turbo charge fsync". The above Josef's patch performs very good in random sync write test, because we won't have too much extents to merge. However, it does not performs good on the test: dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync The reason is when we do sequencial sync write, we need to merge the current extent just with the previous one, so that we can get accumulated extents to log: A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ... So we'll have to flush more and more checksum into log tree, which is the bottleneck according to my tests. But we can avoid this by telling fsync the real extents that are needed to be logged. With this, I did the above dd sync write test (size=50m), w/o (orig) w/ (josef's) w/ (this) SATA 104KB/s 109KB/s 121KB/s ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>