summaryrefslogtreecommitdiff
path: root/fs/btrfs/file.c
AgeCommit message (Collapse)Author
2020-12-09btrfs: disable fallocate in ZONED modeNaohiro Aota
fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in ZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in ZONED mode by utilizing space_info->bytes_may_use or similar, so this returns EOPNOTSUPP. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make btrfs_cont_expand take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make btrfs_truncate_block take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make btrfs_insert_replace_extent take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make find_first_non_hole take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make btrfs_update_inode take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: update the number of bytes used by an inode atomicallyFilipe Manana
There are several occasions where we do not update the inode's number of used bytes atomically, resulting in a concurrent stat(2) syscall to report a value of used blocks that does not correspond to a valid value, that is, a value that does not match neither what we had before the operation nor what we get after the operation completes. In extreme cases it can result in stat(2) reporting zero used blocks, which can cause problems for some userspace tools where they can consider a file with a non-zero size and zero used blocks as completely sparse and skip reading data, as reported/discussed a long time ago in some threads like the following: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html The cases where this can happen are the following: -> Case 1 If we do a write (buffered or direct IO) against a file region for which there is already an allocated extent (or multiple extents), then we have a short time window where we can report a number of used blocks to stat(2) that does not take into account the file region being overwritten. This short time window happens when completing the ordered extent(s). This happens because when we drop the extents in the write range we decrement the inode's number of bytes and later on when we insert the new extent(s) we increment the number of bytes in the inode, resulting in a short time window where a stat(2) syscall can get an incorrect number of used blocks. If we do writes that overwrite an entire file, then we have a short time window where we report 0 used blocks to stat(2). Example reproducer: $ cat reproducer-1.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null expected=$(stat -c %b $MNT/foobar) # Create a process to keep calling stat(2) on the file and see if the # reported number of blocks used (disk space used) changes, it should # not because we are not increasing the file size nor punching holes. stat_loop $MNT/foobar $expected & loop_pid=$! for ((i = 0; i < 50000; i++)); do xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null done kill $loop_pid &> /dev/null wait umount $DEV $ ./reproducer-1.sh ERROR: unexpected used blocks (got: 0 expected: 128) ERROR: unexpected used blocks (got: 0 expected: 128) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 2 If we do a buffered write against a file region that does not have any allocated extents, like a hole or beyond EOF, then during ordered extent completion we have a short time window where a concurrent stat(2) syscall can report a number of used blocks that does not correspond to the value before or after the write operation, a value that is actually larger than the value after the write completes. This happens because once we start a buffered write into an unallocated file range we increment the inode's 'new_delalloc_bytes', to make sure any stat(2) call gets a correct used blocks value before delalloc is flushed and completes. However at ordered extent completion, after we inserted the new extent, we increment the inode's number of bytes used with the size of the new extent, and only later, when clearing the range in the inode's iotree, we decrement the inode's 'new_delalloc_bytes' counter with the size of the extent. So this results in a short time window where a concurrent stat(2) syscall can report a number of used blocks that accounts for the new extent twice. Example reproducer: $ cat reproducer-2.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foobar write_size=$((64 * 1024)) for ((i = 0; i < 16384; i++)); do offset=$(($i * $write_size)) xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null blocks_used=$(stat -c %b $MNT/foobar) # Fsync the file to trigger writeback and keep calling stat(2) on it # to see if the number of blocks used changes. stat_loop $MNT/foobar $blocks_used & loop_pid=$! xfs_io -c "fsync" $MNT/foobar kill $loop_pid &> /dev/null wait $loop_pid done umount $DEV $ ./reproducer-2.sh ERROR: unexpected used blocks (got: 265472 expected: 265344) ERROR: unexpected used blocks (got: 284032 expected: 283904) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 3 Another case where such problems happen is during other operations that replace extents in a file range with other extents. Those operations are extent cloning, deduplication and fallocate's zero range operation. The cause of the problem is similar to the first case. When we drop the extents from a range, we decrement the inode's number of bytes, and later on, after inserting the new extents we increment it. Since this is not done atomically, a concurrent stat(2) call can see and return a number of used blocks that is smaller than it should be, does not match the number of used blocks before or after the clone/deduplication/zero operation. Like for the first case, when doing a clone, deduplication or zero range operation against an entire file, we end up having a time window where we can report 0 used blocks to a stat(2) call. Example reproducer: $ cat reproducer-3.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f -m reflink=1 $DEV > /dev/null mount $DEV $MNT extent_size=$((64 * 1024)) num_extents=16384 file_size=$(($extent_size * $num_extents)) # File foo has many small extents. xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \ > /dev/null # File bar has much less extents and has exactly the same data as foo. xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null expected=$(stat -c %b $MNT/foo) # Now deduplicate bar into foo. While the deduplication is in progres, # the number of used blocks/file size reported by stat should not change xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null & dedupe_pid=$! while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do used=$(stat -c %b $MNT/foo) if [ $used -ne $expected ]; then echo "Unexpected blocks used: $used (expected: $expected)" fi done umount $DEV $ ./reproducer-3.sh Unexpected blocks used: 2076800 (expected: 2097152) Unexpected blocks used: 2097024 (expected: 2097152) Unexpected blocks used: 2079872 (expected: 2097152) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. So fix this by: 1) Making btrfs_drop_extents() not decrement the VFS inode's number of bytes, and instead return the number of bytes; 2) Making any code that drops extents and adds new extents update the inode's number of bytes atomically, while holding the btrfs inode's spinlock, which is also used by the stat(2) callback to get the inode's number of bytes; 3) For ranges in the inode's iotree that are marked as 'delalloc new', corresponding to previously unallocated ranges, increment the inode's number of bytes when clearing the 'delalloc new' bit from the range, in the same critical section that decrements the inode's 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock. An alternative would be to have btrfs_getattr() wait for any IO (ordered extents in progress) and locking the whole range (0 to (u64)-1) while it it computes the number of blocks used. But that would mean blocking stat(2), which is a very used syscall and expected to be fast, waiting for writes, clone/dedupe, fallocate, page reads, fiemap, etc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: refactor btrfs_drop_extents() to make it easier to extendFilipe Manana
There are many arguments for __btrfs_drop_extents() and its wrapper btrfs_drop_extents(), which makes it hard to add more arguments to it and requires changing every caller. I have added a couple myself back in 2014 commit 1acae57b161e ("Btrfs: faster file extent item replace operations") and therefore know firsthand that it is a bit cumbersome to add additional arguments to these functions. Since I will need to add more arguments in a subsequent bug fix, this change is preparatory work and adds a data structure that holds all the arguments, for both input and output, that are passed to this function, with some comments in the structure's definition mentioning what each field is and how it relates to other fields. Callers of this function need only to zero out the content of the structure and setup only the fields they need. This also removes the need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so now we have a single function named btrfs_drop_extents() that takes a pointer to this new data structure (struct btrfs_drop_extents_args). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: locking: remove all the blocking helpersJosef Bacik
Now that we're using a rw_semaphore we no longer need to indicate if a lock is blocking or not, nor do we need to flip the entire path from blocking to spinning. Remove these helpers and all the places they are called. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: replace s_blocksize_bits with fs_info::sectorsize_bitsDavid Sterba
The value of super_block::s_blocksize_bits is the same as fs_info::sectorsize_bits, but we don't need to do the extra dereferences in many functions and storing the bits as u32 (in fs_info) generates shorter assembly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: remove dio iomap DSYNC workaroundGoldwyn Rodrigues
This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC workaround") now that the iomap API has been updated to allow iomap_dio_complete() not to be called under i_rwsem anymore. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: call iomap_dio_complete() without inode_lockGoldwyn Rodrigues
If direct writes are called with O_DIRECT | O_DSYNC, it will result in a deadlock because iomap_dio_rw() is called under i_rwsem which calls: iomap_dio_complete() generic_write_sync() btrfs_sync_file() btrfs_sync_file() requires i_rwsem, so call __iomap_dio_rw() with the i_rwsem locked, and call iomap_dio_complete() after unlocking i_rwsem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: remove btrfs_inode::dio_semGoldwyn Rodrigues
The inode dio_sem can be eliminated because all DIO synchronization is now performed through inode->i_rwsem that provides the same guarantees. This reduces btrfs_inode size by 40 bytes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: use shared lock for direct writes within EOFGoldwyn Rodrigues
Direct writes within EOF are safe to be performed with inode shared lock to improve parallelization with other direct writes or reads because EOF is not changed and there is no race with truncate(). Direct reads are already performed under shared inode lock. This patch is precursor to removing btrfs_inode->dio_sem. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: push inode locking and unlocking into buffered/direct writeGoldwyn Rodrigues
Push inode locking and unlocking closer to where we perform the I/O. For this we need to move the write checks inside the respective functions as well. pos is evaluated after generic_write_checks because O_APPEND can change iocb->ki_pos. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: introduce btrfs_inode_lock()/unlock()Goldwyn Rodrigues
btrfs_inode_lock/unlock() are wrappers around inode locks, separating the type of lock and actual locking. - 0 - default, exclusive lock - BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO - BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence The bits SHARED and TRY can be combined together. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: introduce btrfs_write_check()Goldwyn Rodrigues
btrfs_write_check() checks write parameters in one place before beginning a write. This does away with inode_unlock() after every check. In the later patches, it will help push inode_lock/unlock() in buffered and direct write functions respectively. generic_write_checks needs to be called before as it could truncate iov_iter and its return used as count. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: check FS error state bit early during writeGoldwyn Rodrigues
fs_info::fs_state is a filesystem bit check as opposed to inode and can be performed before we begin with write checks. This eliminates inode lock/unlock in case the error bit is set. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: move pos increment and pagecache extension to btrfs_buffered_writeGoldwyn Rodrigues
While we do this, correct the call to pagecache_isize_extended: - pagecache_isize_extended needs to be called to the start of the write as opposed to i_size - we don't need to check range before the call, this is done in the function Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: split btrfs_direct_IO to read and writeGoldwyn Rodrigues
The read and write DIO don't have anything in common except for the call to iomap_dio_rw. Extract the write call into a new function to get rid of conditional statements for direct write. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()Goldwyn Rodrigues
Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as opposed to calling set_extent_bits again later. Fold check for written length within the function. Note: EXTENT_NORESERVE is set before unlocking extents. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: use round_down while calculating start position in btrfs_dirty_pages()Goldwyn Rodrigues
round_down looks prettier than the bit mask operations. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_writeGoldwyn Rodrigues
write_bytes can change in btrfs_check_nocow_lock(). Calculate variables such as num_pages and reserve_bytes once we are sure of the value of write_bytes so there is no need to re-calculate. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13btrfs: fix missing delalloc new bit for new delalloc rangesFilipe Manana
When doing a buffered write, through one of the write family syscalls, we look for ranges which currently don't have allocated extents and set the 'delalloc new' bit on them, so that we can report a correct number of used blocks to the stat(2) syscall until delalloc is flushed and ordered extents complete. However there are a few other places where we can do a buffered write against a range that is mapped to a hole (no extent allocated) and where we do not set the 'new delalloc' bit. Those places are: - Doing a memory mapped write against a hole; - Cloning an inline extent into a hole starting at file offset 0; - Calling btrfs_cont_expand() when the i_size of the file is not aligned to the sector size and is located in a hole. For example when cloning to a destination offset beyond EOF. So after such cases, until the corresponding delalloc range is flushed and the respective ordered extents complete, we can report an incorrect number of blocks used through the stat(2) syscall. In some cases we can end up reporting 0 used blocks to stat(2), which is a particular bad value to report as it may mislead tools to think a file is completely sparse when its i_size is not zero, making them skip reading any data, an undesired consequence for tools such as archivers and other backup tools, as reported a long time ago in the following thread (and other past threads): https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html Example reproducer: $ cat reproducer.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -c "truncate 64K" \ -c "mmap -w 0 64K" \ -c "mwrite -S 0xab 0 64K" \ -c "munmap" \ $MNT/foo blocks_used=$(stat -c %b $MNT/foo) echo "blocks used: $blocks_used" if [ $blocks_used -eq 0 ]; then echo "ERROR: blocks used is 0" fi umount $DEV $ ./reproducer.sh blocks used: 0 ERROR: blocks used is 0 So move the logic that decides to set the 'delalloc bit' bit into the function btrfs_set_extent_delalloc(), since that is what we use for all those missing cases as well as for the cases that currently work well. This change is also preparatory work for an upcoming patch that fixes other problems related to tracking and reporting the number of bytes used by an inode. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-27btrfs: don't fallback to buffered read if we don't need toJohannes Thumshirn
Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs: switch to iomap for direct IO") we're calling generic_file_buffered_read() directly and not via generic_file_read_iter() anymore. If the read could read everything there is no need to bother calling generic_file_buffered_read(), like it is handled in generic_file_read_iter(). If we call generic_file_buffered_read() in this case we can hit a situation where we do an invalid readahead and cause this UBSAN splat in fstest generic/091: run fstests generic/091 at 2020-10-21 10:52:32 ================================================================================ UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 dump_stack+0x57/0x70 lib/dump_stack.c:118 ubsan_epilogue+0x5/0x40 lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395 __roundup_pow_of_two ./include/linux/log2.h:57 get_init_ra_size mm/readahead.c:318 ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530 generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199 call_read_iter ./include/linux/fs.h:1876 new_sync_read+0x102/0x180 fs/read_write.c:415 vfs_read+0x11c/0x1a0 fs/read_write.c:481 ksys_read+0x4f/0xc0 fs/read_write.c:615 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118 RIP: 0033:0x7fe87fee992e RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003 RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000 R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000 R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000 ================================================================================ BTRFS info (device nullb0): has skinny extents BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B BTRFS info (device nullb0): enabling ssd optimizations Fixes: f85781fb505e ("btrfs: switch to iomap for direct IO") Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flagNikolay Borisov
Commit 8d875f95da43 ("btrfs: disable strict file flushes for renames and truncates") eliminated the notion of ordered operations and instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag indicating that a file's content should be synced to disk in case a file is truncated and any writes happen to it concurrently. In fact this intendend behavior was broken until it was fixed in f6dc45c7a93a ("Btrfs: fix filemap_flush call in btrfs_file_release"). All things considered let's give the flag a more descriptive name. Also slightly reword comments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: remove inode argument from btrfs_start_ordered_extentNikolay Borisov
The passed in ordered_extent struct is always well-formed and contains the inode making the explicit argument redundant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: sink total_data parameter in setup_items_for_insertNikolay Borisov
That parameter can easily be derived based on the "data_size" and "nr" parameters exploit this fact to simply the function's signature. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: eliminate total_size parameter from setup_items_for_insertNikolay Borisov
The value of this argument can be derived from the total_data as it's simply the value of the data size + size of btrfs_items being touched. Move the parameter calculation inside the function. This results in a simpler interface and also a minor size reduction: ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34) Function old new delta btrfs_duplicate_item 260 259 -1 setup_items_for_insert 1200 1190 -10 btrfs_insert_empty_items 177 154 -23 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: rename btrfs_insert_clone_extent() to a more generic nameFilipe Manana
Now that we use the same mechanism to replace all the extents in a file range with either a hole, an existing extent (when cloning) or a new extent (when using fallocate), the name of btrfs_insert_clone_extent() no longer reflects its genericity. So rename it to btrfs_insert_replace_extent(), since what it does is to either insert an existing extent or a new extent into a file range. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: rename btrfs_punch_hole_range() to a more generic nameFilipe Manana
The function btrfs_punch_hole_range() is now used to replace all the file extents in a given file range with an extent described in the given struct btrfs_replace_extent_info argument. This extent can either be an existing extent that is being cloned or it can be a new extent (namely a prealloc extent). When that argument is NULL it only punches a hole (drops all the existing extents) in the file range. So rename the function to btrfs_replace_file_extents(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: rename struct btrfs_clone_extent_info to a more generic nameFilipe Manana
Now that we can use btrfs_clone_extent_info to convey information for a new prealloc extent as well, and not just for existing extents that are being cloned, rename it to btrfs_replace_extent_info, which reflects the fact that this is now more generic and it is used to replace all existing extents in a file range with the extent described by the structure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: remove item_size member of struct btrfs_clone_extent_infoFilipe Manana
The value of item_size of struct btrfs_clone_extent_info is always set to the size of a non-inline file extent item, and in fact the infrastructure that uses this structure (btrfs_punch_hole_range()) does not work with inline file extents at all (and it is not supposed to). So just remove that field from the structure and use directly sizeof(struct btrfs_file_extent_item) instead. Also assert that the file extent type is not inline at btrfs_insert_clone_extent(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: fix metadata reservation for fallocate that leads to transaction abortsFilipe Manana
When doing an fallocate(), specially a zero range operation, we assume that reserving 3 units of metadata space is enough, that at most we touch one leaf in subvolume/fs tree for removing existing file extent items and inserting a new file extent item. This assumption is generally true for most common use cases. However when we end up needing to remove file extent items from multiple leaves, we can end up failing with -ENOSPC and abort the current transaction, turning the filesystem to RO mode. When this happens a stack trace like the following is dumped in dmesg/syslog: [ 1500.620934] ------------[ cut here ]------------ [ 1500.620938] BTRFS: Transaction aborted (error -28) [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...) [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.621026] Code: 8b 40 50 f0 48 (...) [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286 [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000 [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001 [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000 [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60 [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000 [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0 [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1500.621049] Call Trace: [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs] [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs] [ 1500.621108] vfs_fallocate+0x14d/0x290 [ 1500.621112] ksys_fallocate+0x3a/0x70 [ 1500.621117] __x64_sys_fallocate+0x1a/0x20 [ 1500.621120] do_syscall_64+0x33/0x80 [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1500.621126] RIP: 0033:0x7fb5b248c477 [ 1500.621128] Code: 89 7c 24 08 (...) [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477 [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003 [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000 [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010 [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003 [ 1500.621151] irq event stamp: 1026217 [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0 [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0 [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606 [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20 [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]--- [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left When we use fallocate() internally, for reserving an extent for a space cache, inode cache or relocation, we can't hit this problem since either there aren't any file extent items to remove from the subvolume tree or there is at most one. When using plain fallocate() it's very unlikely, since that would require having many file extent items representing holes for the target range and crossing multiple leafs - we attempt to increase the range (merge) of such file extent items when punching holes, so at most we end up with 2 file extent items for holes at leaf boundaries. However when using the zero range operation of fallocate() for a large range (100+ MiB for example) that's fairly easy to trigger. The following example reproducer triggers the issue: $ cat reproducer.sh #!/bin/bash umount /dev/sdj &> /dev/null mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null mount /dev/sdj /mnt/sdj # Create a 100M file with many file extent items. Punch a hole every 8K # just to speedup the file creation - we could do 4K sequential writes # followed by fsync (or O_SYNC) as well, but that takes a lot of time. file_size=$((100 * 1024 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar for ((i = 0; i < $file_size; i += 8192)); do xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar done # Force a transaction commit, so the zero range operation will be forced # to COW all metadata extents it need to touch. sync xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar umount /mnt/sdj $ ./reproducer.sh wrote 104857600/104857600 bytes at offset 0 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec) fallocate: No space left on device $ dmesg <shows the same stack trace pasted before> To fix this use the existing infrastructure that hole punching and extent cloning use for replacing a file range with another extent. This deals with doing the removal of file extent items and inserting the new one using an incremental approach, reserving more space when needed and always ensuring we don't leave an implicit hole in the range in case we need to do multiple iterations and a crash happens between iterations. A test case for fstests will follow up soon. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: make btrfs_zero_range_check_range_boundary take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: convert btrfs_inode_sectorsize to take btrfs_inodeNikolay Borisov
It's counterintuitive to have a function named btrfs_inode_xxx which takes a generic inode. Also move the function to btrfs_inode.h so that it has access to the definition of struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: make btrfs_lookup_first_ordered_extent take btrfs_inodeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: dio iomap DSYNC workaroundJosef Bacik
iomap dio will run generic_write_sync() for us if the iocb is DSYNC. This is problematic for us because of 2 reasons: 1. we hold the inode_lock() during this operation, and we take it in generic_write_sync() 2. we hold a read lock on the dio_sem but take the write lock in fsync Since we don't want to rip out this code right now, but reworking the locking is a bit much to do at this point, work around this problem with this masterpiece of a patch. First, we clear DSYNC on the iocb so that the iomap stuff doesn't know that it needs to handle the sync. We save this fact in current->journal_info, because we need to see do special things once we're in iomap_begin, and we have no way to pass private information into iomap_dio_rw(). Next we specify a separate iomap_dio_ops for sync, which implements an ->end_io() callback that gets called when the dio completes. This is important for AIO, because we really do need to run generic_write_sync() if we complete asynchronously. However if we're still in the submitting context when we enter ->end_io() we clear the flag so that the submitter knows they're the ones that needs to run generic_write_sync(). This is meant to be temporary. We need to work out how to eliminate the inode_lock() and the dio_sem in our fsync and use another mechanism to protect these operations. Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: switch to iomap for direct IOGoldwyn Rodrigues
We're using direct io implementation based on buffer heads. This patch switches to the new iomap infrastructure. Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as iomap_begin() for iomap direct I/O functions. This function allocates and locks all the blocks required for the I/O. btrfs_submit_direct() is used as the submit_io() hook for direct I/O ops. Since we need direct I/O reads to go through iomap_dio_rw(), we change file_operations.read_iter() to a btrfs_file_read_iter() which calls btrfs_direct_IO() for direct reads and falls back to generic_file_buffered_read() for incomplete reads and buffered reads. We don't need address_space.direct_IO() anymore: set it to noop. Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is capable of direct I/O reads from a hole, so we don't need to return -ENOENT. Btrfs direct I/O is now done under i_rwsem, shared in case of reads and exclusive in case of writes. This guards against simultaneous truncates. Use iomap->iomap_end() to check for failed or incomplete direct I/O: - for writes, call __endio_write_update_ordered() - for reads, unlock extents btrfs_dio_data is now hooked in iomap->private and not current->journal_info. It carries the reservation variable and the amount of data submitted, so we can calculate the amount of data to call __endio_write_update_ordered in case of an error. This patch removes last use of struct buffer_head from btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: make fast fsyncs wait only for writebackFilipe Manana
Currently regardless of a full or a fast fsync we always wait for ordered extents to complete, and then start logging the inode after that. However for fast fsyncs we can just wait for the writeback to complete, we don't need to wait for the ordered extents to complete since we use the list of modified extents maps to figure out which extents we must log and we can get their checksums directly from the ordered extents that are still in flight, otherwise look them up from the checksums tree. Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at fsync time"), for fast fsyncs, we used to start logging without even waiting for the writeback to complete first, we would wait for it to complete after logging, while holding a transaction open, which lead to performance issues when using cgroups and probably for other cases too, as wait for IO while holding a transaction handle should be avoided as much as possible. After that, for fast fsyncs, we started to wait for ordered extents to complete before starting to log, which adds some latency to fsyncs and we even got at least one report about a performance drop which bisected to that particular change: https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/ This change makes fast fsyncs only wait for writeback to finish before starting to log the inode, instead of waiting for both the writeback to finish and for the ordered extents to complete. This brings back part of the logic we had that extracts checksums from in flight ordered extents, which are not yet in the checksums tree, and making sure transaction commits wait for the completion of ordered extents previously logged (by far most of the time they have already completed by the time a transaction commit starts, resulting in no wait at all), to avoid any data loss if an ordered extent completes after the transaction used to log an inode is committed, followed by a power failure. When there are no other tasks accessing the checksums and the subvolume btrees, the ordered extent completion is pretty fast, typically taking 100 to 200 microseconds only in my observations. However when there are other tasks accessing these btrees, ordered extent completion can take a lot more time due to lock contention on nodes and leaves of these btrees. I've seen cases over 2 milliseconds, which starts to be significant. In particular when we do have concurrent fsyncs against different files there is a lot of contention on the checksums btree, since we have many tasks writing the checksums into the btree and other tasks that already started the logging phase are doing lookups for checksums in the btree. This change also turns all ranged fsyncs into full ranged fsyncs, which is something we already did when not using the NO_HOLES features or when doing a full fsync. This is to guarantee we never miss checksums due to writeback having been triggered only for a part of an extent, and we end up logging the full extent but only checksums for the written range, which results in missing checksums after log replay. Allowing ranged fsyncs to operate again only in the original range, when using the NO_HOLES feature and doing a fast fsync is doable but requires some non trivial changes to the writeback path, which can always be worked on later if needed, but I don't think they are a very common use case. Several tests were performed using fio for different numbers of concurrent jobs, each writing and fsyncing its own file, for both sequential and random file writes. The tests were run on bare metal, no virtualization, on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device, with a kernel configuration that is the default of typical distributions (debian in this case), without debug options enabled (kasan, kmemleak, slub debug, debug of page allocations, lock debugging, etc). The following script that calls fio was used: $ cat test-fsync.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/btrfs MOUNT_OPTIONS="-o ssd -o space_cache=v2" MKFS_OPTIONS="-d single -m single" if [ $# -ne 5 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 BLOCK_SIZE=$4 WRITE_MODE=$5 if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'" exit 1 fi cat <<EOF > /tmp/fio-job.ini [writers] rw=$WRITE_MODE fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=$BLOCK_SIZE ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The results were the following: ************************* *** sequential writes *** ************************* ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec After patch: WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec (+9.8%, -8.8% runtime) ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec After patch: WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec (+21.5% throughput, -17.8% runtime) ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec After patch: WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec (+28.7% throughput, -22.3% runtime) ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec After patch: WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec (+35.6% throughput, -25.2% runtime) ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec After patch: WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec (+34.1% throughput, -25.6% runtime) ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec After patch: WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec (+19.1% throughput, -16.4% runtime) ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec After patch: WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec (+23.1% throughput, -18.7% runtime) ************************ *** random writes *** ************************ ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec After patch: WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec (+0.9% throughput, -1.7% runtime) ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec After patch: WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec (+2.3% throughput, -2.0% runtime) ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec After patch: WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec (+15.6% throughput, -13.3% runtime) ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec After patch: WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec (+11.6% throughput, -10.7% runtime) ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec After patch: WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec (+3.8% throughput, -3.8% runtime) ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec After patch: WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec (+12.7% throughput, -11.2% runtime) ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec After patch: WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec (+6.3% throughput, -6.0% runtime) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need()Qu Wenruo
We're just doing rounding up to sectorsize to calculate the lockend. There is no need to do the unnecessary length calculation, just direct round_up() is enough. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-24Merge tag 'for-5.9-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix swapfile activation on subvolumes with deleted snapshots - error value mixup when removing directory entries from tree log - fix lzo compression level reset after previous level setting - fix space cache memory leak after transaction abort - fix const function attribute - more error handling improvements * tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: detect nocow for swap after snapshot delete btrfs: check the right error variable in btrfs_del_dir_entries_in_log btrfs: fix space cache memory leak after transaction abort btrfs: use the correct const function attribute for btrfs_get_num_csums btrfs: reset compression level for lzo on remount btrfs: handle errors from async submission
2020-08-21btrfs: detect nocow for swap after snapshot deleteBoris Burkov
can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for detecting a must cow condition which is not exactly accurate, but saves unnecessary tree traversal. The incorrect assumption is that if the extent was created in a generation smaller than the last snapshot generation, it must be referenced by that snapshot. That is true, except the snapshot could have since been deleted, without affecting the last snapshot generation. The original patch claimed a performance win from this check, but it also leads to a bug where you are unable to use a swapfile if you ever snapshotted the subvolume it's in. Make the check slower and more strict for the swapon case, without modifying the general cow checks as a compromise. Turning swap on does not seem to be a particularly performance sensitive operation, so incurring a possibly unnecessary btrfs_search_slot seems worthwhile for the added usability. Note: Until the snapshot is competely cleaned after deletion, check_committed_refs will still cause the logic to think that cow is necessary, so the user must until 'btrfs subvolu sync' finished before activating the swapfile swapon. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-03Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Lots of cleanups in here, hardening the code and/or making it easier to read and fixing bugs, but a core feature/change too adding support for real async buffered reads. With the latter in place, we just need buffered write async support and we're done relying on kthreads for the fast path. In detail: - Cleanup how memory accounting is done on ring setup/free (Bijan) - sq array offset calculation fixup (Dmitry) - Consistently handle blocking off O_DIRECT submission path (me) - Support proper async buffered reads, instead of relying on kthread offload for that. This uses the page waitqueue to drive retries from task_work, like we handle poll based retry. (me) - IO completion optimizations (me) - Fix race with accounting and ring fd install (me) - Support EPOLLEXCLUSIVE (Jiufei) - Get rid of the io_kiocb unionizing, made possible by shrinking other bits (Pavel) - Completion side cleanups (Pavel) - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel) - Request environment grabbing cleanups (Pavel) - File and socket read/write cleanups (Pavel) - Improve kiocb_set_rw_flags() (Pavel) - Tons of fixes and cleanups (Pavel) - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)" * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits) io_uring: flip if handling after io_setup_async_rw fs: optimise kiocb_set_rw_flags() io_uring: don't touch 'ctx' after installing file descriptor io_uring: get rid of atomic FAA for cq_timeouts io_uring: consolidate *_check_overflow accounting io_uring: fix stalled deferred requests io_uring: fix racy overflow count reporting io_uring: deduplicate __io_complete_rw() io_uring: de-unionise io_kiocb io-wq: update hash bits io_uring: fix missing io_queue_linked_timeout() io_uring: mark ->work uninitialised after cleanup io_uring: deduplicate io_grab_files() calls io_uring: don't do opcode prep twice io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: batch put_task_struct() tasks: add put_task_struct_many() io_uring: return locked and pinned page accounting io_uring: don't miscount pinned memory io_uring: don't open-code recv kbuf managment ...
2020-07-27btrfs: make btrfs_check_data_free_space take btrfs_inodeNikolay Borisov
Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_delalloc_release_space take btrfs_inodeNikolay Borisov
It needs btrfs_inode so take it as a parameter directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_free_reserved_data_space take btrfs_inodeNikolay Borisov
It only uses btrfs_inode internally so take it as a parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_qgroup_reserve_data take btrfs_inodeNikolay Borisov
There's only a single use of vfs_inode in a tracepoint so let's take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_dirty_pages take btrfs_inodeNikolay Borisov
There is a single use of the generic vfs_inode so let's take btrfs_inode as a parameter and remove couple of redundant BTRFS_I() calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>