Age | Commit message (Collapse) | Author |
|
The 'out' lable just return the 'ret' value and seems not required, so
remove this label and switch to return appropriate value immediately.
This patch also do some minor cleanup, no logical change.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210610112440.3438139-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_ext_grow_indepth will add a new extent block which has init the
expected content. We can mark this buffer as verified so to stop a
useless check in __read_extent_tree_block.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210609075545.1442160-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This comments was for jbd2_fc_wait_bufs, not for jbd2_fc_release_bufs.
Remove this misleading comments.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210608141236.459441-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The in-kernel ext4 resize code doesn't support filesystem with the
sparse_super2 feature. It fails with errors like this and doesn't finish
the resize:
EXT4-fs (loop0): resizing filesystem from 16640 to 7864320 blocks
EXT4-fs warning (device loop0): verify_reserved_gdb:760: reserved GDT 2 missing grp 1 (32770)
EXT4-fs warning (device loop0): ext4_resize_fs:2111: error (-22) occurred during file system resize
EXT4-fs (loop0): resized filesystem to 2097152
To reproduce:
mkfs.ext4 -b 4096 -I 256 -J size=32 -E resize=$((256*1024*1024)) -O sparse_super2 ext4.img 65M
truncate -s 30G ext4.img
mount ext4.img /mnt
python3 -c 'import fcntl, os, struct ; fd = os.open("/mnt", os.O_RDONLY | os.O_DIRECTORY) ; fcntl.ioctl(fd, 0x40086610, struct.pack("Q", 30 * 1024 * 1024 * 1024 // 4096), False) ; os.close(fd)'
dmesg | tail
e2fsck ext4.img
The userspace resize2fs tool has a check for this case: it checks if the
filesystem has sparse_super2 set and if the kernel provides
/sys/fs/ext4/features/sparse_super2. However, the former check requires
manually reading and parsing the filesystem superblock.
Detect this case in ext4_resize_begin and error out early with a clear
error message.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/74b8ae78405270211943cd7393e65586c5faeed1.1623093259.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Two different places checked for attempts to resize a filesystem with
the bigalloc feature. Move the check into ext4_resize_begin, which both
places already call.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/bee03303d999225ecb3bfa5be8576b2f4c6edbe6.1623093259.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_xattr_ibody_inline_set() & ext4_xattr_ibody_set() have the exact
same definition. Hence remove ext4_xattr_ibody_inline_set() and all
its call references. Convert the callers of it to call
ext4_xattr_ibody_set() instead.
[ Modified to preserve ext4_xattr_ibody_set() and remove
ext4_xattr_ibody_inline_set() instead. -- TYT ]
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/fd566b799bbbbe9b668eb5eecde5b5e319e3694f.1622685482.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
While debugging fstest ext4/027 failure, found below comment to be wrong and
confusing. Hence fix it while we are at it.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/e79134132db7ea42f15747b5c669ee91cc1aacdf.1622432690.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fix the comment for s_hash_unsigned to not be the opposite of what it
actually is.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210527235557.2377525-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: chenyichong <chenyichong@uniontech.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20210526052930.11278-1-chenyichong@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The avefreec should be average free clusters instead
of average free blocks, otherwize Orlov's allocator
will not work properly when bigalloc enabled.
Cc: stable@kernel.org
Signed-off-by: Pan Dong <pandong.peter@bytedance.com>
Link: https://lore.kernel.org/r/20210525073656.31594-1-pandong.peter@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The cache_cnt parameter of tracepoint ext4_es_shrink_exit means the
remaining cache count after shrink, but now it is the cache count before
shrink, fix it by read sbi->s_extent_cache_cnt again.
Fixes: 1ab6c4997e04 ("fs: convert fs shrinkers to new scan/count API")
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210522103045.690103-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After converting fs shrinkers to new scan/count API, we are no longer
pass zero nr_to_scan parameter to detect the number of objects to free,
just remove this check.
Fixes: 1ab6c4997e04 ("fs: convert fs shrinkers to new scan/count API")
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210522103045.690103-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In the ext4_dx_add_entry function, the at variable is assigned but will
reset just after “again:” label. So delete the unnecessary assignment.
this will not chang the logic.
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Link: https://lore.kernel.org/r/1621493752-36890-1-git-send-email-tiantao6@hisilicon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ioctl EXT4_IOC_CHECKPOINT checkpoints and flushes the journal. This
includes forcing all the transactions to the log, checkpointing the
transactions, and flushing the log to disk. This ioctl takes u32 "flags"
as an argument. Three flags are supported. EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
can be used to verify input to the ioctl. It returns error if there is any
invalid input, otherwise it returns success without performing
any checkpointing. The other two flags, EXT4_IOC_CHECKPOINT_FLAG_DISCARD
and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT, can be used to issue requests to
discard or zeroout the journal logs blocks, respectively. At this
point, EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT is primarily added to enable
testing of this codepath on devices that don't support discard.
EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
cannot both be set.
Systems that wish to achieve content deletion SLO can set up a daemon
that calls this ioctl at a regular interval such that it matches with the
SLO requirement. Thus, with this patch, the ext4_dir_entry2 wipeout
patch[1], and the Ext4 "-o discard" mount option set, Ext4 can now
guarantee that all file contents, file metatdata, and filenames will not
be accessible through the filesystem and will have had discard or
zeroout requests issued for corresponding device blocks.
The __jbd2_journal_erase function could also be used to discard or
zero-fill the journal during journal load after recovery. This would
provide a potential solution to a journal replay bug reported earlier this
year[2]. After a successful journal recovery, e2fsck can call this ioctl to
discard the journal as well.
[1] https://lore.kernel.org/linux-ext4/YIHknqxngB1sUdie@mit.edu/
[2] https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/
Link: https://lore.kernel.org/r/20210518151327.130198-2-leah.rumancik@gmail.com
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add a flags argument to jbd2_journal_flush to enable discarding or
zero-filling the journal blocks while flushing the journal.
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Link: https://lore.kernel.org/r/20210518151327.130198-1-leah.rumancik@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After commit c89128a00838 ("ext4: handle errors on
ext4_commit_super"), 'ret' may be set to 0 before calling
ext4_fill_flex_info(), if ext4_fill_flex_info() fails ext4_mount()
doesn't return error code, it makes 'root' is null which causes crash
in legacy_get_tree().
Fixes: c89128a00838 ("ext4: handle errors on ext4_commit_super")
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: <stable@vger.kernel.org> # v4.18+
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20210510111051.55650-1-yangyingliang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
transaction handle
In ext4_orphan_cleanup(), if ext4_truncate() failed to get a transaction
handle, it didn't remove the inode from the in-core orphan list, which
may probably trigger below error dump in ext4_destroy_inode() during the
final iput() and could lead to memory corruption on the later orphan
list changes.
EXT4-fs (sda): Inode 6291467 (00000000b8247c67): orphan list check failed!
00000000b8247c67: 0001f30a 00000004 00000000 00000023 ............#...
00000000e24cde71: 00000006 014082a3 00000000 00000000 ......@.........
0000000072c6a5ee: 00000000 00000000 00000000 00000000 ................
...
This patch fix this by cleanup in-core orphan list manually if
ext4_truncate() return error.
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210507071904.160808-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Initialize eh_generation of struct ext4_extent_header to prevent leaking
info to userspace. Fixes KMSAN kernel-infoleak bug reported by syzbot at:
http://syzkaller.appspot.com/bug?id=78e9ad0e6952a3ca16e8234724b2fa92d041b9b8
Cc: stable@kernel.org
Reported-by: syzbot+2dcfeaf8cb49b05e8f1a@syzkaller.appspotmail.com
Fixes: a86c61812637 ("[PATCH] ext3: add extent map support")
Signed-off-by: Anirudh Rayabharam <mail@anirudhrb.com>
Link: https://lore.kernel.org/r/20210506185655.7118-1-mail@anirudhrb.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
static int kthread(void *_create) will return -ENOMEM
or -EINTR in case of internal failure or
kthread_stop() call happens before threadfn call.
To prevent fancy error checking and make code
more straightforward we moved all cleanup code out
of kmmpd threadfn.
Also, dropped struct mmpd_data at all. Now struct super_block
is a threadfn data and struct buffer_head embedded into
struct ext4_sb_info.
Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Link: https://lore.kernel.org/r/20210430185046.15742-1-paskripkin@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Variable error is set to zero but this value is never read as it's not
used later on, hence it is a redundant assignment and can be removed.
Cleans up the following clang-analyzer warning:
fs/ext4/ioctl.c:657:3: warning: Value stored to 'error' is never read
[clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619691409-83160-1-git-send-email-jiapeng.chong@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now set_buffer_uptodate() will test first and then set, so we don't have
to check buffer_uptodate() first, remove it to simplify code.
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619418587-5580-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
A code in iomap alloc may overflow block number when converting it to
byte offset. Luckily this is mostly harmless as we will just use more
expensive method of writing using unwritten extents even though we are
writing beyond i_size.
Cc: stable@kernel.org
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210412102333.2676-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Only advertise encrypted_casefold when encryption and unicode are enabled
ext4: fix no-key deletion for encrypt+casefold
ext4: fix memory leak in ext4_fill_super
ext4: fix fast commit alignment issues
ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
ext4: fix accessing uninit percpu counter variable with fast_commit
ext4: fix memory leak in ext4_mb_init_backend on error path.
|
|
Encrypted casefolding is only supported when both encryption and
casefolding are both enabled in the config.
Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption")
Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210603094849.314342-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
commit 471fbbea7ff7 ("ext4: handle casefolding with encryption") is
missing a few checks for the encryption key which are needed to
support deleting enrypted casefolded files when the key is not
present.
This bug made it impossible to delete encrypted+casefolded directories
without the encryption key, due to errors like:
W : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key
Repro steps in kvm-xfstests test appliance:
mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc
mount /vdc
mkdir /vdc/dir
chattr +F /vdc/dir
keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}')
xfs_io -c "set_encpolicy $keyid" /vdc/dir
for i in `seq 1 100`; do
mkdir /vdc/dir/$i
done
xfs_io -c "rm_enckey $keyid" /vdc
rm -rf /vdc/dir # fails with the bug
Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption")
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.
Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Fast commit recovery data on disk may not be aligned. So, when the
recovery code reads it, this patch makes sure that fast commit info
found on-disk is first memcpy-ed into an aligned variable before
accessing it. As a consequence of it, we also remove some macros that
could resulted in unaligned accesses.
Cc: stable@kernel.org
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We got follow bug_on when run fsstress with injecting IO fault:
[130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
[130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
......
[130747.334329] Call trace:
[130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4]
[130747.334975] ext4_cache_extents+0x64/0xe8 [ext4]
[130747.335368] ext4_find_extent+0x300/0x330 [ext4]
[130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4]
[130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4]
[130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
[130747.336995] ext4_readpage+0x54/0x100 [ext4]
[130747.337359] generic_file_buffered_read+0x410/0xae8
[130747.337767] generic_file_read_iter+0x114/0x190
[130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4]
[130747.338556] __vfs_read+0x11c/0x188
[130747.338851] vfs_read+0x94/0x150
[130747.339110] ksys_read+0x74/0xf0
This patch's modification is according to Jan Kara's suggestion in:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/
"I see. Now I understand your patch. Honestly, seeing how fragile is trying
to fix extent tree after split has failed in the middle, I would probably
go even further and make sure we fix the tree properly in case of ENOSPC
and EDQUOT (those are easily user triggerable). Anything else indicates a
HW problem or fs corruption so I'd rather leave the extent tree as is and
don't try to fix it (which also means we will not create overlapping
extents)."
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When fallocate punches holes out of inode size, if original isize is in
the middle of last cluster, then the part from isize to the end of the
cluster will be zeroed with buffer write, at that time isize is not yet
updated to match the new size, if writeback is kicked in, it will invoke
ocfs2_writepage()->block_write_full_page() where the pages out of inode
size will be dropped. That will cause file corruption. Fix this by
zero out eof blocks when extending the inode size.
Running the following command with qemu-image 4.2.1 can get a corrupted
coverted image file easily.
qemu-img convert -p -t none -T none -f qcow2 $qcow_image \
-O qcow2 -o compat=1.1 $qcow_image.conv
The usage of fallocate in qemu is like this, it first punches holes out
of inode size, then extend the inode size.
fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0
fallocate(11, 0, 2276196352, 65536) = 0
v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html
v2: https://lore.kernel.org/linux-fsdevel/20210525093034.GB4112@quack2.suse.cz/T/
Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull io_uring fix from Jens Axboe:
"Just a single one-liner fix for an accounting regression in this
release"
* tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block:
io_uring: fix misaccounting fix buf pinned pages
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Error handling improvements, caught by error injection:
- handle errors during checksum deletion
- set error on mapping when ordered extent io cannot be finished
- inode link count fixup in tree-log
- missing return value checks for inode updates in tree-log
- abort transaction in rename exchange if adding second reference
fails
Fixes:
- fix fsync failure after writes to prealloc extents
- fix deadlock when cloning inline extents and low on available space
- fix compressed writes that cross stripe boundary"
* tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: add btrfs IRC link
btrfs: fix deadlock when cloning inline extents and low on available space
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: check error value from btrfs_update_inode in tree log
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fix error handling in btrfs_del_csums
btrfs: fix compressed writes that cross stripe boundary
|
|
When running generic/527 with fast_commit configuration, the following
issue is seen on Power. With fast_commit, during ext4_fc_replay()
(which can be called from ext4_fill_super()), if inode eviction
happens then it can access an uninitialized percpu counter variable.
This patch adds the check before accessing the counters in
ext4_free_inode() path.
[ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43
[ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
[ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
[ 323.619767] Faulting instruction address: 0xc000000000bae78c
cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
pid = 5593, comm = mount
ext4_free_inode+0x780/0xb90
ext4_evict_inode+0xa8c/0xc60
evict+0xfc/0x1e0
ext4_fc_replay+0xc50/0x20f0
do_one_pass+0xfe0/0x1350
jbd2_journal_recover+0x184/0x2e0
jbd2_journal_load+0x1c0/0x4a0
ext4_fill_super+0x2458/0x4200
mount_bdev+0x1dc/0x290
ext4_mount+0x28/0x40
legacy_get_tree+0x4c/0xa0
vfs_get_tree+0x4c/0x120
path_mount+0xcf8/0xd70
do_mount+0x80/0xd0
sys_mount+0x3fc/0x490
system_call_exception+0x384/0x3d0
system_call_common+0xec/0x278
Cc: stable@kernel.org
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This reverts commit b7f55d928e75557295c1ac280c291b738905b6fb.
As explained by Linus in [*], write faults on a mmap region are reads
from a filesysten point of view, so taking the inode glock exclusively
on write faults is incorrect.
Instead, when a page is marked writable, the .page_mkwrite vm operation
will be called, which is where the exclusive lock taking needs to
happen. I got this wrong because of a broken test case that made me
believe .page_mkwrite isn't getting called when it actually is.
[*] https://lore.kernel.org/lkml/CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com/
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes"
* tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix use-after-free in gfs2_glock_shrink_scan
gfs2: Fix mmap locking for write faults
gfs2: Clean up revokes on normal withdraws
gfs2: fix a deadlock on withdraw-during-mount
gfs2: fix scheduling while atomic bug in glocks
gfs2: Fix I_NEW check in gfs2_dinode_in
gfs2: Prevent direct-I/O write fallback errors from getting lost
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"A fix for permission checking with fanotify unpriviledged groups.
Also there's a small update in MAINTAINERS file for fanotify"
* tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix permission model of unprivileged group
MAINTAINERS: Add Matthew Bobrowski as a reviewer
|
|
The GLF_LRU flag is checked under lru_lock in gfs2_glock_remove_from_lru() to
remove the glock from the lru list in __gfs2_glock_put().
On the shrink scan path, the same flag is cleared under lru_lock but because
of cond_resched_lock(&lru_lock) in gfs2_dispose_glock_lru(), progress on the
put side can be made without deleting the glock from the lru list.
Keep GLF_LRU across the race window opened by cond_resched_lock(&lru_lock) to
ensure correct behavior on both sides - clear GLF_LRU after list_del under
lru_lock.
Reported-by: syzbot <syzbot+34ba7ddbf3021981a228@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Pull xfs fixes from Darrick Wong:
"This week's pile mitigates some decades-old problems in how extent
size hints interact with realtime volumes, fixes some failures in
online shrink, and fixes a problem where directory and symlink
shrinking on extremely fragmented filesystems could fail.
The most user-notable change here is to point users at our (new) IRC
channel on OFTC. Freedom isn't free, it costs folks like you and me;
and if you don't kowtow, they'll expel everyone and take over your
channel. (Ok, ok, that didn't fit the song lyrics...)
Summary:
- Fix a bug where unmapping operations end earlier than expected,
which can cause chaos on multi-block directory and symlink shrink
operations.
- Fix an erroneous assert that can trigger if we try to transition a
bmap structure from btree format to extents format with zero
extents. This was exposed by xfs/538"
* tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: bunmapi has unnecessary AG lock ordering issues
xfs: btree format inode forks can have zero extents
xfs: add new IRC channel to MAINTAINERS
xfs: validate extsz hints against rt extent size when rtinherit is set
xfs: standardize extent size hint validation
xfs: check free AG space when making per-AG reservations
|
|
As Andres reports "... io_sqe_buffer_register() doesn't initialize imu.
io_buffer_account_pin() does imu->acct_pages++, before calling
io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM.
Initialise the field.
Reported-by: Andres Freund <andres@anarazel.de>
Fixes: 41edf1a5ec967 ("io_uring: keep table of pointers to ubufs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are three small driver core / debugfs fixes for 5.13-rc4:
- debugfs fix for incorrect "lockdown" mode for selinux accesses
- two device link changes, one bugfix and one cleanup
All of these have been in linux-next for over a week with no reported
problems"
* tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers: base: Reduce device link removal code duplication
drivers: base: Fix device link removal
debugfs: fix security_locked_down() call for SELinux
|
|
Pull io_uring fixes from Jens Axboe:
"A few minor fixes:
- Fix an issue with hashed wait removal on exit (Zqiang, Pavel)
- Fix a recent data race introduced in this series (Marco)"
* tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block:
io_uring: fix data race to avoid potential NULL-deref
io-wq: Fix UAF when wakeup wqe in hash waitqueue
io_uring/io-wq: close io-wq full-stop gap
|
|
Pull cifs fixes from Steve French:
"Three SMB3 fixes.
Two for stable, and the other fixes a problem pointed out with a
recently added ioctl"
* tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: change format of CIFS_FULL_KEY_DUMP ioctl
cifs: fix string declarations and assignments in tracepoints
cifs: set server->cipher_type to AES-128-CCM for SMB3.0
|
|
Pull NFS client bugfixes from Trond Myklebust:
"Stable fixes:
- Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
- Fix Oops in xs_tcp_send_request() when transport is disconnected
- Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
Bugfixes:
- Fix instances where signal_pending() should be fatal_signal_pending()
- fix an incorrect limit in filelayout_decode_layout()
- Fixes for the SUNRPC backlogged RPC queue
- Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
- Revert commit 586a0787ce35 ("Clean up rpcrdma_prepare_readch()")"
* tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Remove trailing semicolon in macros
xprtrdma: Revert 586a0787ce35
NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
NFS: Clean up reset of the mirror accounting variables
NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
NFS: Fix an Oopsable condition in __nfs_pageio_add_request()
SUNRPC: More fixes for backlog congestion
SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnected
NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
SUNRPC in case of backlog, hand free slots directly to waiting task
pNFS/NFSv4: Remove redundant initialization of 'rd_size'
NFS: fix an incorrect limit in filelayout_decode_layout()
fs/nfs: Use fatal_signal_pending instead of signal_pending
|
|
There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:
1) When starting writeback for an inode with a very small dirty range that
fits in an inline extent, we deadlock during the writeback when trying
to insert the inline extent, at cow_file_range_inline(), if the extent
is going to be located in the leaf for which we are already holding a
read lock;
2) After successfully starting writeback, for non-inline extent cases,
the async reclaim thread will hang waiting for an ordered extent to
complete if the ordered extent completion needs to modify the leaf
for which the clone task is holding a read lock (for adding or
replacing file extent items). So the cloning task will wait forever
on the async reclaim thread to make progress, which in turn is
waiting for the ordered extent completion which in turn is waiting
to acquire a write lock on the same leaf.
So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.
Fixes: 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
b84b8390d6009c ("Btrfs: fix file read corruption after extent cloning and fsync")
40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree")
e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.
When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.
This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.
A test case for fstests will follow soon.
Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Error injection stress uncovered a problem where we'd leave a dangling
inode ref if we failed during a rename_exchange. This happens because
we insert the inode ref for one side of the rename, and then for the
other side. If this second inode ref insert fails we'll leave the first
one dangling and leave a corrupt file system behind. Fix this by
aborting if we did the insert for the first inode ref.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Error injection testing uncovered a case where we ended up with invalid
link counts on an inode. This happened because we failed to notice an
error when updating the inode while replaying the tree log, and
committed the transaction with an invalid file system.
Fix this by checking the return value of btrfs_update_inode. This
resolved the link count errors I was seeing, and we already properly
handle passing up the error values in these paths.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This function has the following pattern
while (1) {
ret = whatever();
if (ret)
goto out;
}
ret = 0
out:
return ret;
However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.
Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the
ret = 0;
out:
bit so everybody can break if there is an error, which will allow for
proper error handling to occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While doing error injection testing I saw that sometimes we'd get an
abort that wouldn't stop the current transaction commit from completing.
This abort was coming from finish ordered IO, but at this point in the
transaction commit we should have gotten an error and stopped.
It turns out the abort came from finish ordered io while trying to write
out the free space cache. It occurred to me that any failure inside of
finish_ordered_io isn't actually raised to the person doing the writing,
so we could have any number of failures in this path and think the
ordered extent completed successfully and the inode was fine.
Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
marking the mapping of the inode with mapping_set_error, so any callers
that simply call fdatawait will also get the error.
With this we're seeing the IO error on the free space inode when we fail
to do the finish_ordered_io.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are unconditionally returning 0 in cleanup_ref_head, despite the fact
that btrfs_del_csums could fail. We need to return the error so the
transaction gets aborted properly, fix this by returning ret from
btrfs_del_csums in cleanup_ref_head.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent. This occurred because the pattern
in btrfs_del_csums was
while (1) {
ret = btrfs_search_slot();
if (ret < 0)
break;
}
ret = 0;
out:
btrfs_free_path(path);
return ret;
If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out. Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the
ret = 0;
out:
pattern and simply allow break to have the proper error reporting. With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|