summaryrefslogtreecommitdiff
path: root/fs/ext4/ext4.h
AgeCommit message (Collapse)Author
2020-04-01ext4: save all error info in save_error_info() and drop ext4_set_errno()Theodore Ts'o
Using a separate function, ext4_set_errno() to set the errno is problematic because it doesn't do the right thing once s_last_error_errorcode is non-zero. It's also less racy to set all of the error information all at once. (Also, as a bonus, it shrinks code size slightly.) Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu Fixes: 878520ac45f9 ("ext4: save the error code which triggered...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-05ext4: remove EXT4_EOFBLOCKS_FL and associated codeEric Whitney
The EXT4_EOFBLOCKS_FL inode flag is used to indicate whether a file contains unwritten blocks past i_size. It's set when ext4_fallocate is called with the KEEP_SIZE flag to extend a file with an unwritten extent. However, this flag hasn't been useful functionally since March, 2012, when a decision was made to remove it from ext4. All traces of EXT4_EOFBLOCKS_FL were removed from e2fsprogs version 1.42.2 by commit 010dc7b90d97 ("e2fsck: remove EXT4_EOFBLOCKS_FL flag handling") at that time. Now that enough time has passed to make e2fsprogs versions containing this modification common, this patch now removes the code associated with EXT4_EOFBLOCKS_FL from the kernel as well. This change has two implications. First, because pre-1.42.2 e2fsck versions only look for a problem if EXT4_EOFBLOCKS_FL is set, and because that bit will never be set by newer kernels containing this patch, old versions of e2fsck won't have a compatibility problem with files created by newer kernels. Second, newer kernels will not clear EXT4_EOFBLOCKS_FL inode flag bits belonging to a file written by an older kernel. If set, it will remain in that state until the file is deleted. Because e2fsck versions since 1.42.2 don't check the flag at all, no adverse effect is expected. However, pre-1.42.2 e2fsck versions that do check the flag may report that it is set when it ought not to be after a file has been truncated or had its unwritten blocks written. In this case, the old version of e2fsck will offer to clear the flag. No adverse effect would then occur whether the user chooses to clear the flag or not. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200211210216.24960-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-21ext4: fix race between writepages and enabling EXT4_EXTENTS_FLEric Biggers
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running on it, the following warning in ext4_add_complete_io() can be hit: WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120 Here's a minimal reproducer (not 100% reliable) (root isn't required): while true; do sync done & while true; do rm -f file touch file chattr -e file echo X >> file chattr +e file done The problem is that in ext4_writepages(), ext4_should_dioread_nolock() (which only returns true on extent-based files) is checked once to set the number of reserved journal credits, and also again later to select the flags for ext4_map_blocks() and copy the reserved journal handle to ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set, the first check can see dioread_nolock disabled while the later one can see it enabled, causing the reserved handle to unexpectedly be NULL. Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races related to doing so as well, fix this by synchronizing changing EXT4_EXTENTS_FL with ext4_writepages() via the existing s_writepages_rwsem (previously called s_journal_flag_rwsem). This was originally reported by syzbot without a reproducer at https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf, but now that dioread_nolock is the default I also started seeing this when running syzkaller locally. Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2020-02-21ext4: rename s_journal_flag_rwsem to s_writepages_rwsemEric Biggers
In preparation for making s_journal_flag_rwsem synchronize ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA flags (rather than just JOURNAL_DATA as it does currently), rename it to s_writepages_rwsem. Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2020-02-21ext4: fix potential race between s_flex_groups online resizing and accessSuraj Jitindar Singh
During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-02-21ext4: fix potential race between s_group_info online resizing and accessSuraj Jitindar Singh
During an online resize an array of pointers to s_group_info gets replaced so it can get enlarged. If there is a concurrent access to the array in ext4_get_group_info() and this memory has been reused then this can lead to an invalid memory access. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Balbir Singh <sblbir@amazon.com> Cc: stable@kernel.org
2020-02-21ext4: fix potential race between online resizing and write operationsTheodore Ts'o
During an online resize an array of pointers to buffer heads gets replaced so it can get enlarged. If there is a racing block allocation or deallocation which uses the old array, and the old array has gotten reused this can lead to a GPF or some other random kernel memory getting modified. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu Reported-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-02-19ext4: fix a data race in EXT4_I(inode)->i_disksizeQian Cai
EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4] write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127: ext4_write_end+0x4e3/0x750 [ext4] ext4_update_i_disksize at fs/ext4/ext4.h:3032 (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046 (inlined by) ext4_write_end at fs/ext4/inode.c:1287 generic_perform_write+0x208/0x2a0 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37: ext4_writepages+0x10ac/0x1d00 [ext4] mpage_map_and_submit_extent at fs/ext4/inode.c:2468 (inlined by) ext4_writepages at fs/ext4/inode.c:2772 do_writepages+0x5e/0x130 __writeback_single_inode+0xeb/0xb20 writeback_sb_inodes+0x429/0x900 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x4bd/0x870 wb_workfn+0x6b4/0x960 process_one_work+0x54c/0xbe0 worker_thread+0x80/0x650 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-7:0) Since only the read is operating as lockless (outside of the "i_data_sem"), load tearing could introduce a logic bug. Fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the write. Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-02-13ext4: fix checksum errors with indexed dirsJan Kara
DIR_INDEX has been introduced as a compat ext4 feature. That means that even kernels / tools that don't understand the feature may modify the filesystem. This works because for kernels not understanding indexed dir format, internal htree nodes appear just as empty directory entries. Index dir aware kernels then check the htree structure is still consistent before using the data. This all worked reasonably well until metadata checksums were introduced. The problem is that these effectively made DIR_INDEX only ro-compatible because internal htree nodes store checksums in a different place than normal directory blocks. Thus any modification ignorant to DIR_INDEX (or just clearing EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch and trigger kernel errors. So we have to be more careful when dealing with indexed directories on filesystems with checksumming enabled. 1) We just disallow loading any directory inodes with EXT4_INDEX_FL when DIR_INDEX is not enabled. This is harsh but it should be very rare (it means someone disabled DIR_INDEX on existing filesystem and didn't run e2fsck), e2fsck can fix the problem, and we don't want to answer the difficult question: "Should we rather corrupt the directory more or should we ignore that DIR_INDEX feature is not set?" 2) When we find out htree structure is corrupted (but the filesystem and the directory should in support htrees), we continue just ignoring htree information for reading but we refuse to add new entries to the directory to avoid corrupting it more. Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-01-17ext4: drop ext4_kvmalloc()Theodore Ts'o
As Jan pointed out[1], as of commit 81378da64de ("jbd2: mark the transaction context with the scope GFP_NOFS context") we use memalloc_nofs_{save,restore}() while a jbd2 handle is active. So ext4_kvmalloc() so we can call allocate using GFP_NOFS is no longer necessary. [1] https://lore.kernel.org/r/20200109100007.GC27035@quack2.suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200116155031.266620-1-tytso@mit.edu Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: make some functions static in extents.cEric Biggers
Make the following functions static since they're only used in extents.c: __ext4_ext_dirty() ext4_can_extents_be_merged() ext4_collapse_range() ext4_insert_range() Also remove the prototype for ext4_ext_writepage_trans_blocks(), as this function is not defined anywhere. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-5-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove ext4_{ind,ext}_calc_metadata_amount()Eric Biggers
Remove the ext4_ind_calc_metadata_amount() and ext4_ext_calc_metadata_amount() functions, which have been unused since commit 71d4f7d03214 ("ext4: remove metadata reservation checks"). Also remove the i_da_metadata_calc_last_lblock and i_da_metadata_calc_len fields from struct ext4_inode_info, as these were only used by these removed functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: Delete ext4_kvzvalloc()Naoto Kobayashi
Since we're not using ext4_kvzalloc(), delete this function. Signed-off-by: Naoto Kobayashi <naoto.kobayashi4c@gmail.com> Link: https://lore.kernel.org/r/20191227080523.31808-2-naoto.kobayashi4c@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: Optimize ext4 DIO overwritesJan Kara
Currently we start transaction for mapping every extent for writing using direct IO. This is unnecessary when we know we are overwriting already allocated blocks and the overhead of starting a transaction can be significant especially for multithreaded workloads doing small writes. Use iomap operations that avoid starting a transaction for direct IO overwrites. This improves throughput of 4k random writes - fio jobfile: [global] rw=randrw norandommap=1 invalidate=0 bs=4k numjobs=16 time_based=1 ramp_time=30 runtime=120 group_reporting=1 ioengine=psync direct=1 size=16G filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31 file_service_type=random nrfiles=32 from 3018MB/s to 4059MB/s in my test VM running test against simulated pmem device (note that before iomap conversion, this workload was able to achieve 3708MB/s because old direct IO path avoided transaction start for overwrites as well). For dax, the win is even larger improving throughput from 3042MB/s to 4311MB/s. Reported-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: simulate various I/O and checksum errors when reading metadataTheodore Ts'o
This allows us to test various error handling code paths Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: save the error code which triggered an ext4_error() in the superblockTheodore Ts'o
This allows the cause of an ext4_error() report to be categorized based on whether it was triggered due to an I/O error, or an memory allocation error, or other possible causes. Most errors are caused by a detected file system inconsistency, so the default code stored in the superblock will be EXT4_ERR_EFSCORRUPTED. Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-30Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "This merge window saw the the following new featuers added to ext4: - Direct I/O via iomap (required the iomap-for-next branch from Darrick as a prereq). - Support for using dioread-nolock where the block size < page size. - Support for encryption for file systems where the block size < page size. - Rework of journal credits handling so a revoke-heavy workload will not cause the journal to run out of space. - Replace bit-spinlocks with spinlocks in jbd2 Also included were some bug fixes and cleanups, mostly to clean up corner cases from fuzzed file systems and error path handling" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits) ext4: work around deleting a file with i_nlink == 0 safely ext4: add more paranoia checking in ext4_expand_extra_isize handling jbd2: make jbd2_handle_buffer_credits() handle reserved handles ext4: fix a bug in ext4_wait_for_tail_page_commit ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails ext4: code cleanup for get_next_id ext4: fix leak of quota reservations ext4: remove unused variable warning in parse_options() ext4: Enable encryption for subpage-sized blocks fs/buffer.c: support fscrypt in block_read_full_page() ext4: Add error handling for io_end_vec struct allocation jbd2: Fine tune estimate of necessary descriptor blocks jbd2: Provide trace event for handle restarts ext4: Reserve revoke credits for freed blocks jbd2: Make credit checking more strict jbd2: Rename h_buffer_credits to h_total_credits jbd2: Reserve space for revoke descriptor blocks jbd2: Drop jbd2_space_needed() jbd2: Account descriptor blocks into t_outstanding_credits jbd2: Factor out common parts of stopping and restarting a handle ...
2019-11-06ext4: add support for IV_INO_LBLK_64 encryption policiesEric Biggers
IV_INO_LBLK_64 encryption policies have special requirements from the filesystem beyond those of the existing encryption policies: - Inode numbers must never change, even if the filesystem is resized. - Inode numbers must be <= 32 bits. - File logical block numbers must be <= 32 bits. ext4 has 32-bit inode and file logical block numbers. However, resize2fs can re-number inodes when shrinking an ext4 filesystem. However, typically the people who would want to use this format don't care about filesystem shrinking. They'd be fine with a solution that just prevents the filesystem from being shrunk. Therefore, add a new feature flag EXT4_FEATURE_COMPAT_STABLE_INODES that will do exactly that. Then wire up the fscrypt_operations to expose this flag to fs/crypto/, so that it allows IV_INO_LBLK_64 policies when this flag is set. Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-05Merge branch 'mb/dio' into masterTheodore Ts'o
2019-11-05Merge branch 'jk/jbd2-revoke-overflow'Theodore Ts'o
2019-11-05ext4: Reserve revoke credits for freed blocksJan Kara
So far we have reserved only relatively high fixed amount of revoke credits for each transaction. We over-reserved by large amount for most cases but when freeing large directories or files with data journalling, the fixed amount is not enough. In fact the worst case estimate is inconveniently large (maximum extent size) for freeing of one extent. We fix this by doing proper estimate of the amount of blocks that need to be revoked when removing blocks from the inode due to truncate or hole punching and otherwise reserve just a small amount of revoke credits for each transaction to accommodate freeing of xattrs block or so. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: Provide function to handle transaction restartsJan Kara
Provide ext4_journal_ensure_credits_fn() function to ensure transaction has given amount of credits and call helper function to prepare for restarting a transaction. This allows to remove some boilerplate code from various places, add proper error handling for the case where transaction extension or restart fails, and reduces following changes needed for proper revoke record reservation tracking. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-10-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: introduce direct I/O write using iomap infrastructureMatthew Bobrowski
This patch introduces a new direct I/O write path which makes use of the iomap infrastructure. All direct I/O writes are now passed from the ->write_iter() callback through to the new direct I/O handler ext4_dio_write_iter(). This function is responsible for calling into the iomap infrastructure via iomap_dio_rw(). Code snippets from the existing direct I/O write code within ext4_file_write_iter() such as, checking whether the I/O request is unaligned asynchronous I/O, or whether the write will result in an overwrite have effectively been moved out and into the new direct I/O ->write_iter() handler. The block mapping flags that are eventually passed down to ext4_map_blocks() from the *_get_block_*() suite of routines have been taken out and introduced within ext4_iomap_alloc(). For inode extension cases, ext4_handle_inode_extension() is effectively the function responsible for performing such metadata updates. This is called after iomap_dio_rw() has returned so that we can safely determine whether we need to potentially truncate any allocated blocks that may have been prepared for this direct I/O write. We don't perform the inode extension, or truncate operations from the ->end_io() handler as we don't have the original I/O 'length' available there. The ->end_io() however is responsible fo converting allocated unwritten extents to written extents. In the instance of a short write, we fallback and complete the remainder of the I/O using buffered I/O via ext4_buffered_write_iter(). The existing buffer_head direct I/O implementation has been removed as it's now redundant. [ Fix up ext4_dio_write_iter() per Jan's comments at https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ] Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: introduce new callback for IOMAP_REPORTMatthew Bobrowski
As part of the ext4_iomap_begin() cleanups that precede this patch, we also split up the IOMAP_REPORT branch into a completely separate ->iomap_begin() callback named ext4_iomap_begin_report(). Again, the raionale for this change is to reduce the overall clutter within ext4_iomap_begin(). Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22ext4: Add support for blocksize < pagesize in dioread_nolockRitesh Harjani
This patch adds the support for blocksize < pagesize for dioread_nolock feature. Since in case of blocksize < pagesize, we can have multiple small buffers of page as unwritten extents, we need to maintain a vector of these unwritten extents which needs the conversion after the IO is complete. Thus, we maintain a list of tuple <offset, size> pair (io_end_vec) for this & traverse this list to do the unwritten to written conversion. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-5-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22ext4: Add API to bring in support for unwritten io_end_vec conversionRitesh Harjani
This patch just brings in the API for conversion of unwritten io_end_vec extents which will be required for blocksize < pagesize support for dioread_nolock feature. No functional changes in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-3-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Added new ext4 debugging ioctls to allow userspace to get information about the state of the extent status cache. Dropped workaround for pre-1970 dates which were encoded incorrectly in pre-4.4 kernels. Since both the kernel correctly generates, and e2fsck detects and fixes this issue for the past four years, it'e time to drop the workaround. (Also, it's not like files with dates in the distant past were all that common in the first place.) A lot of miscellaneous bug fixes and cleanups, including some ext4 Documentation fixes. Also included are two minor bug fixes in fs/unicode" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) unicode: make array 'token' static const, makes object smaller unicode: Move static keyword to the front of declarations ext4: add missing bigalloc documentation. ext4: fix kernel oops caused by spurious casefold flag ext4: fix integer overflow when calculating commit interval ext4: use percpu_counters for extent_status cache hits/misses ext4: fix potential use after free after remounting with noblock_validity jbd2: add missing tracepoint for reserved handle ext4: fix punch hole for inline_data file systems ext4: rework reserved cluster accounting when invalidating pages ext4: documentation fixes ext4: treat buffers with write errors as containing valid data ext4: fix warning inside ext4_convert_unwritten_extents_endio ext4: set error return correctly when ext4_htree_store_dirent fails ext4: drop legacy pre-1970 encoding workaround ext4: add new ioctl EXT4_IOC_GET_ES_CACHE ext4: add a new ioctl EXT4_IOC_GETSTATE ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHE jbd2: flush_descriptor(): Do not decrease buffer head's ref count ext4: remove unnecessary error check ...
2019-09-19Merge tag 'y2038-vfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull y2038 vfs updates from Arnd Bergmann: "Add inode timestamp clamping. This series from Deepa Dinamani adds a per-superblock minimum/maximum timestamp limit for a file system, and clamps timestamps as they are written, to avoid random behavior from integer overflow as well as having different time stamps on disk vs in memory. At mount time, a warning is now printed for any file system that can represent current timestamps but not future timestamps more than 30 years into the future, similar to the arbitrary 30 year limit that was added to settimeofday(). This was picked as a compromise to warn users to migrate to other file systems (e.g. ext4 instead of ext3) when they need the file system to survive beyond 2038 (or similar limits in other file systems), but not get in the way of normal usage" * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: ext4: Reduce ext4 timestamp warnings isofs: Initialize filesystem timestamp ranges pstore: fs superblock limits fs: omfs: Initialize filesystem timestamp ranges fs: hpfs: Initialize filesystem timestamp ranges fs: ceph: Initialize filesystem timestamp ranges fs: sysv: Initialize filesystem timestamp ranges fs: affs: Initialize filesystem timestamp ranges fs: fat: Initialize filesystem timestamp ranges fs: cifs: Initialize filesystem timestamp ranges fs: nfs: Initialize filesystem timestamp ranges ext4: Initialize timestamps limits 9p: Fill min and max timestamps in sb fs: Fill in max and min timestamps in superblock utimes: Clamp the timestamps before update mount: Add mount warning for impending timestamp expiry timestamp_truncate: Replace users of timespec64_trunc vfs: Add timestamp_truncate() api vfs: Add file timestamp range support
2019-09-04ext4: Reduce ext4 timestamp warningsDeepa Dinamani
When ext4 file systems were created intentionally with 128 byte inodes, the rate-limited warning of eventual possible timestamp overflow are still emitted rather frequently. Remove the warning for now. Discussion for whether any warning is needed, and where it should be emitted, can be found at https://lore.kernel.org/lkml/1567523922.5576.57.camel@lca.pw/. I can post a separate follow-up patch after the conclusion. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-30ext4: Initialize timestamps limitsDeepa Dinamani
ext4 has different overflow limits for max filesystem timestamps based on the extra bytes available. The timestamp limits are calculated according to the encoding table in a4dad1ae24f85i(ext4: Fix handling of extended tv_sec): * extra msb of adjust for signed * epoch 32-bit 32-bit tv_sec to * bits time decoded 64-bit tv_sec 64-bit tv_sec valid time range * 0 0 1 -0x80000000..-0x00000001 0x000000000 1901-12-13..1969-12-31 * 0 0 0 0x000000000..0x07fffffff 0x000000000 1970-01-01..2038-01-19 * 0 1 1 0x080000000..0x0ffffffff 0x100000000 2038-01-19..2106-02-07 * 0 1 0 0x100000000..0x17fffffff 0x100000000 2106-02-07..2174-02-25 * 1 0 1 0x180000000..0x1ffffffff 0x200000000 2174-02-25..2242-03-16 * 1 0 0 0x200000000..0x27fffffff 0x200000000 2242-03-16..2310-04-04 * 1 1 1 0x280000000..0x2ffffffff 0x300000000 2310-04-04..2378-04-22 * 1 1 0 0x300000000..0x37fffffff 0x300000000 2378-04-22..2446-05-10 Note that the time limits are not correct for deletion times. Added a warn when an inode cannot be extended to incorporate an extended timestamp. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: tytso@mit.edu Cc: adilger.kernel@dilger.ca Cc: linux-ext4@vger.kernel.org
2019-08-28ext4: fix potential use after free after remounting with noblock_validityzhangyi (F)
Remount process will release system zone which was allocated before if "noblock_validity" is specified. If we mount an ext4 file system to two mountpoints with default mount options, and then remount one of them with "noblock_validity", it may trigger a use after free problem when someone accessing the other one. # mount /dev/sda foo # mount /dev/sda bar User access mountpoint "foo" | Remount mountpoint "bar" | ext4_map_blocks() | ext4_remount() check_block_validity() | ext4_setup_system_zone() ext4_data_block_valid() | ext4_release_system_zone() | free system_blks rb nodes access system_blks rb nodes | trigger use after free | This problem can also be reproduced by one mountpint, At the same time, add_system_zone() can get called during remount as well so there can be racing ext4_data_block_valid() reading the rbtree at the same time. This patch add RCU to protect system zone from releasing or building when doing a remount which inverse current "noblock_validity" mount option. It assign the rbtree after the whole tree was complete and do actual freeing after rcu grace period, avoid any intermediate state. Reported-by: syzbot+1e470567330b7ad711d5@syzkaller.appspotmail.com Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-08-22ext4: rework reserved cluster accounting when invalidating pagesEric Whitney
The goal of this patch is to remove two references to the buffer delay bit in ext4_da_page_release_reservation() as part of a larger effort to remove all such references from ext4. These two references are principally used to reduce the reserved block/cluster count when pages are invalidated as a result of truncating, punching holes, or collapsing a block range in a file. The entire function is removed and replaced with code in ext4_es_remove_extent() that reduces the reserved count as a side effect of removing a block range from delayed and not unwritten extents in the extent status tree as is done when truncating, punching holes, or collapsing ranges. The code is written to minimize the number of searches descending from rb tree roots for scalability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-22ext4: treat buffers with write errors as containing valid dataZhangXiaoxu
I got some errors when I repair an ext4 volume which stacked by an iscsi target: Entry 'test60' in / (2) has deleted/unused inode 73750. Clear? It can be reproduced when the network not good enough. When I debug this I found ext4 will read entry buffer from disk and the buffer is marked with write_io_error. If the buffer is marked with write_io_error, it means it already wroten to journal, and not checked out to disk. IOW, the journal is newer than the data in disk. If this journal record 'delete test60', it means the 'test60' still on the disk metadata. In this case, if we read the buffer from disk successfully and create file continue, the new journal record will overwrite the journal which record 'delete test60', then the entry corruptioned. So, use the buffer rather than read from disk if the buffer is marked with write_io_error. Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-12ext4: add fs-verity read supportEric Biggers
Make ext4_mpage_readpages() verify data as it is read from fs-verity files, using the helper functions from fs/verity/. To support both encryption and verity simultaneously, this required refactoring the decryption workflow into a generic "post-read processing" workflow which can do decryption, verification, or both. The case where the ext4 block size is not equal to the PAGE_SIZE is not supported yet, since in that case ext4_mpage_readpages() sometimes falls back to block_read_full_page(), which does not support fs-verity yet. Co-developed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12ext4: add basic fs-verity supportEric Biggers
Add most of fs-verity support to ext4. fs-verity is a filesystem feature that enables transparent integrity protection and authentication of read-only files. It uses a dm-verity like mechanism at the file level: a Merkle tree is used to verify any block in the file in log(filesize) time. It is implemented mainly by helper functions in fs/verity/. See Documentation/filesystems/fsverity.rst for the full documentation. This commit adds all of ext4 fs-verity support except for the actual data verification, including: - Adding a filesystem feature flag and an inode flag for fs-verity. - Implementing the fsverity_operations to support enabling verity on an inode and reading/writing the verity metadata. - Updating ->write_begin(), ->write_end(), and ->writepages() to support writing verity metadata pages. - Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl(). ext4 stores the verity metadata (Merkle tree and fsverity_descriptor) past the end of the file, starting at the first 64K boundary beyond i_size. This approach works because (a) verity files are readonly, and (b) pages fully beyond i_size aren't visible to userspace but can be read/written internally by ext4 with only some relatively small changes to ext4. This approach avoids having to depend on the EA_INODE feature and on rearchitecturing ext4's xattr support to support paging multi-gigabyte xattrs into memory, and to support encrypting xattrs. Note that the verity metadata *must* be encrypted when the file is, since it contains hashes of the plaintext data. This patch incorporates work by Theodore Ts'o and Chandan Rajendra. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12ext4: drop legacy pre-1970 encoding workaroundTheodore Ts'o
Originally, support for expanded timestamps had a bug in that pre-1970 times were erroneously encoded as being in the the 24th century. This was fixed in commit a4dad1ae24f8 ("ext4: Fix handling of extended tv_sec") which landed in 4.4. Starting with 4.4, pre-1970 timestamps were correctly encoded, but for backwards compatibility those incorrectly encoded timestamps were mapped back to the pre-1970 dates. Given that backwards compatibility workaround has been around for 4 years, and given that running e2fsck from e2fsprogs 1.43.2 and later will offer to fix these timestamps (which has been released for 3 years), it's past time to drop the legacy workaround from the kernel. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: add new ioctl EXT4_IOC_GET_ES_CACHETheodore Ts'o
For debugging reasons, it's useful to know the contents of the extent cache. Since the extent cache contains much of what is in the fiemap ioctl, use an fiemap-style interface to return this information. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: add a new ioctl EXT4_IOC_GETSTATETheodore Ts'o
The new ioctl EXT4_IOC_GETSTATE returns some of the dynamic state of an ext4 inode for debugging purposes. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHETheodore Ts'o
The new ioctl EXT4_IOC_CLEAR_ES_CACHE will force an inode's extent status cache to be cleared out. This is intended for use for debugging. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()Theodore Ts'o
Clean up namespace pollution by the inline_data code. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21ext4: refactor initialize_dirent_tail()Theodore Ts'o
Move the calculation of the location of the dirent tail into initialize_dirent_tail(). Also prefix the function with ext4_ to fix kernel namepsace polution. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-21ext4: rename "dirent_csum" functions to use "dirblock"Theodore Ts'o
Functions such as ext4_dirent_csum_verify() and ext4_dirent_csum_set() don't actually operate on a directory entry, but a directory block. And while they take a struct ext4_dir_entry *dirent as an argument, it had better be the first directory at the beginning of the direct block, or things will go very wrong. Rename the following functions so that things make more sense, and remove a lot of confusing casts along the way: ext4_dirent_csum_verify -> ext4_dirblock_csum_verify ext4_dirent_csum_set -> ext4_dirblock_csum_set ext4_dirent_csum -> ext4_dirblock_csum ext4_handle_dirty_dirent_node -> ext4_handle_dirty_dirblock Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-19ext4: optimize case-insensitive lookupsGabriel Krisman Bertazi
Temporarily cache a casefolded version of the file name under lookup in ext4_filename, to avoid repeatedly casefolding it. I got up to 30% speedup on lookups of large directories (>100k entries), depending on the length of the string under lookup. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-10ext4: only set project inherit bit for directoryWang Shilong
It doesn't make any sense to have project inherit bits for regular files, even though this won't cause any problem, but it is better fix this. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-05-07Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Ted Ts'o: "Clean up fscrypt's dcache revalidation support, and other miscellaneous cleanups" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: cache decrypted symlink target in ->i_link vfs: use READ_ONCE() to access ->i_link fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext fscrypt: only set dentry_operations on ciphertext dentries fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory fscrypt: fix race allowing rename() and link() of ciphertext dentries fscrypt: clean up and improve dentry revalidation fscrypt: use READ_ONCE() to access ->i_crypt_info fscrypt: remove WARN_ON_ONCE() when decryption fails fscrypt: drop inode argument from fscrypt_get_ctx()
2019-05-07Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Add as a feature case-insensitive directories (the casefold feature) using Unicode 12.1. Also, the usual largish number of cleanups and bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present ext4: fix ext4_show_options for file systems w/o journal unicode: refactor the rule for regenerating utf8data.h docs: ext4.rst: document case-insensitive directories ext4: Support case-insensitive file name lookups ext4: include charset encoding information in the superblock MAINTAINERS: add Unicode subsystem entry unicode: update unicode database unicode version 12.1.0 unicode: introduce test module for normalized utf8 implementation unicode: implement higher level API for string handling unicode: reduce the size of utf8data[] unicode: introduce code for UTF-8 normalization unicode: introduce UTF-8 character database ext4: actually request zeroing of inode table after grow ext4: cond_resched in work-heavy group loops ext4: fix use-after-free race with debug_want_extra_isize ext4: avoid drop reference to iloc.bh twice ext4: ignore e_value_offs for xattrs with value-in-ea-inode ext4: protect journal inode's blocks using block_validity ext4: use BUG() instead of BUG_ON(1) ...
2019-04-25ext4: Support case-insensitive file name lookupsGabriel Krisman Bertazi
This patch implements the actual support for case-insensitive file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the casefold feature set is able to configure directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups to succeed in that directory in a case-insensitive fashion, i.e: match a directory entry even if the name used by userspace is not a byte per byte match with the disk name, but is an equivalent case-insensitive version of the Unicode string. This operation is called a case-insensitive file name lookup. The feature is configured as an inode attribute applied to directories and inherited by its children. This attribute can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. * dcache handling: For a +F directory, Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite which equivalent string was used in a previous lookup, without having to resort to ->lookup(). d_hash() of casefolded directories is implemented as the hash of the casefolded string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare() uses the utf8_strncasecmp() infrastructure, which handles the comparison of equivalent, same case, names as well. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetched from the disk is a byte-per-byte match with what the user requested, making this implementation 'name-preserving'. i.e. no actual information is lost when writing to storage. DX is supported by modifying the hashes used in +F directories to make them case/encoding-aware. The new disk hashes are calculated as the hash of the full casefolded string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide an exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that case-insensitive file name lookup will not work only for that file. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that optional bit is set, any attempt to create a file name using an invalid UTF-8 sequence will fail and return an error to userspace. * Normalization algorithm: The UTF-8 algorithms used to compare strings in ext4 is implemented lives in fs/unicode, and is based on a previous version developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 12.1, or higher, combined with the elimination of ignorable code points (NFDi) and full case-folding (CF) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method for EXT4 because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like compatibility decompositions. Although: - This implementation is not completely linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25ext4: include charset encoding information in the superblockGabriel Krisman Bertazi
Support for encoding is considered an incompatible feature, since it has potential to create collisions of file names in existing filesystems. If the feature flag is not enabled, the entire filesystem will operate on opaque byte sequences, respecting the original behavior. The s_encoding field stores a magic number indicating the encoding format and version used globally by file and directory names in the filesystem. The s_encoding_flags defines policies for using the charset encoding, like how to handle invalid sequences. The magic number is mapped to the exact charset table, but the mapping is specific to ext4. Since we don't have any commitment to support old encodings, the only encoding I am supporting right now is utf8-12.1.0. The current implementation prevents the user from enabling encoding and per-directory encryption on the same filesystem at the same time. The incompatibility between these features lies in how we do efficient directory searches when we cannot be sure the encryption of the user provided fname will match the actual hash stored in the disk without decrypting every directory entry, because of normalization cases. My quickest solution is to simply block the concurrent use of these features for now, and enable it later, once we have a better solution. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25crypto: shash - remove shash_desc::flagsEric Biggers
The flags field in 'struct shash_desc' never actually does anything. The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP. However, no shash algorithm ever sleeps, making this flag a no-op. With this being the case, inevitably some users who can't sleep wrongly pass MAY_SLEEP. These would all need to be fixed if any shash algorithm actually started sleeping. For example, the shash_ahash_*() functions, which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP from the ahash API to the shash API. However, the shash functions are called under kmap_atomic(), so actually they're assumed to never sleep. Even if it turns out that some users do need preemption points while hashing large buffers, we could easily provide a helper function crypto_shash_update_large() which divides the data into smaller chunks and calls crypto_shash_update() and cond_resched() for each chunk. It's not necessary to have a flag in 'struct shash_desc', nor is it necessary to make individual shash algorithms aware of this at all. Therefore, remove shash_desc::flags, and document that the crypto_shash_*() functions can be called from any context. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-17fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertextEric Biggers
->lookup() in an encrypted directory begins as follows: 1. fscrypt_prepare_lookup(): a. Try to load the directory's encryption key. b. If the key is unavailable, mark the dentry as a ciphertext name via d_flags. 2. fscrypt_setup_filename(): a. Try to load the directory's encryption key. b. If the key is available, encrypt the name (treated as a plaintext name) to get the on-disk name. Otherwise decode the name (treated as a ciphertext name) to get the on-disk name. But if the key is concurrently added, it may be found at (2a) but not at (1a). In this case, the dentry will be wrongly marked as a ciphertext name even though it was actually treated as plaintext. This will cause the dentry to be wrongly invalidated on the next lookup, potentially causing problems. For example, if the racy ->lookup() was part of sys_mount(), then the new mount will be detached when anything tries to access it. This is despite the mountpoint having a plaintext path, which should remain valid now that the key was added. Of course, this is only possible if there's a userspace race. Still, the additional kernel-side race is confusing and unexpected. Close the kernel-side race by changing fscrypt_prepare_lookup() to also set the on-disk filename (step 2b), consistent with the d_flags update. Fixes: 28b4c263961c ("ext4 crypto: revalidate dentry after adding or removing the key") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>