Age | Commit message (Collapse) | Author |
|
Pull xfs fixes from Darrick Wong:
- Fix a memory leak in the new in-core extent map
- Refactor the xfs_dev_t conversions for easier xfsprogs porting
* tag 'xfs-4.15-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: abstract out dev_t conversions
xfs: fix memory leak in xfs_iext_free_last_leaf
|
|
And move them to xfs_linux.h so that xfsprogs can stub them out more
easily.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
found the issue by kmemleak.
unreferenced object 0xffff8800674611c0 (size 16):
xfs_iext_insert+0x82a/0xa90 [xfs]
xfs_bmap_add_extent_hole_delay+0x1e5/0x5b0 [xfs]
xfs_bmapi_reserve_delalloc+0x483/0x530 [xfs]
xfs_file_iomap_begin+0xac8/0xd40 [xfs]
iomap_apply+0xb8/0x1b0
iomap_file_buffered_write+0xac/0xe0
xfs_file_buffered_aio_write+0x198/0x420 [xfs]
xfs_file_write_iter+0x23f/0x2a0 [xfs]
__vfs_write+0x23e/0x340
vfs_write+0xe9/0x240
SyS_write+0xa1/0x120
do_syscall_64+0xda/0x260
Signed-off-by: Shu Wang <shuwang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Pull xfs fixes from Darrick Wong:
"A couple more patches to fix a locking bug and some inconsistent type
usage in some of the new code:
- Fix a forgotten rcu read unlock
- Fix some inconsistent integer type usage"
* tag 'xfs-4.15-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix type usage
xfs: fix forgotten rcu read unlock when skipping inode reclaim
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"Save for a few late fixes, all of these commits have shipped in -next
releases since before the merge window opened, and 0day has given a
build success notification.
The ext4 touches came from Jan, and the xfs touches have Darrick's
reviewed-by. An xfstest for the MAP_SYNC feature has been through
a few round of reviews and is on track to be merged.
- Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may
be required to satisfy a write fault to also be flushed ("on disk")
before the kernel returns to userspace from the fault handler.
Effectively every write-fault that dirties metadata completes an
fsync() before returning from the fault handler. The new
MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
is validated as supported by the filesystem's ->mmap() file
operation.
- Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
This enables interoperability with environments that only implement
the standardized methods.
- Add support for the ACPI 6.2 NVDIMM media error injection methods.
- Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
latch last shutdown status, firmware update, SMART error injection,
and SMART alarm threshold control.
- Cleanup physical address information disclosures to be root-only.
- Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
- Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
- 957ac8c421ad ("dax: fix PMD faults on zero-length files"):
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
- a39e596baa07 ("xfs: support for synchronous DAX faults") and
7b565c9f965b ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"
* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
acpi, nfit: add 'Enable Latch System Shutdown Status' command support
dax: fix general protection fault in dax_alloc_inode
dax: fix PMD faults on zero-length files
dax: stop requiring a live device for dax_flush()
brd: remove dax support
dax: quiet bdev_dax_supported()
fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
tools/testing/nvdimm: unit test clear-error commands
acpi, nfit: validate commands against the device type
tools/testing/nvdimm: stricter bounds checking for error injection commands
xfs: support for synchronous DAX faults
xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
ext4: Support for synchronous DAX faults
ext4: Simplify error handling in ext4_dax_huge_fault()
dax: Implement dax_finish_sync_fault()
dax, iomap: Add support for synchronous faults
mm: Define MAP_SYNC and VM_SYNC flags
dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
dax: Allow dax_iomap_fault() to return pfn
dax: Fix comment describing dax_iomap_fault()
...
|
|
Be consistent about using uint32_t/uint8_t instead of u32/u8. This is
more so that we don't have to maintain /those/ types in xfsprogs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
|
In commit f2e9ad21 ("xfs: check for race with xfs_reclaim_inode"), we
skip an inode if we're racing with freeing the inode via
xfs_reclaim_inode, but we forgot to release the rcu read lock when
dumping the inode, with the result that we exit to userspace with a lock
held. Don't do that; generic/320 with a 1k block size fails this
very occasionally.
================================================
WARNING: lock held when returning to user space!
4.14.0-rc6-djwong #4 Tainted: G W
------------------------------------------------
rm/30466 is leaving the kernel with locks still held!
1 lock held by rm/30466:
#0: (rcu_read_lock){....}, at: [<ffffffffa01364d3>] xfs_ifree_cluster.isra.17+0x2c3/0x6f0 [xfs]
------------[ cut here ]------------
WARNING: CPU: 1 PID: 30466 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x71/0x700
Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
CPU: 1 PID: 30466 Comm: rm Tainted: G W 4.14.0-rc6-djwong #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
task: ffff880037680000 task.stack: ffffc90001064000
RIP: 0010:rcu_note_context_switch+0x71/0x700
RSP: 0000:ffffc90001067e50 EFLAGS: 00010002
RAX: 0000000000000001 RBX: ffff880037680000 RCX: ffff88003e73d200
RDX: 0000000000000002 RSI: ffffffff819e53e9 RDI: ffffffff819f4375
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff880062c900d0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880037680000
R13: 0000000000000000 R14: ffffc90001067eb8 R15: ffff880037680690
FS: 00007fa3b8ce8700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f69bf77c000 CR3: 000000002450a000 CR4: 00000000000006e0
Call Trace:
__schedule+0xb8/0xb10
schedule+0x40/0x90
exit_to_usermode_loop+0x6b/0xa0
prepare_exit_to_usermode+0x7a/0x90
retint_user+0x8/0x20
RIP: 0033:0x7fa3b87fda87
RSP: 002b:00007ffe41206568 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
RAX: 0000000000000000 RBX: 00000000010e88c0 RCX: 00007fa3b87fda87
RDX: 0000000000000000 RSI: 00000000010e89c8 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
R10: 000000000000015e R11: 0000000000000246 R12: 00000000010c8060
R13: 00007ffe41206690 R14: 0000000000000000 R15: 0000000000000000
---[ end trace e88f83bf0cfbd07d ]---
Fixes: f2e9ad212def50bcf4c098c6288779dd97fff0f0
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
etc).
SLAB is bloated temporarily by switching to "unsigned long", but only
temporarily.
Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull xfs updates from Darrick Wong:
"xfs: great scads of new stuff for 4.15.
This merge cycle, we're making some substantive changes to XFS. The
in-core extent mappings have been refactored to use proper iterators
and a btree to handle heavily fragmented files without needing
high-order memory allocations; some important log recovery bug fixes;
and the first part of the online fsck functionality.
(The online fsck feature is disabled by default and more pieces of it
will be coming in future release cycles.)
This giant pile of patches has been run through a full xfstests run
over the weekend and through a quick xfstests run against this
morning's master, with no major failures reported.
New in this version:
- Refactor the incore extent map manipulations to use a cursor
instead of directly modifying extent data.
- Refactor the incore extent map cursor to use an in-memory btree
instead of a single high-order allocation. This eliminates a major
source of complaints about insufficient memory when opening a
heavily fragmented file into a system whose memory is also heavily
fragmented.
- Fix a longstanding bug where deleting a file with a complex
extended attribute btree incorrectly handled memory pointers, which
could lead to memory corruption.
- Improve metadata validation to eliminate crashing problems found
while fuzzing xfs.
- Move the error injection tag definitions into libxfs to be shared
with userspace components.
- Fix some log recovery bugs where we'd underflow log block position
vector and incorrectly fail log recovery.
- Drain the buffer lru after log recovery to force recovered buffers
back through the verifiers after mount. On a v4 filesystem the log
never attaches verifiers during log replay (v5 does), so we could
end up with buffers marked verified but without having ever been
verified.
- Fix various other bugs.
- Introduce the first part of a new online fsck tool. The new fsck
tool will be able to iterate every piece of metadata in the
filesystem to look for obvious errors and corruptions. In the next
release cycle the checking will be extended to cross-reference with
the other fs metadata, so this feature should only be used by the
developers in the mean time"
* tag 'xfs-4.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (131 commits)
xfs: on failed mount, force-reclaim inodes after unmounting quota controls
xfs: check the uniqueness of the AGFL entries
xfs: remove u_int* type usage
xfs: handle zero entries case in xfs_iext_rebalance_leaf
xfs: add comments documenting the rebalance algorithm
xfs: trivial indentation fixup for xfs_iext_remove_node
xfs: remove a superflous assignment in xfs_iext_remove_node
xfs: add some comments to xfs_iext_insert/xfs_iext_insert_node
xfs: fix number of records handling in xfs_iext_split_leaf
fs/xfs: Remove NULL check before kmem_cache_destroy
xfs: only check da node header padding on v5 filesystems
xfs: fix btree scrub deref check
xfs: fix uninitialized return values in scrub code
xfs: pass inode number to xfs_scrub_ino_set_{preen,warning}
xfs: refactor the directory data block bestfree checks
xfs: mark xlog_verify_dest_ptr STATIC
xfs: mark xlog_recover_check_summary STATIC
xfs: mark xfs_btree_check_lblock and xfs_btree_check_ptr static
xfs: remove unreachable error injection code in xfs_qm_dqget
xfs: remove unused debug counts for xfs_lock_inodes
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- Add support for online resizing of file systems with bigalloc
- Fix a two data corruption bugs involving DAX, as well as a corruption
bug after a crash during a racing fallocate and delayed allocation.
- Finally, a number of cleanups and optimizations.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: improve smp scalability for inode generation
ext4: add support for online resizing with bigalloc
ext4: mention noload when recovering on read-only device
Documentation: fix little inconsistencies
ext4: convert timers to use timer_setup()
jbd2: convert timers to use timer_setup()
ext4: remove duplicate extended attributes defs
ext4: add ext4_should_use_dax()
ext4: add sanity check for encryption + DAX
ext4: prevent data corruption with journaling + DAX
ext4: prevent data corruption with inline data + DAX
ext4: fix interaction between i_size, fallocate, and delalloc after a crash
ext4: retry allocations conservatively
ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
ext4: Add iomap support for inline data
iomap: Add IOMAP_F_DATA_INLINE flag
iomap: Switch from blkno to disk offset
|
|
While reviewing whether MAP_SYNC should strengthen its current guarantee
of syncing writes from the initiating process to also include
third-party readers observing dirty metadata, Dave pointed out that the
check of IOMAP_WRITE is misplaced.
The policy of what to with IOMAP_F_DIRTY should be separated from the
generic filesystem mechanism of reporting dirty metadata. Move this
policy to the fs-dax core to simplify the per-filesystem iomap handlers,
and further centralize code that implements the MAP_SYNC policy. This
otherwise should not change behavior, it just makes it easier to change
behavior in the future.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
When mounting fails, we must force-reclaim inodes (and disable delayed
reclaim) /after/ the realtime and quota control have let go of the
realtime and quota inodes. Without this, we corrupt the timer list and
cause other weird problems.
Found by xfs/376 fuzzing u3.bmbt[0].lastoff on an rmap filesystem to
force a bogus post-eof extent reclaim that causes the fs to go down.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Make sure we don't list a block twice in the agfl by copying the
contents of the AGFL to an array, sorting it, and looking for
duplicates. We can easily check that the number of agfl entries we see
actually matches the flcount, so do that too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Use the uint* types instead of the u_int* types. This will (hopefully)
pair with an xfsprogs cleanup.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
And also rename fill to nr_entries to match the rest of the code.
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Fix to check the correct value, and remove a duplicate handling of the
uneven record number split algorith,
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
kmem_cache_destroy already checks for null values.
Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
It turns out that we only started zeroing a new da btree node's block
header on v5 filesystems. Prior to that, we just wouldn't set anything
at all, which means that the pad field never got set and would retain
whatever happened to be in memory.
Therefore, we can only check the pad for zeroness on v5 filesystems.
shared/006 on a v4 filesystem exposes this scrub bug.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
The btree scrubber has some custom code to retrieve and check a btree
block via xfs_btree_lookup_get_block. This function will either return
an error code (verifiers failed) or a *pblock will be untouched (bad
pointer). Since we previously set *pblock to NULL, we need to check
*pblock, not pblock, to trigger the early bailout.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Fix smatch complaints about uninitialized return codes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
There are two ways to scrub an inode -- calling xfs_iget and checking
the raw inode core, or by loading the inode cluster buffer and checking
the on-disk contents directly. The second method is only useful if
_iget fails the verifiers; when this is the case, sc->ip is NULL and
calling the tracepoint will cause a system crash.
Therefore, pass the raw inode number directly into the _preen and
_warning functions.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
In a directory data block, the zeroth bestfree item must point to the
longest free space. Therefore, when we check the bestfree block's
records against the data blocks, we only need to compare with bf[0] and
don't need the loop.
The weird loop was most probably the result of an earlier refactoring
gone bad.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Conflicts:
include/linux/compiler-clang.h
include/linux/compiler-gcc.h
include/linux/compiler-intel.h
include/uapi/linux/stddef.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
We already did it in the forward declaration, but not for the function
body itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We already did it in the forward declaration, but not for the function
body itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
[darrick: fix broken initializer in xfs_scrub_xattr]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Ever since we added the noinline tag there is no good reason to define
away the static for debug builds - we'll get just as good debug
information with our without it, so don't mess up sparse and other
checkers due to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Neither defines an on-disk format, so move them out of xfs_format.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This removed an unaligned load per extent, as well as the manual poking
into the on-disk extent format.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We only have two places that remove 2 extents at the same time, so unroll
the loop there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We only have two places that insert 2 extents at the same time, so unroll
the loop there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Replace the current linear list and the indirection array for the in-core
extent list with a b+tree to avoid the need for larger memory allocations
for the indirection array when lots of extents are present. The current
extent list implementations leads to heavy pressure on the memory
allocator when modifying files with a high extent count, and can lead
to high latencies because of that.
The replacement is a b+tree with a few quirks. The leaf nodes directly
store the extent record in two u64 values. The encoding is a little bit
different from the existing in-core extent records so that the start
offset and length which are required for lookups can be retreived with
simple mask operations. The inner nodes store a 64-bit key containing
the start offset in the first half of the node, and the pointers to the
next lower level in the second half. In either case we walk the node
from the beginninig to the end and do a linear search, as that is more
efficient for the low number of cache lines touched during a search
(2 for the inner nodes, 4 for the leaf nodes) than a binary search.
We store termination markers (zero length for the leaf nodes, an
otherwise impossible high bit for the inner nodes) to terminate the key
list / records instead of storing a count to use the available cache
lines as efficiently as possible.
One quirk of the algorithm is that while we normally split a node half and
half like usual btree implementations we just spill over entries added at
the very end of the list to a new node on its own. This means we get a
100% fill grade for the common cases of bulk insertion when reading an
inode into memory, and when only sequentially appending to a file. The
downside is a slightly higher chance of splits on the first random
insertions.
Both insert and removal manually recurse into the lower levels, but
the bulk deletion of the whole tree is still implemented as a recursive
function call, although one limited by the overall depth and with very
little stack usage in every iteration.
For the first few extents we dynamically grow the list from a single
extent to the next powers of two until we have a first full leaf block
and that building the actual tree.
The code started out based on the generic lib/btree.c code from Joern
Engel based on earlier work from Peter Zijlstra, but has since been
rewritten beyond recognition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
To make life a little simpler make xfs_bmbt_set_all unaligned access
aware so that we can use it directly on the destination buffer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Supporting a small bit of data inside the inode fork blows up the fork size
a lot, removing the 32 bytes of inline data halves the effective size of
the inode fork (and it still has a lot of unused padding left), and the
performance of a single kmalloc doesn't show up compared to the size to read
an inode or create one.
It also simplifies the fork management code a lot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Instead of looking up extents to convert and calling xfs_bmapi_write on
each of them just let xfs_bmapi_write handle the full range. To make
this robust add a new XFS_BMAPI_CONVERT_ONLY that only converts ranges
and never allocates blocks.
[darrick: shorten the stringified CONVERT_ONLY trace flag]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Match the iteration order for extent deletion in the truncate and
reflink I/O completion path.
This also happens to make implementing the new incore extent list
a lot easier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Add a new xfs_iext_cursor structure to hide the direct extent map
index manipulations. In addition to the existing lookup/get/insert/
remove and update routines new primitives to get the first and last
extent cursor, as well as moving up and down by one extent are
provided. Also new are convenience to increment/decrement the
cursor and retreive the new extent, as well as to peek into the
previous/next extent without updating the cursor and last but not
least a macro to iterate over all extents in a fork.
[darrick: rename for_each_iext to for_each_xfs_iext]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This actually makes the function very slightly less efficient for now as we
detour through the expanded irect format between the in-core extent format
and the on-disk one instead of just endian swapping them. But with the
incore extent btree the in-core one will use a different format and the
representation will be entirely hidden.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This actually makes the function very slightly less efficient for now as we
detour through the expanded irect format between the in-core extent format
and the on-disk one instead of just endian swapping them. But with the
incore extent btree the in-core one will use a different format and the
representation will be entirely hidden. It also happens to make the
function a whole more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This prepares for getting rid of the current in-memory extent format.
At the end of the series we will change the calling convention again
to pass the xfs_bmbt_irec structure once it is available everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|