Age | Commit message (Collapse) | Author |
|
In preparation of removing the historic icinode struct, move the projid
field into the containing xfs_inode structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The xfs_icdinode structure just contains a random mix of inode field,
which are all read from the on-disk inode and mostly not looked at
before reading the inode or initializing a new inode cluster. The
only exceptions are the forkoff and blocks field, which are used
in sanity checks for freshly allocated inodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The legacy DMAPI fields were never set by upstream Linux XFS, and have no
way to be read using the kernel APIs. So instead of bloating the in-core
inode for them just copy them from the on-disk inode into the log when
logging the inode. The only caveat is that we need to make sure to zero
the fields for newly read or deleted inodes, which is solved using a new
flag in the inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The crtime only exists for v5 inodes, so only copy it for those.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Make sure di_flags2 is always initialized. We currently get this implicitly
by clearing the dinode core on allocating the in-core inode, but that is
about to go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Split looking up the dinode from xfs_imap_to_bp, which can be
significantly simplified as a result.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
A directory with one directory block which in turns consists of two or more fs
blocks is incorrectly flagged as corrupt by scrub since it assumes that
"Block" format directories have a data fork single extent spanning the file
offset range of [0, Dir block size - 1].
This commit fixes the bug by removing the incorrect check.
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
xfs/538 can cause the following call trace to be printed when executing on a
multi-block directory configuration,
WARNING: CPU: 1 PID: 2578 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0
Call Trace:
? xfs_buf_rele+0x4f/0x450
xfs_bmap_add_extent_hole_real+0x747/0x960
xfs_bmapi_allocate+0x39a/0x440
xfs_bmapi_write+0x507/0x9e0
xfs_da_grow_inode_int+0x1cd/0x330
? up+0x12/0x60
xfs_dir2_grow_inode+0x62/0x110
? xfs_trans_log_inode+0x234/0x2d0
xfs_dir2_sf_to_block+0x103/0x940
? xfs_dir2_sf_check+0x8c/0x210
? xfs_da_compname+0x19/0x30
? xfs_dir2_sf_lookup+0xd0/0x3d0
xfs_dir2_sf_addname+0x10d/0x910
xfs_dir_createname+0x1ad/0x210
xfs_create+0x404/0x620
xfs_generic_create+0x24c/0x320
path_openat+0xda6/0x1030
do_filp_open+0x88/0x130
? kmem_cache_alloc+0x50/0x210
? __cond_resched+0x16/0x40
? kmem_cache_alloc+0x50/0x210
do_sys_openat2+0x97/0x150
__x64_sys_creat+0x49/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
This occurs because xfs_bmap_exact_minlen_extent_alloc() initializes
xfs_alloc_arg->total to xfs_bmalloca->minlen. In the context of
xfs_bmap_exact_minlen_extent_alloc(), xfs_bmalloca->minlen has a value of 1
and hence the space allocator could choose an AG which has less than
xfs_bmalloca->total number of free blocks available. As the transaction
proceeds, one of the future space allocation requests could fail due to
non-availability of free blocks in the AG that was originally chosen.
This commit fixes the bug by assigning xfs_alloc_arg->total to the value of
xfs_bmalloca->total.
Fixes: 301519674699 ("xfs: Introduce error injection to allocate only minlen size extents for files")
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
unwritten extent
With dax enabled filesystems, a direct write operation into an existing
unwritten extent results in xfs_iomap_write_direct() zero-ing and converting
the extent into a normal extent before the actual data is copied from the
userspace buffer.
The inode extent count can increase by 2 if the extent range being written to
maps to the middle of the existing unwritten extent range. Hence this commit
uses XFS_IEXT_WRITE_UNWRITTEN_CNT as the extent count delta when such a write
operation is being performed.
Fixes: 727e1acd297c ("xfs: Check for extent overflow when trivally adding a new extent")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Removal of kmem_zone_init wrappers accidentally changed a slab cache
name from "xfs_trans" to "xf_trans". Fix this so that userspace
consumers of /proc/slabinfo and /sys/kernel/slab can find it again.
Fixes: b1231760e443 ("xfs: Remove slab init wrappers")
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
per-AG resv failure after fixing up freespace is hard to test in an
effective way, so directly add an error injection path to observe
such error handling path works as expected.
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
As the first step of shrinking, this attempts to enable shrinking
unused space in the last allocation group by fixing up freespace
btree, agi, agf and adjusting super block and use a helper
xfs_ag_shrink_space() to fixup the last AG.
This can be all done in one transaction for now, so I think no
additional protection is needed.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
This patch introduces a helper to shrink unused space in the last AG
by fixing up the freespace btree.
Also make sure that the per-AG reservation works under the new AG
size. If such per-AG reservation or extent allocation fails, roll
the transaction so the new transaction could cancel without any side
effects.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Move out related logic for initializing new added AGs to a new helper
in preparation for shrinking. No logic changes.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
sb_fdblocks will be updated lazily if lazysbcount is enabled,
therefore when shrinking the filesystem sb_fdblocks could be
larger than sb_dblocks and xfs_validate_sb_write() would fail.
Even for growfs case, it'd be better to update lazy sb counters
immediately to reflect the real sb counters.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
s/strutures/structures/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Reviewed-by: Pavel Reichl <preichl@redhat.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
s/sytemcall/syscall/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
s/filesytem/filesystem/
s/instrumention/instrumentation/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
- 21.92% __xfs_trans_commit
- 21.62% xfs_log_commit_cil
- 11.69% xfs_trans_unreserve_and_mod_sb
- 11.58% __percpu_counter_compare
- 11.45% __percpu_counter_sum
- 10.29% _raw_spin_lock_irqsave
- 10.28% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
We debated just getting rid of it last time this came up and
there was no real objection to removing it. Now it's the biggest
scalability limitation for debug kernels even on smallish machines,
so let's just get rid of it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
On debug kernels, we call xfs_dir3_leaf_check_int() multiple times
on every directory modification. The robust hash ordering checks it
does on every entry in the leaf on every call results in a massive
CPU overhead which slows down debug kernels by a large amount.
We use xfs_dir3_leaf_check_int() for the verifiers as well, so we
can't just gut the function to reduce overhead. What we can do,
however, is reduce the work it does when it is called from the
debug interfaces, just leaving the high level checks in place and
leaving the robust validation to the verifiers. This means the debug
checks will catch gross errors, but subtle bugs might not be caught
until a verifier is run.
It is easy enough to restore the existing debug behaviour if the
developer needs it (just change a call parameter in the debug code),
but overwise the overhead makes testing large directory block sizes
on debug kernels very slow.
Profile at an unlink rate of ~80k file/s on a 64k block size
filesystem before the patch:
40.30% [kernel] [k] xfs_dir3_leaf_check_int
10.98% [kernel] [k] __xfs_dir3_data_check
8.10% [kernel] [k] xfs_verify_dir_ino
4.42% [kernel] [k] memcpy
2.22% [kernel] [k] xfs_dir2_data_get_ftype
1.52% [kernel] [k] do_raw_spin_lock
Profile after, at an unlink rate of ~125k files/s (+50% improvement)
has largely dropped the leaf verification debug overhead out of the
profile.
16.53% [kernel] [k] __xfs_dir3_data_check
12.53% [kernel] [k] xfs_verify_dir_ino
7.97% [kernel] [k] memcpy
3.36% [kernel] [k] xfs_dir2_data_get_ftype
2.86% [kernel] [k] __pv_queued_spin_lock_slowpath
Create shows a similar change in profile and a +25% improvement in
performance.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We call xfs_dir_ino_validate() for every dir entry in a directory
when doing validity checking of the directory. It calls
xfs_verify_dir_ino() then emits a corruption report if bad or does
error injection if good. It is extremely costly:
43.27% [kernel] [k] xfs_dir3_leaf_check_int
10.28% [kernel] [k] __xfs_dir3_data_check
6.61% [kernel] [k] xfs_verify_dir_ino
4.16% [kernel] [k] xfs_errortag_test
4.00% [kernel] [k] memcpy
3.48% [kernel] [k] xfs_dir_ino_validate
7% of the cpu usage in this directory traversal workload is
xfs_dir_ino_validate() doing absolutely nothing.
We don't need error injection to simulate a bad inode numbers in the
directory structure because we can do that by fuzzing the structure
on disk.
And we don't need a corruption report, because the
__xfs_dir3_data_check() will emit one if the inode number is bad.
So just call xfs_verify_dir_ino() directly here, and get rid of all
this unnecessary overhead:
40.30% [kernel] [k] xfs_dir3_leaf_check_int
10.98% [kernel] [k] __xfs_dir3_data_check
8.10% [kernel] [k] xfs_verify_dir_ino
4.42% [kernel] [k] memcpy
2.22% [kernel] [k] xfs_dir2_data_get_ftype
1.52% [kernel] [k] do_raw_spin_lock
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
From a concurrent rm -rf workload:
41.04% [kernel] [k] xfs_dir3_leaf_check_int
9.85% [kernel] [k] __xfs_dir3_data_check
5.60% [kernel] [k] xfs_verify_ino
5.32% [kernel] [k] xfs_agino_range
4.21% [kernel] [k] memcpy
3.06% [kernel] [k] xfs_errortag_test
2.57% [kernel] [k] xfs_dir_ino_validate
1.66% [kernel] [k] xfs_dir2_data_get_ftype
1.17% [kernel] [k] do_raw_spin_lock
1.11% [kernel] [k] xfs_verify_dir_ino
0.84% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
0.83% [kernel] [k] xfs_buf_find
0.64% [kernel] [k] xfs_log_commit_cil
THere's an awful lot of overhead in just range checking inode
numbers in that, but each inode number check is not a lot of code.
The total is a bit over 14.5% of the CPU time is spent validating
inode numbers.
The problem is that they deeply nested global scope functions so the
overhead here is all in function call marshalling.
text data bss dec hex filename
2077 0 0 2077 81d fs/xfs/libxfs/xfs_types.o.orig
2197 0 0 2197 895 fs/xfs/libxfs/xfs_types.o
There's a small increase in binary size by inlining all the local
nested calls in the verifier functions, but the same workload now
profiles as:
40.69% [kernel] [k] xfs_dir3_leaf_check_int
10.52% [kernel] [k] __xfs_dir3_data_check
6.68% [kernel] [k] xfs_verify_dir_ino
4.22% [kernel] [k] xfs_errortag_test
4.15% [kernel] [k] memcpy
3.53% [kernel] [k] xfs_dir_ino_validate
1.87% [kernel] [k] xfs_dir2_data_get_ftype
1.37% [kernel] [k] do_raw_spin_lock
0.98% [kernel] [k] xfs_buf_find
0.94% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
0.73% [kernel] [k] xfs_log_commit_cil
Now we only spend just over 10% of the time validing inode numbers
for the same workload. Hence a few "inline" keyworks is good enough
to reduce the validation overhead by 30%...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We process the buf_log_item bitmap one set bit at a time with
xfs_next_bit() so we can detect if a region crosses a memcpy
discontinuity in the buffer data address. This has massive overhead
on large buffers (e.g. 64k directory blocks) because we do a lot of
unnecessary checks and xfs_buf_offset() calls.
For example, 16-way concurrent create workload on debug kernel
running CPU bound has this at the top of the profile at ~120k
create/s on 64kb directory block size:
20.66% [kernel] [k] xfs_dir3_leaf_check_int
7.10% [kernel] [k] memcpy
6.22% [kernel] [k] xfs_next_bit
3.55% [kernel] [k] xfs_buf_offset
3.53% [kernel] [k] xfs_buf_item_format
3.34% [kernel] [k] __pv_queued_spin_lock_slowpath
3.04% [kernel] [k] do_raw_spin_lock
2.84% [kernel] [k] xfs_buf_item_size_segment.isra.0
2.31% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
1.36% [kernel] [k] xfs_log_commit_cil
(debug checks hurt large blocks)
The only buffers with discontinuities in the data address are
unmapped buffers, and they are only used for inode cluster buffers
and only for logging unlinked pointers. IOWs, it is -rare- that we
even need to detect a discontinuity in the buffer item formatting
code.
Optimise all this by using xfs_contig_bits() to find the size of
the contiguous regions, then test for a discontiunity inside it. If
we find one, do the slow "bit at a time" method we do now. If we
don't, then just copy the entire contiguous range in one go.
Profile now looks like:
25.26% [kernel] [k] xfs_dir3_leaf_check_int
9.25% [kernel] [k] memcpy
5.01% [kernel] [k] __pv_queued_spin_lock_slowpath
2.84% [kernel] [k] do_raw_spin_lock
2.22% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
1.88% [kernel] [k] xfs_buf_find
1.53% [kernel] [k] memmove
1.47% [kernel] [k] xfs_log_commit_cil
....
0.34% [kernel] [k] xfs_buf_item_format
....
0.21% [kernel] [k] xfs_buf_offset
....
0.16% [kernel] [k] xfs_contig_bits
....
0.13% [kernel] [k] xfs_buf_item_size_segment.isra.0
So the bit scanning over for the dirty region tracking for the
buffer log items is basically gone. Debug overhead hurts even more
now...
Perf comparison
dir block creates unlink
size (kb) time rate time
Original 4 4m08s 220k 5m13s
Original 64 7m21s 115k 13m25s
Patched 4 3m59s 230k 5m03s
Patched 64 6m23s 143k 12m33s
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Otherwise it doesn't correctly calculate the number of vectors
in a logged buffer that has a contiguous map that gets split into
multiple regions because the range spans discontigous memory.
Probably never been hit in practice - we don't log contiguous ranges
on unmapped buffers (inode clusters).
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we modify btrees repeatedly, we regularly increase the size of
the logged region by a single chunk at a time (per transaction
commit). This results in the CIL formatting code having to
reallocate the log vector buffer every time the buffer dirty region
grows. Hence over a typical 4kB btree buffer, we might grow the log
vector 4096/128 = 32x over a short period where we repeatedly add
or remove records to/from the buffer over a series of running
transaction. This means we are doing 32 memory allocations and frees
over this time during a performance critical path in the journal.
The amount of space tracked in the CIL for the object is calculated
during the ->iop_format() call for the buffer log item, but the
buffer memory allocated for it is calculated by the ->iop_size()
call. The size callout determines the size of the buffer, the format
call determines the space used in the buffer.
Hence we can oversize the buffer space required in the size
calculation without impacting the amount of space used and accounted
to the CIL for the changes being logged. This allows us to reduce
the number of allocations by rounding up the buffer size to allow
for future growth. This can safe a substantial amount of CPU time in
this path:
- 46.52% 2.02% [kernel] [k] xfs_log_commit_cil
- 44.49% xfs_log_commit_cil
- 30.78% _raw_spin_lock
- 30.75% do_raw_spin_lock
30.27% __pv_queued_spin_lock_slowpath
(oh, ouch!)
....
- 1.05% kmem_alloc_large
- 1.02% kmem_alloc
0.94% __kmalloc
This overhead here us what this patch is aimed at. After:
- 0.76% kmem_alloc_large
- 0.75% kmem_alloc
0.70% __kmalloc
The size of 512 bytes is based on the bitmap chunk size being 128
bytes and that random directory entry updates almost never require
more than 3-4 128 byte regions to be logged in the directory block.
The other observation is for per-ag btrees. When we are inserting
into a new btree block, we'll pack it from the front. Hence the
first few records land in the first 128 bytes so we log only 128
bytes, the next 8-16 records land in the second region so now we log
256 bytes. And so on. If we are doing random updates, it will only
allocate every 4 random 128 byte regions that are dirtied instead of
every single one.
Any larger than 512 bytes and I noticed an increase in memory
footprint in my scalability workloads. Any less than this and I
didn't really see any significant benefit to CPU usage.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
|
|
When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.
This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.
This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:
18.99% [kernel] [k] __pv_queued_spin_lock_slowpath
3.57% [kernel] [k] do_raw_spin_lock
2.51% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
2.48% [kernel] [k] memcpy
2.34% [kernel] [k] xfs_log_commit_cil
The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:
- 15.30% xfs_init_security
- 15.23% security_inode_init_security
- 13.05% xfs_initxattrs
- 12.94% xfs_attr_set
- 6.75% xfs_bmap_add_attrfork
- 5.51% xfs_trans_commit
- 5.48% __xfs_trans_commit
- 5.35% xfs_log_commit_cil
- 3.86% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.70% xfs_trans_alloc
0.52% xfs_trans_reserve
- 5.41% xfs_attr_set_args
- 5.39% xfs_attr_set_shortform.constprop.0
- 4.46% xfs_trans_commit
- 4.46% __xfs_trans_commit
- 4.33% xfs_log_commit_cil
- 2.74% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
0.60% xfs_inode_item_format
0.90% xfs_attr_try_sf_addname
- 1.99% selinux_inode_init_security
- 1.02% security_sid_to_context_force
- 1.00% security_sid_to_context_core
- 0.92% sidtab_entry_to_string
- 0.90% sidtab_sid2str_get
0.59% sidtab_sid2str_put.part.0
- 0.82% selinux_determine_inode_label
- 0.77% security_transition_sid
0.70% security_compute_sid.part.0
And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.
So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:
- 8.99% xfs_init_security
- 9.00% security_inode_init_security
- 6.43% xfs_initxattrs
- 6.37% xfs_attr_set
- 5.45% xfs_attr_set_args
- 5.42% xfs_attr_set_shortform.constprop.0
- 4.51% xfs_trans_commit
- 4.54% __xfs_trans_commit
- 4.59% xfs_log_commit_cil
- 2.67% _raw_spin_lock
- 3.28% do_raw_spin_lock
3.08% __pv_queued_spin_lock_slowpath
0.66% xfs_inode_item_format
- 0.90% xfs_attr_try_sf_addname
- 0.60% xfs_trans_alloc
- 2.35% selinux_inode_init_security
- 1.25% security_sid_to_context_force
- 1.21% security_sid_to_context_core
- 1.19% sidtab_entry_to_string
- 1.20% sidtab_sid2str_get
- 0.86% sidtab_sid2str_put.part.0
- 0.62% _raw_spin_lock_irqsave
- 0.77% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.84% selinux_determine_inode_label
- 0.83% security_transition_sid
0.86% security_compute_sid.part.0
Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix compilation error when CONFIG_SECURITY=n]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
|
|
Add the BUILD_BUG_ON to xfs_errortag_add() in order to make sure that
the length of xfs_errortag_random_default matches XFS_ERRTAG_MAX when
building.
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Skip the warnings about mount option being deprecated if we are
remounting and deprecated option state is not changing.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211605
Fix-suggested-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rename mp variable to parsisng_mp so it is easy to distinguish
between current mount point handle and handle for mount point
which mount options are being parsed.
Suggested-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Since we're about to start using the blockgc workqueue to dispose of
inactivated inodes, strip the "block" prefix from the name; now it's
merely the general garbage collection (gc) workqueue.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic. This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.
In other words, xfs_inactive() should leave these files alone. Add the
necessary predicate functions to make this happen. This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
commit f4c32e87de7d. If we ever decide to support removing metadata
files, we should make all those metadata updates explicit.
Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h
Followup-to: f4c32e87de7d ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Use the AG btree height limits that we precomputed into the xfs_mount to
validate the AG headers instead of using XFS_BTREE_MAXLEVELS.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Functions called by this function cannot fail, so get rid of the return
and error checking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Since xchk_ag_read_headers initializes fields in struct xchk_ag, we
might as well set the AG number and save the callers the trouble.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If scrub observes cross-referencing errors while scanning a data
structure, mark the data structure sick. There's /something/
inconsistent, even if we can't really tell what it is.
Fixes: 4860a05d2475 ("xfs: scrub/repair should update filesystem metadata health")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If a scrubber cannot complete its check and signals an incomplete check,
we must bail out immediately without updating health status, trying a
repair, etc. because our scan is incomplete and we therefore do not know
much more.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
When xchk_quota_item figures out that it needs to terminate the scrub
operation, it needs to return some error code to abort the loop, but
instead it returns zero and the loop keeps running. Fix this by making
it use ECANCELED, and fix the other loop bailout condition check at the
bottom too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If we can't read the AGF header, we never actually set a value for
freelen and usedlen. These two variables are used to make the worst
case estimate of btree size, so it's safe to set them to the AG size as
a fallback.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
A recent log refactoring patchset from Brian Foster relaxed fsfreeze
behavior with regards to the buffer cache -- now freeze only waits for
pending buffer IO to finish, and does not try to drain the buffer cache
LRU. As a result, fsfreeze should no longer stall indefinitely while
fsmap runs. Drop the sb_start_write calls around fsmap invocations.
While we're cleaning things, add a comment to the xfs_trans_alloc_empty
call explaining why we're running around with empty transactions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes for v5.12"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: initialize ret to suppress smatch warning
ext4: stop inode update before return
ext4: fix rename whiteout with fast commit
ext4: fix timer use-after-free on failed mount
ext4: fix potential error in ext4_do_update_inode
ext4: do not try to set xattr into ea_inode if value is empty
ext4: do not iput inode under running transaction in ext4_rename()
ext4: find old entry again if failed to rename whiteout
ext4: fix error handling in ext4_end_enable_verity()
ext4: fix bh ref count on error paths
fs/ext4: fix integer overflow in s_log_groups_per_flex
ext4: add reclaim checks to xattr code
ext4: shrink race window in ext4_should_retry_alloc()
|
|
Pull io_uring followup fixes from Jens Axboe:
- The SIGSTOP change from Eric, so we properly ignore that for
PF_IO_WORKER threads.
- Disallow sending signals to PF_IO_WORKER threads in general, we're
not interested in having them funnel back to the io_uring owning
task.
- Stable fix from Stefan, ensuring we properly break links for short
send/sendmsg recv/recvmsg if MSG_WAITALL is set.
- Catch and loop when needing to run task_work before a PF_IO_WORKER
threads goes to sleep.
* tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block:
io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL
io-wq: ensure task is running before processing task_work
signal: don't allow STOP on PF_IO_WORKER threads
signal: don't allow sending any signals to PF_IO_WORKER threads
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"The freshest pile of shiny x86 fixes for 5.12:
- Add the arch-specific mapping between physical and logical CPUs to
fix devicetree-node lookups
- Restore the IRQ2 ignore logic
- Fix get_nr_restart_syscall() to return the correct restart syscall
number. Split in a 4-patches set to avoid kABI breakage when
backporting to dead kernels"
* tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/of: Fix CPU devicetree-node lookups
x86/ioapic: Ignore IRQ2 again
x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
x86: Move TS_COMPAT back to asm/thread_info.h
kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
|
|
MSG_WAITALL
Without that it's not safe to use them in a linked combination with
others.
Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE
should be possible.
We already handle short reads and writes for the following opcodes:
- IORING_OP_READV
- IORING_OP_READ_FIXED
- IORING_OP_READ
- IORING_OP_WRITEV
- IORING_OP_WRITE_FIXED
- IORING_OP_WRITE
- IORING_OP_SPLICE
- IORING_OP_TEE
Now we have it for these as well:
- IORING_OP_SENDMSG
- IORING_OP_SEND
- IORING_OP_RECVMSG
- IORING_OP_RECV
For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC
flags in order to call req_set_fail_links().
There might be applications arround depending on the behavior
that even short send[msg]()/recv[msg]() retuns continue an
IOSQE_IO_LINK chain.
It's very unlikely that such applications pass in MSG_WAITALL,
which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'.
It's expected that the low level sock_sendmsg() call just ignores
MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set
SO_ZEROCOPY.
We also expect the caller to know about the implicit truncation to
MAX_RW_COUNT, which we don't detect.
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/c4e1a4cc0d905314f4d5dc567e65a7b09621aab3.1615908477.git.metze@samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Mark the current task as running if we need to run task_work from the
io-wq threads as part of work handling. If that is the case, then return
as such so that the caller can appropriately loop back and reset if it
was part of a going-to-sleep flush.
Fixes: 3bfe6106693b ("io-wq: fork worker threads from original task")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The inode update should be stopped before returing the error code.
Signed-off-by: Pan Bian <bianpan2016@163.com>
Link: https://lore.kernel.org/r/20210117085732.93788-1-bianpan2016@163.com
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch adds rename whiteout support in fast commits. Note that the
whiteout object that gets created is actually char device. Which
imples, the function ext4_inode_journal_mode(struct inode *inode)
would return "JOURNAL_DATA" for this inode. This has a consequence in
fast commit code that it will make creation of the whiteout object a
fast-commit ineligible behavior and thus will fall back to full
commits. With this patch, this can be observed by running fast commits
with rename whiteout and seeing the stats generated by ext4_fc_stats
tracepoint as follows:
ext4_fc_stats: dev 254:32 fc ineligible reasons:
XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
num_commits:6, ineligible: 6, numblks: 3
So in short, this patch guarantees that in case of rename whiteout, we
fall back to full commits.
Amir mentioned that instead of creating a new whiteout object for
every rename, we can create a static whiteout object with irrelevant
nlink. That will make fast commits to not fall back to full
commit. But until this happens, this patch will ensure correctness by
falling back to full commits.
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When filesystem mount fails because of corrupted filesystem we first
cancel the s_err_report timer reminding fs errors every day and only
then we flush s_error_work. However s_error_work may report another fs
error and re-arm timer thus resulting in timer use-after-free. Fix the
problem by first flushing the work and only after that canceling the
s_err_report timer.
Reported-by: syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com
Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(),
the error code will be overridden, go to out_brelse to avoid this
situation.
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|