summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)Author
2021-06-28Merge tag 'fallthrough-fixes-clang-5.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull fallthrough fixes from Gustavo Silva: "Fix many fall-through warnings when building with Clang 12.0.0 and '-Wimplicit-fallthrough' so that we at some point will be able to enable that warning by default" * tag 'fallthrough-fixes-clang-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (26 commits) rxrpc: Fix fall-through warnings for Clang drm/nouveau/clk: Fix fall-through warnings for Clang drm/nouveau/therm: Fix fall-through warnings for Clang drm/nouveau: Fix fall-through warnings for Clang xfs: Fix fall-through warnings for Clang xfrm: Fix fall-through warnings for Clang tipc: Fix fall-through warnings for Clang sctp: Fix fall-through warnings for Clang rds: Fix fall-through warnings for Clang net/packet: Fix fall-through warnings for Clang net: netrom: Fix fall-through warnings for Clang ide: Fix fall-through warnings for Clang hwmon: (max6621) Fix fall-through warnings for Clang hwmon: (corsair-cpro) Fix fall-through warnings for Clang firewire: core: Fix fall-through warnings for Clang braille_console: Fix fall-through warnings for Clang ipv4: Fix fall-through warnings for Clang qlcnic: Fix fall-through warnings for Clang bnxt_en: Fix fall-through warnings for Clang netxen_nic: Fix fall-through warnings for Clang ...
2021-05-27xfs: bunmapi has unnecessary AG lock ordering issuesDave Chinner
large directory block size operations are assert failing because xfs_bunmapi() is not completely removing fragmented directory blocks like so: XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 677 .... Call Trace: xfs_dir2_shrink_inode+0x1a8/0x210 xfs_dir2_block_to_sf+0x2ae/0x410 xfs_dir2_block_removename+0x21a/0x280 xfs_dir_removename+0x195/0x1d0 xfs_rename+0xb79/0xc50 ? avc_has_perm+0x8d/0x1a0 ? avc_has_perm_noaudit+0x9a/0x120 xfs_vn_rename+0xdb/0x150 vfs_rename+0x719/0xb50 ? __lookup_hash+0x6a/0xa0 do_renameat2+0x413/0x5e0 __x64_sys_rename+0x45/0x50 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae We are aborting the bunmapi() pass because of this specific chunk of code: /* * Make sure we don't touch multiple AGF headers out of order * in a single transaction, as that could cause AB-BA deadlocks. */ if (!wasdel && !isrt) { agno = XFS_FSB_TO_AGNO(mp, del.br_startblock); if (prev_agno != NULLAGNUMBER && prev_agno > agno) break; prev_agno = agno; } This is designed to prevent deadlocks in AGF locking when freeing multiple extents by ensuring that we only ever lock in increasing AG number order. Unfortunately, this also violates the "bunmapi will always succeed" semantic that some high level callers depend on, such as xfs_dir2_shrink_inode(), xfs_da_shrink_inode() and xfs_inactive_symlink_rmt(). This AG lock ordering was introduced back in 2017 to fix deadlocks triggered by generic/299 as reported here: https://lore.kernel.org/linux-xfs/800468eb-3ded-9166-20a4-047de8018582@gmail.com/ This codebase is old enough that it was before we were defering all AG based extent freeing from within xfs_bunmapi(). THat is, we never actually lock AGs in xfs_bunmapi() any more - every non-rt based extent free is added to the defer ops list, as is all BMBT block freeing. And RT extents are not RT based, so there's no lock ordering issues associated with them. Hence this AGF lock ordering code is both broken and dead. Let's just remove it so that the large directory block code works reliably again. Tested against xfs/538 and generic/299 which is the original test that exposed the deadlocks that this code fixed. Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27xfs: btree format inode forks can have zero extentsDave Chinner
xfs/538 is assert failing with this trace when testing with directory block sizes of 64kB: XFS: Assertion failed: !xfs_need_iread_extents(ifp), file: fs/xfs/libxfs/xfs_bmap.c, line: 608 .... Call Trace: xfs_bmap_btree_to_extents+0x2a9/0x470 ? kmem_cache_alloc+0xe7/0x220 __xfs_bunmapi+0x4ca/0xdf0 xfs_bunmapi+0x1a/0x30 xfs_dir2_shrink_inode+0x71/0x210 xfs_dir2_block_to_sf+0x2ae/0x410 xfs_dir2_block_removename+0x21a/0x280 xfs_dir_removename+0x195/0x1d0 xfs_remove+0x244/0x460 xfs_vn_unlink+0x53/0xa0 ? selinux_inode_unlink+0x13/0x20 vfs_unlink+0x117/0x220 do_unlinkat+0x1a2/0x2d0 __x64_sys_unlink+0x42/0x60 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae This is a check to ensure that the extents have been read into memory before we are doing a ifork btree manipulation. This assert is bogus in the above case. We have a fragmented directory block that has more extents in it than can fit in extent format, so the inode data fork is in btree format. xfs_dir2_shrink_inode() asks to remove all remaining 16 filesystem blocks from the inode so it can convert to short form, and __xfs_bunmapi() removes all the extents. We now have a data fork in btree format but have zero extents in the fork. This incorrectly trips the xfs_need_iread_extents() assert because it assumes that an empty extent btree means the extent tree has not been read into memory yet. This is clearly not the case with xfs_bunmapi(), as it has an explicit call to xfs_iread_extents() in it to pull the extents into memory before it starts unmapping. Also, the assert directly after this bogus one is: ASSERT(ifp->if_format == XFS_DINODE_FMT_BTREE); Which covers the context in which it is legal to call xfs_bmap_btree_to_extents just fine. Hence we should just remove the bogus assert as it is clearly wrong and causes a regression. The returns the test behaviour to the pre-existing assert failure in xfs_dir2_shrink_inode() that indicates xfs_bunmapi() has failed to remove all the extents in the range it was asked to unmap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-26xfs: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix the following warnings by replacing /* fall through */ comments, and its variants, with the new pseudo-keyword macro fallthrough: fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] Notice that Clang doesn't recognize /* fall through */ comments as implicit fall-through markings, so in order to globally enable -Wimplicit-fallthrough for Clang, these comments need to be replaced with fallthrough; in the whole codebase. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-24xfs: validate extsz hints against rt extent size when rtinherit is setDarrick J. Wong
The RTINHERIT bit can be set on a directory so that newly created regular files will have the REALTIME bit set to store their data on the realtime volume. If an extent size hint (and EXTSZINHERIT) are set on the directory, the hint will also be copied into the new file. As pointed out in previous patches, for realtime files we require the extent size hint be an integer multiple of the realtime extent, but we don't perform the same validation on a directory with both RTINHERIT and EXTSZINHERIT set, even though the only use-case of that combination is to propagate extent size hints into new realtime files. This leads to inode corruption errors when the bad values are propagated. Because there may be existing filesystems with such a configuration, we cannot simply amend the inode verifier to trip on these directories and call it a day because that will cause previously "working" filesystems to start throwing errors abruptly. Note that it's valid to have directories with rtinherit set even if there is no realtime volume, in which case the problem does not manifest because rtinherit is ignored if there's no realtime device; and it's possible that someone set the flag, crashed, repaired the filesystem (which clears the hint on the realtime file) and continued. Therefore, mitigate this issue in several ways: First, if we try to write out an inode with both rtinherit/extszinherit set and an unaligned extent size hint, turn off the hint to correct the error. Second, if someone tries to misconfigure a directory via the fssetxattr ioctl, fail the ioctl. Third, reverify both extent size hint values when we propagate heritable inode attributes from parent to child, to prevent misconfigurations from spreading. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-24xfs: standardize extent size hint validationDarrick J. Wong
While chasing a bug involving invalid extent size hints being propagated into newly created realtime files, I noticed that the xfs_ioctl_setattr checks for the extent size hints weren't the same as the ones now encoded in libxfs and used for validation in repair and mkfs. Because the checks in libxfs are more stringent than the ones in the ioctl, it's possible for a live system to set inode flags that immediately result in corruption warnings. Specifically, it's possible to set an extent size hint on an rtinherit directory without checking if the hint is aligned to the realtime extent size, which makes no sense since that combination is used only to seed new realtime files. Replace the open-coded and inadequate checks with the libxfs verifier versions and update the code comments a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-24xfs: check free AG space when making per-AG reservationsDarrick J. Wong
The new online shrink code exposed a gap in the per-AG reservation code, which is that we only return ENOSPC to callers if the entire fs doesn't have enough free blocks. Except for debugging mode, the reservation init code doesn't ever check that there's enough free space in that AG to cover the reservation. Not having enough space is not considered an immediate fatal error that requires filesystem offlining because (a) it's shouldn't be possible to wind up in that state through normal file operations and (b) even if one did, freeing data blocks would recover the situation. However, online shrink now needs to know if shrinking would not leave enough space so that it can abort the shrink operation. Hence we need to promote this assertion into an actual error return. Observed by running xfs/168 with a 1k block size, though in theory this could happen with any configuration. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-05-20xfs: restore old ioctl definitionsDarrick J. Wong
These ioctl definitions in xfs_fs.h are part of the userspace ABI and were mistakenly removed during the 5.13 merge window. Fixes: 9fefd5db08ce ("xfs: convert to fileattr") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-06Merge tag 'xfs-5.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull more xfs updates from Darrick Wong: "Except for the timestamp struct renaming patches, everything else in here are bug fixes: - Rename the log timestamp struct. - Remove broken transaction counter debugging that wasn't working correctly on very old filesystems. - Various fixes to make pre-lazysbcount filesystems work properly again. - Fix a free space accounting problem where we neglected to consider free space btree blocks that track metadata reservation space when deciding whether or not to allow caller to reserve space for a metadata update. - Fix incorrect pagecache clearing behavior during FUNSHARE ops. - Don't allow log writes if the data device is readonly" * tag 'xfs-5.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't allow log writes if the data device is readonly xfs: fix xfs_reflink_unshare usage of filemap_write_and_wait_range xfs: set aside allocation btree blocks from block reservation xfs: introduce in-core global counter of allocbt blocks xfs: unconditionally read all AGFs on mounts with perag reservation xfs: count free space btree blocks when scrubbing pre-lazysbcount fses xfs: update superblock counters correctly for !lazysbcount xfs: don't check agf_btreeblks on pre-lazysbcount filesystems xfs: remove obsolete AGF counter debugging xfs: rename struct xfs_legacy_ictimestamp xfs: rename xfs_ictimestamp_t
2021-04-29Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "The notable user-visible addition this cycle is ability to remove space from the last AG in a filesystem. This is the first of many changes needed for full-fledged support for shrinking a filesystem. Still needed are (a) the ability to reorganize files and metadata away from the end of the fs; (b) the ability to remove entire allocation groups; (c) shrink support for realtime volumes; and (d) thorough testing of (a-c). There are a number of performance improvements in this code drop: Dave streamlined various parts of the buffer logging code and reduced the cost of various debugging checks, and added the ability to pre-create the xattr structures while creating files. Brian eliminated transaction reservations that were being held across writeback (thus reducing livelock potential. Other random pieces: Pavel fixed the repetitve warnings about deprecated mount options, I fixed online fsck to behave itself when a readonly remount comes in during scrub, and refactored various other parts of that code, Christoph contributed a lot of refactoring this cycle. The xfs_icdinode structure has been absorbed into the (incore) xfs_inode structure, and the format and flags handling around xfs_inode_fork structures has been simplified. Chandan provided a number of fixes for extent count overflow related problems that have been shaken out by debugging knobs added during 5.12. Summary: - Various minor fixes in online scrub. - Prevent metadata files from being automatically inactivated. - Validate btree heights by the computed per-btree limits. - Don't warn about remounting with deprecated mount options. - Initialize attr forks at create time if we suspect we're going to need to store them. - Reduce memory reallocation workouts in the logging code. - Fix some theoretical math calculation errors in logged buffers that span multiple discontig memory ranges but contiguous ondisk regions. - Speedups in dirty buffer bitmap handling. - Make type verifier functions more inline-happy to reduce overhead. - Reduce debug overhead in directory checking code. - Many many typo fixes. - Begin to handle the permanent loss of the very end of a filesystem. - Fold struct xfs_icdinode into xfs_inode. - Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx ioctl. - Remove a broken directory block format check from online scrub. - Fix a bug where we could produce an unnecessarily tall data fork btree when creating an attr fork. - Fix scrub and readonly remounts racing. - Fix a writeback ioend log deadlock problem by dropping the behavior where we could preallocate a setfilesize transaction. - Fix some bugs in the new extent count checking code. - Fix some bugs in the attr fork preallocation code. - Refactor if_flags out of the incore inode fork data structure" * tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (77 commits) xfs: remove xfs_quiesce_attr declaration xfs: remove XFS_IFEXTENTS xfs: remove XFS_IFINLINE xfs: remove XFS_IFBROOT xfs: only look at the fork format in xfs_idestroy_fork xfs: simplify xfs_attr_remove_args xfs: rename and simplify xfs_bmap_one_block xfs: move the XFS_IFEXTENTS check into xfs_iread_extents xfs: drop unnecessary setfilesize helper xfs: drop unused ioend private merge and setfilesize code xfs: open code ioend needs workqueue helper xfs: drop submit side trans alloc for append ioends xfs: fix return of uninitialized value in variable error xfs: get rid of the ip parameter to xchk_setup_* xfs: fix scrub and remount-ro protection when running scrub xfs: move the check for post-EOF mappings into xfs_can_free_eofblocks xfs: move the xfs_can_free_eofblocks call under the IOLOCK xfs: precalculate default inode attribute offset xfs: default attr fork size does not handle device inodes xfs: inode fork allocation depends on XFS_IFEXTENT flag ...
2021-04-29xfs: introduce in-core global counter of allocbt blocksBrian Foster
Introduce an in-core counter to track the sum of all allocbt blocks used by the filesystem. This value is currently tracked per-ag via the ->agf_btreeblks field in the AGF, which also happens to include rmapbt blocks. A global, in-core count of allocbt blocks is required to identify the subset of global ->m_fdblocks that consists of unavailable blocks currently used for allocation btrees. To support this calculation at block reservation time, construct a similar global counter for allocbt blocks, populate it on first read of each AGF and update it as allocbt blocks are used and released. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: unconditionally read all AGFs on mounts with perag reservationBrian Foster
perag reservation is enabled at mount time on a per AG basis. The upcoming change to set aside allocbt blocks from block reservation requires a populated allocbt counter as soon as possible after mount to be fully effective against large perag reservations. Therefore as a preparation step, initialize the pagf on all mounts where at least one reservation is active. Note that this already occurs to some degree on most default format filesystems as reservation requirement calculations already depend on the AGF or AGI, depending on the reservation type. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: update superblock counters correctly for !lazysbcountDave Chinner
Keep the mount superblock counters up to date for !lazysbcount filesystems so that when we log the superblock they do not need updating in any way because they are already correct. It's found by what Zorro reported: 1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev 2. mount $dev $mnt 3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load) 4. umount $mnt 5. xfs_repair -n $dev and I've seen no problem with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29xfs: remove obsolete AGF counter debuggingDarrick J. Wong
In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to remove blocks from an overfilled AGFL because there were complaints about transaction overruns that stemmed from trying to free multiple blocks in a single transaction. Unfortunately, that commit missed a subtlety in the debug-mode transaction accounting when a realtime volume is attached. If a realtime file undergoes a data fork mapping change such that realtime extents are allocated (or freed) in the same transaction that a data device block is also allocated (or freed), we can trip a debugging assertion. This can happen (for example) if a realtime extent is allocated and it is necessary to reshape the bmbt to hold the new mapping. When we go to allocate a bmbt block from an AG, the first thing the data device block allocator does is ensure that the freelist is the proper length. If the freelist is too long, it will trim the freelist to the proper length. In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to record the decrement in the AG free list count. Prior to f8f28 we would put the free block back in the free space btrees in the same transaction, which calls xfs_trans_agblocks_delta() to record the increment in the AG free block count. Since AGFL blocks are included in the global free block count (fdblocks), there is no corresponding fdblocks update, so the AGFL free satisfies the following condition in xfs_trans_apply_sb_deltas: /* * Check that superblock mods match the mods made to AGF counters. */ ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) == (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta + tp->t_ag_btree_delta)); The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is the number blocks that were allocated. After commit f8f28 we defer the block freeing to the next chained transaction, which means that the calls to xfs_trans_agflist_delta and xfs_trans_agblocks_delta occur in separate transactions. The (first) transaction that shortens the free list trips on the comparison, which has now become: (X + 0) == ((X) + -1 + 0) because we haven't freed the AGFL block yet; we've only logged an intention to free it. When the second transaction (the deferred free) commits, it will evaluate the expression as: (0 + 0) == (1 + 0 + 0) and trip over that in turn. At this point, the astute reader may note that the two commits tagged by this patch have been in the kernel for a long time but haven't generated any bug reports. How is it that the author became aware of this bug? This originally surfaced as an intermittent failure when I was testing realtime rmap, but a different bug report by Zorro Lang reveals the same assertion occuring on !lazysbcount filesystems. The common factor to both reports (and why this problem wasn't previously reported) becomes apparent if we consider when xfs_trans_apply_sb_deltas is called by __xfs_trans_commit(): if (tp->t_flags & XFS_TRANS_SB_DIRTY) xfs_trans_apply_sb_deltas(tp); With a modern lazysbcount filesystem, transactions update only the percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence xfs_trans_apply_sb_deltas is rarely called. However, updates to the count of free realtime extents are not part of lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or removing data fork mappings to realtime files; similarly, XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems. Dave mentioned in response to an earlier version of this patch: "IIUC, what you are saying is that this debug code is simply not exercised in normal testing and hasn't been for the past decade? And it still won't be exercised on anything other than realtime device testing? "...it was debugging code from 1994 that was largely turned into dead code when lazysbcounters were introduced in 2007. Hence I'm not sure it holds any value anymore." This debugging code isn't especially helpful - you can modify the flcount on one AG and the freeblks of another AG, and it won't trigger. Add the fact that nobody noticed for a decade, and let's just get rid of it (and start testing realtime :P). This bug was found by running generic/051 on either a V4 filesystem lacking lazysbcount; or a V5 filesystem with a realtime volume. Cc: bfoster@redhat.com, zlang@redhat.com Fixes: f8f2835a9cf3 ("xfs: defer agfl block frees when dfops is available") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-22xfs: rename struct xfs_legacy_ictimestampChristoph Hellwig
Rename struct xfs_legacy_ictimestamp to struct xfs_log_legacy_timestamp as it is a type used for logging timestamps with no relationship to the in-core inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-22xfs: rename xfs_ictimestamp_tChristoph Hellwig
Rename xfs_ictimestamp_t to xfs_log_timestamp_t as it is a type used for logging timestamps with no relationship to the in-core inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: remove XFS_IFEXTENTSChristoph Hellwig
The in-memory XFS_IFEXTENTS is now only used to check if an inode with extents still needs the extents to be read into memory before doing operations that need the extent map. Add a new xfs_need_iread_extents helper that returns true for btree format forks that do not have any entries in the in-memory extent btree, and use that instead of checking the XFS_IFEXTENTS flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: remove XFS_IFINLINEChristoph Hellwig
Just check for an inline format fork instead of the using the equivalent in-memory XFS_IFINLINE flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: remove XFS_IFBROOTChristoph Hellwig
Just check for a btree format fork instead of the using the equivalent in-memory XFS_IFBROOT flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: only look at the fork format in xfs_idestroy_forkChristoph Hellwig
Stop using the XFS_IFEXTENTS flag, and instead switch on the fork format in xfs_idestroy_fork to decide how to cleanup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: simplify xfs_attr_remove_argsChristoph Hellwig
Directly return from the subfunctions and avoid the error variable. Also remove the not really needed dp local variable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: rename and simplify xfs_bmap_one_blockChristoph Hellwig
xfs_bmap_one_block is only called for the attribute fork. Move it to xfs_attr.c, drop the unused whichfork argument and code only executed for the data fork and rename the result to xfs_attr_is_leaf. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-15xfs: move the XFS_IFEXTENTS check into xfs_iread_extentsChristoph Hellwig
Move the XFS_IFEXTENTS check from the callers into xfs_iread_extents to simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-12xfs: convert to fileattrMiklos Szeredi
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org>
2021-04-09xfs: fix return of uninitialized value in variable errorColin Ian King
A previous commit removed a call to xfs_attr3_leaf_read that assigned an error return code to variable error. We now have a few early error return paths to label 'out' that return error if error is set; however error now is uninitialized so potentially garbage is being returned. Fix this by setting error to zero to restore the original behaviour where error was zero at the label 'restart'. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: precalculate default inode attribute offsetDave Chinner
Default attr fork offset is based on inode size, so is a fixed geometry parameter of the inode. Move it to the xfs_ino_geometry structure and stop calculating it on every call to xfs_default_attroffset(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-07xfs: default attr fork size does not handle device inodesDave Chinner
Device inodes have a non-default data fork size of 8 bytes as checked/enforced by xfs_repair. xfs_default_attroffset() doesn't handle this, so lets do a minor refactor so it does. Fixes: e6a688c33238 ("xfs: initialise attr fork on inode create") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-07xfs: Use struct xfs_bmdr_block instead of struct xfs_btree_block to ↵Chandan Babu R
calculate root node size The incore data fork of an inode stores the bmap btree root node as 'struct xfs_btree_block'. However, the ondisk version of the inode stores the bmap btree root node as a 'struct xfs_bmdr_block'. xfs_bmap_add_attrfork_btree() checks if the btree root node fits inside the data fork of the inode. However, it incorrectly uses 'struct xfs_btree_block' to compute the size of the bmap btree root node. Since size of 'struct xfs_btree_block' is larger than that of 'struct xfs_bmdr_block', xfs_bmap_add_attrfork_btree() could end up unnecessarily demoting the current root node as the child of newly allocated root node. This commit optimizes space usage by modifying xfs_bmap_add_attrfork_btree() to use 'struct xfs_bmdr_block' to check if the bmap btree root node fits inside the data fork of the inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: deprecate BMV_IF_NO_DMAPI_READ flagAnthony Iliopoulos
Use of the flag has had no effect since kernel commit 288699fecaff ("xfs: drop dmapi hooks"), which removed all dmapi related code, so deprecate it. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_crtime field to struct xfs_inodeChristoph Hellwig
Move the crtime field from struct xfs_icdinode into stuct xfs_inode and remove the now entirely unused struct xfs_icdinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_flags2 field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the flags2 field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_flags field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the flags field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_forkoff field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the forkoff field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: use a union for i_cowextsize and i_flushiterChristoph Hellwig
The i_cowextsize field is only used for v3 inodes, and the i_flushiter field is only used for v1/v2 inodes. Use a union to pack the inode a littler better after adding a few missing guards around their usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_flushiter field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the flushiter field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_cowextsize field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the cowextsize field into the containing xfs_inode structure. Also switch to use the xfs_extlen_t instead of a uint32_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_extsize field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the extsize field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_nblocks field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the nblocks field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_size field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the on-disk size field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_projid field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the projid field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: remove the di_dmevmask and di_dmstate fields from struct xfs_icdinodeChristoph Hellwig
The legacy DMAPI fields were never set by upstream Linux XFS, and have no way to be read using the kernel APIs. So instead of bloating the in-core inode for them just copy them from the on-disk inode into the log when logging the inode. The only caveat is that we need to make sure to zero the fields for newly read or deleted inodes, which is solved using a new flag in the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: remove the unused xfs_icdinode_has_bigtime helperChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: split xfs_imap_to_bpChristoph Hellwig
Split looking up the dinode from xfs_imap_to_bp, which can be significantly simplified as a result. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: Initialize xfs_alloc_arg->total correctly when allocating minlen extentsChandan Babu R
xfs/538 can cause the following call trace to be printed when executing on a multi-block directory configuration, WARNING: CPU: 1 PID: 2578 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0 Call Trace: ? xfs_buf_rele+0x4f/0x450 xfs_bmap_add_extent_hole_real+0x747/0x960 xfs_bmapi_allocate+0x39a/0x440 xfs_bmapi_write+0x507/0x9e0 xfs_da_grow_inode_int+0x1cd/0x330 ? up+0x12/0x60 xfs_dir2_grow_inode+0x62/0x110 ? xfs_trans_log_inode+0x234/0x2d0 xfs_dir2_sf_to_block+0x103/0x940 ? xfs_dir2_sf_check+0x8c/0x210 ? xfs_da_compname+0x19/0x30 ? xfs_dir2_sf_lookup+0xd0/0x3d0 xfs_dir2_sf_addname+0x10d/0x910 xfs_dir_createname+0x1ad/0x210 xfs_create+0x404/0x620 xfs_generic_create+0x24c/0x320 path_openat+0xda6/0x1030 do_filp_open+0x88/0x130 ? kmem_cache_alloc+0x50/0x210 ? __cond_resched+0x16/0x40 ? kmem_cache_alloc+0x50/0x210 do_sys_openat2+0x97/0x150 __x64_sys_creat+0x49/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae This occurs because xfs_bmap_exact_minlen_extent_alloc() initializes xfs_alloc_arg->total to xfs_bmalloca->minlen. In the context of xfs_bmap_exact_minlen_extent_alloc(), xfs_bmalloca->minlen has a value of 1 and hence the space allocator could choose an AG which has less than xfs_bmalloca->total number of free blocks available. As the transaction proceeds, one of the future space allocation requests could fail due to non-availability of free blocks in the AG that was originally chosen. This commit fixes the bug by assigning xfs_alloc_arg->total to the value of xfs_bmalloca->total. Fixes: 301519674699 ("xfs: Introduce error injection to allocate only minlen size extents for files") Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-03-25xfs: add error injection for per-AG resv failureGao Xiang
per-AG resv failure after fixing up freespace is hard to test in an effective way, so directly add an error injection path to observe such error handling path works as expected. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: introduce xfs_ag_shrink_space()Gao Xiang
This patch introduces a helper to shrink unused space in the last AG by fixing up the freespace btree. Also make sure that the per-AG reservation works under the new AG size. If such per-AG reservation or extent allocation fails, roll the transaction so the new transaction could cancel without any side effects. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: reduce debug overhead of dir leaf/node checksDave Chinner
On debug kernels, we call xfs_dir3_leaf_check_int() multiple times on every directory modification. The robust hash ordering checks it does on every entry in the leaf on every call results in a massive CPU overhead which slows down debug kernels by a large amount. We use xfs_dir3_leaf_check_int() for the verifiers as well, so we can't just gut the function to reduce overhead. What we can do, however, is reduce the work it does when it is called from the debug interfaces, just leaving the high level checks in place and leaving the robust validation to the verifiers. This means the debug checks will catch gross errors, but subtle bugs might not be caught until a verifier is run. It is easy enough to restore the existing debug behaviour if the developer needs it (just change a call parameter in the debug code), but overwise the overhead makes testing large directory block sizes on debug kernels very slow. Profile at an unlink rate of ~80k file/s on a 64k block size filesystem before the patch: 40.30% [kernel] [k] xfs_dir3_leaf_check_int 10.98% [kernel] [k] __xfs_dir3_data_check 8.10% [kernel] [k] xfs_verify_dir_ino 4.42% [kernel] [k] memcpy 2.22% [kernel] [k] xfs_dir2_data_get_ftype 1.52% [kernel] [k] do_raw_spin_lock Profile after, at an unlink rate of ~125k files/s (+50% improvement) has largely dropped the leaf verification debug overhead out of the profile. 16.53% [kernel] [k] __xfs_dir3_data_check 12.53% [kernel] [k] xfs_verify_dir_ino 7.97% [kernel] [k] memcpy 3.36% [kernel] [k] xfs_dir2_data_get_ftype 2.86% [kernel] [k] __pv_queued_spin_lock_slowpath Create shows a similar change in profile and a +25% improvement in performance. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: No need for inode number error injection in __xfs_dir3_data_checkDave Chinner
We call xfs_dir_ino_validate() for every dir entry in a directory when doing validity checking of the directory. It calls xfs_verify_dir_ino() then emits a corruption report if bad or does error injection if good. It is extremely costly: 43.27% [kernel] [k] xfs_dir3_leaf_check_int 10.28% [kernel] [k] __xfs_dir3_data_check 6.61% [kernel] [k] xfs_verify_dir_ino 4.16% [kernel] [k] xfs_errortag_test 4.00% [kernel] [k] memcpy 3.48% [kernel] [k] xfs_dir_ino_validate 7% of the cpu usage in this directory traversal workload is xfs_dir_ino_validate() doing absolutely nothing. We don't need error injection to simulate a bad inode numbers in the directory structure because we can do that by fuzzing the structure on disk. And we don't need a corruption report, because the __xfs_dir3_data_check() will emit one if the inode number is bad. So just call xfs_verify_dir_ino() directly here, and get rid of all this unnecessary overhead: 40.30% [kernel] [k] xfs_dir3_leaf_check_int 10.98% [kernel] [k] __xfs_dir3_data_check 8.10% [kernel] [k] xfs_verify_dir_ino 4.42% [kernel] [k] memcpy 2.22% [kernel] [k] xfs_dir2_data_get_ftype 1.52% [kernel] [k] do_raw_spin_lock Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: type verification is expensiveDave Chinner
From a concurrent rm -rf workload: 41.04% [kernel] [k] xfs_dir3_leaf_check_int 9.85% [kernel] [k] __xfs_dir3_data_check 5.60% [kernel] [k] xfs_verify_ino 5.32% [kernel] [k] xfs_agino_range 4.21% [kernel] [k] memcpy 3.06% [kernel] [k] xfs_errortag_test 2.57% [kernel] [k] xfs_dir_ino_validate 1.66% [kernel] [k] xfs_dir2_data_get_ftype 1.17% [kernel] [k] do_raw_spin_lock 1.11% [kernel] [k] xfs_verify_dir_ino 0.84% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 0.83% [kernel] [k] xfs_buf_find 0.64% [kernel] [k] xfs_log_commit_cil THere's an awful lot of overhead in just range checking inode numbers in that, but each inode number check is not a lot of code. The total is a bit over 14.5% of the CPU time is spent validating inode numbers. The problem is that they deeply nested global scope functions so the overhead here is all in function call marshalling. text data bss dec hex filename 2077 0 0 2077 81d fs/xfs/libxfs/xfs_types.o.orig 2197 0 0 2197 895 fs/xfs/libxfs/xfs_types.o There's a small increase in binary size by inlining all the local nested calls in the verifier functions, but the same workload now profiles as: 40.69% [kernel] [k] xfs_dir3_leaf_check_int 10.52% [kernel] [k] __xfs_dir3_data_check 6.68% [kernel] [k] xfs_verify_dir_ino 4.22% [kernel] [k] xfs_errortag_test 4.15% [kernel] [k] memcpy 3.53% [kernel] [k] xfs_dir_ino_validate 1.87% [kernel] [k] xfs_dir2_data_get_ftype 1.37% [kernel] [k] do_raw_spin_lock 0.98% [kernel] [k] xfs_buf_find 0.94% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 0.73% [kernel] [k] xfs_log_commit_cil Now we only spend just over 10% of the time validing inode numbers for the same workload. Hence a few "inline" keyworks is good enough to reduce the validation overhead by 30%... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: initialise attr fork on inode createDave Chinner
When we allocate a new inode, we often need to add an attribute to the inode as part of the create. This can happen as a result of needing to add default ACLs or security labels before the inode is made visible to userspace. This is highly inefficient right now. We do the create transaction to allocate the inode, then we do an "add attr fork" transaction to modify the just created empty inode to set the inode fork offset to allow attributes to be stored, then we go and do the attribute creation. This means 3 transactions instead of 1 to allocate an inode, and this greatly increases the load on the CIL commit code, resulting in excessive contention on the CIL spin locks and performance degradation: 18.99% [kernel] [k] __pv_queued_spin_lock_slowpath 3.57% [kernel] [k] do_raw_spin_lock 2.51% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 2.48% [kernel] [k] memcpy 2.34% [kernel] [k] xfs_log_commit_cil The typical profile resulting from running fsmark on a selinux enabled filesytem is adds this overhead to the create path: - 15.30% xfs_init_security - 15.23% security_inode_init_security - 13.05% xfs_initxattrs - 12.94% xfs_attr_set - 6.75% xfs_bmap_add_attrfork - 5.51% xfs_trans_commit - 5.48% __xfs_trans_commit - 5.35% xfs_log_commit_cil - 3.86% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.70% xfs_trans_alloc 0.52% xfs_trans_reserve - 5.41% xfs_attr_set_args - 5.39% xfs_attr_set_shortform.constprop.0 - 4.46% xfs_trans_commit - 4.46% __xfs_trans_commit - 4.33% xfs_log_commit_cil - 2.74% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 0.60% xfs_inode_item_format 0.90% xfs_attr_try_sf_addname - 1.99% selinux_inode_init_security - 1.02% security_sid_to_context_force - 1.00% security_sid_to_context_core - 0.92% sidtab_entry_to_string - 0.90% sidtab_sid2str_get 0.59% sidtab_sid2str_put.part.0 - 0.82% selinux_determine_inode_label - 0.77% security_transition_sid 0.70% security_compute_sid.part.0 And fsmark creation rate performance drops by ~25%. The key point to note here is that half the additional overhead comes from adding the attribute fork to the newly created inode. That's crazy, considering we can do this same thing at inode create time with a couple of lines of code and no extra overhead. So, if we know we are going to add an attribute immediately after creating the inode, let's just initialise the attribute fork inside the create transaction and chop that whole chunk of code out of the create fast path. This completely removes the performance drop caused by enabling SELinux, and the profile looks like: - 8.99% xfs_init_security - 9.00% security_inode_init_security - 6.43% xfs_initxattrs - 6.37% xfs_attr_set - 5.45% xfs_attr_set_args - 5.42% xfs_attr_set_shortform.constprop.0 - 4.51% xfs_trans_commit - 4.54% __xfs_trans_commit - 4.59% xfs_log_commit_cil - 2.67% _raw_spin_lock - 3.28% do_raw_spin_lock 3.08% __pv_queued_spin_lock_slowpath 0.66% xfs_inode_item_format - 0.90% xfs_attr_try_sf_addname - 0.60% xfs_trans_alloc - 2.35% selinux_inode_init_security - 1.25% security_sid_to_context_force - 1.21% security_sid_to_context_core - 1.19% sidtab_entry_to_string - 1.20% sidtab_sid2str_get - 0.86% sidtab_sid2str_put.part.0 - 0.62% _raw_spin_lock_irqsave - 0.77% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.84% selinux_determine_inode_label - 0.83% security_transition_sid 0.86% security_compute_sid.part.0 Which indicates the XFS overhead of creating the selinux xattr has been halved. This doesn't fix the CIL lock contention problem, just means it's not a limiting factor for this workload. Lock contention in the security subsystems is going to be an issue soon, though... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [djwong: fix compilation error when CONFIG_SECURITY=n] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@redhat.com>