summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-06-27xfs: replace log_badcrc_factor knob with error injection tagBrian Foster
Now that error injection tags support dynamic frequency adjustment, replace the debug mode sysfs knob that controls log record CRC error injection with an error injection tag. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: convert drop_writes to use the errortag mechanismDarrick J. Wong
We now have enhanced error injection that can control the frequency with which errors happen, so convert drop_writes to use this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27xfs: remove unneeded parameter from XFS_TEST_ERRORDarrick J. Wong
Since we moved the injected error frequency controls to the mountpoint, we can get rid of the last argument to XFS_TEST_ERROR. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27xfs: expose errortag knobs via sysfsDarrick J. Wong
Creates a /sys/fs/xfs/$dev/errortag/ directory to control the errortag values directly. This enables us to control the randomness values, rather than having to accept the defaults. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27xfs: make errortag a per-mountpoint structureDarrick J. Wong
Remove the xfs_etest structure in favor of a per-mountpoint structure. This will give us the flexibility to set as many error injection points as we want, and later enable us to set up sysfs knobs to set the trigger frequency as we wish. This comes at a cost of higher memory use, but unti we hit 1024 injection points (we're at 29) or a lot of mounts this shouldn't be a huge issue. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-24xfs: free uncommitted transactions during log recoveryBrian Foster
Log recovery allocates in-core transaction and member item data structures on-demand as it processes the on-disk log. Transactions are allocated on first encounter on-disk and stored in a hash table structure where they are easily accessible for subsequent lookups. Transaction items are also allocated on demand and are attached to the associated transactions. When a commit record is encountered in the log, the transaction is committed to the fs and the in-core structures are freed. If a filesystem crashes or shuts down before all in-core log buffers are flushed to the log, however, not all transactions may have commit records in the log. As expected, the modifications in such an incomplete transaction are not replayed to the fs. The in-core data structures for the partial transaction are never freed, however, resulting in a memory leak. Update xlog_do_recovery_pass() to first correctly initialize the hash table array so empty lists can be distinguished from populated lists on function exit. Update xlog_recover_free_trans() to always remove the transaction from the list prior to freeing the associated memory. Finally, walk the hash table of transaction lists as the last step before it goes out of scope and free any transactions that may remain on the lists. This prevents a memory leak of partial transactions in the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-20xfs: don't allow bmap on rt filesDarrick J. Wong
bmap returns a dumb LBA address but not the block device that goes with that LBA. Swapfiles don't care about this and will blindly assume that the data volume is the correct blockdev, which is totally bogus for files on the rt subvolume. This results in the swap code doing IOs to arbitrary locations on the data device(!) if the passed in mapping is a realtime file, so just turn off bmap for rt files. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-20xfs: allow reading of already-locked remote symbolic linkDarrick J. Wong
Expose the readlink variant that doesn't take the inode lock so that the scrubber can inspect symlink contents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: pass along transaction context when reading xattr block buffersDarrick J. Wong
Teach the extended attribute reading functions to pass along a transaction context if one was supplied. The extended attribute scrub code will use transactions to lock buffers and avoid deadlocking with itself in the case of loops; since it will already have the inode locked, also create xattr get/list helpers that don't take locks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: pass along transaction context when reading directory block buffersDarrick J. Wong
Teach the directory reading functions to pass along a transaction context if one was supplied. The directory scrub code will use transactions to lock buffers and avoid deadlocking with itself in the case of loops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: return the hash value of a leaf1 directory blockDarrick J. Wong
Modify the existing dir leafn lasthash function to enable us to calculate the highest hash value of a leaf1 block. This will be used by the directory scrubbing code to check the sanity of hashes in leaf1 directory blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: refactor the ifork block counting functionDarrick J. Wong
Refactor the inode fork block counting function to count extents for us at the same time. This will be used by the bmbt scrubber function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: make _bmap_count_blocks consistent wrt delalloc extent behaviorDarrick J. Wong
There is an inconsistency in the way that _bmap_count_blocks deals with delalloc reservations -- if the specified fork is in extents format, *count is set to the total number of blocks referenced by the in-core fork, including delalloc extents. However, if the fork is in btree format, *count is set to the number of blocks referenced by the on-disk fork, which does /not/ include delalloc extents. For the lone existing caller of _bmap_count_blocks this hasn't been an issue because the function is only used to count xattr fork blocks (where there aren't any delalloc reservations). However, when scrub comes along it will use this same function to check di_nblocks against both on-disk extent maps, so we need this behavior to be consistent. Therefore, fix _bmap_count_leaves not to include delalloc extents and remove unnecessary parameters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: separate function to check if inode shares extentsDarrick J. Wong
Separate the "clear reflink flag" function into one function that checks if the flag is needed, and a second function that checks and clears the flag. The inode scrub code will want to check the necessity of the flag without clearing it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: reflink find shared should take a transactionDarrick J. Wong
Adapt _reflink_find_shared to take an optional transaction pointer. The inode scrubber code will need to decide (within transaction context) if a file has shared blocks. To avoid buffer deadlocks, we must pass the tp through to this function's utility calls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: check if an inode is cached and allocatedDarrick J. Wong
Check the inode cache for a particular inode number. If it's in the cache, check that it's not currently being reclaimed. If it's not being reclaimed, return zero if the inode is allocated. This function will be used by various scrubbers to decide if the cache is more up to date than the disk in terms of checking if an inode is allocated. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: export _inobt_btrec_to_irec and _ialloc_cluster_alignment for scrubDarrick J. Wong
Create a function to extract an in-core inobt record from a generic btree_rec union so that scrub will be able to check inobt records and check inode block alignment. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: plumb in needed functions for range querying of various btreesDarrick J. Wong
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call query_range on the inode space and block mapping btrees and to extract raw btree records. This will eventually be used by the inobt and bmbt scrubbers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: export various function for the online scrubberDarrick J. Wong
Export various internal functions so that the online scrubber can use them to check the state of metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: always compile the btree inorder check functionsDarrick J. Wong
The btree record and key inorder check functions will be used by the btree scrubber code, so make sure they're always built. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: remove double-underscore integer typesDarrick J. Wong
This is a purely mechanical patch that removes the private __{u,}int{8,16,32,64}_t typedefs in favor of using the system {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform the transformation and fix the resulting whitespace and indentation errors: s/typedef\t__uint8_t/typedef __uint8_t\t/g s/typedef\t__uint/typedef __uint/g s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g s/__uint8_t\t/__uint8_t\t\t/g s/__uint/uint/g s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g s/__int/int/g /^typedef.*int[0-9]*_t;$/d Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: optimize _btree_query_allDarrick J. Wong
Don't bother wandering our way through the leaf nodes when the caller issues a query_all; just zoom down the left side of the tree and walk rightwards along level zero. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: remove bli from AIL before release on transaction abortBrian Foster
When a buffer is modified, logged and committed, it ultimately ends up sitting on the AIL with a dirty bli waiting for metadata writeback. If another transaction locks and invalidates the buffer (freeing an inode chunk, for example) in the meantime, the bli is flagged as stale, the dirty state is cleared and the bli remains in the AIL. If a shutdown occurs before the transaction that has invalidated the buffer is committed, the transaction is ultimately aborted. The log items are flagged as such and ->iop_unlock() handles the aborted items. Because the bli is clean (due to the invalidation), ->iop_unlock() unconditionally releases it. The log item may still reside in the AIL, however, which means the I/O completion handler may still run and attempt to access it. This results in assert failure due to the release of the bli while still present in the AIL and a subsequent NULL dereference and panic in the buffer I/O completion handling. This can be reproduced by running generic/388 in repetition. To avoid this problem, update xfs_buf_item_unlock() to first check whether the bli is aborted and if so, remove it from the AIL before it is released. This ensures that the bli is no longer accessed during the shutdown sequence after it has been freed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: release bli from transaction properly on fs shutdownBrian Foster
If a filesystem shutdown occurs with a buffer log item in the CIL and a log force occurs, the ->iop_unpin() handler is generally expected to tear down the bli properly. This entails freeing the bli memory and releasing the associated hold on the buffer so it can be released and the filesystem unmounted. If this sequence occurs while ->bli_refcount is elevated (i.e., another transaction is open and attempting to modify the buffer), however, ->iop_unpin() may not be responsible for releasing the bli. Instead, the transaction may release the final ->bli_refcount reference and thus xfs_trans_brelse() is responsible for tearing down the bli. While xfs_trans_brelse() does drop the reference count, it only attempts to release the bli if it is clean (i.e., not in the CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty in the CIL as noted above, this ends up skipping the last opportunity to release the bli. In turn, this leaves the hold on the buffer and causes an unmount hang. This can be reproduced by running generic/388 in repetition. Update xfs_trans_brelse() to handle this shutdown corner case correctly. If the final bli reference is dropped and the filesystem is shutdown, remove the bli from the AIL (if necessary) and release the bli to drop the buffer hold and ensure an unmount does not hang. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: avoid harmless gcc-7 warningsArnd Bergmann
gcc-7 flags the use of integer math inside of a condition as a potential bug: fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format': fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context] fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context] There is already a helper function for testing the di_forkoff field for zero, so let's use that instead to shut up the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: remove lsn relevant fields from xfs_trans structure and its usersShan Hai
The t_lsn is not used anymore and the t_commit_lsn is used as a tmp storage for the checkpoint sequence number only in the current code. And the start/commit lsn are tracked as a transaction group tag in the xfs_cil_ctx instead of a single transaction, so remove them from the xfs_trans structure and their users to match with the design. Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: remove XFS_HSIZEChristoph Hellwig
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t. Given that handle_t always only had two sizes, and one of them isn't even covered by XFS_HSIZE to start with just remove the macro and use a constant sizeof expression. Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: dump transaction usage details on log reservation overrunBrian Foster
If a transaction log reservation overrun occurs, the ticket data associated with the reservation is dumped in xfs_log_commit_cil(). This occurs long after the transaction items and details have been removed from the transaction and effectively lost. This limited set of ticket data provides very little information to support debugging transaction overruns based on the typical report. To improve transaction log reservation overrun reporting, create a helper to dump transaction details such as log items, log vector data, etc., as well as the underlying ticket data for the transaction. Move the overrun detection from xfs_log_commit_cil() to xlog_cil_insert_items() so it occurs prior to migration of the logged items to the CIL. Call the new helper such that it is able to dump this transaction data before it is lost. Also, warn on overrun to provide callstack context for the offending transaction and include a few additional messages from xlog_cil_insert_items() to display the reservation consumed locally for overhead such as log vector headers, split region headers and the context ticket. This provides a complete general breakdown of the reservation consumption of a transaction when/if it happens to overrun the reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: refactor xlog_cil_insert_items() to facilitate transaction dumpBrian Foster
Transaction reservation overrun detection currently occurs too late to print useful information about the offending transaction. Ideally, the transaction data is printed before the associated log items are moved from the transaction to the CIL, which occurs in xlog_cil_insert_items(), such that details of the items logged by the transaction are available for analysis. Refactor xlog_cil_insert_items() to facilitate moving tx overrun detection to this function. Update the function to track each bit of extra log reservation stolen from the transaction (i.e., such as for the CIL context ticket) and perform the log item migration as the last operation before the CIL lock is released. This creates a context where the transaction reservation consumption has been fully calculated when the log items are moved to the CIL. This patch makes no functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: separate shutdown from ticket reservation print helperBrian Foster
xlog_print_tic_res() pre-dates delayed logging and the committed items list (CIL) and thus retains some factoring warts, such as hard coded function names in the output and the fact that it induces a shutdown. In preparation for more detailed logging of regular transaction overrun situations, refactor xlog_print_tic_res() to be slightly more generic. Reword some of the warning messages and pull the shutdown into the callers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: define fatal assert build time tunableBrian Foster
While configurable at runtime, the DEBUG mode assert failure behavior is usually either desired or not for a particular situation. For example, developers using kernel modules may prefer for fatal asserts to remain disabled across module reloads while QE engineers doing broad regression testing may prefer to have fatal asserts enabled on boot to facilitate data collection for bug reports. To provide a compromise/convenience for developers, create a Kconfig option that sets the default value of the DEBUG mode 'bug_on_assert' sysfs tunable. The default behavior remains to trigger kernel BUGs on assert failures to preserve existing behavior across kernel configuration updates with DEBUG mode enabled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: define bug_on_assert debug mode sysfs tunableBrian Foster
In DEBUG mode, assert failures unconditionally trigger a kernel BUG. This is useful in diagnostic situations to panic a system and collect detailed state information at the time of a failure. This can also cause problems in cases where DEBUG mode code is desired but it is preferable not trigger kernel BUGs on assert failure. For example, during development of new code or during certain xfstests tests that intentionally cause corruption and test the kernel for survival (but otherwise may expect to trigger assert failures). To provide additional flexibility, create the <sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert failure behavior at runtime. This tunable is only available in DEBUG mode and is enabled by default to preserve existing default behavior. When disabled, assert failures in DEBUG mode result in kernel warnings. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: try to avoid blowing out the transaction reservation when bunmaping a ↵Darrick J. Wong
shared extent In a pathological scenario where we are trying to bunmapi a single extent in which every other block is shared, it's possible that trying to unmap the entire large extent in a single transaction can generate so many EFIs that we overflow the transaction reservation. Therefore, use a heuristic to guess at the number of blocks we can safely unmap from a reflink file's data fork in an single transaction. This should prevent problems such as the log head slamming into the tail and ASSERTs that trigger because we've exceeded the transaction reservation. Note that since bunmapi can fail to unmap the entire range, we must also teach the deferred unmap code to roll into a new transaction whenever we get low on reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: random edits, all bugs are my fault] Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: refactor dir2 leaf readahead shadow buffer clevernessDarrick J. Wong
Currently, the dir2 leaf block getdents function uses a complex state tracking mechanism to create a shadow copy of the block mappings and then uses the shadow copy to schedule readahead. Since the read and readahead functions are perfectly capable of reading the mappings themselves, we can tear all that out in favor of a simpler function that simply keeps pushing the readahead window further out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: push buffer of flush locked dquot to avoid quotacheck deadlockBrian Foster
Reclaim during quotacheck can lead to deadlocks on the dquot flush lock: - Quotacheck populates a local delwri queue with the physical dquot buffers. - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties all of the dquots. - Reclaim kicks in and attempts to flush a dquot whose buffer is already queud on the quotacheck queue. The flush succeeds but queueing to the reclaim delwri queue fails as the backing buffer is already queued. The flush unlock is now deferred to I/O completion of the buffer from the quotacheck queue. - The dqadjust bulkstat continues and dirties the recently flushed dquot once again. - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the flush lock to update the backing buffers with the in-core recalculated values. It deadlocks on the redirtied dquot as the flush lock was already acquired by reclaim, but the buffer resides on the local delwri queue which isn't submitted until the end of quotacheck. This is reproduced by running quotacheck on a filesystem with a couple million inodes in low memory (512MB-1GB) situations. This is a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists"), which removed a trylock and buffer I/O submission from the quotacheck dquot flush sequence. Quotacheck first resets and collects the physical dquot buffers in a delwri queue. Then, it traverses the filesystem inodes via bulkstat, updates the in-core dquots, flushes the corrected dquots to the backing buffers and finally submits the delwri queue for I/O. Since the backing buffers are queued across the entire quotacheck operation, dquot reclaim cannot possibly complete a dquot flush before quotacheck completes. Therefore, quotacheck must submit the buffer for I/O in order to cycle the flush lock and flush the dirty in-core dquot to the buffer. Add a delwri queue buffer push mechanism to submit an individual buffer for I/O without losing the delwri queue status and use it from quotacheck to avoid the deadlock. This restores quotacheck behavior to as before the regression was introduced. Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19mm: larger stack guard gap, between vmasHugh Dickins
Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-18Merge tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A fix for an old ceph ->fh_to_* bug from Luis and two timestamp fixups from Zheng, prompted by the ongoing y2038 work" * tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client: ceph: unify inode i_ctime update ceph: use current_kernel_time() to get request time stamp ceph: check i_nlink while converting a file handle to dentry
2017-06-17Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "One more bugfix for you for 4.12-rc6 to fix something that came up in an earlier rc: - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y" * tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-17Merge branch 'ufs-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ufs fixes from Al Viro: "Fix assorted ufs bugs: a couple of deadlocks, fs corruption in truncate(), oopsen on tail unpacking and truncate when racing with vmscan, mild fs corruption (free blocks stats summary buggered, *BSD fsck would complain and fix), several instances of broken logics around reserved blocks (starting with "check almost never triggers when it should" and then there are issues with sufficiently large UFS2)" [ Note: ufs hasn't gotten any loving in a long time, because nobody really seems to use it. These ufs fixes are triggered by people actually caring now, not some sudden influx of new bugs. - Linus ] * 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ufs_truncate_blocks(): fix the case when size is in the last direct block ufs: more deadlock prevention on tail unpacking ufs: avoid grabbing ->truncate_mutex if possible ufs_get_locked_page(): make sure we have buffer_heads ufs: fix s_size/s_dsize users ufs: fix reserved blocks check ufs: make ufs_freespace() return signed ufs: fix logics in "ufs: make fsck -f happy"
2017-06-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes; a leak in mntns_install() caught by Andrei (this cycle regression) + d_invalidate() softlockup fix - that had been reported by a bunch of people lately, but the problem is pretty old" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: don't forget to put old mntns in mntns_install Hang/soft lockup in d_invalidate with simultaneous calls
2017-06-17userfaultfd: shmem: handle coredumping in handle_userfault()Andrea Arcangeli
Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to __get_user_pages(). shmem as opposed has no special FOLL_DUMP handling there so handle_mm_fault() is invoked without mmap_sem and ends up calling handle_userfault() that isn't expecting to be invoked without mmap_sem held. This makes handle_userfault() fail immediately if invoked through shmem_vm_ops->fault during coredumping and solves the problem. The side effect is a BUG_ON with no lock held triggered by the coredumping process which exits. Only 4.11 is affected, pre-4.11 anon memory holes are skipped in __get_user_pages by checking FOLL_DUMP explicitly against empty pagetables (mm/gup.c:no_page_table()). It's zero cost as we already had a check for current->flags to prevent futex to trigger userfaults during exit (PF_EXITING). Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-16Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs updates from Christoph Hellwig: "A fix from Nic for a race seen in production (including a stable tag). And while I'm sending you this I'm also sneaking in a trivial new helper from Bart so that we don't need inter-tree dependencies for the next merge window" * tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs: configfs: Introduce config_item_get_unless_zero() configfs: Fix race between create_link and configfs_rmdir
2017-06-16fs: pass on flags in compat_writevChristoph Hellwig
Fixes: 793b80ef14af ("vfs: pass a flags argument to vfs_readv/vfs_writev") Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-15fs: don't forget to put old mntns in mntns_installAndrei Vagin
Fixes: 4f757f3cbf54 ("make sure that mntns_install() doesn't end up with referral for root") Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15Hang/soft lockup in d_invalidate with simultaneous callsAl Viro
It's not hard to trigger a bunch of d_invalidate() on the same dentry in parallel. They end up fighting each other - any dentry picked for removal by one will be skipped by the rest and we'll go for the next iteration through the entire subtree, even if everything is being skipped. Morevoer, we immediately go back to scanning the subtree. The only thing we really need is to dissolve all mounts in the subtree and as soon as we've nothing left to do, we can just unhash the dentry and bugger off. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a bug on sparc where we may dereference freed stack memory" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15ufs_truncate_blocks(): fix the case when size is in the last direct blockAl Viro
The logics when deciding whether we need to do anything with direct blocks is broken when new size is within the last direct block. It's better to find the path to the last byte _not_ to be removed and use that instead of the path to the beginning of the first block to be freed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15ufs: more deadlock prevention on tail unpackingAl Viro
->s_lock is not needed for ufs_change_blocknr() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15ufs: avoid grabbing ->truncate_mutex if possibleAl Viro
tail unpacking is done in a wrong place; the deadlocks galore is best dealt with by doing that in ->write_iter() (and switching to iomap, while we are at it), but that's rather painful to backport. The trouble comes from grabbing pages that cover the beginning of tail from inside of ufs_new_fragments(); ongoing pageout of any of those is going to deadlock on ->truncate_mutex with process that got around to extending the tail holding that and waiting for page to get unlocked, while ->writepage() on that page is waiting on ->truncate_mutex. The thing is, we don't need ->truncate_mutex when the fragment we are trying to map is within the tail - the damn thing is allocated (tail can't contain holes). Let's do a plain lookup and if the fragment is present, we can just pretend that we'd won the race in almost all cases. The only exception is a fragment between the end of tail and the end of block containing tail. Protect ->i_lastfrag with ->meta_lock - read_seqlock_excl() is sufficient. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14ufs_get_locked_page(): make sure we have buffer_headsAl Viro
callers rely upon that, but find_lock_page() racing with attempt of page eviction by memory pressure might have left us with * try_to_free_buffers() successfully done * __remove_mapping() failed, leaving the page in our mapping * find_lock_page() returning an uptodate page with no buffer_heads attached. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>