Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"In addition to bug fixes and cleanups there are two new features from
Amir:
- Consistent inode number support for the case when layers are not
all on the same filesystem (feature is dubbed "xino").
- Optimize overlayfs file handle decoding. This one touches the
exportfs interface to allow detecting the disconnected directory
case"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: update documentation w.r.t "xino" feature
ovl: add support for "xino" mount and config options
ovl: consistent d_ino for non-samefs with xino
ovl: consistent i_ino for non-samefs with xino
ovl: constant st_ino for non-samefs with xino
ovl: allocate anon bdev per unique lower fs
ovl: factor out ovl_map_dev_ino() helper
ovl: cleanup ovl_update_time()
ovl: add WARN_ON() for non-dir redirect cases
ovl: cleanup setting OVL_INDEX
ovl: set d->is_dir and d->opaque for last path element
ovl: Do not check for redirect if this is last layer
ovl: lookup in inode cache first when decoding lower file handle
ovl: do not try to reconnect a disconnected origin dentry
ovl: disambiguate ovl_encode_fh()
ovl: set lower layer st_dev only if setting lower st_ino
ovl: fix lookup with middle layer opaque dir and absolute path redirects
ovl: Set d->last properly during lookup
ovl: set i_ino to the value of st_ino for NFS export
|
|
When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
kernel warning due to first key verification:
[ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
[ 4239.523830] Modules linked in:
[ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
[ 4239.527101] Call Trace:
[ 4239.527251] read_tree_block+0x42/0x70
[ 4239.527434] read_node_slot+0xd2/0x110
[ 4239.527632] push_leaf_right+0xad/0x1b0
[ 4239.527809] split_leaf+0x4ea/0x700
[ 4239.527988] ? leaf_space_used+0xbc/0xe0
[ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0
[ 4239.528416] btrfs_search_slot+0x8cc/0xa40
[ 4239.528605] btrfs_insert_empty_items+0x71/0xc0
[ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680
[ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0
[ 4239.529205] btrfs_commit_transaction+0x33/0xaf0
[ 4239.529445] ? start_transaction+0xa8/0x4f0
[ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
[ 4239.529833] btrfs_check_data_free_space+0x54/0xa0
[ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70
[ 4239.531907] btrfs_direct_IO+0x233/0x3d0
[ 4239.532098] generic_file_direct_write+0xcb/0x170
[ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4
[ 4239.532491] aio_write+0xe2/0x180
[ 4239.532669] ? lock_acquire+0xac/0x1e0
[ 4239.532839] ? __might_fault+0x3e/0x90
[ 4239.533032] do_io_submit+0x594/0x860
[ 4239.533223] ? do_io_submit+0x594/0x860
[ 4239.533398] SyS_io_submit+0x10/0x20
[ 4239.533560] ? SyS_io_submit+0x10/0x20
[ 4239.533729] do_syscall_64+0x75/0x1d0
[ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 4239.534182] RIP: 0033:0x7f8519741697
The problem here is, at btree_read_extent_buffer_pages() we don't have
acquired read/write lock on that extent buffer, only basic info like
level/bytenr is reliable.
So race condition leads to such false alert.
However in current call site, it's impossible to acquire proper lock
without race window.
To fix the problem, we only verify first key for committed tree blocks
(whose generation is no larger than fs_info->last_trans_committed), so
the content of such tree blocks will not change and there is no need to
get read/write lock.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 581c1760415c ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The ignore mask logic in send_to_group() does not match the logic
in fanotify_should_send_event(). In the latter, a vfsmount mark ignore
mask precedes an inode mark mask and in the former, it does not.
That difference may cause events to be sent to fanotify backend for no
reason. Fix the logic in send_to_group() to match that of
fanotify_should_send_event().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
and get rid of some more calls to get_rfc1002_length()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
More cleanup of use of hardcoded 4 byte RFC1001 field size
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
|
|
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
and get rid of some get_rfc1002_length() in smb2
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
SMB3.11 crypto and hash contexts were not being checked strictly enough.
Add parsing and validity checking for the security contexts in the SMB3.11
negotiate response.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
The length checking for SMB3.11 negotiate request includes
"negotiate contexts" which caused a buffer validation problem
and a confusing warning message on SMB3.11 mount e.g.:
SMB2 server sent bad RFC1001 len 236 not 170
Fix the length checking for SMB3.11 negotiate to account for
the new negotiate context so that we don't log a warning on
SMB3.11 mount by default but do log warnings if lengths returned
by the server are incorrect.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Pull more xfs updates from Darrick Wong:
"Most of these are code cleanups, but there are a couple of notable
use-after-free bug fixes.
This series has been run through a full xfstests run over the week and
through a quick xfstests run against this morning's master, with no
major failures reported.
- clean up unnecessary function call parameters
- fix a use-after-free bug when aborting logging intents
- refactor filestreams state data to avoid use-after-free bug
- fix incorrect removal of cow extents when truncating extended
attributes.
- refactor open-coded __set_page_dirty in favor of using vfs
function.
- fix a deadlock when fstrim and fs shutdown race"
* tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
Force log to disk before reading the AGF during a fstrim
Export __set_page_dirty
xfs: only cancel cow blocks when truncating the data fork
xfs: non-scrub - remove unused function parameters
xfs: remove filestream item xfs_inode reference
xfs: fix intent use-after-free on abort
xfs: Remove "committed" argument of xfs_dir_ialloc
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull more gfs2 updates from Bob Peterson:
"We decided to request the latest three patches to be merged into this
merge window while it's still open.
- The first patch adds a new function to lockref:
lockref_put_not_zero
- The second patch fixes GFS2's glock dump code so it uses the new
lockref function. This fixes a problem whereby lock dumps could
miss glocks.
- I made a minor patch to update some comments and fix the lock
ordering text in our gfs2-glocks.txt Documentation file"
* tag 'gfs2-4.17.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Minor improvements to comments and documentation
gfs2: Stop using rhashtable_walk_peek
lockref: Add lockref_put_not_zero
|
|
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- xprtrdma: Fix corner cases when handling device removal # v4.12+
- xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
Features:
- New sunrpc tracepoint for RPC pings
- Finer grained NFSv4 attribute checking
- Don't unnecessarily return NFS v4 delegations
Other bugfixes and cleanups:
- Several other small NFSoRDMA cleanups
- Improvements to the sunrpc RTT measurements
- A few sunrpc tracepoint cleanups
- Various fixes for NFS v4 lock notifications
- Various sunrpc and NFS v4 XDR encoding cleanups
- Switch to the ida_simple API
- Fix NFSv4.1 exclusive create
- Forget acl cache after setattr operation
- Don't advance the nfs_entry readdir cookie if xdr decoding fails"
* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
NFS: advance nfs_entry cookie only after decoding completes successfully
NFSv3/acl: forget acl cache after setattr
NFSv4.1: Fix exclusive create
NFSv4: Declare the size up to date after it was set.
nfs: Use ida_simple API
NFSv4: Fix the nfs_inode_set_delegation() arguments
NFSv4: Clean up CB_GETATTR encoding
NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
NFSv4: Add a helper to encode/decode struct timespec
NFSv4: Clean up encode_attrs
NFSv4; Clean up XDR encoding of type bitmap4
NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
SUNRPC: Add a helper for encoding opaque data inline
SUNRPC: Add helpers for decoding opaque and string types
NFSv4: Ignore change attribute invalidations if we hold a delegation
NFS: More fine grained attribute tracking
NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
NFS: Don't force a revalidation of all attributes if change is missing
NFS: Convert NFS_INO_INVALID flags to unsigned long
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs thaw updates from Al Viro:
"An ancient series that has fallen through the cracks in the previous
cycle"
* 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
buffer.c: call thaw_super during emergency thaw
vfs: factor sb iteration out of do_emergency_remount
|
|
Pull AFS updates from Al Viro:
"The AFS series posted by dhowells depended upon lookup_one_len()
rework; now that prereq is in the mainline, that series had been
rebased on top of it and got some exposure and testing..."
* 'afs-dh' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Do better accretion of small writes on newly created content
afs: Add stats for data transfer operations
afs: Trace protocol errors
afs: Locally edit directory data for mkdir/create/unlink/...
afs: Adjust the directory XDR structures
afs: Split the directory content defs into a header
afs: Fix directory handling
afs: Split the dynroot stuff out and give it its own ops tables
afs: Keep track of invalid-before version for dentry coherency
afs: Rearrange status mapping
afs: Make it possible to get the data version in readpage
afs: Init inode before accessing cache
afs: Introduce a statistics proc file
afs: Dump bad status record
afs: Implement @cell substitution handling
afs: Implement @sys substitution handling
afs: Prospectively look up extra files when doing a single lookup
afs: Don't over-increment the cell usage count when pinning it
afs: Fix checker warnings
vfs: Remove the const from dir_context::actor
|
|
This patch simply fixes some comments and the gfs2-glocks.txt file:
Places where i_rwsem was called i_mutex, and adding i_rw_mutex.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Function rhashtable_walk_peek is problematic because there is no
guarantee that the glock previously returned still exists; when that key
is deleted, rhashtable_walk_peek can end up returning a different key,
which will cause an inconsistent glock dump. Fix this by keeping track
of the current glock in the seq file iterator functions instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
During the "insert range" fallocate operation, extents starting at the
range offset are shifted "right" (to a higher file offset) by the range
length. But, as shown by syzbot, it's not validated that this doesn't
cause extents to be shifted beyond EXT_MAX_BLOCKS. In that case
->ee_block can wrap around, corrupting the extent tree.
Fix it by returning an error if the space between the end of the last
extent and EXT4_MAX_BLOCKS is smaller than the range being inserted.
This bug can be reproduced by running the following commands when the
current directory is on an ext4 filesystem with a 4k block size:
fallocate -l 8192 file
fallocate --keep-size -o 0xfffffffe000 -l 4096 -n file
fallocate --insert-range -l 8192 file
Then after unmounting the filesystem, e2fsck reports corruption.
Reported-by: syzbot+06c885be0edcdaeab40c@syzkaller.appspotmail.com
Fixes: 331573febb6a ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Unify the include protection macros to match the file names.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:
add prealloc extent beyond i_size
fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
transaction commit
add another prealloc extent beyond i_size
fsync - triggers the fast fsync path
power failure
In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
$ sync
$ xfs_io -c "falloc -k 256K 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# mount to replay log
$ mount /dev/sdb /mnt
# at this point the file only has one extent, at offset 0, size 256K
A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.
Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set
However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));
For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources. For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With mount option "xino=on", mounter declares that there are enough
free high bits in underlying fs to hold the layer fsid.
If overlayfs does encounter underlying inodes using the high xino
bits reserved for layer fsid, a warning will be emitted and the original
inode number will be used.
The mount option name "xino" goes after a similar meaning mount option
of aufs, but in overlayfs case, the mapping is stateless.
An example for a use case of "xino=on" is when upper/lower is on an xfs
filesystem. xfs uses 64bit inode numbers, but it currently never uses the
upper 8bit for inode numbers exposed via stat(2) and that is not likely to
change in the future without user opting-in for a new xfs feature. The
actual number of unused upper bit is much larger and determined by the xfs
filesystem geometry (64 - agno_log - agblklog - inopblog). That means
that for all practical purpose, there are enough unused bits in xfs
inode numbers for more than OVL_MAX_STACK unique fsid's.
Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode
numbers are allocated sequentially since boot, so they will practially
never use the high inode number bits.
For compatibility with applications that expect 32bit inodes, the feature
can be disabled with "xino=off". The option "xino=auto" automatically
detects underlying filesystem that use 32bit inodes and enables the
feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of
the same name, determine if the default mode for overlayfs mount is
"xino=auto" or "xino=off".
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, relax non-samefs constraint for consistent d_ino and always
iterate non-merge dir using ovl_fill_real() actor so we can remap lower
inode numbers to unique lower fs range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, set i_ino value to the same value as st_ino for nfsd
readdirplus validator.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
On 64bit systems, when overlay layers are not all on the same fs, but
all inode numbers of underlying fs are not using the high bits, use the
high bits to partition the overlay st_ino address space. The high bits
hold the fsid (upper fsid is 0). This way overlay inode numbers are unique
and all inodes use overlay st_dev. Inode numbers are also persistent
for a given layer configuration.
Currently, our only indication for available high ino bits is from a
filesystem that supports file handles and uses the default encode_fh()
operation, which encodes a 32bit inode number.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Instead of allocating an anonymous bdev per lower layer, allocate
one anonymous bdev per every unique lower fs that is different than
upper fs.
Every unique lower fs is assigned an fsid > 0 and the number of
unique lower fs are stored in ofs->numlowerfs.
The assigned fsid is stored in the lower layer struct and will be
used also for inode number multiplexing.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
A helper for ovl_getattr() to map the values of st_dev and st_ino
according to constant st_ino rules.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
No need to mess with an alias, the upperdentry can be retrieved directly
from the overlay inode.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Certain properties in ovl_lookup_data should be set only for the last
element of the path. IOW, if we are calling ovl_lookup_single() for an
absolute redirect, then d->is_dir and d->opaque do not make much sense
for intermediate path elements. Instead set them only if dentry being
lookup is last path element.
As of now we do not seem to be making use of d->opaque if it is set for
a path/dentry in lower. But just define the semantics so that future code
can make use of this assumption.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If we are looking in last layer, then there should not be any need to
process redirect. redirect information is used only for lookup in next
lower layer and there is no more lower layer to look into. So no need
to process redirects.
IOW, ignore redirects on lowest layer.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When decoding a lower file handle, we need to check if lower file was
copied up and indexed and if it has a whiteout index, we need to check
if this is an unlinked but open non-dir before returning -ESTALE.
To find out if this is an unlinked but open non-dir we need to lookup
an overlay inode in inode cache by lower inode and that requires decoding
the lower file handle before looking in inode cache.
Before this change, if the lower inode turned out to be a directory, we
may have paid an expensive cost to reconnect that lower directory for
nothing.
After this change, we start by decoding a disconnected lower dentry and
using the lower inode for looking up an overlay inode in inode cache.
If we find overlay inode and dentry in cache, we avoid the index lookup
overhead. If we don't find an overlay inode and dentry in cache, then we
only need to decode a connected lower dentry in case the lower dentry is
a non-indexed directory.
The xfstests group overlay/exportfs tests decoding overlayfs file
handles after drop_caches with different states of the file at encode
and decode time. Overall the tests in the group call ovl_lower_fh_to_d()
89 times to decode a lower file handle.
Before this change, the tests called ovl_get_index_fh() 75 times and
reconnect_one() 61 times.
After this change, the tests call ovl_get_index_fh() 70 times and
reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided
are cases where a non-upper directory file handle was encoded, then the
directory removed and then file handle was decoded.
To demonstrate the affect on decoding file handles with hot inode/dentry
cache, the drop_caches call in the tests was disabled. Without
drop_caches, there are no reconnect_one() calls at all before or after
the change. Before the change, there are 75 calls to ovl_get_index_fh(),
exactly as the case with drop_caches. After the change, there are only
10 calls to ovl_get_index_fh().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
On lookup of non directory, we try to decode the origin file handle
stored in upper inode. The origin file handle is supposed to be decoded
to a disconnected non-dir dentry, which is fine, because we only need
the lower inode of a copy up origin.
However, if the origin file handle somehow turns out to be a directory
we pay the expensive cost of reconnecting the directory dentry, only to
get a mismatch file type and drop the dentry.
Optimize this case by explicitly opting out of reconnecting the dentry.
Opting-out of reconnect is done by passing a NULL acceptable callback
to exportfs_decode_fh().
While the case described above is a strange corner case that does not
really need to be optimized, the API added for this optimization will
be used by a following patch to optimize a more common case of decoding
an overlayfs file handle.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the
exportfs function ovl_encode_inode_fh() and change the latter to
ovl_encode_fh() to match the exportfs method name.
Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
For broken hardlinks, we do not return lower st_ino, so we should
also not return lower pseudo st_dev.
Fixes: a0c5ad307ac0 ("ovl: relax same fs constraint for constant st_ino")
Cc: <stable@vger.kernel.org> #v4.15
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
As of now if we encounter an opaque dir while looking for a dentry, we set
d->last=true. This means that there is no need to look further in any of
the lower layers. This works fine as long as there are no redirets or
relative redircts. But what if there is an absolute redirect on the
children dentry of opaque directory. We still need to continue to look into
next lower layer. This patch fixes it.
Here is an example to demonstrate the issue. Say you have following setup.
upper: /redirect (redirect=/a/b/c)
lower1: /a/[b]/c ([b] is opaque) (c has absolute redirect=/a/b/d/)
lower0: /a/b/d/foo
Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d.
Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look
into lower0 because children c has an absolute redirect.
Following is a reproducer.
Watch me make foo disappear:
$ mkdir lower middle upper work work2 merged
$ mkdir lower/origin
$ touch lower/origin/foo
$ mount -t overlay none merged/ \
-olowerdir=lower,upperdir=middle,workdir=work2
$ mkdir merged/pure
$ mv merged/origin merged/pure/redirect
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
$ mv merged/pure/redirect merged/redirect
Now you see foo inside a twice redirected merged dir:
$ ls merged/redirect
foo
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
After mount cycle you don't see foo inside the same dir:
$ ls merged/redirect
During middle layer lookup, the opaqueness of middle/pure is left in
the lookup state and then middle/pure/redirect is wrongly treated as
opaque.
Fixes: 02b69b284cd7 ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
d->last signifies that this is the last layer we are looking into and there
is no more. And that means this allows for some optimzation opportunities
during lookup. For example, in ovl_lookup_single() we don't have to check
for opaque xattr of a directory is this is the last layer we are looking
into (d->last = true).
But knowing for sure whether we are looking into last layer can be very
tricky. If redirects are not enabled, then we can look at poe->numlower and
figure out if the lookup we are about to is last layer or not. But if
redircts are enabled then it is possible poe->numlower suggests that we are
looking in last layer, but there is an absolute redirect present in found
element and that redirects us to a layer in root and that means lookup will
continue in lower layers further.
For example, consider following.
/upperdir/pure (opaque=y)
/upperdir/pure/foo (opaque=y,redirect=/bar)
/lowerdir/bar
In this case pure is "pure upper". When we look for "foo", that time
poe->numlower=0. But that alone does not mean that we will not search for a
merge candidate in /lowerdir. Absolute redirect changes that.
IOW, d->last should not be set just based on poe->numlower if redirects are
enabled. That can lead to setting d->last while it should not have and that
means we will not check for opaque xattr while we should have.
So do this.
- If redirects are not enabled, then continue to rely on poe->numlower
information to determine if it is last layer or not.
- If redirects are enabled, then set d->last = true only if this is the
last layer in root ovl_entry (roe).
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 02b69b284cd7 ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
|
|
Eddie Horng reported that readdir of an overlayfs directory that
was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN.
The reason is that while preparing the response for readdirplus, nfsd
checks inside encode_entryplus_baggage() that a child dentry's inode
number matches the value of d_ino returns by overlayfs readdir iterator.
Because the overlayfs inodes use arbitrary inode numbers that are not
correlated with the values of st_ino/d_ino, NFSv3 falls back to not
encoding d_type. Although this is an allowed behavior, we can fix it for
the case of all overlayfs layers on the same underlying filesystem.
When NFS export is enabled and d_ino is consistent with st_ino
(samefs), set the same value also to i_ino in ovl_fill_inode() for all
overlayfs inodes, nfsd readdirplus sanity checks will pass.
ovl_fill_inode() may be called from ovl_new_inode(), before real inode
was created with ino arg 0. In that case, i_ino will be updated to real
upper inode i_ino on ovl_inode_init() or ovl_inode_update().
Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Tested-by: Eddie Horng <eddiehorng.tw@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 8383f1748829 ("ovl: wire up NFS export operations")
Cc: <stable@vger.kernel.org> #v4.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Pull UBI and UBIFS updates from Richard Weinberger:
"Minor bug fixes and improvements"
* tag 'tags/upstream-4.17-rc1' of git://git.infradead.org/linux-ubifs:
ubi: Reject MLC NAND
ubifs: Remove useless parameter of lpt_heap_replace
ubifs: Constify struct ubifs_lprops in scan_for_leb_for_idx
ubifs: remove unnecessary assignment
ubi: Fix error for write access
ubi: fastmap: Don't flush fastmap work on detach
ubifs: Check ubifs_wbuf_sync() return code
|
|
* Since cifs_vfs_error was just using pr_debug_ratelimited like the rest
of cifs_dbg, move it there too
* Add a ONCE type flag to call the pr_xxx_once() debug function instead
of the ratelimited ones.
To convert existing printk_once() calls to this we can run:
perl -i -pE \
's/printk_once\s*\(([^" \n]+)(.*)/cifs_dbg(VFS|ONCE,$2/g' \
fs/cifs/*.c
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):
fs/cifs/inode.c: In function ‘simple_hashstr’:
fs/cifs/inode.c:713: warning: integer constant is too large for ‘long’ type
Fixes: 7ea884c77e5c97f1 ("smb3: Fix root directory when server returns inode number of zero")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
Adding an extra debug message to show if a tree connect failure during
reconnect (and made it a log once so it doesn't spam the logs).
Saw a case recently where tree connect repeatedly returned
access denied on reconnect and it wasn't as easy to spot as it
should have been.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
tcon->ses is being dereferenced before it is null checked, hence
there is a potential null pointer dereference.
Fix this by moving the pointer dereference after tcon->ses has
been properly null checked.
Addresses-Coverity-ID: 1467426 ("Dereference before null check")
Fixes: 93012bf98416 ("cifs: add server->vals->header_preamble_size")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Merge more updates from Andrew Morton:
- almost all of the rest of MM
- kasan updates
- lots of procfs work
- misc things
- lib/ updates
- checkpatch
- rapidio
- ipc/shm updates
- the start of willy's XArray conversion
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (140 commits)
page cache: use xa_lock
xarray: add the xa_lock to the radix_tree_root
fscache: use appropriate radix tree accessors
export __set_page_dirty
unicore32: turn flush_dcache_mmap_lock into a no-op
arm64: turn flush_dcache_mmap_lock into a no-op
mac80211_hwsim: use DEFINE_IDA
radix tree: use GFP_ZONEMASK bits of gfp_t for flags
linux/const.h: refactor _BITUL and _BITULL a bit
linux/const.h: move UL() macro to include/linux/const.h
linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
xen, mm: allow deferred page initialization for xen pv domains
elf: enforce MAP_FIXED on overlaying elf segments
fs, elf: drop MAP_FIXED usage from elf_map
mm: introduce MAP_FIXED_NOREPLACE
MAINTAINERS: update bouncing aacraid@adaptec.com addresses
fs/dcache.c: add cond_resched() in shrink_dentry_list()
include/linux/kfifo.h: fix comment
ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param
...
|
|
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|