Age | Commit message (Collapse) | Author |
|
Pull more vfs mount updates from Al Viro:
"Propagation of new syscalls to other architectures + cosmetic change
from Christian (fscontext didn't follow the convention for anon inode
names)"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]
uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]
uapi, fsopen: use square brackets around "fscontext" [ver #2]
|
|
Pull io_uring fixes from Jens Axboe:
"A small set of fixes for io_uring.
This contains:
- smp_rmb() cleanup for io_cqring_events() (Jackie)
- io_cqring_wait() simplification (Jackie)
- removal of dead 'ev_flags' passing (me)
- SQ poll CPU affinity verification fix (me)
- SQ poll wait fix (Roman)
- SQE command prep cleanup and fix (Stefan)"
* tag 'for-linus-20190516' of git://git.kernel.dk/linux-block:
io_uring: use wait_event_interruptible for cq_wait conditional wait
io_uring: adjust smp_rmb inside io_cqring_events
io_uring: fix infinite wait in khread_park() on io_finish_async()
io_uring: remove 'ev_flags' argument
io_uring: fix failure to verify SQ_AFF cpu
io_uring: fix race condition reading SQE data
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS callback promise fixes from David Howells:
"This series fixes a bunch of problems in callback promise handling,
where a callback promise indicates a promise on the part of the server
to notify the client in the event of some sort of change to a file or
volume. In the event of a break, the client has to go and refetch the
client status from the server and discard any cached permission
information as the ACL might have changed.
The problem in the current code is that changes made by other clients
aren't always noticed, primarily because the file status information
and the callback information aren't updated in the same critical
section, even if these are carried in the same reply from an RPC
operation, and so the AFS_VNODE_CB_PROMISED flag is unreliable.
Arranging for them to be done in the same critical section during
reply decoding is tricky because of the FS.InlineBulkStatus op - which
has all the statuses in the reply arriving and then all the callbacks,
so they have to be buffered. It simplifies things a lot to move the
critical section out of the decode phase and do it after the RPC
function returns.
Also new inodes (either newly fetched or newly created) aren't
properly managed against a callback break happening before we get the
local inode up and running.
Fix this by:
- There's now a combined file status and callback record (struct
afs_status_cb) to carry both plus some flags.
- Each operation wrapper function allocates sufficient afs_status_cb
records for all the vnodes it is interested in and passes them into
RPC operations to be filled in from the reply.
- The FileStatus and CallBack record decoders no longer apply the
new/revised status and callback information to the inode/vnode at
the point of decoding and instead store the information into the
record from (2).
- afs_vnode_commit_status() then revises the file status, detects
deletion and notes callback information inside of a single critical
section. It also checks the callback break counters and cancels the
callback promise if they changed during the operation.
[*] Note that "callback break counters" are counters of server
events that cancel one or more callback promises that the client
thinks it has. The client counts the events and compares the
counters before and after an operation to see if the callback
promise it thinks it just got evaporated before it got recorded
under lock.
- Volume and server callback break counters are passed into
afs_iget() allowing callback breaks concurrent with inode set up to
be detected and the callback promise thence to be cancelled.
- AFS validation checks are now done under RCU conditions using a
read lock on cb_lock. This requires vnode->cb_interest to be made
RCU safe.
- If the checks in (6) fail, the callback breaker is then called
under write lock on the cb_lock - but only if the callback break
counter didn't change from the value read before the checks were
made.
- Results from FS.InlineBulkStatus that correspond to inodes we
currently have in memory are now used to update those inodes'
status and callback information rather than being discarded. This
requires those inodes to be looked up before the RPC op is made and
all their callback break values saved.
To aid in this, the following changes have also been made:
- Don't pass the vnode into the reply delivery functions or the
decoders. The vnode shouldn't be altered anywhere in those paths.
The only exception, for the moment, is for the call done hook for
file lock ops that wants access to both the vnode and the call -
this can be fixed at a later time.
- Get rid of the call->reply[] void* array and replace it with named
and typed members. This avoids confusion since different ops were
mapping different reply[] members to different things.
- Fix an order-1 kmalloc allocation in afs_do_lookup() and replace it
with kvcalloc().
- Always get the reply time. Since callback, lock and fileserver
record expiry times are calculated for several RPCs, make this
mandatory.
- Call afs_pages_written_back() from the operation wrapper rather
than from the delivery function.
- Don't store the version and type from a callback promise in a reply
as the information in them is of very limited use"
* tag 'afs-fixes-b-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix application of the results of a inline bulk status fetch
afs: Pass pre-fetch server and volume break counts into afs_iget5_set()
afs: Fix unlink to handle YFS.RemoveFile2 better
afs: Clear AFS_VNODE_CB_PROMISED if we detect callback expiry
afs: Make vnode->cb_interest RCU safe
afs: Split afs_validate() so first part can be used under LOOKUP_RCU
afs: Don't save callback version and type fields
afs: Fix application of status and callback to be under same lock
afs: Always get the reply time
afs: Fix order-1 allocation in afs_do_lookup()
afs: Get rid of afs_call::reply[]
afs: Don't pass the vnode pointer through into the inline bulk status op
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull misc AFS fixes from David Howells:
"This fixes a set of miscellaneous issues in the afs filesystem,
including:
- leak of keys on file close.
- broken error handling in xattr functions.
- missing locking when updating VL server list.
- volume location server DNS lookup whereby preloaded cells may not
ever get a lookup and regular DNS lookups to maintain server lists
consume power unnecessarily.
- incorrect error propagation and handling in the fileserver
iteration code causes operations to sometimes apparently succeed.
- interruption of server record check/update side op during
fileserver iteration causes uninterruptible main operations to fail
unexpectedly.
- callback promise expiry time miscalculation.
- over invalidation of the callback promise on directories.
- double locking on callback break waking up file locking waiters.
- double increment of the vnode callback break counter.
Note that it makes some changes outside of the afs code, including:
- an extra parameter to dns_query() to allow the dns_resolver key
just accessed to be immediately invalidated. AFS is caching the
results itself, so the key can be discarded.
- an interruptible version of wait_var_event().
- an rxrpc function to allow the maximum lifespan to be set on a
call.
- a way for an rxrpc call to be marked as non-interruptible"
* tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix double inc of vnode->cb_break
afs: Fix lock-wait/callback-break double locking
afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
afs: Fix calculation of callback expiry time
afs: Make dynamic root population wait uninterruptibly for proc_cells_lock
afs: Make some RPC operations non-interruptible
rxrpc: Allow the kernel to mark a call as being non-interruptible
afs: Fix error propagation from server record check/update
afs: Fix the maximum lifespan of VL and probe calls
rxrpc: Provide kernel interface to set max lifespan on a call
afs: Fix "kAFS: AFS vnode with undefined type 0"
afs: Fix cell DNS lookup
Add wait_var_event_interruptible()
dns_resolver: Allow used keys to be invalidated
afs: Fix afs_cell records to always have a VL server list record
afs: Fix missing lock when replacing VL server list
afs: Fix afs_xattr_get_yfs() to not try freeing an error value
afs: Fix incorrect error handling in afs_xattr_get_acl()
afs: Fix key leak in afs_release() and afs_evict_inode()
|
|
Pull ceph updates from Ilya Dryomov:
"On the filesystem side we have:
- a fix to enforce quotas set above the mount point (Luis Henriques)
- support for exporting snapshots through NFS (Zheng Yan)
- proper statx implementation (Jeff Layton). statx flags are mapped
to MDS caps, with AT_STATX_{DONT,FORCE}_SYNC taken into account.
- some follow-up dentry name handling fixes, in particular
elimination of our hand-rolled helper and the switch to __getname()
as suggested by Al (Jeff Layton)
- a set of MDS client cleanups in preparation for async MDS requests
in the future (Jeff Layton)
- a fix to sync the filesystem before remounting (Jeff Layton)
On the rbd side, work is on-going on object-map and fast-diff image
features"
* tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-client: (29 commits)
ceph: flush dirty inodes before proceeding with remount
ceph: fix unaligned access in ceph_send_cap_releases
libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer
libceph: fix unaligned accesses in ceph_entity_addr handling
rbd: don't assert on writes to snapshots
rbd: client_mutex is never nested
ceph: print inode number in __caps_issued_mask debugging messages
ceph: just call get_session in __ceph_lookup_mds_session
ceph: simplify arguments and return semantics of try_get_cap_refs
ceph: fix comment over ceph_drop_caps_for_unlink
ceph: move wait for mds request into helper function
ceph: have ceph_mdsc_do_request call ceph_mdsc_submit_request
ceph: after an MDS request, do callback and completions
ceph: use pathlen values returned by set_request_path_attr
ceph: use __getname/__putname in ceph_mdsc_build_path
ceph: use ceph_mdsc_build_path instead of clone_dentry_name
ceph: fix potential use-after-free in ceph_mdsc_build_path
ceph: dump granular cap info in "caps" debugfs file
ceph: make iterate_session_caps a public symbol
ceph: fix NULL pointer deref when debugging is enabled
...
|
|
Fix afs_do_lookup() such that when it does an inline bulk status fetch op,
it will update inodes that are already extant (something that afs_iget()
doesn't do) and to cache permits for each inode created (thereby avoiding a
follow up FS.FetchStatus call to determine this).
Extant inodes need looking up in advance so that their cb_break counters
before and after the operation can be compared. To this end, the inode
pointers are cached so that they don't need looking up again after the op.
Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Pass the server and volume break counts from before the status fetch
operation that queried the attributes of a file into afs_iget5_set() so
that the new vnode's break counters can be initialised appropriately.
This allows detection of a volume or server break that happened whilst we
were fetching the status or setting up the vnode.
Fixes: c435ee34551e ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Make use of the status update for the target file that the YFS.RemoveFile2
RPC op returns to correctly update the vnode as to whether the file was
actually deleted or just had nlink reduced.
Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix afs_validate() to clear AFS_VNODE_CB_PROMISED on a vnode if we detect
any condition that causes the callback promise to be broken implicitly,
including server break (cb_s_break), volume break (cb_v_break) or callback
expiry.
Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Use RCU-based freeing for afs_cb_interest struct objects and use RCU on
vnode->cb_interest. Use that change to allow afs_check_validity() to use
read_seqbegin_or_lock() instead of read_seqlock_excl().
This also requires the caller of afs_check_validity() to hold the RCU read
lock across the call.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Split afs_validate() so that the part that decides if the vnode is still
valid can be used under LOOKUP_RCU conditions from afs_d_revalidate().
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Don't save callback version and type fields as the version is about the
format of the callback information and the type is relative to the
particular RPC call.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Pull configfs update from Christoph Hellwig:
- a fix for an error path use after free (YueHaibing)
* tag 'configfs-for-5.2' of git://git.infradead.org/users/hch/configfs:
configfs: fix possible use-after-free in configfs_register_group
|
|
Make the name of the anon inode fd "[fscontext]" instead of "fscontext".
This is minor but most core-kernel anon inode fds already carry square
brackets around their name:
[eventfd]
[eventpoll]
[fanotify]
[io_uring]
[pidfd]
[signalfd]
[timerfd]
[userfaultfd]
For the sake of consistency lets do the same for the fscontext anon inode
fd that comes with the new mount api.
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
When __afs_break_callback() clears the CB_PROMISED flag, it increments
vnode->cb_break to trigger a future refetch of the status and callback -
however it also calls afs_clear_permits(), which also increments
vnode->cb_break.
Fix this by removing the increment from afs_clear_permits().
Whilst we're at it, fix the conditional call to afs_put_permits() as the
function checks to see if the argument is NULL, so the check is redundant.
Fixes: be080a6f43c4 ("afs: Overhaul permit caching");
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
When applying the status and callback in the response of an operation,
apply them in the same critical section so that there's no race between
checking the callback state and checking status-dependent state (such as
the data version).
Fix this by:
(1) Allocating a joint {status,callback} record (afs_status_cb) before
calling the RPC function for each vnode for which the RPC reply
contains a status or a status plus a callback. A flag is set in the
record to indicate if a callback was actually received.
(2) These records are passed into the RPC functions to be filled in. The
afs_decode_status() and yfs_decode_status() functions are removed and
the cb_lock is no longer taken.
(3) xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() no longer
update the vnode.
(4) xdr_decode_AFSCallBack() and xdr_decode_YFSCallBack() no longer update
the vnode.
(5) vnodes, expected data-version numbers and callback break counters
(cb_break) no longer need to be passed to the reply delivery
functions.
Note that, for the moment, the file locking functions still need
access to both the call and the vnode at the same time.
(6) afs_vnode_commit_status() is now given the cb_break value and the
expected data_version and the task of applying the status and the
callback to the vnode are now done here.
This is done under a single taking of vnode->cb_lock.
(7) afs_pages_written_back() is now called by afs_store_data() rather than
by the reply delivery function.
afs_pages_written_back() has been moved to before the call point and
is now given the first and last page numbers rather than a pointer to
the call.
(8) The indicator from YFS.RemoveFile2 as to whether the target file
actually got removed (status.abort_code == VNOVNODE) rather than
merely dropping a link is now checked in afs_unlink rather than in
xdr_decode_YFSFetchStatus().
Supplementary fixes:
(*) afs_cache_permit() now gets the caller_access mask from the
afs_status_cb object rather than picking it out of the vnode's status
record. afs_fetch_status() returns caller_access through its argument
list for this purpose also.
(*) afs_inode_init_from_status() now uses a write lock on cb_lock rather
than a read lock and now sets the callback inside the same critical
section.
Fixes: c435ee34551e ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
__afs_break_callback() holds vnode->lock around its call of
afs_lock_may_be_available() - which also takes that lock.
Fix this by not taking the lock in __afs_break_callback().
Also, there's no point checking the granted_locks and pending_locks queues;
it's sufficient to check lock_state, so move that check out of
afs_lock_may_be_available() into __afs_break_callback() to replace the
queue checks.
Fixes: e8d6c554126b ("AFS: implement file locking")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Always ask for the reply time from AF_RXRPC as it's used to calculate the
callback expiry time and lock expiry times, so it's needed by most FS
operations.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Don't invalidate the callback promise on a directory if the
AFS_VNODE_DIR_VALID flag is not set (which indicates that the directory
contents are invalid, due to edit failure, callback break, page reclaim).
The directory will be reloaded next time the directory is accessed, so
clearing the callback flag at this point may race with a reload of the
directory and cancel it's recorded callback promise.
Fixes: f3ddee8dc4e2 ("afs: Fix directory handling")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
afs_do_lookup() will do an order-1 allocation to allocate status records if
there are more than 39 vnodes to stat.
Fix this by allocating an array of {status,callback} records for each vnode
we want to examine using vmalloc() if larger than a page.
This not only gets rid of the order-1 allocation, but makes it easier to
grow beyond 50 records for YFS servers. It also allows us to move to
{status,callback} tuples for other calls too and makes it easier to lock
across the application of the status and the callback to the vnode.
Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix the calculation of the expiry time of a callback promise, as obtained
from operations like FS.FetchStatus and FS.FetchData.
The time should be based on the timestamp of the first DATA packet in the
reply and the calculation needs to turn the ktime_t timestamp into a
time64_t.
Fixes: c435ee34551e ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Replace the afs_call::reply[] array with a bunch of typed members so that
the compiler can use type-checking on them. It's also easier for the eye
to see what's going on.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Make dynamic root population wait uninterruptibly for proc_cells_lock.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Don't pass the vnode pointer through into the inline bulk status op. We
want to process the status records outside of it anyway.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Make certain RPC operations non-interruptible, including:
(*) Set attributes
(*) Store data
We don't want to get interrupted during a flush on close, flush on
unlock, writeback or an inode update, leaving us in a state where we
still need to do the writeback or update.
(*) Extend lock
(*) Release lock
We don't want to get lock extension interrupted as the file locks on
the server are time-limited. Interruption during lock release is less
of an issue since the lock is time-limited, but it's better to
complete the release to avoid a several-minute wait to recover it.
*Setting* the lock isn't a problem if it's interrupted since we can
just return to the user and tell them they were interrupted - at
which point they can elect to retry.
(*) Silly unlink
We want to remove silly unlink files if we can, rather than leaving
them for the salvager to clear up.
Note that whilst these calls are no longer interruptible, they do have
timeouts on them, so if the server stops responding the call will fail with
something like ETIME or ECONNRESET.
Without this, the following:
kAFS: Unexpected error from FS.StoreData -512
appears in dmesg when a pending store data gets interrupted and some
processes may just hang.
Additionally, make the code that checks/updates the server record ignore
failure due to interruption if the main call is uninterruptible and if the
server has an address list. The next op will check it again since the
expiration time on the old list has past.
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Jonathan Billings <jsbillings@jsbillings.org>
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Allow kernel services using AF_RXRPC to indicate that a call should be
non-interruptible. This allows kafs to make things like lock-extension and
writeback data storage calls non-interruptible.
If this is set, signals will be ignored for operations on that call where
possible - such as waiting to get a call channel on an rxrpc connection.
It doesn't prevent UDP sendmsg from being interrupted, but that will be
handled by packet retransmission.
rxrpc_kernel_recv_data() isn't affected by this since that never waits,
preferring instead to return -EAGAIN and leave the waiting to the caller.
Userspace initiated calls can't be set to be uninterruptible at this time.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
afs_check/update_server_record() should be setting fc->error rather than
fc->ac.error as they're called from within the cursor iteration function.
afs_fs_cursor::error is where the error code of the attempt to call the
operation on multiple servers is integrated and is the final result,
whereas afs_addr_cursor::error is used to hold the error from individual
iterations of the call loop. (Note there's also an afs_vl_cursor which
also wraps afs_addr_cursor for accessing VL servers rather than file
servers).
Fix this by setting fc->error in the afs_check/update_server_record() so
that any error incurred whilst talking to the VL server correctly
propagates to the final result.
This results in:
kAFS: Unexpected error from FS.StoreData -512
being seen, even though the store-data op is non-interruptible. The error
is actually coming from the server record update getting interrupted.
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
If an older AFS server doesn't support an operation, it may accept the call
and then sit on it forever, happily responding to pings that make kafs
think that the call is still alive.
Fix this by setting the maximum lifespan of Volume Location service calls
in particular and probe calls in general so that they don't run on
endlessly if they're not supported.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Under some circumstances afs_select_fileserver() can return without setting
an error in fc->error. The problem is in the no_more_servers segment where
the accumulated errors from attempts to contact various servers are
integrated into an afs_error-type variable 'e'. The resultant error code
is, however, then abandoned.
Fix this by getting the error out of e.error and putting it in 'error' so
that the next part will store it into fc->error.
Not doing this causes a report like the following:
kAFS: AFS vnode with undefined type 0
kAFS: A=0 m=0 s=0 v=0
kAFS: vnode 20000025:1:1
because the code following the server selection loop then sees what it
thinks is a successful invocation because fc.error is 0. However, it can't
apply the status record because it's all zeros.
The report is followed on the first instance with a trace looking something
like:
dump_stack+0x67/0x8e
afs_inode_init_from_status.isra.2+0x21b/0x487
afs_fetch_status+0x119/0x1df
afs_iget+0x130/0x295
afs_get_tree+0x31d/0x595
vfs_get_tree+0x1f/0xe8
fc_mount+0xe/0x36
afs_d_automount+0x328/0x3c3
follow_managed+0x109/0x20a
lookup_fast+0x3bf/0x3f8
do_last+0xc3/0x6a4
path_openat+0x1af/0x236
do_filp_open+0x51/0xae
? _raw_spin_unlock+0x24/0x2d
? __alloc_fd+0x1a5/0x1b7
do_sys_open+0x13b/0x1e8
do_syscall_64+0x7d/0x1b3
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 4584ae96ae30 ("afs: Fix missing net error handling")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The previous patch has ensured that io_cqring_events contain
smp_rmb memory barriers, Now we can use wait_event_interruptible
to keep the code simple.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Whenever smp_rmb is required to use io_cqring_events,
keep smp_rmb inside the function io_cqring_events.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This fixes couple of races which lead to infinite wait of park completion
with the following backtraces:
[20801.303319] Call Trace:
[20801.303321] ? __schedule+0x284/0x650
[20801.303323] schedule+0x33/0xc0
[20801.303324] schedule_timeout+0x1bc/0x210
[20801.303326] ? schedule+0x3d/0xc0
[20801.303327] ? schedule_timeout+0x1bc/0x210
[20801.303329] ? preempt_count_add+0x79/0xb0
[20801.303330] wait_for_completion+0xa5/0x120
[20801.303331] ? wake_up_q+0x70/0x70
[20801.303333] kthread_park+0x48/0x80
[20801.303335] io_finish_async+0x2c/0x70
[20801.303336] io_ring_ctx_wait_and_kill+0x95/0x180
[20801.303338] io_uring_release+0x1c/0x20
[20801.303339] __fput+0xad/0x210
[20801.303341] task_work_run+0x8f/0xb0
[20801.303342] exit_to_usermode_loop+0xa0/0xb0
[20801.303343] do_syscall_64+0xe0/0x100
[20801.303349] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20801.303380] Call Trace:
[20801.303383] ? __schedule+0x284/0x650
[20801.303384] schedule+0x33/0xc0
[20801.303386] io_sq_thread+0x38a/0x410
[20801.303388] ? __switch_to_asm+0x40/0x70
[20801.303390] ? wait_woken+0x80/0x80
[20801.303392] ? _raw_spin_lock_irqsave+0x17/0x40
[20801.303394] ? io_submit_sqes+0x120/0x120
[20801.303395] kthread+0x112/0x130
[20801.303396] ? kthread_create_on_node+0x60/0x60
[20801.303398] ret_from_fork+0x35/0x40
o kthread_park() waits for park completion, so io_sq_thread() loop
should check kthread_should_park() along with khread_should_stop(),
otherwise if kthread_park() is called before prepare_to_wait()
the following schedule() never returns:
CPU#0 CPU#1
io_sq_thread_stop(): io_sq_thread():
while(!kthread_should_stop() && !ctx->sqo_stop) {
ctx->sqo_stop = 1;
kthread_park()
prepare_to_wait();
if (kthread_should_stop() {
}
schedule(); <<< nobody checks park flag,
<<< so schedule and never return
o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop
it is quite possible, that kthread_should_park() check and the
following kthread_parkme() is never called, because kthread_park()
has not been yet called, but few moments later is is called and
waits there for park completion, which never happens, because
kthread has already exited:
CPU#0 CPU#1
io_sq_thread_stop(): io_sq_thread():
ctx->sqo_stop = 1;
while(!kthread_should_stop() && !ctx->sqo_stop) {
<<< observe sqo_stop and exit the loop
}
if (kthread_should_park())
kthread_parkme(); <<< never called, since was
<<< never parked
kthread_park() <<< waits forever for park completion
In the current patch we quit the loop by only kthread_should_park()
check (kthread_park() is synchronous, so kthread_should_stop() is
never observed), and we abandon ->sqo_stop flag, since it is racy.
At the end of the io_sq_thread() we unconditionally call parmke(),
since we've exited the loop by the park flag.
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, once configured, AFS cells are looked up in the DNS at regular
intervals - which is a waste of resources if those cells aren't being
used. It also leads to a problem where cells preloaded, but not
configured, before the network is brought up end up effectively statically
configured with no VL servers and are unable to get any.
Fix this by not doing the DNS lookup until the first time a cell is
touched. It is waited for if we don't have any cached records yet,
otherwise the DNS lookup to maintain the record is done in the background.
This has the downside that the first time you touch a cell, you now have to
wait for the upcall to do the required DNS lookups rather than them already
being cached.
Further, the record is not replaced if the old record has at least one
server in it and the new record doesn't have any.
Fixes: 0a5143f2f89c ("afs: Implement VL server rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Pull nfsd updates from Bruce Fields:
"This consists mostly of nfsd container work:
Scott Mayhew revived an old api that communicates with a userspace
daemon to manage some on-disk state that's used to track clients
across server reboots. We've been using a usermode_helper upcall for
that, but it's tough to run those with the right namespaces, so a
daemon is much friendlier to container use cases.
Trond fixed nfsd's handling of user credentials in user namespaces. He
also contributed patches that allow containers to support different
sets of NFS protocol versions.
The only remaining container bug I'm aware of is that the NFS reply
cache is shared between all containers. If anyone's aware of other
gaps in our container support, let me know.
The rest of this is miscellaneous bugfixes"
* tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits)
nfsd: update callback done processing
locks: move checks from locks_free_lock() to locks_release_private()
nfsd: fh_drop_write in nfsd_unlink
nfsd: allow fh_want_write to be called twice
nfsd: knfsd must use the container user namespace
SUNRPC: rsi_parse() should use the current user namespace
SUNRPC: Fix the server AUTH_UNIX userspace mappings
lockd: Pass the user cred from knfsd when starting the lockd server
SUNRPC: Temporary sockets should inherit the cred from their parent
SUNRPC: Cache the process user cred in the RPC server listener
nfsd: Allow containers to set supported nfs versions
nfsd: Add custom rpcbind callbacks for knfsd
SUNRPC: Allow further customisation of RPC program registration
SUNRPC: Clean up generic dispatcher code
SUNRPC: Add a callback to initialise server requests
SUNRPC/nfs: Fix return value for nfs4_callback_compound()
nfsd: handle legacy client tracking records sent by nfsdcld
nfsd: re-order client tracking method selection
nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld
nfsd: un-deprecate nfsdcld
...
|
|
We always pass in 0 for the cqe flags argument, since the support for
"this read hit page cache" hint was dropped.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results. This reduces the amount of resources
required.
Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.
Fixes: 0a5143f2f89c ("afs: Implement VL server rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix it such that afs_cell records always have a VL server list record
attached, even if it's a dummy one, so that various checks can be removed.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
When afs_update_cell() replaces the cell->vl_servers list, it uses RCU
protocol so that proc is protected, but doesn't take ->vl_servers_lock to
protect afs_start_vl_iteration() (which does actually take a shared lock).
Fix this by making afs_update_cell() take an exclusive lock when replacing
->vl_servers.
Fixes: 0a5143f2f89c ("afs: Implement VL server rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
afs_xattr_get_yfs() tries to free yacl, which may hold an error value (say
if yfs_fs_fetch_opaque_acl() failed and returned an error).
Fix this by allocating yacl up front (since it's a fixed-length struct,
unlike afs_acl) and passing it in to the RPC function. This also allows
the flags to be placed in the object rather than passing them through to
the RPC function.
Fixes: ae46578b963f ("afs: Get YFS ACLs and information through xattrs")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix incorrect error handling in afs_xattr_get_acl() where there appears to
be a redundant assignment before return, but in fact the return should be a
goto to the error handling at the end of the function.
Fixes: 260f082bae6d ("afs: Get an AFS3 ACL as an xattr")
Addresses-Coverity: ("Unused Value")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Joe Perches <joe@perches.com>
|
|
Fix afs_release() to go through the cleanup part of the function if
FMODE_WRITE is set rather than exiting through vfs_fsync() (which skips the
cleanup). The cleanup involves discarding the refs on the key used for
file ops and the writeback key record.
Also fix afs_evict_inode() to clean up any left over wb keys attached to
the inode/vnode when it is removed.
Fixes: 5a8132761609 ("afs: Do better accretion of small writes on newly created content")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
linux/dax.h is included more than once.
Link: http://lkml.kernel.org/r/5c867e95.1c69fb81.4f15a.e5e4@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
linux/xattr.h is included more than once.
Link: http://lkml.kernel.org/r/5c86803d.1c69fb81.1a7c6.2b78@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
linux/poll.h is included more than once.
Link: http://lkml.kernel.org/r/5c86820f.1c69fb81.149f0.0834@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix sparse warning:
fs/eventfd.c:26:1: warning:
symbol 'eventfd_ida' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20190413142348.34716-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Finding endpoints of an IPC channel is one of essential task to
understand how a user program works. Procfs and netlink socket provide
enough hints to find endpoints for IPC channels like pipes, unix
sockets, and pseudo terminals. However, there is no simple way to find
endpoints for an eventfd file from userland. An inode number doesn't
hint. Unlike pipe, all eventfd files share the same inode object.
To provide the way to find endpoints of an eventfd file, this patch adds
"eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier.
Integers managed by an IDA are used as ids.
A tool like lsof can utilize the information to print endpoints.
Link: http://lkml.kernel.org/r/20190327181823.20222-1-yamato@redhat.com
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
->recursion_depth is changed only by current, therefore decrementing can
be done without taking any locks.
Link: http://lkml.kernel.org/r/20190417213150.GA26474@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
fsync() needs to make sure the data & meta-data of file are persistent
after the return of fsync(), even when a power-failure occurs later. In
the case of fat-fs, the FAT belongs to the meta-data of file, so we need
to issue a flush after the writeback of FAT instead before.
Also bail out early when any stage of fsync fails.
Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
csum_partial() gives different results for little-endian and big-endian
hosts. This causes images created on little-endian hosts and mounted on
big endian hosts to see csum mismatches. This causes an endianness bug.
Sparse gives a warning as csum_partial returns a restricted integer type
__wsum_t and xattr_hash expects __u32. This warning acts as a reminder
for this bug and should not be suppressed.
This comment aims to convey these endianness issues.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190423161831.GA15387@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commmit eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"),
made changes in the rare case when the ELF loader was directly invoked
(e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of
the loader), by moving into the mmap region to avoid both ET_EXEC and
PIE binaries. This had the effect of also moving the brk region into
mmap, which could lead to the stack and brk being arbitrarily close to
each other. An unlucky process wouldn't get its requested stack size
and stack allocations could end up scribbling on the heap.
This is illustrated here. In the case of using the loader directly, brk
(so helpfully identified as "[heap]") is allocated with the _loader_ not
the binary. For example, with ASLR entirely disabled, you can see this
more clearly:
$ /bin/cat /proc/self/maps
555555554000-55555555c000 r-xp 00000000 ... /bin/cat
55555575b000-55555575c000 r--p 00007000 ... /bin/cat
55555575c000-55555575d000 rw-p 00008000 ... /bin/cat
55555575d000-55555577e000 rw-p 00000000 ... [heap]
...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 ...
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]
$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
...
7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat
7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat
7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat
7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat
7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap]
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]
The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
nothing is there in the direct loader case (and ET_EXEC is still far
away at 0x400000). Anything that ran before should still work (i.e.
the ultimately-launched binary already had the brk very far from its
text, so this should be no different from a COMPAT_BRK standpoint). The
only risk I see here is that if someone started to suddenly depend on
the entire memory space lower than the mmap region being available when
launching binaries via a direct loader execs which seems highly
unlikely, I'd hope: this would mean a binary would _not_ work when
exec()ed normally.
(Note that this is only done under CONFIG_ARCH_HAS_ELF_RANDOMIZATION
when randomization is turned on.)
Link: http://lkml.kernel.org/r/20190422225727.GA21011@beast
Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPSjA@mail.gmail.com
Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ali Saidi <alisaidi@amazon.com>
Cc: Ali Saidi <alisaidi@amazon.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|