Age | Commit message (Collapse) | Author |
|
This code is ancient, and goes back to when we only had a single page
for the pipe buffers. The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).
At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.
However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense. And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").
But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the read side version of the previous commit: it simplifies the
logic to only wake up waiting writers when necessary, and makes sure to
use a synchronous wakeup. This time not so much for GNU make jobserver
reasons (that pipe never fills up), but simply to get the writer going
quickly again.
A bit less verbose commentary this time, if only because I assume that
the write side commentary isn't going to be ignored if you touch this
code.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The pipe rework ends up having been extra painful, partly becaused of
actual bugs with ordering and caching of the pipe state, but also
because of subtle performance issues.
In particular, the pipe rework caused the kernel build to inexplicably
slow down.
The reason turns out to be that the GNU make jobserver (which limits the
parallelism of the build) uses a pipe to implement a "token" system: a
parallel submake will read a character from the pipe to get the job
token before starting a new job, and will write a character back to the
pipe when it is done. The overall job limit is thus easily controlled
by just writing the appropriate number of initial token characters into
the pipe.
But to work well, that really means that the old behavior of write
wakeups being synchronous (WF_SYNC) is very important - when the pipe
writer wakes up a reader, we want the reader to actually get scheduled
immediately. Otherwise you lose the parallelism of the build.
The pipe rework lost that synchronous wakeup on write, and we had
clearly all forgotten the reasons and rules for it.
This rewrites the pipe write wakeup logic to do the required Wsync
wakeups, but also clarifies the logic and avoids extraneous wakeups.
It also ends up addign a number of comments about what oit does and why,
so that we hopefully don't end up forgetting about this next time we
change this code.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on. That avoids the races with the event you're waiting for.
The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.
Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.
So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The legacy client tracking infrastructure of nfsd makes use of MD5 to
derive a client's recovery directory name. As the nfsd module doesn't
declare any dependency on CRYPTO_MD5, though, it may fail to allocate
the hash if the kernel was compiled without it. As a result, generation
of client recovery directories will fail with the following error:
NFSD: unable to generate recoverydir name
The explicit dependency on CRYPTO_MD5 was removed as redundant back in
6aaa67b5f3b9 (NFSD: Remove redundant "select" clauses in fs/Kconfig
2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5.
This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit
df486a25900f (NFS: Fix the selection of security flavours in Kconfig) at
a later point.
Fix the issue by adding back an explicit dependency on CRYPTO_MD5.
Fixes: df486a25900f (NFS: Fix the selection of security flavours in Kconfig)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Static checker revealed possible error path leading to possible
NULL pointer dereferencing.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e0639dc5805a: ("NFSD introduce async copy feature")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Fix the iteration end check in fuse_dev_splice_write(). The iterator
position can only be compared with == or != since wrappage may be involved.
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Similarly to commit 8f868d68d335 ("pipe: Fix missing mask update after
pipe_wait()") this fixes a case where the pipe rewrite ended up caching
the pipe state incorrectly over a pipe lock drop event.
It wasn't quite as obvious, because you needed to splice data from a
pipe to a file, which is a fairly unusual operation, but it's completely
wrong.
Make sure we load the pipe head/tail/size information only after we've
waited for there to be data in the pipe.
While in that file, also make one of the splice helper functions use the
canonical arghument order for pipe_empty(). That's syntactic - pipe
emptiness is just that head and tail are equal, and thus mixing up head
and tail doesn't really matter. It's still wrong, though.
Reported-by: David Sterba <dsterba@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When using the special SID to store the mode bits in an ACE (See
http://technet.microsoft.com/en-us/library/hh509017(v=ws.10).aspx)
which is enabled with mount parm "modefromsid" we were not
passing in the mode via SMB3 create (although chmod was enabled).
SMB3 create allows a security descriptor context to be passed
in (which is more atomic and thus preferable to setting the mode
bits after create via a setinfo).
This patch enables setting the mode bits on create when using
modefromsid mount option. In addition it fixes an endian
error in the definition of the Control field flags in the SMB3
security descriptor. It also makes the ACE type of the special
SID better match the documentation (and behavior of servers
which use this to store mode bits in SMB3 ACLs).
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Pull more block and io_uring updates from Jens Axboe:
"I wasn't expecting this to be so big, and if I was, I would have used
separate branches for this. Going forward I'll be doing separate
branches for the current tree, just like for the next kernel version
tree. In any case, this contains:
- Series from Christoph that fixes an inherent race condition with
zoned devices and revalidation.
- null_blk zone size fix (Damien)
- Fix for a regression in this merge window that caused busy spins by
sending empty disk uevents (Eric)
- Fix for a regression in this merge window for bfq stats (Hou)
- Fix for io_uring creds allocation failure handling (me)
- io_uring -ERESTARTSYS send/recvmsg fix (me)
- Series that fixes the need for applications to retain state across
async request punts for io_uring. This one is a bit larger than I
would have hoped, but I think it's important we get this fixed for
5.5.
- connect(2) improvement for io_uring, handling EINPROGRESS instead
of having applications needing to poll for it (me)
- Have io_uring use a hash for poll requests instead of an rbtree.
This turned out to work much better in practice, so I think we
should make the switch now. For some workloads, even with a fair
amount of cancellations, the insertion sort is just too expensive.
(me)
- Various little io_uring fixes (me, Jackie, Pavel, LimingWu)
- Fix for brd unaligned IO, and a warning for the future (Ming)
- Fix for a bio integrity data leak (Justin)
- bvec_iter_advance() improvement (Pavel)
- Xen blkback page unmap fix (SeongJae)
The major items in here are all well tested, and on the liburing side
we continue to add regression and feature test cases. We're up to 50
topic cases now, each with anywhere from 1 to more than 10 cases in
each"
* tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits)
block: fix memleak of bio integrity data
io_uring: fix a typo in a comment
bfq-iosched: Ensure bio->bi_blkg is valid before using it
io_uring: hook all linked requests via link_list
io_uring: fix error handling in io_queue_link_head
io_uring: use hash table for poll command lookups
io-wq: clear node->next on list deletion
io_uring: ensure deferred timeouts copy necessary data
io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
brd: warn on un-aligned buffer
brd: remove max_hw_sectors queue limit
xen/blkback: Avoid unmapping unmapped grant pages
io_uring: handle connect -EINPROGRESS like -EAGAIN
block: set the zone size in blk_revalidate_disk_zones atomically
block: don't handle bio based drivers in blk_revalidate_disk_zones
block: allocate the zone bitmaps lazily
block: replace seq_zones_bitmap with conv_zones_bitmap
block: simplify blkdev_nr_zones
block: remove the empty line at the end of blk-zoned.c
...
|
|
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
"Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
Basically, the problem is that negative pinned dentries require
careful treatment - unless ->d_lock is locked or parent is held at
least shared, another thread can make them positive right under us.
Most of the uses turned out to be safe - the main surprises as far as
filesystems are concerned were
- race in dget_parent() fastpath, that might end up with the caller
observing the returned dentry _negative_, due to insufficient
barriers. It is positive in memory, but we could end up seeing the
wrong value of ->d_inode in CPU cache. Fixed.
- manual checks that result of lookup_one_len_unlocked() is positive
(and rejection of negatives). Again, insufficient barriers (we
might end up with inconsistent observed values of ->d_inode and
->d_flags). Fixed by switching to a new primitive that does the
checks itself and returns ERR_PTR(-ENOENT) instead of a negative
dentry. That way we get rid of boilerplate converting negatives
into ERR_PTR(-ENOENT) in the callers and have a single place to
deal with the barrier-related mess - inside fs/namei.c rather than
in every caller out there.
The guts of pathname resolution *do* need to be careful - the race
found by Ritesh is real, as well as several similar races.
Fortunately, it turns out that we can take care of that with fairly
local changes in there.
The tree-wide audit had not been fun, and I hate the idea of repeating
it. I think the right approach would be to annotate the places where
we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
catch regressions. But I'm still not sure what would be the least
invasive way of doing that and it's clearly the next cycle fodder"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/namei.c: fix missing barriers when checking positivity
fix dget_parent() fastpath race
new helper: lookup_positive_unlocked()
fs/namei.c: pull positivity check into follow_managed()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull autofs updates from Al Viro:
"autofs misuses checks for ->d_subdirs emptiness; the cursors are in
the same lists, resulting in false negatives. It's not needed anyway,
since autofs maintains counter in struct autofs_info, containing 0 for
removed ones, 1 for live symlinks and 1 + number of children for live
directories, which is precisely what we need for those checks.
This series switches to use of that counter and untangles the crap
around its uses (it needs not be atomic and there's a bunch of
completely pointless "defensive" checks).
This fell out of dcache_readdir work; the main point is to get rid of
->d_subdirs abuses in there. I've more followup cleanups, but I hadn't
run those by Ian yet, so they can go next cycle"
* 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs: don't bother with atomics for ino->count
autofs_dir_rmdir(): check ino->count for deciding whether it's empty...
autofs: get rid of pointless checks around ->count handling
autofs_clear_leaf_automount_flags(): use ino->count instead of ->d_subdirs
|
|
Merge two fixes for the pipe rework from David Howells:
"Here are a couple of patches to fix bugs syzbot found in the pipe
changes:
- An assertion check will sometimes trip when polling a pipe because
the ring size and indices used are approximate and may be being
changed simultaneously.
An equivalent approximate calculation was done previously, but
without the assertion check, so I've just dropped the check. To
make it accurate, the pipe mutex would need to be taken or the spin
lock could be used - but usage of the spinlock would need to be
rolled out into splice, iov_iter and other places for that.
- The index mask and the max_usage values cannot be cached across
pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
This can cause pipe_write() to break"
* pipe-rework:
pipe: Fix missing mask update after pipe_wait()
pipe: Remove assertion from pipe_poll()
|
|
Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter
function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
Without this, pipe_write() may subsequently miscalculate the array
indices and pipe fullness, leading to an oops like the following:
BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
...
CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
...
Call Trace:
pipe_write+0xc25/0xe10 fs/pipe.c:481
call_write_iter include/linux/fs.h:1895 [inline]
new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
__vfs_write+0x94/0x110 fs/read_write.c:496
vfs_write+0x18a/0x520 fs/read_write.c:558
ksys_write+0x105/0x220 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__x64_sys_write+0x6e/0xb0 fs/read_write.c:620
do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is not a problem for pipe_read() as the mask is recalculated on
each pass of the loop, after pipe_wait() has been called.
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@kernel.org>
[ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size. However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ
to intervene and muck up the calculation, thereby causing the oops.
Fix this by simply removing the assertion and accepting that the
calculation might be approximate.
Note that the previous code also had a similar issue, though there was
no assertion check, since the occupancy counter and the ring size were
not read with a lock held, so it's possible that the poll check might
have malfunctioned then too.
Also wake up all the waiters so that they can reissue their checks if
there was a competing read or write.
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Andreas Gruenbacher:
"Bob's extensive filesystem withdrawal and recovery testing:
- don't write log headers after file system withdraw
- clean up iopen glock mess in gfs2_create_inode
- close timing window with GLF_INVALIDATE_IN_PROGRESS
- abort gfs2_freeze if io error is seen
- don't loop forever in gfs2_freeze if withdrawn
- fix infinite loop in gfs2_ail1_flush on io error
- introduce function gfs2_withdrawn
- fix glock reference problem in gfs2_trans_remove_revoke
Filesystems with a block size smaller than the page size:
- fix end-of-file handling in gfs2_page_mkwrite
- improve mmap write vs. punch_hole consistency
Other:
- remove active journal side effect from gfs2_write_log_header
- multi-block allocations in gfs2_page_mkwrite
Minor cleanups and coding style fixes:
- remove duplicate call from gfs2_create_inode
- make gfs2_log_shutdown static
- make gfs2_fs_parameters static
- some whitespace cleanups
- removed unnecessary semicolon"
* tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't write log headers after file system withdraw
gfs2: Remove duplicate call from gfs2_create_inode
gfs2: clean up iopen glock mess in gfs2_create_inode
gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
gfs2: Abort gfs2_freeze if io error is seen
gfs2: Don't loop forever in gfs2_freeze if withdrawn
gfs2: fix infinite loop in gfs2_ail1_flush on io error
gfs2: Introduce function gfs2_withdrawn
gfs2: fix glock reference problem in gfs2_trans_remove_revoke
gfs2: make gfs2_log_shutdown static
gfs2: Remove active journal side effect from gfs2_write_log_header
gfs2: Fix end-of-file handling in gfs2_page_mkwrite
gfs2: Multi-block allocations in gfs2_page_mkwrite
gfs2: Improve mmap write vs. punch_hole consistency
gfs2: make gfs2_fs_parameters static
gfs2: Some whitespace cleanups
gfs2: removed unnecessary semicolon
|
|
Pull ceph updates from Ilya Dryomov:
"The two highlights are a set of improvements to how rbd read-only
mappings are handled and a conversion to the new mount API (slightly
complicated by the fact that we had a common option parsing framework
that called out into rbd and the filesystem instead of them calling
into it).
Also included a few scattered fixes and a MAINTAINERS update for rbd,
adding Dongsheng as a reviewer"
* tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client:
libceph, rbd, ceph: convert to use the new mount API
rbd: ask for a weaker incompat mask for read-only mappings
rbd: don't query snapshot features
rbd: remove snapshot existence validation code
rbd: don't establish watch for read-only mappings
rbd: don't acquire exclusive lock for read-only mappings
rbd: disallow read-write partitions on images mapped read-only
rbd: treat images mapped read-only seriously
rbd: introduce RBD_DEV_FLAG_READONLY
rbd: introduce rbd_is_snap()
ceph: don't leave ino field in ceph_mds_request_head uninitialized
ceph: tone down loglevel on ceph_mdsc_build_path warning
rbd: update MAINTAINERS info
ceph: fix geting random mds from mdsmap
rbd: fix spelling mistake "requeueing" -> "requeuing"
ceph: make several helper accessors take const pointers
libceph: drop unnecessary check from dispatch() in mon_client.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
- Fix a regression introduced in the last release
- Fix a number of issues with validating data coming from userspace
- Some cleanups in virtiofs
* tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix Kconfig indentation
fuse: fix leak of fuse_io_priv
virtiofs: Use completions while waiting for queue to be drained
virtiofs: Do not send forget request "struct list_head" element
virtiofs: Use a common function to send forget
virtiofs: Fix old-style declaration
fuse: verify nlink
fuse: verify write return
fuse: verify attributes
|
|
This patch fixes the following KASAN report. The @ioend has been
freed by dio_put(), but the iomap_finish_ioend() still trys to access
its data.
[20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
[20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184
[20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
[20563.653887] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
[20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
[20563.669547] Call trace:
[20563.671993] dump_backtrace+0x0/0x370
[20563.675648] show_stack+0x1c/0x28
[20563.678958] dump_stack+0x138/0x1b0
[20563.682455] print_address_description.isra.9+0x60/0x378
[20563.687759] __kasan_report+0x1a4/0x2a8
[20563.691587] kasan_report+0xc/0x18
[20563.694985] __asan_report_load8_noabort+0x18/0x20
[20563.699769] iomap_finish_ioend+0x58c/0x5c0
[20563.703944] iomap_finish_ioends+0x110/0x270
[20563.708396] xfs_end_ioend+0x168/0x598 [xfs]
[20563.712823] xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.716834] process_one_work+0x7f0/0x1ac8
[20563.720922] worker_thread+0x334/0xae0
[20563.724664] kthread+0x2c4/0x348
[20563.727889] ret_from_fork+0x10/0x18
[20563.732941] Allocated by task 83403:
[20563.736512] save_stack+0x24/0xb0
[20563.739820] __kasan_kmalloc.isra.9+0xc4/0xe0
[20563.744169] kasan_slab_alloc+0x14/0x20
[20563.747998] slab_post_alloc_hook+0x50/0xa8
[20563.752173] kmem_cache_alloc+0x154/0x330
[20563.756185] mempool_alloc_slab+0x20/0x28
[20563.760186] mempool_alloc+0xf4/0x2a8
[20563.763845] bio_alloc_bioset+0x2d0/0x448
[20563.767849] iomap_writepage_map+0x4b8/0x1740
[20563.772198] iomap_do_writepage+0x200/0x8d0
[20563.776380] write_cache_pages+0x8a4/0xed8
[20563.780469] iomap_writepages+0x4c/0xb0
[20563.784463] xfs_vm_writepages+0xf8/0x148 [xfs]
[20563.788989] do_writepages+0xc8/0x218
[20563.792658] __writeback_single_inode+0x168/0x18f8
[20563.797441] writeback_sb_inodes+0x370/0xd30
[20563.801703] wb_writeback+0x2d4/0x1270
[20563.805446] wb_workfn+0x344/0x1178
[20563.808928] process_one_work+0x7f0/0x1ac8
[20563.813016] worker_thread+0x334/0xae0
[20563.816757] kthread+0x2c4/0x348
[20563.819979] ret_from_fork+0x10/0x18
[20563.825028] Freed by task 22184:
[20563.828251] save_stack+0x24/0xb0
[20563.831559] __kasan_slab_free+0x10c/0x180
[20563.835648] kasan_slab_free+0x10/0x18
[20563.839389] slab_free_freelist_hook+0xb4/0x1c0
[20563.843912] kmem_cache_free+0x8c/0x3e8
[20563.847745] mempool_free_slab+0x20/0x28
[20563.851660] mempool_free+0xd4/0x2f8
[20563.855231] bio_free+0x33c/0x518
[20563.858537] bio_put+0xb8/0x100
[20563.861672] iomap_finish_ioend+0x168/0x5c0
[20563.865847] iomap_finish_ioends+0x110/0x270
[20563.870328] xfs_end_ioend+0x168/0x598 [xfs]
[20563.874751] xfs_end_io+0x1e0/0x2d0 [xfs]
[20563.878755] process_one_work+0x7f0/0x1ac8
[20563.882844] worker_thread+0x334/0xae0
[20563.886584] kthread+0x2c4/0x348
[20563.889804] ret_from_fork+0x10/0x18
[20563.894855] The buggy address belongs to the object at fffffc0c54a36900
which belongs to the cache bio-1 of size 248
[20563.906844] The buggy address is located 40 bytes inside of
248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
[20563.918485] The buggy address belongs to the page:
[20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
[20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
[20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
[20563.948300] page dumped because: kasan: bad access detected
[20563.955345] Memory state around the buggy address:
[20563.960129] fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.967342] fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[20563.981766] ^
[20563.986288] fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
[20563.993501] fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[20564.000713] ==================================================================
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703
Signed-off-by: Zorro Lang <zlang@redhat.com>
Fixes: 9cd0ed63ca514 ("iomap: enhance writeback error message")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
thatn -> than.
Signed-off-by: Liming Wu <19092205@suning.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.
Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.
Stop consuming sqes, and let the user handle errors.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ELF reads done by the kernel have very complicated error detection code
which better live in one place.
Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Link: http://lkml.kernel.org/r/20191005165049.GA26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Take the case where we have:
t0
| (ew)
e0
| (et)
e1
| (lt)
s0
t0: thread 0
e0: epoll fd 0
e1: epoll fd 1
s0: socket fd 0
ew: epoll_wait
et: edge-trigger
lt: level-trigger
We remove unnecessary wakeups to prevent the nested epoll that working in edge-
triggered mode to waking up continuously.
Test code:
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/socket.h>
int main(int argc, char *argv[])
{
int sfd[2];
int efd[2];
struct epoll_event e;
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
goto out;
efd[0] = epoll_create(1);
if (efd[0] < 0)
goto out;
efd[1] = epoll_create(1);
if (efd[1] < 0)
goto out;
e.events = EPOLLIN;
if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
goto out;
e.events = EPOLLIN | EPOLLET;
if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
goto out;
if (write(sfd[1], "w", 1) != 1)
goto out;
if (epoll_wait(efd[0], &e, 1, 0) != 1)
goto out;
if (epoll_wait(efd[0], &e, 1, 0) != 0)
goto out;
close(efd[0]);
close(efd[1]);
close(sfd[0]);
close(sfd[1]);
return 0;
out:
return -1;
}
More tests:
https://github.com/heiher/epoll-wakeup
Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.cc
Signed-off-by: hev <r@hev.cc>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
ep_call_nested() in order to pass the correct subclass argument to
spin_lock_irqsave_nested(). However, ep_call_nested() adds unnecessary
checks for epoll depth and loops that are already verified when doing
EPOLL_CTL_ADD. This mirrors a conversion that was done for
!CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a44 ("epoll: remove
ep_call_nested() from ep_eventpoll_poll()")
Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ / /' -i */Kconfig
[adobriyan@gmail.com: add two spaces where necessary]
Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
List iteration takes more code than anything else which means embedded
list_head should be the first element of the structure.
Space savings:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
Function old new delta
close_pdeo 228 227 -1
proc_reg_release 86 82 -4
proc_entry_rundown 143 139 -4
proc_reg_open 298 289 -9
Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pointer to next '/' encodes length of path element and next start
position. Subtraction and increment are redundant.
Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently gluing PDE into global /proc tree is done under lock, but
changing ->nlink is not. Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.
Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If someone removes a node from a list, and then later adds it back to
a list, we can have invalid data in ->next. This can cause all sorts
of issues. One such use case is the IORING_OP_POLL_ADD command, which
will do just that if we race and get woken twice without any pending
events. This is a pretty rare case, but can happen under extreme loads.
Dan reports that he saw the following crash:
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
Oops: 0002 [#1] SMP
CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
RIP: 0010:io_wqe_enqueue+0x3e/0xd0
Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
FS: 00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
io_poll_wake+0x12f/0x2a0
__wake_up_common+0x86/0x120
__wake_up_common_lock+0x7a/0xc0
sock_def_readable+0x3c/0x70
tcp_rcv_established+0x557/0x630
tcp_v6_do_rcv+0x118/0x3c0
tcp_v6_rcv+0x97e/0x9d0
ip6_protocol_deliver_rcu+0xe3/0x440
ip6_input+0x3d/0xc0
? ip6_protocol_deliver_rcu+0x440/0x440
ipv6_rcv+0x56/0xd0
? ip6_rcv_finish_core.isra.18+0x80/0x80
__netif_receive_skb_one_core+0x50/0x70
netif_receive_skb_internal+0x2f/0xa0
napi_gro_receive+0x125/0x150
mlx5e_handle_rx_cqe+0x1d9/0x5a0
? mlx5e_poll_tx_cq+0x305/0x560
mlx5e_poll_rx_cq+0x49f/0x9c5
mlx5e_napi_poll+0xee/0x640
? smp_reschedule_interrupt+0x16/0xd0
? reschedule_interrupt+0xf/0x20
net_rx_action+0x286/0x3d0
__do_softirq+0xca/0x297
irq_exit+0x96/0xa0
do_IRQ+0x54/0xe0
common_interrupt+0xf/0xf
</IRQ>
RIP: 0033:0x7fdc627a2e3a
Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
when running a networked workload with about 5000 sockets being polled
for. Fix this by clearing node->next when the node is being removed from
the list.
Fixes: 6206f0e180d4 ("io-wq: shrink io_wq_work a bit")
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160c6
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc8702 instead of having the timeout deferral use its
own.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
iface[0] was accessed regardless of the count value and without
locking.
* check count before accessing any ifaces
* make copy of iface list (it's a simple POD array) and use it without
locking.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
|
|
With the addition of SMB session channels, we introduced new TCP
server pointers that have no sessions or tcons associated with them.
In this case, when we started looking for TCP connections, we might
end up picking session channel rather than the master connection,
hence failing to get either a session or a tcon.
In order to fix that, this patch introduces a new "is_channel" field
to TCP_Server_Info structure so we can skip session channels during
lookup of connections.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio completions can race when a page spans more than one file system
block. Add a spinlock to synchronize marking the page uptodate.
Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.
The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.
When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.
The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
syzbot (via KASAN) reports a use-after-free in the error path of
xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
handle the case of a fully initialized ->l_iclog linked list.
Instead, it assumes that the list is partially constructed and NULL
terminated.
This bug manifested because there was no possible error scenario
after iclog list setup when the original code was added. Subsequent
code and associated error conditions were added some time later,
while the original error handling code was never updated. Fix up the
error loop to terminate either on a NULL iclog or reaching the end
of the list.
Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Since timestamps on files on most servers can be updated at
close, and since timestamps on our dentries default to one
second we can have stale timestamps in some common cases
(e.g. open, write, close, stat, wait one second, stat - will
show different mtime for the first and second stat).
The SMB2/SMB3 protocol allows querying timestamps at close
so add the code to request timestamp and attr information
(which is cheap for the server to provide) to be returned
when a file is closed (it is not needed for the many
paths that call SMB2_close that are from compounded
query infos and close nor is it needed for some of
the cases where a directory close immediately follows a
directory open.
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Pull iomap cleanups from Darrick Wong:
"Aome more new iomap code for 5.5.
There's not much this time -- just removing some local variables that
don't need to exist in the iomap directio code"
* tag 'iomap-5.5-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: remove unneeded variable in iomap_dio_rw()
iomap: Do not create fake iter in iomap_dio_bio_actor()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
"The main changes in the timer code in this cycle were:
- Clockevent updates:
- timer-of framework cleanups. (Geert Uytterhoeven)
- Use timer-of for the renesas-ostm and the device name to prevent
name collision in case of multiple timers. (Geert Uytterhoeven)
- Check if there is an error after calling of_clk_get in asm9260
(Chuhong Yuan)
- ABI fix: Zero out high order bits of nanoseconds on compat
syscalls. This got broken a year ago, with apparently no side
effects so far.
Since the kernel would use random data otherwise I don't think we'd
have other options but to fix the bug, even if there was a side
effect to applications (Dmitry Safonov)
- Optimize ns_to_timespec64() on 32-bit systems: move away from
div_s64_rem() which can be slow, to div_u64_rem() which is faster
(Arnd Bergmann)
- Annotate KCSAN-reported false positive data races in
hrtimer_is_queued() users by moving timer->state handling over to
the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
(Eric Dumazet)
- Misc cleanups and small fixes"
[ I undid the "ABI fix" and updated the comments instead. The reason
there were apparently no side effects is that the fix was a no-op.
The updated comment is to say _why_ it was a no-op. - Linus ]
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Zero the upper 32-bits in __kernel_timespec on 32-bit
time: Rename tsk->real_start_time to ->start_boottime
hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
time: Fix spelling mistake in comment
time: Optimize ns_to_timespec64()
hrtimer: Annotate lockless access to timer->state
clocksource/drivers/asm9260: Add a check for of_clk_get
clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
clocksource/drivers/renesas-ostm: Convert to timer_of
clocksource/drivers/timer-of: Use unique device name instead of timer
clocksource/drivers/timer-of: Convert last full_name to %pOF
|
|
Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit b18fdf71e01f ("io_uring: simplify io_req_link_next()"),
the io_wq_current_is_worker function is no longer needed, clean it
up.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Parameter ctx we have never used, clean it up.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just like commit f67676d160c6 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just like commit f67676d160c6 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.
Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
in the wrong order. However, this check isn't applicable for realtime
files. In most cases, it just makes us do unnecessary commits. However,
without the fix from the previous commit ("xfs: fix realtime file data
space leak"), if the last and second-to-last extents also happen to have
different "AG numbers", then the break actually causes __xfs_bunmapi()
to return without making any progress, which sends
xfs_itruncate_extents_flags() into an infinite loop.
Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|