Age | Commit message (Collapse) | Author |
|
We managed to get confused about the shift direction at least once.
Let's switch to division/multiplcation instead. Add a number of pages
macro for this purpose. We still keep the order macro around too since
this is what alloc/free pages want.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
|
|
free_page_order is a confusing name. It's not a page order
actually, it's the order of the block of memory we are hinting.
Rename to hint_block_order. Also, rename SIZE to BYTES
to make it clear it's the block size in bytes.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
|
|
In case we have to migrate a ballon page to a newpage of another zone, the
managed page count of both zones is wrong. Paired with memory offlining
(which will adjust the managed page count), we can trigger kernel crashes
and all kinds of different symptoms.
One way to reproduce:
1. Start a QEMU guest with 4GB, no NUMA
2. Hotplug a 1GB DIMM and online the memory to ZONE_NORMAL
3. Inflate the balloon to 1GB
4. Unplug the DIMM (be quick, otherwise unmovable data ends up on it)
5. Observe /proc/zoneinfo
Node 0, zone Normal
pages free 16810
min 24848885473806
low 18471592959183339
high 36918337032892872
spanned 262144
present 262144
managed 18446744073709533486
6. Do anything that requires some memory (e.g., inflate the balloon some
more). The OOM goes crazy and the system crashes
[ 238.324946] Out of memory: Killed process 537 (login) total-vm:27584kB, anon-rss:860kB, file-rss:0kB, shmem-rss:00
[ 238.338585] systemd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 238.339420] CPU: 0 PID: 1 Comm: systemd Tainted: G D W 5.4.0-next-20191204+ #75
[ 238.340139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[ 238.341121] Call Trace:
[ 238.341337] dump_stack+0x8f/0xd0
[ 238.341630] dump_header+0x61/0x5ea
[ 238.341942] oom_kill_process.cold+0xb/0x10
[ 238.342299] out_of_memory+0x24d/0x5a0
[ 238.342625] __alloc_pages_slowpath+0xd12/0x1020
[ 238.343024] __alloc_pages_nodemask+0x391/0x410
[ 238.343407] pagecache_get_page+0xc3/0x3a0
[ 238.343757] filemap_fault+0x804/0xc30
[ 238.344083] ? ext4_filemap_fault+0x28/0x42
[ 238.344444] ext4_filemap_fault+0x30/0x42
[ 238.344789] __do_fault+0x37/0x1a0
[ 238.345087] __handle_mm_fault+0x104d/0x1ab0
[ 238.345450] handle_mm_fault+0x169/0x360
[ 238.345790] do_user_addr_fault+0x20d/0x490
[ 238.346154] do_page_fault+0x31/0x210
[ 238.346468] async_page_fault+0x43/0x50
[ 238.346797] RIP: 0033:0x7f47eba4197e
[ 238.347110] Code: Bad RIP value.
[ 238.347387] RSP: 002b:00007ffd7c0c1890 EFLAGS: 00010293
[ 238.347834] RAX: 0000000000000002 RBX: 000055d196a20a20 RCX: 00007f47eba4197e
[ 238.348437] RDX: 0000000000000033 RSI: 00007ffd7c0c18c0 RDI: 0000000000000004
[ 238.349047] RBP: 00007ffd7c0c1c20 R08: 0000000000000000 R09: 0000000000000033
[ 238.349660] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001
[ 238.350261] R13: ffffffffffffffff R14: 0000000000000000 R15: 00007ffd7c0c18c0
[ 238.350878] Mem-Info:
[ 238.351085] active_anon:3121 inactive_anon:51 isolated_anon:0
[ 238.351085] active_file:12 inactive_file:7 isolated_file:0
[ 238.351085] unevictable:0 dirty:0 writeback:0 unstable:0
[ 238.351085] slab_reclaimable:5565 slab_unreclaimable:10170
[ 238.351085] mapped:3 shmem:111 pagetables:155 bounce:0
[ 238.351085] free:720717 free_pcp:2 free_cma:0
[ 238.353757] Node 0 active_anon:12484kB inactive_anon:204kB active_file:48kB inactive_file:28kB unevictable:0kB iss
[ 238.355979] Node 0 DMA free:11556kB min:36kB low:48kB high:60kB reserved_highatomic:0KB active_anon:152kB inactivB
[ 238.358345] lowmem_reserve[]: 0 2955 2884 2884 2884
[ 238.358761] Node 0 DMA32 free:2677864kB min:7004kB low:10028kB high:13052kB reserved_highatomic:0KB active_anon:0B
[ 238.361202] lowmem_reserve[]: 0 0 72057594037927865 72057594037927865 72057594037927865
[ 238.361888] Node 0 Normal free:193448kB min:99395541895224kB low:73886371836733356kB high:147673348131571488kB reB
[ 238.364765] lowmem_reserve[]: 0 0 0 0 0
[ 238.365101] Node 0 DMA: 7*4kB (U) 5*8kB (UE) 6*16kB (UME) 2*32kB (UM) 1*64kB (U) 2*128kB (UE) 3*256kB (UME) 2*512B
[ 238.366379] Node 0 DMA32: 0*4kB 1*8kB (U) 2*16kB (UM) 2*32kB (UM) 2*64kB (UM) 1*128kB (U) 1*256kB (U) 1*512kB (U)B
[ 238.367654] Node 0 Normal: 1985*4kB (UME) 1321*8kB (UME) 844*16kB (UME) 524*32kB (UME) 300*64kB (UME) 138*128kB (B
[ 238.369184] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 238.369915] 130 total pagecache pages
[ 238.370241] 0 pages in swap cache
[ 238.370533] Swap cache stats: add 0, delete 0, find 0/0
[ 238.370981] Free swap = 0kB
[ 238.371239] Total swap = 0kB
[ 238.371488] 1048445 pages RAM
[ 238.371756] 0 pages HighMem/MovableOnly
[ 238.372090] 306992 pages reserved
[ 238.372376] 0 pages cma reserved
[ 238.372661] 0 pages hwpoisoned
In another instance (older kernel), I was able to observe this
(negative page count :/):
[ 180.896971] Offlined Pages 32768
[ 182.667462] Offlined Pages 32768
[ 184.408117] Offlined Pages 32768
[ 186.026321] Offlined Pages 32768
[ 187.684861] Offlined Pages 32768
[ 189.227013] Offlined Pages 32768
[ 190.830303] Offlined Pages 32768
[ 190.833071] Built 1 zonelists, mobility grouping on. Total pages: -36920272750453009
In another instance (older kernel), I was no longer able to start any
process:
[root@vm ~]# [ 214.348068] Offlined Pages 32768
[ 215.973009] Offlined Pages 32768
cat /proc/meminfo
-bash: fork: Cannot allocate memory
[root@vm ~]# cat /proc/meminfo
-bash: fork: Cannot allocate memory
Fix it by properly adjusting the managed page count when migrating if
the zone changed. The managed page count of the zones now looks after
unplug of the DIMM (and after deflating the balloon) just like before
inflating the balloon (and plugging+onlining the DIMM).
We'll temporarily modify the totalram page count. If this ever becomes a
problem, we can fine tune by providing helpers that don't touch
the totalram pages (e.g., adjust_zone_managed_page_count()).
Please note that fixing up the managed page count is only necessary when
we adjusted the managed page count when inflating - only if we
don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the
managed page count is not touched when inflating/deflating.
Reported-by: Yumei Huang <yuhuang@redhat.com>
Fixes: 3dcc0571cd64 ("mm: correctly update zone->managed_pages")
Cc: <stable@vger.kernel.org> # v3.11+
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Instead of multiplying by page order, virtio balloon divided by page
order. The result is that it can return 0 if there are a bit less
than MAX_ORDER - 1 pages in use, and then shrinker scan won't be called.
Cc: stable@vger.kernel.org
Fixes: 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
|
|
virtio_balloon_shrinker_scan should return number of system pages freed,
but because it's calling functions that deal with balloon pages, it gets
confused and sometimes returns the number of balloon pages.
It does not matter practically as the exact number isn't
used, but it seems better to be consistent in case someone
starts using this API.
Further, if we ever tried to iteratively leak pages as
virtio_balloon_shrinker_scan tries to do, we'd run into issues - this is
because freed_pages was accumulating total freed pages, but was also
subtracted on each iteration from pages_to_free, which can result in
either leaking less memory than we were supposed to free, or more if
pages_to_free underruns.
On a system with 4K pages we are lucky that we are never asked to leak
more than 128 pages while we can leak up to 256 at a time,
but it looks like a real issue for systems with page size != 4K.
Fixes: 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
Reported-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Commit 780bc7903a32 ("virtio_ring: Support DMA APIs") makes
virtqueue_add() return -EIO when we fail to map our I/O buffers. This is
a very realistic scenario for guests with encrypted memory, as swiotlb
may run out of space, depending on it's size and the I/O load.
The virtio-blk driver interprets -EIO form virtqueue_add() as an IO
error, despite the fact that swiotlb full is in absence of bugs a
recoverable condition.
Let us change the return code to -ENOMEM, and make the block layer
recover form these failures when virtio-blk encounters the condition
described above.
Cc: stable@vger.kernel.org
Fixes: 780bc7903a32 ("virtio_ring: Support DMA APIs")
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Tested-by: Michael Mueller <mimu@linux.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
When VIRTIO_F_RING_EVENT_IDX is negotiated, virtio devices can
use virtqueue_enable_cb_delayed_packed to reduce the number of device
interrupts. At the moment, this is the case for virtio-net when the
napi_tx module parameter is set to false.
In this case, the virtio driver selects an event offset and expects that
the device will send a notification when rolling over the event offset
in the ring. However, if this roll-over happens before the event
suppression structure update, the notification won't be sent. To address
this race condition the driver needs to check wether the device rolled
over the offset after updating the event suppression structure.
With VIRTIO_F_RING_PACKED, the virtio driver did this by reading the
flags field of the descriptor at the specified offset.
Unfortunately, checking at the event offset isn't reliable: if
descriptors are chained (e.g. when INDIRECT is off) not all descriptors
are overwritten by the device, so it's possible that the device skipped
the specific descriptor driver is checking when writing out used
descriptors. If this happens, the driver won't detect the race condition
and will incorrectly expect the device to send a notification.
For virtio-net, the result will be a TX queue stall, with the
transmission getting blocked forever.
With the packed ring, it isn't easy to find a location which is
guaranteed to change upon the roll-over, except the next device
descriptor, as described in the spec:
Writes of device and driver descriptors can generally be
reordered, but each side (driver and device) are only required to
poll (or test) a single location in memory: the next device descriptor after
the one they processed previously, in circular order.
while this might be sub-optimal, let's do exactly this for now.
Cc: stable@vger.kernel.org
Cc: Jason Wang <jasowang@redhat.com>
Fixes: f51f982682e2a ("virtio_ring: leverage event idx in packed ring")
Signed-off-by: Marvin Liu <yong.liu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The function virtqueue_add_split() DMA-maps the scatterlist buffers. In
case a mapping error occurs the already mapped buffers must be unmapped.
This happens by jumping to the 'unmap_release' label.
In case of indirect descriptors the release is wrong and may leak kernel
memory. Because the implementation assumes that the head descriptor is
already mapped it starts iterating over the descriptor list starting
from the head descriptor. However for indirect descriptors the head
descriptor is never mapped in case of an error.
The fix is to initialize the start index with zero in case of indirect
descriptors and use the 'desc' pointer directly for iterating over the
descriptor chain.
Signed-off-by: Matthias Lange <matthias.lange@kernkonzept.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"Primarily just the virtio_pmem driver:
- virtio_pmem
The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX
mechanisms to access a host-file with host-page-cache. It arranges
for MAP_SYNC to be disabled and instead triggers a host fsync()
when a 'write-cache flush' command is sent to the virtual disk
device.
- Miscellaneous small fixups"
* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
virtio_pmem: fix sparse warning
xfs: disable map_sync for async flush
ext4: disable map_sync for async flush
dax: check synchronous mapping is supported
dm: enable synchronous dax
libnvdimm: add dax_dev sync flag
virtio-pmem: Add virtio pmem driver
libnvdimm: nd_region flush callback support
libnvdimm, namespace: Drop uuid_t implementation detail
|
|
in vm_find_vqs() irq has a wrong type
so, in case of no IRQ resource defined,
wrong parameter will be passed to request_irq()
Signed-off-by: Ihor Matushchak <ihor.matushchak@foobox.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Ivan T. Ivanov <iivanov.xz@gmail.com>
|
|
This patch adds virtio-pmem driver for KVM guest.
Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.
This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jakub Staron <jstaron@google.com>
Tested-by: Jakub Staron <jstaron@google.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
VIRTIO_MMIO config option block starts with a space, fix that.
Signed-off-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Convert the virtio_balloon filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "Michael S. Tsirkin" <mst@redhat.com>
cc: Jason Wang <jasowang@redhat.com>
cc: virtualization@lists.linux-foundation.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...
All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
51 franklin st fifth floor boston ma 02110 1301 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 50 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.499889647@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2 or
later see the copying file in the top level directory
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 6 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075210.858783702@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
There are lots of mismatches between comments and codes, this
patch do these comment fixes.
Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
'desc' should be freed before leaving from err handing path.
Fixes: 1ce9e6055fa0 ("virtio_ring: introduce packed ring support")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
stable@vger.kernel.org
|
|
vring_create_virtqueue() allows the caller to specify via the
may_reduce_num parameter whether the vring code is allowed to
allocate a smaller ring than specified.
However, the split ring allocation code tries to allocate a
smaller ring on allocation failure regardless of what the
caller specified. This may cause trouble for e.g. virtio-pci
in legacy mode, which does not support ring resizing. (The
packed ring code does not resize in any case.)
Let's fix this by bailing out immediately in the split ring code
if the requested size cannot be allocated and may_reduce_num has
not been specified.
While at it, fix a typo in the usage instructions.
Fixes: 2a2d1382fe9d ("virtio: Add improved queue allocation API")
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Jens Freimann <jfreimann@redhat.com>
|
|
If the msix_affinity_masks is alloced failed, then we'll
try to free some resources in vp_free_vectors() that may
access it directly.
We met the following stack in our production:
[ 29.296767] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 29.311151] IP: [<ffffffffc04fe35a>] vp_free_vectors+0x6a/0x150 [virtio_pci]
[ 29.324787] PGD 0
[ 29.333224] Oops: 0000 [#1] SMP
[...]
[ 29.425175] RIP: 0010:[<ffffffffc04fe35a>] [<ffffffffc04fe35a>] vp_free_vectors+0x6a/0x150 [virtio_pci]
[ 29.441405] RSP: 0018:ffff9a55c2dcfa10 EFLAGS: 00010206
[ 29.453491] RAX: 0000000000000000 RBX: ffff9a55c322c400 RCX: 0000000000000000
[ 29.467488] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a55c322c400
[ 29.481461] RBP: ffff9a55c2dcfa20 R08: 0000000000000000 R09: ffffc1b6806ff020
[ 29.495427] R10: 0000000000000e95 R11: 0000000000aaaaaa R12: 0000000000000000
[ 29.509414] R13: 0000000000010000 R14: ffff9a55bd2d9e98 R15: ffff9a55c322c400
[ 29.523407] FS: 00007fdcba69f8c0(0000) GS:ffff9a55c2840000(0000) knlGS:0000000000000000
[ 29.538472] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 29.551621] CR2: 0000000000000000 CR3: 000000003ce52000 CR4: 00000000003607a0
[ 29.565886] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 29.580055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 29.594122] Call Trace:
[ 29.603446] [<ffffffffc04fe8a2>] vp_request_msix_vectors+0xe2/0x260 [virtio_pci]
[ 29.618017] [<ffffffffc04fedc5>] vp_try_to_find_vqs+0x95/0x3b0 [virtio_pci]
[ 29.632152] [<ffffffffc04ff117>] vp_find_vqs+0x37/0xb0 [virtio_pci]
[ 29.645582] [<ffffffffc057bf63>] init_vq+0x153/0x260 [virtio_blk]
[ 29.658831] [<ffffffffc057c1e8>] virtblk_probe+0xe8/0x87f [virtio_blk]
[...]
Cc: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Longpeng <longpeng2@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
|
|
A virtio transport is free to implement some of the callbacks in
virtio_config_ops in a matter that they cannot be called from
atomic context (e.g. virtio-ccw, which maps a lot of the callbacks
to channel I/O, which is an inherently asynchronous mechanism).
This can be very surprising for developers using the much more
common virtio-pci transport, just to find out that things break
when used on s390.
The documentation for virtio_config_ops now contains a comment
explaining this, but it makes sense to add a might_sleep() annotation
to various wrapper functions in the virtio core to avoid surprises
later.
Note that annotations are NOT added to two classes of calls:
- direct calls from device drivers (all current callers should be
fine, however)
- calls which clearly won't be made from atomic context (such as
those ultimately coming in via the driver core)
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
We've changed to kzalloc the vb struct, so no need to 0-initialize
this field one more time.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
|
|
There is no need to update the balloon actual register when there is no
ballooning request. This patch avoids update_balloon_size when diff is 0.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This function returns the maximum segment size for a single
dma transaction of a virtio device. The possible limit comes
from the SWIOTLB implementation in the Linux kernel, that
has an upper limit of (currently) 256kb of contiguous
memory it can map. Other DMA-API implementations might also
have limits.
Use the new dma_max_mapping_size() function to determine the
maximum mapping size when DMA-API is in use for virtio.
Cc: stable@vger.kernel.org
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
There's no reason to expose struct vring_packed in UAPI - if we do we
won't be able to change or drop it, and it's not part of any interface.
Let's move it to virtio_ring.c
Cc: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
This patch introduces the support for VIRTIO_F_ORDER_PLATFORM.
If this feature is negotiated, the driver must use the barriers
suitable for hardware devices. Otherwise, the device and driver
are assumed to be implemented in software, that is they can be
assumed to run on identical CPUs in an SMP configuration. Thus
a weaker form of memory barriers is sufficient to yield better
performance.
It is recommended that an add-in card based PCI device offers
this feature for portability. The device will fail to operate
further or will operate in a slower emulation mode if this
feature is offered but not accepted.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
virtio-ccw has deadlock issues with reading the config space inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.
The config_read_bitmap is used as a flag to the workqueue callbacks
about the related config fields that need to be read.
The cmd_id_received is also renamed to cmd_id_received_cache, and
the value should be obtained via virtio_balloon_cmd_id_received.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Some vqs may not need to be allocated when their related feature bits
are disabled. So callers may pass in such vqs with "names = NULL".
Then we skip such vq allocations.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
|
|
When find_vqs, there will be no vq[i] allocation if its corresponding
names[i] is NULL. For example, the caller may pass in names[i] (i=4)
with names[2] being NULL because the related feature bit is turned off,
so technically there are 3 queues on the device, and name[4] should
correspond to the 3rd queue on the device.
So we use queue_idx as the queue index, which is increased only when the
queue exists.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
|
|
Pull virtio/vhost updates from Michael Tsirkin:
"Features, fixes, cleanups:
- discard in virtio blk
- misc fixes and cleanups"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: correct the related warning message
vhost: split structs into a separate header file
virtio: remove deprecated VIRTIO_PCI_CONFIG()
vhost/vsock: switch to a mutex for vhost_vsock_hash
virtio_blk: add discard and write zeroes support
|
|
VIRTIO_PCI_CONFIG() is deprecated. Use VIRTIO_PCI_CONFIG_OFF() instead.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Advertize the packed ring layout support.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Leverage the EVENT_IDX feature in packed ring to suppress
events when it's available.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce the packed ring support. Packed ring can only be
created by vring_create_virtqueue() and each chunk of packed
ring will be allocated individually. Packed ring can not be
created on preallocated memory by vring_new_virtqueue() or
the likes currently.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cache whether we will use DMA API, instead of doing the
check every time. We are going to check whether DMA API
is used more often in packed ring.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce a specific function to create the split ring.
And also move the DMA allocation and size information to
the .split sub-structure.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Put the split ring's desc state into the .split sub-structure,
and allocate desc state for split ring separately, this makes
the code more readable and more consistent with what we will
do for packed ring.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce a helper to check whether we will use indirect
feature. It will be used by packed ring too.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce debug helpers for last_add_time update, check and
invalid. They will be used by packed ring too.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Put the split ring specific fields in a sub-struct named
as "split" to avoid misuse after introducing packed ring.
There is no functional change.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Put the xxx_split() functions together to make the
code more readable and avoid misuse after introducing
the packed ring. There is no functional change.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add _split suffix for split ring specific functions. This
is a preparation for introducing the packed ring support.
There is no functional change.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.
Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
obtained one by one from the mm free list via the regular allocation
function.
Host requests the guest to report free page hints by sending a new cmd id
to the guest via the free_page_report_cmd_id configuration register. When
the guest starts to report, it first sends a start cmd to host via the
free page vq, which acks to host the cmd id received. When the guest
finishes reporting free pages, a stop cmd is sent to host via the vq.
Host may also send a stop cmd id to the guest to stop the reporting.
VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest
reporting.
VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
the reported pages are ready to be freed.
Why does the guest free the reported pages when host tells it is ready to
free?
This is because freeing pages appears to be expensive for live migration.
free_pages() dirties memory very quickly and makes the live migraion not
converge in some cases. So it is good to delay the free_page operation
when the migration is done, and host sends a command to guest about that.
Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
VIRTIO_BALLOON_CMD_ID_STOP?
This is because live migration is usually done in several rounds. At the
end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
the guest to stop (or say pause) the reporting. The guest resumes the
reporting when it receives a new command id at the beginning of the next
round. So we need a new cmd id to distinguish between "stop reporting" and
"ready to free the reported pages".
TODO:
- Add a batch page allocation API to amortize the allocation overhead.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Pull virtio updates from Michael Tsirkin:
"virtio, vhost: fixes, tweaks
No new features but a bunch of tweaks such as switching balloon from
oom notifier to shrinker"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048
vhost: allow vhost-scsi driver to be built-in
virtio: pci-legacy: Validate queue pfn
virtio: mmio-v1: Validate queue PFN
virtio_balloon: replace oom notifier with shrinker
virtio-balloon: kzalloc the vb struct
virtio-balloon: remove BUG() in init_vqs
|
|
Legacy PCI over virtio uses a 32bit PFN for the queue. If the
queue pfn is too large to fit in 32bits, which we could hit on
arm64 systems with 52bit physical addresses (even with 64K page
size), we simply miss out a proper link to the other side of
the queue.
Add a check to validate the PFN, rather than silently breaking
the devices.
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <cdall@kernel.org>
Cc: Peter Maydel <peter.maydell@linaro.org>
Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|