Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux
KVM: s390: Fixes and features for 4.14
- merge of topic branch tlb-flushing from the s390 tree to get the
no-dat base features
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
|
|
The machine check information is part of the vsie_page.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170830160603.5452-4-david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Move the real logic that always has to be executed out of the
WARN_ON_ONCE.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170830160603.5452-3-david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Looks like the "overflowing" range check is wrong.
|=======b-------a=======|
addr >= a || addr <= b
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170830160603.5452-2-david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Independent of the underlying hardware, kvm will now always handle
SIGP SET ARCHITECTURE as if czam were enabled. Therefore, let's not
only forward that bit but always set it.
While at it, add a comment regarding STHYI.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170829143108.14703-1-david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
The STFLE bit 147 indicates whether the ESSA no-DAT operation code is
valid, the bit is not normally provided to the host; the host is
instead provided with an SCLP bit that indicates whether guests can
support the feature.
This patch:
* enables the STFLE bit in the guest if the corresponding SCLP bit is
present in the host.
* adds support for migrating the no-DAT bit in the PGSTEs
* fixes the software interpretation of the ESSA instruction that is
used when migrating, both for the new operation code and for the old
"set stable", as per specifications.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
handle_sthyi() always writes to guest memory if the sthyi function
code is zero in order to fault in the page that later is written to.
However a function code of zero does not necessarily mean that a write
to guest memory happens: if the KVM host is running as a second level
guest under z/VM 6.2 the sthyi instruction is indicated to be
available to the KVM host, however if the instruction is executed it
will always return with a return code that indicates "unsupported
function code".
In such a case handle_sthyi() must not write to guest memory. This
means that the prior write access to fault in the guest page may
result in invalid guest exceptions, and/or invalid data modification.
In order to be architecture compliant simply remove the write_guest()
call.
Given that the guest assumed a write access anyway, this fix does not
qualify for -stable. This just makes sure the sthyi handler is
architecture compliant.
Fixes: 95ca2cb57985 ("KVM: s390: Add sthyi emulation")
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Allow for the enablement of MEF and the support for the extended
epoch in SIE and VSIE for the extended guest TOD-Clock.
A new interface is used for getting/setting a guest's extended TOD-Clock
that uses a single ioctl invocation, KVM_S390_VM_TOD_EXT. Since the
host time is a moving target that might see an epoch switch or STP sync
checks we need an atomic ioctl and cannot use the exisiting two
interfaces. The old method of getting and setting the guest TOD-Clock is
still retained and is used when the old ioctls are called.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
kvm has always supported the concept of starting in z/Arch mode so let's
reflect the feature bit to the guest.
Also, we change sigp set architecture to reject any request to change
architecture modes.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Additional fixes on top of these two
- missing inline assembly constraint
- wrong exception handling
are necessary
|
|
According to the SDM, if the "use TPR shadow" VM-execution control is
1, bits 11:0 of the virtual-APIC address must be 0 and the address
should set any bits beyond the processor's physical-address width.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
------------[ cut here ]------------
WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc4+ #11
RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
Call Trace:
? kvm_multiple_exception+0x149/0x170 [kvm]
? handle_emulation_failure+0x79/0x230 [kvm]
? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
? check_chain_key+0x137/0x1e0
? reexecute_instruction.part.168+0x130/0x130 [kvm]
nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
vmx_queue_exception+0x197/0x300 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
? preempt_count_sub+0x18/0xc0
? restart_apic_timer+0x17d/0x300 [kvm]
? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]
The flag "nested_run_pending", which can override the decision of which should run
next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
is especially intended to avoid switching to L1 in the injection decision-point.
This can be handled just like the other cases in vmx_check_nested_events, instead of
having a special case in vmx_queue_exception.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
vmx_complete_interrupts() assumes that the exception is always injected,
so it can be dropped by kvm_clear_exception_queue(). However,
an exception cannot be injected immediately if it is: 1) originally
destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
This patch applies to exceptions the same algorithm that is used for
NMIs, replacing exception.reinject with "exception.injected" (equivalent
to nmi_injected).
exception.pending now represents an exception that is queued and whose
side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
If exception.pending is true, the exception might result in a nested
vmexit instead, too (in which case the side effects must not be applied).
exception.injected instead represents an exception that is going to be
injected into the guest at the next vmentry.
Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use kvm_event_needs_reinjection() encapsulation.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
update_permission_bitmask currently does a 128-iteration loop to,
essentially, compute a constant array. Computing the 8 bits in parallel
reduces it to 16 iterations, and is enough to speed it up substantially
because many boolean operations in the inner loop become constants or
simplify noticeably.
Because update_permission_bitmask is actually the top item in the profile
for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
clock cycles, or up to 30%:
before after
cpuid 35173 25954
vmcall 35122 27079
inl_from_pmtimer 52635 42675
inl_from_qemu 53604 44599
inl_from_kernel 38498 30798
outl_to_kernel 34508 28816
wr_tsc_adjust_msr 34185 26818
rd_tsc_adjust_msr 37409 27049
mmio-no-eventfd:pci-mem 50563 45276
mmio-wildcard-eventfd:pci-mem 34495 30823
mmio-datamatch-eventfd:pci-mem 35612 31071
portio-no-eventfd:pci-io 44925 40661
portio-wildcard-eventfd:pci-io 29708 27269
portio-datamatch-eventfd:pci-io 31135 27164
(I wrote a small C program to compare the tables for all values of CR0.WP,
CR4.SMAP and CR4.SMEP, and they match).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This patch exposes 5 level page table feature to the VM.
At the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.
Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
[Unconditionally reset the MMU context in kvm_cpuid_update.
Changing MAXPHYADDR invalidates the reserved bit bitmasks.
- Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.
Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
redefine it to 5 whenever a replacement is needed for 5 level
paging.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Return false in kvm_cpuid() when it fails to find the cpuid
entry. Also, this routine(and its caller) is optimized with
a new argument - check_limit, so that the check_cpuid_limit()
fall back can be avoided.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A guest may not be configured to support XSAVES/XRSTORS, even when the host
does. If the guest does not support XSAVES/XRSTORS, clear the secondary
execution control so that the processor will raise #UD.
Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
the VMCS02.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A guest may not be configured to support RDSEED, even when the host
does. If the guest does not support RDSEED, intercept the instruction
and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting
in the IA32_VMX_PROCBASED_CTLS2 MSR.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A guest may not be configured to support RDRAND, even when the host
does. If the guest does not support RDRAND, intercept the instruction
and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting
in the IA32_VMX_PROCBASED_CTLS2 MSR.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Currently, secondary execution controls are divided in three groups:
- static, depending mostly on the module arguments or the processor
(vmx_secondary_exec_control)
- static, depending on CPUID (vmx_cpuid_update)
- dynamic, depending on nested VMX or local APIC state
Because walking CPUID is expensive, prepare_vmcs02 is using only
the first group. This however is unnecessarily complicated. Just
cache the static secondary execution controls, and then prepare_vmcs02
does not need to compute them every time. Computation of all static
secondary execution controls is now kept in a single function,
vmx_compute_secondary_exec_control.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Enable the Virtual GIF feature. This is done by setting bit 25 at position
60h in the vmcb.
With this feature enabled, the processor uses bit 9 at position 60h as the
virtual GIF when executing STGI/CLGI instructions.
Since the execution of STGI by the L1 hypervisor does not cause a return to
the outermost (L0) hypervisor, the enable_irq_window and enable_nmi_window
are modified.
The IRQ window will be opened even if GIF is not set, under the assumption
that on resuming the L1 hypervisor the IRQ will be held pending until the
processor executes the STGI instruction.
For the NMI window, the STGI intercept is set. This will assist in opening
the window only when GIF=1.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a new cpufeature definition for Virtual GIF.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
sthyi should only generate a specification exception if the function
code is zero and the response buffer is not on a 4k boundary.
The current code would also test for unknown function codes if the
response buffer, that is currently only defined for function code 0,
is not on a 4k boundary and incorrectly inject a specification
exception instead of returning with condition code 3 and return code 4
(unsupported function code).
Fix this by moving the boundary check.
Fixes: 95ca2cb57985 ("KVM: s390: Add sthyi emulation")
Cc: <stable@vger.kernel.org> # 4.8+
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
The sthyi inline assembly misses register r3 within the clobber
list. The sthyi instruction will always write a return code to
register "R2+1", which in this case would be r3. Due to that we may
have register corruption and see host crashes or data corruption
depending on how gcc decided to allocate and use registers during
compile time.
Fixes: 95ca2cb57985 ("KVM: s390: Add sthyi emulation")
Cc: <stable@vger.kernel.org> # 4.8+
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
We already always set that type but don't check if it is supported. Also
for nVMX, we only support WB for now. Let's just require it.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Don't use shifts, tag them correctly as EPTP and use better matching
names (PWL vs. GAW).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
With lightly tweaked defconfig:
text data bss dec hex filename
11259661 5109408 2981888 19350957 12745ad vmlinux.before
11259661 5109408 884736 17253805 10745ad vmlinux.after
Only compile-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: pbonzini@redhat.com
Cc: rkrcmar@redhat.com
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
There is currently some confusion between nested and L1 GPAs. The
assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
it is not enough. What this patch does is fence off the MMIO cache
completely when using shadow nested page tables, since we have neither
a GVA nor an L1 GPA to put in the cache. This also allows some
simplifications in kvm_mmu_page_fault and FNAME(page_fault).
The EPT misconfig likewise does not have an L1 GPA to pass to
kvm_io_bus_write, so that must be skipped for guest mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Changed comment to say "GPAs" instead of "L1's physical addresses", as
per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.
Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Calling handle_mmio_page_fault() has been unnecessary since commit
e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to
kvm_mmu_page_fault()", 2016-02-22).
handle_mmio_page_fault() can now be made static.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow
local APIC state transition constraints, but the value written must be
valid.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Bailing out immediately if there is no available mmu page to alloc.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
irq event stamp: 20532
hardirqs last enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
softirqs last enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
CPU: 5 PID: 3089 Comm: warn_test Tainted: G OE 4.13.0-rc3+ #8
RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
Call Trace:
make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
kvm_mmu_load+0x1cf/0x410 [kvm]
kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
This can be reproduced readily by ept=N and running syzkaller tests since
many syzkaller testcases don't setup any memory regions. However, if ept=Y
rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
which just hide the issue.
I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
softlockup.
This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
according to whether or not there is mmu page zapped.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Let's reuse the function introduced with eptp switching.
We don't explicitly have to check against enable_ept_ad_bits, as this
is implicitly done when checking against nested_vmx_ept_caps in
valid_ept_address().
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This is the same as commit 147277540bbc ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error
codes", 2016-11-23) added a new error code to aid nested page fault
handling. The commit unprotects (kvm_mmu_unprotect_page) the page when
we get a NPF due to guest page table walk where the page was marked RO.
However, if an L0->L2 shadow nested page table can also be marked read-only
when a page is read only in L1's nested page table. If such a page
is accessed by L2 while walking page tables it can cause a nested
page fault (page table walks are write accesses). However, after
kvm_mmu_unprotect_page we may get another page fault, and again in an
endless stream.
To cover this use case, we qualify the new error_code check with
vcpu->arch.mmu_direct_map so that the error_code check would run on L1
guest, and not the L2 guest. This avoids hitting the above scenario.
Fixes: 147277540bbc54119172481c8ef6d930cc9fbfc2
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reported by syzkaller:
The kvm-intel.unrestricted_guest=0
WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
Call Trace:
? put_pid+0x3a/0x50
? rcu_read_lock_sched_held+0x79/0x80
? kmem_cache_free+0x2f2/0x350
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
The syszkaller folks reported a residual mmio emulation request to userspace
due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
several threads to launch the same vCPU, the thread which lauch this vCPU after
the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
trigger the warning.
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/kvm.h>
#include <stdio.h>
int kvmcpu;
struct kvm_run *run;
void* thr(void* arg)
{
int res;
res = ioctl(kvmcpu, KVM_RUN, 0);
printf("ret1=%d exit_reason=%d suberror=%d\n",
res, run->exit_reason, run->internal.suberror);
return 0;
}
void test()
{
int i, kvm, kvmvm;
pthread_t th[4];
kvm = open("/dev/kvm", O_RDWR);
kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
srand(getpid());
for (i = 0; i < 4; i++) {
pthread_create(&th[i], 0, thr, 0);
usleep(rand() % 10000);
}
for (i = 0; i < 4; i++)
pthread_join(th[i], 0);
}
int main()
{
for (;;) {
int pid = fork();
if (pid < 0)
exit(1);
if (pid == 0) {
test();
exit(0);
}
int status;
while (waitpid(pid, &status, __WALL) != pid) {}
}
return 0;
}
This patch fixes it by resetting the vcpu->mmio_needed once we receive
the triple fault to avoid the residue.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This implements the kvm_arch_vcpu_in_kernel() for ARM, and adjusts
the calls to kvm_vcpu_on_spin().
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This implements kvm_arch_vcpu_in_kernel() for s390. DIAG is a privileged
operation, so it cannot be called from problem state (user mode).
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If a vcpu exits due to request a user mode spinlock, then
the spinlock-holder may be preempted in user mode or kernel mode.
(Note that not all architectures trap spin loops in user mode,
only AMD x86 and ARM/ARM64 currently do).
But if a vcpu exits in kernel mode, then the holder must be
preempted in kernel mode, so we should choose a vcpu in kernel mode
as a more likely candidate for the lock holder.
This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted. kvm_vcpu_on_spin's
new argument says the same of the spinning VCPU.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add guest_cpuid_clear() and use it instead of kvm_find_cpuid_entry().
Also replace some uses of kvm_find_cpuid_entry() with guest_cpuid_has().
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This patch turns guest_cpuid_has_XYZ(cpuid) into guest_cpuid_has(cpuid,
X86_FEATURE_XYZ), which gets rid of many very similar helpers.
When seeing a X86_FEATURE_*, we can know which cpuid it belongs to, but
this information isn't in common code, so we recreate it for KVM.
Add some BUILD_BUG_ONs to make sure that it runs nicely.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
bit(X86_FEATURE_NRIPS) is 3 since 2ccd71f1b278 ("x86/cpufeature: Move
some of the scattered feature bits to x86_capability"), so we can
simplify the code.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|