Age | Commit message (Collapse) | Author |
|
The largepages debugfs entry is incremented/decremented as shadow
pages are created or destroyed. Clearing it will result in an
underflow, which is harmless to KVM but ugly (and could be
misinterpreted by tools that use debugfs information), so make
this particular statistic read-only.
Cc: kvm-ppc@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"A kexec fix for the case when GCC_PLUGIN_STACKLEAK=y is enabled"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/purgatory: Disable the stackleak GCC plugin for the purgatory
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
"This is the latest iteration of the kernel lockdown patchset, from
Matthew Garrett, David Howells and others.
From the original description:
This patchset introduces an optional kernel lockdown feature,
intended to strengthen the boundary between UID 0 and the kernel.
When enabled, various pieces of kernel functionality are restricted.
Applications that rely on low-level access to either hardware or the
kernel may cease working as a result - therefore this should not be
enabled without appropriate evaluation beforehand.
The majority of mainstream distributions have been carrying variants
of this patchset for many years now, so there's value in providing a
doesn't meet every distribution requirement, but gets us much closer
to not requiring external patches.
There are two major changes since this was last proposed for mainline:
- Separating lockdown from EFI secure boot. Background discussion is
covered here: https://lwn.net/Articles/751061/
- Implementation as an LSM, with a default stackable lockdown LSM
module. This allows the lockdown feature to be policy-driven,
rather than encoding an implicit policy within the mechanism.
The new locked_down LSM hook is provided to allow LSMs to make a
policy decision around whether kernel functionality that would allow
tampering with or examining the runtime state of the kernel should be
permitted.
The included lockdown LSM provides an implementation with a simple
policy intended for general purpose use. This policy provides a coarse
level of granularity, controllable via the kernel command line:
lockdown={integrity|confidentiality}
Enable the kernel lockdown feature. If set to integrity, kernel features
that allow userland to modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland to extract
confidential information from the kernel are also disabled.
This may also be controlled via /sys/kernel/security/lockdown and
overriden by kernel configuration.
New or existing LSMs may implement finer-grained controls of the
lockdown features. Refer to the lockdown_reason documentation in
include/linux/security.h for details.
The lockdown feature has had signficant design feedback and review
across many subsystems. This code has been in linux-next for some
weeks, with a few fixes applied along the way.
Stephen Rothwell noted that commit 9d1f8be5cf42 ("bpf: Restrict bpf
when kernel lockdown is in confidentiality mode") is missing a
Signed-off-by from its author. Matthew responded that he is providing
this under category (c) of the DCO"
* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
kexec: Fix file verification on S390
security: constify some arrays in lockdown LSM
lockdown: Print current->comm in restriction messages
efi: Restrict efivar_ssdt_load when the kernel is locked down
tracefs: Restrict tracefs when the kernel is locked down
debugfs: Restrict debugfs when the kernel is locked down
kexec: Allow kexec_file() with appropriate IMA policy when locked down
lockdown: Lock down perf when in confidentiality mode
bpf: Restrict bpf when kernel lockdown is in confidentiality mode
lockdown: Lock down tracing and perf kprobes when in confidentiality mode
lockdown: Lock down /proc/kcore
x86/mmiotrace: Lock down the testmmiotrace module
lockdown: Lock down module params that specify hardware parameters (eg. ioport)
lockdown: Lock down TIOCSSERIAL
lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
acpi: Disable ACPI table override if the kernel is locked down
acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
ACPI: Limit access to custom_method when the kernel is locked down
x86/msr: Restrict MSR access when the kernel is locked down
x86: Lock down IO port access when the kernel is locked down
...
|
|
Pull more KVM updates from Paolo Bonzini:
"x86 KVM changes:
- The usual accuracy improvements for nested virtualization
- The usual round of code cleanups from Sean
- Added back optimizations that were prematurely removed in 5.2 (the
bare minimum needed to fix the regression was in 5.3-rc8, here
comes the rest)
- Support for UMWAIT/UMONITOR/TPAUSE
- Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM
- Tell Windows guests if SMT is disabled on the host
- More accurate detection of vmexit cost
- Revert a pvqspinlock pessimization"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (56 commits)
KVM: nVMX: cleanup and fix host 64-bit mode checks
KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386
KVM: x86: Don't check kvm_rebooting in __kvm_handle_fault_on_reboot()
KVM: x86: Drop ____kvm_handle_fault_on_reboot()
KVM: VMX: Add error handling to VMREAD helper
KVM: VMX: Optimize VMX instruction error and fault handling
KVM: x86: Check kvm_rebooting in kvm_spurious_fault()
KVM: selftests: fix ucall on x86
Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"
kvm: nvmx: limit atomic switch MSRs
kvm: svm: Intercept RDPRU
kvm: x86: Add "significant index" flag to a few CPUID leaves
KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zero
KVM: x86/mmu: Explicitly track only a single invalid mmu generation
KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"
KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first""
KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all pages""
KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""
KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages""
KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints""
...
|
|
The l1tf_vmx_mitigation is only set to VMENTER_L1D_FLUSH_NOT_REQUIRED
when the ARCH_CAPABILITIES MSR indicates that L1D flush is not required.
However, if the CPU is not affected by L1TF, l1tf_vmx_mitigation will
still be set to VMENTER_L1D_FLUSH_AUTO. This is certainly not the best
option for a !X86_BUG_L1TF CPU.
So force l1tf_vmx_mitigation to VMENTER_L1D_FLUSH_NOT_REQUIRED to make it
more explicit in case users are checking the vmentry_l1d_flush parameter.
Signed-off-by: Waiman Long <longman@redhat.com>
[Patch rewritten accoring to Borislav Petkov's suggestion. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Shadow paging is fundamentally incompatible with the page-modification
log, because the GPAs in the log come from the wrong memory map.
In particular, for the EPT page-modification log, the GPAs in the log come
from L2 rather than L1. (If there was a non-EPT page-modification log,
we couldn't use it for shadow paging because it would log GVAs rather
than GPAs).
Therefore, we need to rely on write protection to record dirty pages.
This has the side effect of bypassing PML, since writes now result in an
EPT violation vmexit.
This is relatively easy to add to KVM, because pretty much the only place
that needs changing is spte_clear_dirty. The first access to the page
already goes through the page fault path and records the correct GPA;
it's only subsequent accesses that are wrong. Therefore, we can equip
set_spte (where the first access happens) to record that the SPTE will
have to be write protected, and then spte_clear_dirty will use this
information to do the right thing.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Currently, we are overloading SPTE_SPECIAL_MASK to mean both
"A/D bits unavailable" and MMIO, where the difference between the
two is determined by mio_mask and mmio_value.
However, the next patch will need two bits to distinguish
availability of A/D bits from write protection. So, while at
it give MMIO its own bit pattern, and move the two bits from
bit 62 to bits 52..53 since Intel is allocating EPT page table
bits from the top.
Reviewed-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
people, and until recently arm64 used these erroneously/pointlessly for
other levels of page table.
To make it incredibly clear that these only apply to the PTE level, and to
align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
to pgtable_pte_page_{ctor,dtor}().
These changes were generated with the following shell script:
----
git grep -lw 'pgtable_page_.tor' | while read FILE; do
sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
done
----
... with the documentation re-flowed to remain under 80 columns, and
whitespace fixed up in macros to keep backslashes aligned.
There should be no functional change as a result of this patch.
Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I was surprised to see that the guest reported `fxsave_leak' while the
host did not. After digging deeper I noticed that the bits are simply
masked out during enumeration.
The XSAVEERPTR feature is actually a bug fix on AMD which means the
kernel can disable a workaround.
Pass XSAVEERPTR to the guest if available on the host.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
CLZERO is available to the guest if it is supported on the
host. Therefore, enumerate support for the instruction in
KVM_GET_SUPPORTED_CPUID whenever it is supported on the host.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When the guest CPUID information represents an AMD vCPU, return all
zeroes for queries of undefined CPUID leaves, whether or not they are
in range.
Signed-off-by: Jim Mattson <jmattson@google.com>
Fixes: bd22f5cfcfe8f6 ("KVM: move and fix substitue search for missing CPUID entries")
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Jacob Xu <jacobhxu@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For these CPUID leaves, the EDX output is not dependent on the ECX
input (i.e. the SIGNIFCANT_INDEX flag doesn't apply to
EDX). Furthermore, the low byte of the ECX output is always identical
to the low byte of the ECX input. KVM does not produce the correct ECX
and EDX outputs for any undefined subleaves beyond the first.
Special-case these CPUID leaves in kvm_cpuid, so that the ECX and EDX
outputs are properly generated for all undefined subleaves.
Fixes: 0771671749b59a ("KVM: Enhance guest cpuid management")
Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Jacob Xu <jacobhxu@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reported by syzkaller:
WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel]
CPU: 0 PID: 6544 Comm: a.out Tainted: G OE 5.3.0-rc4+ #4
RIP: 0010:handle_desc+0x37/0x40 [kvm_intel]
Call Trace:
vmx_handle_exit+0xbe/0x6b0 [kvm_intel]
vcpu_enter_guest+0x4dc/0x18d0 [kvm]
kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm]
kvm_vcpu_ioctl+0x3ad/0x690 [kvm]
do_vfs_ioctl+0xa2/0x690
ksys_ioctl+0x6d/0x80
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x74/0x720
entry_SYSCALL_64_after_hwframe+0x49/0xbe
When CR4.UMIP is set, guest should have UMIP cpuid flag. Current
kvm set_sregs function doesn't have such check when userspace inputs
sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP
in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast
triggers handle_desc warning when executing ltr instruction since
guest architectural CR4 doesn't set UMIP. This patch fixes it by
adding valid CR4 and CPUID combination checking in __set_sregs.
syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000
Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Don't return -E2BIG from __do_cpuid_func when processing function 0BH
or 1FH and the last interesting subleaf occupies the last allocated
entry in the result array.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 831bf664e9c1fc ("KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
5000 guest cycles delta is easy to encounter on desktop, per-vCPU
lapic_timer_advance_ns always keeps at 1000ns initial value, let's
loosen the filter a bit to let adaptive tuning make progress.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.
[walken@google.com: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[walken@google.com: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
KVM was incorrectly checking vmcs12->host_ia32_efer even if the "load
IA32_EFER" exit control was reset. Also, some checks were not using
the new CC macro for tracing.
Cleanup everything so that the vCPU's 64-bit mode is determined
directly from EFER_LMA and the VMCS checks are based on that, which
matches section 26.2.4 of the SDM.
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Fixes: 5845038c111db27902bc220a4f70070fe945871c
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The following was reported on i386:
arch/x86/kvm/vmx/vmx.c: In function 'hv_enable_direct_tlbflush':
arch/x86/kvm/vmx/vmx.c:503:10: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
pr_debugs() in this function are more or less useless, let's just
remove them. evmcs->hv_vm_id can use 'unsigned long' instead of 'u64'.
Also, simplify the code a little bit.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove the kvm_rebooting check from VMX/SVM instruction exception fixup
now that kvm_spurious_fault() conditions its BUG() on !kvm_rebooting.
Because the 'cleanup_insn' functionally is also gone, deferring to
kvm_spurious_fault() means __kvm_handle_fault_on_reboot() can eliminate
its .fixup code entirely and have its exception table entry branch
directly to the call to kvm_spurious_fault().
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove the variation of __kvm_handle_fault_on_reboot() that accepts a
post-fault cleanup instruction now that its sole user (VMREAD) uses
a different method for handling faults.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that VMREAD flows require a taken branch, courtesy of commit
3901336ed9887 ("x86/kvm: Don't call kvm_spurious_fault() from .fixup")
bite the bullet and add full error handling to VMREAD, i.e. replace the
JMP added by __ex()/____kvm_handle_fault_on_reboot() with a hinted Jcc.
To minimize the code footprint, add a helper function, vmread_error(),
to handle both faults and failures so that the inline flow has a single
CALL.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rework the VMX instruction helpers using asm-goto to branch directly
to error/fault "handlers" in lieu of using __ex(), i.e. the generic
____kvm_handle_fault_on_reboot(). Branching directly to fault handling
code during fixup avoids the extra JMP that is inserted after every VMX
instruction when using the generic "fault on reboot" (see commit
3901336ed9887, "x86/kvm: Don't call kvm_spurious_fault() from .fixup").
Opportunistically clean up the helpers so that they all have consistent
error handling and messages.
Leave the usage of ____kvm_handle_fault_on_reboot() (via __ex()) in
kvm_cpu_vmxoff() and nested_vmx_check_vmentry_hw() as is. The VMXOFF
case is not a fast path, i.e. the cleanliness of __ex() is worth the
JMP, and the extra JMP in nested_vmx_check_vmentry_hw() is unavoidable.
Note, VMREAD cannot get the asm-goto treatment as output operands aren't
compatible with GCC's asm-goto due to internal compiler restrictions.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Explicitly check kvm_rebooting in kvm_spurious_fault() prior to invoking
BUG(), as opposed to assuming the caller has already done so. Letting
kvm_spurious_fault() be called "directly" will allow VMX to better
optimize its low level assembly flows.
As a happy side effect, kvm_spurious_fault() no longer needs to be
marked as a dead end since it doesn't unconditionally BUG().
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
cache for page table allocations on several architectures that do not use
PAGE_SIZE tables for one or more levels of the page table hierarchy.
Most architectures do not implement these functions and use __weak default
NOP implementation of pgd_cache_init(). Since there is no such default
for pgtable_cache_init(), its empty stub is duplicated among most
architectures.
Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
drop empty stubs of pgtable_cache_init().
Link: http://lkml.kernel.org/r/1566457046-22637-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Will Deacon <will@kernel.org> [arm64]
Acked-by: Thomas Gleixner <tglx@linutronix.de> [x86]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: remove quicklist page table caches".
A while ago Nicholas proposed to remove quicklist page table caches [1].
I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.
[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com
This patch (of 3):
Remove page table allocator "quicklists". These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.
The numbers in the initial commit look interesting but probably don't
apply anymore. If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.
Also it might be better to instead make more general improvements to page
allocator if this is still so slow.
Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Allowing an unlimited number of MSRs to be specified via the VMX
load/store MSR lists (e.g., vm-entry MSR load list) is bad for two
reasons. First, a guest can specify an unreasonable number of MSRs,
forcing KVM to process all of them in software. Second, the SDM bounds
the number of MSRs allowed to be packed into the atomic switch MSR lists.
Quoting the "Miscellaneous Data" section in the "VMX Capability
Reporting Facility" appendix:
"Bits 27:25 is used to compute the recommended maximum number of MSRs
that should appear in the VM-exit MSR-store list, the VM-exit MSR-load
list, or the VM-entry MSR-load list. Specifically, if the value bits
27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended
maximum number of MSRs to be included in each list. If the limit is
exceeded, undefined processor behavior may result (including a machine
check during the VMX transition)."
Because KVM needs to protect itself and can't model "undefined processor
behavior", arbitrarily force a VM-entry to fail due to MSR loading when
the MSR load list is too large. Similarly, trigger an abort during a VM
exit that encounters an MSR load list or MSR store list that is too large.
The MSR list size is intentionally not pre-checked so as to maintain
compatibility with hardware inasmuch as possible.
Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic
switch MSRs".
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The RDPRU instruction gives the guest read access to the IA32_APERF
MSR and the IA32_MPERF MSR. According to volume 3 of the APM, "When
virtualization is enabled, this instruction can be intercepted by the
Hypervisor. The intercept bit is at VMCB byte offset 10h, bit 14."
Since we don't enumerate the instruction in KVM_SUPPORTED_CPUID,
intercept it and synthesize #UD.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Drew Schmitt <dasch@google.com>
Reviewed-by: Jacob Xu <jacobhxu@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
According to the Intel SDM, volume 2, "CPUID," the index is
significant (or partially significant) for CPUID leaves 0FH, 10H, 12H,
17H, 18H, and 1FH.
Add the corresponding flag to these CPUID leaves in do_host_cpuid().
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Steve Rutherford <srutherford@google.com>
Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support")
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Do not skip invalid shadow pages when zapping obsolete pages if the
pages' root_count has reached zero, in which case the page can be
immediately zapped and freed.
Update the comment accordingly.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Toggle mmu_valid_gen between '0' and '1' instead of blindly incrementing
the generation. Because slots_lock is held for the entire duration of
zapping obsolete pages, it's impossible for there to be multiple invalid
generations associated with shadow pages at any given time.
Toggling between the two generations (valid vs. invalid) allows changing
mmu_valid_gen from an unsigned long to a u8, which reduces the size of
struct kvm_mmu_page from 160 to 152 bytes on 64-bit KVM, i.e. reduces
KVM's memory footprint by 8 bytes per shadow page.
Set sp->mmu_valid_gen before it is added to active_mmu_pages.
Functionally this has no effect as kvm_mmu_alloc_page() has a single
caller that sets sp->mmu_valid_gen soon thereafter, but visually it is
jarring to see a shadow page being added to the list without its
mmu_valid_gen first being set.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrasing the original changelog (commit 5ff0568374ed2 was itself a
partial revert):
Don't force reloading the remote mmu when zapping an obsolete page, as
a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast()
immediately after incrementing mmu_valid_gen, i.e. after marking pages
obsolete.
This reverts commit 5ff0568374ed2e585376a3832857ade5daccd381.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrashing the original changelog:
Introduce a per-VM list to track obsolete shadow pages, i.e. pages
which have been deleted from the mmu cache but haven't yet been freed.
When page reclaiming is needed, zap/free the deleted pages first.
This reverts commit 52d5dedc79bdcbac2976159a172069618cf31be5.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
pages""
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrashing the original changelog:
Reload the mmu on all vCPUs after updating the generation number so
that obsolete pages are not used by any vCPUs. This allows collapsing
all TLB flushes during obsolete page zapping into a single flush, as
there is no need to flush when dropping mmu_lock (to reschedule).
Note: a remote TLB flush is still needed before freeing the pages as
other vCPUs may be doing a lockless shadow page walk.
Opportunstically improve the comments restored by the revert (the
code itself is a true revert).
This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrashing the original changelog:
Zap at least 10 shadow pages before releasing mmu_lock to reduce the
overhead associated with re-acquiring the lock.
Note: "10" is an arbitrary number, speculated to be high enough so
that a vCPU isn't stuck zapping obsolete pages for an extended period,
but small enough so that other vCPUs aren't starved waiting for
mmu_lock.
This reverts commit 43d2b14b105fb00b8864c7b0ee7043cc1cc4a969.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm_mmu_invalidate_all_pages""
Now that the fast invalidate mechanism has been reintroduced, restore
the tracepoint associated with said mechanism.
Note, the name of the tracepoint deviates from the original tracepoint
so as to match KVM's current nomenclature.
This reverts commit 42560fb1f3c6c7f730897b7fa7a478bc37e0be50.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
related tracepoints""
Now that the fast invalidate mechanism has been reintroduced, restore
tracing of the generation number in shadow page tracepoints.
This reverts commit b59c4830ca185ba0e9f9e046fb1cd10a4a92627a.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use the fast invalidate mechasim to zap MMIO sptes on a MMIO generation
wrap. The fast invalidate flow was reintroduced to fix a livelock bug
in kvm_mmu_zap_all() that can occur if kvm_mmu_zap_all() is invoked when
the guest has live vCPUs. I.e. using kvm_mmu_zap_all() to handle the
MMIO generation wrap is theoretically susceptible to the livelock bug.
This effectively reverts commit 4771450c345dc ("Revert "KVM: MMU: drop
kvm_mmu_zap_mmio_sptes""), i.e. restores the behavior of commit
a8eca9dcc656a ("KVM: MMU: drop kvm_mmu_zap_mmio_sptes").
Note, this actually fixes commit 571c5af06e303 ("KVM: x86/mmu:
Voluntarily reschedule as needed when zapping MMIO sptes"), but there
is no need to incrementally revert back to using fast invalidate, e.g.
doing so doesn't provide any bisection or stability benefits.
Fixes: 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Treat invalid shadow pages as obsolete to fix a bug where an obsolete
and invalid page with a non-zero root count could become non-obsolete
due to mmu_valid_gen wrapping. The bug is largely theoretical with the
current code base, as an unsigned long will effectively never wrap on
64-bit KVM, and userspace would have to deliberately stall a vCPU in
order to keep an obsolete invalid page on the active list while
simultaneously modifying memslots billions of times to trigger a wrap.
The obvious alternative is to use a 64-bit value for mmu_valid_gen,
but it's actually desirable to go in the opposite direction, i.e. using
a smaller 8-bit value to reduce KVM's memory footprint by 8 bytes per
shadow page, and relying on proper treatment of invalid pages instead of
preventing the generation from wrapping.
Note, "Fixes" points at a commit that was at one point reverted, but has
since been restored.
Fixes: 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Filter out drastic fluctuation and random fluctuation, remove
timer_advance_adjust_done altogether, the adjustment would be
continuous.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As the latest Intel 64 and IA-32 Architectures Software Developer's
Manual, UMWAIT and TPAUSE instructions cause a VM exit if the
RDTSC exiting and enable user wait and pause VM-execution
controls are both 1.
Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE
should never happen. Considering EXIT_REASON_XSAVES and
EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common
exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index
E1H to determines the maximum time in TSC-quanta that the processor can
reside in either C0.1 or C0.2.
This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
IA32_UMWAIT_CONTROL between host and guest. The variable
mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value,
so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
This patch adds support for user wait instructions in KVM. Availability
of the user wait instructions is indicated by the presence of the CPUID
feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR
to set the maximum time.
The behavior of user wait instructions in VMX non-root operation is
determined first by the setting of the "enable user wait and pause"
secondary processor-based VM-execution control bit 26.
If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause
an invalid-opcode exception (#UD).
If the VM-execution control is 1, treatment is based on the
setting of the “RDTSC exiting†VM-execution control. Because KVM never
enables RDTSC exiting, if the instruction causes a delay, the amount of
time delayed is called here the physical delay. The physical delay is
first computed by determining the virtual delay. If
IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in
EDX:EAX minus the value that RDTSC would return; if
IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum
of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
Because umwait and tpause can put a (psysical) CPU into a power saving
state, by default we dont't expose it to kvm and enable it only when
guest CPUID has it.
Detailed information about user wait instructions can be found in the
latest Intel 64 and IA-32 Architectures Software Developer's Manual.
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Document the intended usage of each emulation type as each exists to
handle an edge case of one kind or another and can be easily
misinterpreted at first glance.
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding
the instruction to determine the instruction length when running as a
guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's
technically not defined for EPT misconfigs). Rather than implement the
slow skip in VMX's generic skip_emulated_instruction(),
handle_ept_misconfig() directly calls kvm_emulate_instruction() with
EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and
so handle_ept_misconfig() misses a single-step #DB.
Rework the EPT misconfig fallback case to route it through
kvm_skip_emulated_instruction() so that single-step #DBs and interrupt
shadow updates are handled automatically. I.e. make VMX's slow skip
logic match SVM's and have the SVM flow not intentionally avoid the
shadow update.
Alternatively, the handle_ept_misconfig() could manually handle single-
step detection, but that results in EMULTYPE_SKIP having split logic for
the interrupt shadow vs. single-step #DBs, and split emulator logic is
largely what led to this mess in the first place.
Modifying SVM to mirror VMX flow isn't really an option as SVM's case
isn't limited to a specific exit reason, i.e. handling the slow skip in
skip_emulated_instruction() is mandatory for all intents and purposes.
Drop VMX's skip_emulated_instruction() wrapper since it can now fail,
and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow
becomes corrupted.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: d391f12070672 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Deferring emulation failure handling (in some cases) to the caller of
x86_emulate_instruction() has proven fragile, e.g. multiple instances of
KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it
being difficult to discern what emulation types can return what result,
and which combination of types and results are handled where.
Now that x86_emulate_instruction() always handles emulation failure,
i.e. EMULATION_FAIL is only referenced in callers, remove the
emulation_result enums entirely. Per KVM's existing exit handling
conventions, return '0' and '1' for "exit to userspace" and "resume
guest" respectively. Doing so cleans up many callers, e.g. they can
return kvm_emulate_instruction() directly instead of having to interpret
its result.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that EMULATE_FAIL is completely unused, remove the last remaning
usage where KVM does something functional in response to EMULATE_FAIL.
Leave the check in place as a WARN_ON_ONCE to provide a better paper
trail when EMULATE_{DONE,FAIL,USER_EXIT} are completely removed.
Opportunistically remove the gotos in handle_invalid_guest_state().
With the EMULATE_FAIL handling gone there is no need to have a common
handler for emulation failure and the gotos only complicate things,
e.g. the signal_pending() check always returns '1', but this is far
from obvious when glancing through the code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Request triple fault in kvm_inject_realmode_interrupt() instead of
returning EMULATE_FAIL and deferring to the caller. All existing
callers request triple fault and it's highly unlikely Real Mode is
going to acquire new features. While this consolidates a small amount
of code, the real goal is to remove the last reference to EMULATE_FAIL.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Consolidate the reporting of emulation failure into kvm_task_switch()
so that it can return EMULATE_USER_EXIT. This helps pave the way for
removing EMULATE_FAIL altogether.
This also fixes a theoretical bug where task switch interception could
suppress an EMULATE_USER_EXIT return.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Kill a few birds with one stone by forcing an exit to userspace on skip
emulation failure. This removes a reference to EMULATE_FAIL, fixes a
bug in handle_ept_misconfig() where it would exit to userspace without
setting run->exit_reason, and fixes a theoretical bug in SVM's
task_switch_interception() where it would overwrite run->exit_reason on
a return of EMULATE_USER_EXIT.
Note, this technically doesn't fully fix task_switch_interception()
as it now incorrectly handles EMULATE_FAIL, but in practice there is no
bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Immediately inject a #UD and return EMULATE done if emulation fails when
handling an intercepted #UD. This helps pave the way for removing
EMULATE_FAIL altogether.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|