Age | Commit message (Collapse) | Author |
|
According to the Intel SDM, volume 2, "CPUID," the index is
significant (or partially significant) for CPUID leaves 0FH, 10H, 12H,
17H, 18H, and 1FH.
Add the corresponding flag to these CPUID leaves in do_host_cpuid().
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Steve Rutherford <srutherford@google.com>
Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support")
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Do not skip invalid shadow pages when zapping obsolete pages if the
pages' root_count has reached zero, in which case the page can be
immediately zapped and freed.
Update the comment accordingly.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Toggle mmu_valid_gen between '0' and '1' instead of blindly incrementing
the generation. Because slots_lock is held for the entire duration of
zapping obsolete pages, it's impossible for there to be multiple invalid
generations associated with shadow pages at any given time.
Toggling between the two generations (valid vs. invalid) allows changing
mmu_valid_gen from an unsigned long to a u8, which reduces the size of
struct kvm_mmu_page from 160 to 152 bytes on 64-bit KVM, i.e. reduces
KVM's memory footprint by 8 bytes per shadow page.
Set sp->mmu_valid_gen before it is added to active_mmu_pages.
Functionally this has no effect as kvm_mmu_alloc_page() has a single
caller that sets sp->mmu_valid_gen soon thereafter, but visually it is
jarring to see a shadow page being added to the list without its
mmu_valid_gen first being set.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrasing the original changelog (commit 5ff0568374ed2 was itself a
partial revert):
Don't force reloading the remote mmu when zapping an obsolete page, as
a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast()
immediately after incrementing mmu_valid_gen, i.e. after marking pages
obsolete.
This reverts commit 5ff0568374ed2e585376a3832857ade5daccd381.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrashing the original changelog:
Introduce a per-VM list to track obsolete shadow pages, i.e. pages
which have been deleted from the mmu cache but haven't yet been freed.
When page reclaiming is needed, zap/free the deleted pages first.
This reverts commit 52d5dedc79bdcbac2976159a172069618cf31be5.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
pages""
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrashing the original changelog:
Reload the mmu on all vCPUs after updating the generation number so
that obsolete pages are not used by any vCPUs. This allows collapsing
all TLB flushes during obsolete page zapping into a single flush, as
there is no need to flush when dropping mmu_lock (to reschedule).
Note: a remote TLB flush is still needed before freeing the pages as
other vCPUs may be doing a lockless shadow page walk.
Opportunstically improve the comments restored by the revert (the
code itself is a true revert).
This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrashing the original changelog:
Zap at least 10 shadow pages before releasing mmu_lock to reduce the
overhead associated with re-acquiring the lock.
Note: "10" is an arbitrary number, speculated to be high enough so
that a vCPU isn't stuck zapping obsolete pages for an extended period,
but small enough so that other vCPUs aren't starved waiting for
mmu_lock.
This reverts commit 43d2b14b105fb00b8864c7b0ee7043cc1cc4a969.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm_mmu_invalidate_all_pages""
Now that the fast invalidate mechanism has been reintroduced, restore
the tracepoint associated with said mechanism.
Note, the name of the tracepoint deviates from the original tracepoint
so as to match KVM's current nomenclature.
This reverts commit 42560fb1f3c6c7f730897b7fa7a478bc37e0be50.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
related tracepoints""
Now that the fast invalidate mechanism has been reintroduced, restore
tracing of the generation number in shadow page tracepoints.
This reverts commit b59c4830ca185ba0e9f9e046fb1cd10a4a92627a.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use the fast invalidate mechasim to zap MMIO sptes on a MMIO generation
wrap. The fast invalidate flow was reintroduced to fix a livelock bug
in kvm_mmu_zap_all() that can occur if kvm_mmu_zap_all() is invoked when
the guest has live vCPUs. I.e. using kvm_mmu_zap_all() to handle the
MMIO generation wrap is theoretically susceptible to the livelock bug.
This effectively reverts commit 4771450c345dc ("Revert "KVM: MMU: drop
kvm_mmu_zap_mmio_sptes""), i.e. restores the behavior of commit
a8eca9dcc656a ("KVM: MMU: drop kvm_mmu_zap_mmio_sptes").
Note, this actually fixes commit 571c5af06e303 ("KVM: x86/mmu:
Voluntarily reschedule as needed when zapping MMIO sptes"), but there
is no need to incrementally revert back to using fast invalidate, e.g.
doing so doesn't provide any bisection or stability benefits.
Fixes: 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Treat invalid shadow pages as obsolete to fix a bug where an obsolete
and invalid page with a non-zero root count could become non-obsolete
due to mmu_valid_gen wrapping. The bug is largely theoretical with the
current code base, as an unsigned long will effectively never wrap on
64-bit KVM, and userspace would have to deliberately stall a vCPU in
order to keep an obsolete invalid page on the active list while
simultaneously modifying memslots billions of times to trigger a wrap.
The obvious alternative is to use a 64-bit value for mmu_valid_gen,
but it's actually desirable to go in the opposite direction, i.e. using
a smaller 8-bit value to reduce KVM's memory footprint by 8 bytes per
shadow page, and relying on proper treatment of invalid pages instead of
preventing the generation from wrapping.
Note, "Fixes" points at a commit that was at one point reverted, but has
since been restored.
Fixes: 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Filter out drastic fluctuation and random fluctuation, remove
timer_advance_adjust_done altogether, the adjustment would be
continuous.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As the latest Intel 64 and IA-32 Architectures Software Developer's
Manual, UMWAIT and TPAUSE instructions cause a VM exit if the
RDTSC exiting and enable user wait and pause VM-execution
controls are both 1.
Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE
should never happen. Considering EXIT_REASON_XSAVES and
EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common
exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index
E1H to determines the maximum time in TSC-quanta that the processor can
reside in either C0.1 or C0.2.
This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
IA32_UMWAIT_CONTROL between host and guest. The variable
mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value,
so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
This patch adds support for user wait instructions in KVM. Availability
of the user wait instructions is indicated by the presence of the CPUID
feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR
to set the maximum time.
The behavior of user wait instructions in VMX non-root operation is
determined first by the setting of the "enable user wait and pause"
secondary processor-based VM-execution control bit 26.
If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause
an invalid-opcode exception (#UD).
If the VM-execution control is 1, treatment is based on the
setting of the “RDTSC exiting†VM-execution control. Because KVM never
enables RDTSC exiting, if the instruction causes a delay, the amount of
time delayed is called here the physical delay. The physical delay is
first computed by determining the virtual delay. If
IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in
EDX:EAX minus the value that RDTSC would return; if
IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum
of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
Because umwait and tpause can put a (psysical) CPU into a power saving
state, by default we dont't expose it to kvm and enable it only when
guest CPUID has it.
Detailed information about user wait instructions can be found in the
latest Intel 64 and IA-32 Architectures Software Developer's Manual.
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Document the intended usage of each emulation type as each exists to
handle an edge case of one kind or another and can be easily
misinterpreted at first glance.
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding
the instruction to determine the instruction length when running as a
guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's
technically not defined for EPT misconfigs). Rather than implement the
slow skip in VMX's generic skip_emulated_instruction(),
handle_ept_misconfig() directly calls kvm_emulate_instruction() with
EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and
so handle_ept_misconfig() misses a single-step #DB.
Rework the EPT misconfig fallback case to route it through
kvm_skip_emulated_instruction() so that single-step #DBs and interrupt
shadow updates are handled automatically. I.e. make VMX's slow skip
logic match SVM's and have the SVM flow not intentionally avoid the
shadow update.
Alternatively, the handle_ept_misconfig() could manually handle single-
step detection, but that results in EMULTYPE_SKIP having split logic for
the interrupt shadow vs. single-step #DBs, and split emulator logic is
largely what led to this mess in the first place.
Modifying SVM to mirror VMX flow isn't really an option as SVM's case
isn't limited to a specific exit reason, i.e. handling the slow skip in
skip_emulated_instruction() is mandatory for all intents and purposes.
Drop VMX's skip_emulated_instruction() wrapper since it can now fail,
and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow
becomes corrupted.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: d391f12070672 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Deferring emulation failure handling (in some cases) to the caller of
x86_emulate_instruction() has proven fragile, e.g. multiple instances of
KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it
being difficult to discern what emulation types can return what result,
and which combination of types and results are handled where.
Now that x86_emulate_instruction() always handles emulation failure,
i.e. EMULATION_FAIL is only referenced in callers, remove the
emulation_result enums entirely. Per KVM's existing exit handling
conventions, return '0' and '1' for "exit to userspace" and "resume
guest" respectively. Doing so cleans up many callers, e.g. they can
return kvm_emulate_instruction() directly instead of having to interpret
its result.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that EMULATE_FAIL is completely unused, remove the last remaning
usage where KVM does something functional in response to EMULATE_FAIL.
Leave the check in place as a WARN_ON_ONCE to provide a better paper
trail when EMULATE_{DONE,FAIL,USER_EXIT} are completely removed.
Opportunistically remove the gotos in handle_invalid_guest_state().
With the EMULATE_FAIL handling gone there is no need to have a common
handler for emulation failure and the gotos only complicate things,
e.g. the signal_pending() check always returns '1', but this is far
from obvious when glancing through the code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Request triple fault in kvm_inject_realmode_interrupt() instead of
returning EMULATE_FAIL and deferring to the caller. All existing
callers request triple fault and it's highly unlikely Real Mode is
going to acquire new features. While this consolidates a small amount
of code, the real goal is to remove the last reference to EMULATE_FAIL.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Consolidate the reporting of emulation failure into kvm_task_switch()
so that it can return EMULATE_USER_EXIT. This helps pave the way for
removing EMULATE_FAIL altogether.
This also fixes a theoretical bug where task switch interception could
suppress an EMULATE_USER_EXIT return.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Kill a few birds with one stone by forcing an exit to userspace on skip
emulation failure. This removes a reference to EMULATE_FAIL, fixes a
bug in handle_ept_misconfig() where it would exit to userspace without
setting run->exit_reason, and fixes a theoretical bug in SVM's
task_switch_interception() where it would overwrite run->exit_reason on
a return of EMULATE_USER_EXIT.
Note, this technically doesn't fully fix task_switch_interception()
as it now incorrectly handles EMULATE_FAIL, but in practice there is no
bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Immediately inject a #UD and return EMULATE done if emulation fails when
handling an intercepted #UD. This helps pave the way for removing
EMULATE_FAIL altogether.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add an explicit emulation type for forced #UD emulation and use it to
detect that KVM should unconditionally inject a #UD instead of falling
into its standard emulation failure handling.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Immediately inject a #GP when VMware emulation fails and return
EMULATE_DONE instead of propagating EMULATE_FAIL up the stack. This
helps pave the way for removing EMULATE_FAIL altogether.
Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86
emulator is called to handle VMware #GP interception, e.g. why a #GP
is injected on emulation failure for EMULTYPE_VMWARE_GP.
Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type. The "no #UD on fail"
is used only in the VMWare case and is obsoleted by having the emulator
itself reinject #GP.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The VMware backdoor hooks #GP faults on IN{S}, OUT{S}, and RDPMC, none
of which generate a non-zero error code for their #GP. Re-injecting #GP
instead of attempting emulation on a non-zero error code will allow a
future patch to move #GP injection (for emulation failure) into
kvm_emulate_instruction() without having to plumb in the error code.
Reviewed-and-tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Return the single-step emulation result directly instead of via an out
param. Presumably at some point in the past kvm_vcpu_do_singlestep()
could be called with *r==EMULATE_USER_EXIT, but that is no longer the
case, i.e. all callers are happy to overwrite their own return variable.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When handling emulation failure, return the emulation result directly
instead of capturing it in a local variable. Future patches will move
additional cases into handle_emulation_failure(), clean up the cruft
before so there isn't an ugly mix of setting a local variable and
returning directly.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the stat.mmio_exits update into x86_emulate_instruction(). This is
both a bug fix, e.g. the current update flows will incorrectly increment
mmio_exits on emulation failure, and a preparatory change to set the
stage for eliminating EMULATE_DONE and company.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
According to section "Checks Related to Address-Space Size" in Intel SDM
vol 3C, the following checks are performed on vmentry of nested guests:
If the logical processor is outside IA-32e mode (if IA32_EFER.LMA = 0)
at the time of VM entry, the following must hold:
- The "IA-32e mode guest" VM-entry control is 0.
- The "host address-space size" VM-exit control is 0.
If the logical processor is in IA-32e mode (if IA32_EFER.LMA = 1) at the
time of VM entry, the "host address-space size" VM-exit control must be 1.
If the "host address-space size" VM-exit control is 0, the following must
hold:
- The "IA-32e mode guest" VM-entry control is 0.
- Bit 17 of the CR4 field (corresponding to CR4.PCIDE) is 0.
- Bits 63:32 in the RIP field are 0.
If the "host address-space size" VM-exit control is 1, the following must
hold:
- Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
- The RIP field contains a canonical address.
On processors that do not support Intel 64 architecture, checks are
performed to ensure that the "IA-32e mode guest" VM-entry control and the
"host address-space size" VM-exit control are both 0.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The bit is supposed to be '1' when SMT is not supported or forcefully
disabled and '0' otherwise.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
impossible
Hyper-V 2019 doesn't expose MD_CLEAR CPUID bit to guests when it cannot
guarantee that two virtual processors won't end up running on sibling SMT
threads without knowing about it. This is done as an optimization as in
this case there is nothing the guest can do to protect itself against MDS
and issuing additional flush requests is just pointless. On bare metal the
topology is known, however, when Hyper-V is running nested (e.g. on top of
KVM) it needs an additional piece of information: a confirmation that the
exposed topology (wrt vCPU placement on different SMT threads) is
trustworthy.
NoNonArchitecturalCoreSharing (CPUID 0x40000004 EAX bit 18) is described in
TLFS as follows: "Indicates that a virtual processor will never share a
physical core with another virtual processor, except for virtual processors
that are reported as sibling SMT threads." From KVM we can give such
guarantee in two cases:
- SMT is unsupported or forcefully disabled (just 'disabled' doesn't work
as it can become re-enabled during the lifetime of the guest).
- vCPUs are properly pinned so the scheduler won't put them on sibling
SMT threads (when they're not reported as such).
This patch reports NoNonArchitecturalCoreSharing bit in to userspace in the
first case. The second case is outside of KVM's domain of responsibility
(as vCPU pinning is actually done by someone who manages KVM's userspace -
e.g. libvirt pinning QEMU threads).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM needs to know if SMT is theoretically possible, this means it is
supported and not forcefully disabled ('nosmt=force'). Create and
export cpu_smt_possible() answering this question.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reported by syzkaller:
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:__apic_accept_irq+0x46/0x740 arch/x86/kvm/lapic.c:1029
Call Trace:
kvm_apic_set_irq+0xb4/0x140 arch/x86/kvm/lapic.c:558
stimer_notify_direct arch/x86/kvm/hyperv.c:648 [inline]
stimer_expiration arch/x86/kvm/hyperv.c:659 [inline]
kvm_hv_process_stimers+0x594/0x1650 arch/x86/kvm/hyperv.c:686
vcpu_enter_guest+0x2b2a/0x54b0 arch/x86/kvm/x86.c:7896
vcpu_run+0x393/0xd40 arch/x86/kvm/x86.c:8152
kvm_arch_vcpu_ioctl_run+0x636/0x900 arch/x86/kvm/x86.c:8360
kvm_vcpu_ioctl+0x6cf/0xaf0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2765
The testcase programs HV_X64_MSR_STIMERn_CONFIG/HV_X64_MSR_STIMERn_COUNT,
in addition, there is no lapic in the kernel, the counters value are small
enough in order that kvm_hv_process_stimers() inject this already-expired
timer interrupt into the guest through lapic in the kernel which triggers
the NULL deferencing. This patch fixes it by don't advertise direct mode
synthetic timers and discarding the inject when lapic is not in kernel.
syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=1752fe0a600000
Reported-by: syzbot+dff25ee91f0c7d5c1695@syzkaller.appspotmail.com
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Zapping collapsible sptes, a.k.a. 4k sptes that can be promoted into a
large page, is only necessary when changing only the dirty logging flag
of a memory region. If the memslot is also being moved, then all sptes
for the memslot are zapped when it is invalidated. When a memslot is
being created, it is impossible for there to be existing dirty mappings,
e.g. KVM can have MMIO sptes, but not present, and thus dirty, sptes.
Note, the comment and logic are shamelessly borrowed from MIPS's version
of kvm_arch_commit_memory_region().
Fixes: 3ea3b7fa9af06 ("kvm: mmu: lazy collapse small sptes into large sptes")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove the duplication code in run_test() of dirty_log_test because
after some reordering of functions now we can directly use the outcome
of vm_create().
Meanwhile, with the new VM_MODE_PXXV48_4K, we can safely revert
b442324b58 too where we stick the x86_64 PA width to 39 bits for
dirty_log_test.
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The naming VM_MODE_P52V48_4K is explicit but unclear when used on
x86_64 machines, because x86_64 machines are having various physical
address width rather than some static values. Here's some examples:
- Intel Xeon E3-1220: 36 bits
- Intel Core i7-8650: 39 bits
- AMD EPYC 7251: 48 bits
All of them are using 48 bits linear address width but with totally
different physical address width (and most of the old machines should
be less than 52 bits).
Let's create a new guest mode called VM_MODE_PXXV48_4K for current
x86_64 tests and make it as the default to replace the old naming of
VM_MODE_P52V48_4K because it shows more clearly that the PA width is
not really a constant. Meanwhile we also stop assuming all the x86
machines are having 52 bits PA width but instead we fetch the real
vm->pa_bits from CPUID 0x80000008 during runtime.
We currently make this exclusively used by x86_64 but no other arch.
As a slight touch up, moving DEBUG macro from dirty_log_test.c to
kvm_util.h so lib can use it too.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since we've just removed the dependency of vm type in previous patch,
now we can create the vm much earlier. Note that to move it earlier
we used an approximation of number of extra pages but it should be
fine.
This prepares for the follow up patches to finally remove the
duplication of guest mode parsings.
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rather than passing the vm type from the top level to the end of vm
creation, let's simply keep that as an internal of kvm_vm struct and
decide the type in _vm_create(). Several reasons for doing this:
- The vm type is only decided by physical address width and currently
only used in aarch64, so we've got enough information as long as
we're passing vm_guest_mode into _vm_create(),
- This removes a loop dependency between the vm->type and creation of
vms. That's why now we need to parse vm_guest_mode twice sometimes,
once in run_test() and then again in _vm_create(). The follow up
patches will move on to clean up that as well so we can have a
single place to decide guest machine types and so.
Note that this patch will slightly change the behavior of aarch64
tests in that previously most vm_create() callers will directly pass
in type==0 into _vm_create() but now the type will depend on
vm_guest_mode, however it shouldn't affect any user because all
vm_create() users of aarch64 will be using VM_MODE_DEFAULT guest
mode (which is VM_MODE_P40V48_4K) so at last type will still be zero.
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
available
It was discovered that after commit 65efa61dc0d5 ("selftests: kvm: provide
common function to enable eVMCS") hyperv_cpuid selftest is failing on AMD.
The reason is that the commit changed _vcpu_ioctl() to vcpu_ioctl() in the
test and this one can't fail.
Instead of fixing the test is seems to make more sense to not announce
KVM_CAP_HYPERV_ENLIGHTENED_VMCS support if it is definitely missing
(on svm and in case kvm_intel.nested=0).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since commit 5158917c7b019 ("KVM: x86: nVMX: Allow nested_enable_evmcs to
be NULL") the code in x86.c is prepared to see nested_enable_evmcs being
NULL and in VMX case it actually is when nesting is disabled. Remove the
unneeded stub from SVM code.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Hyper-V provides direct tlb flush function which helps
L1 Hypervisor to handle Hyper-V tlb flush request from
L2 guest. Add the function support for VMX.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Hyper-V direct tlb flush function should be enabled for
guest that only uses Hyper-V hypercall. User space
hypervisor(e.g, Qemu) can disable KVM identification in
CPUID and just exposes Hyper-V identification to make
sure the precondition. Add new KVM capability KVM_CAP_
HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V
direct tlb function and this function is default to be
disabled in KVM.
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The struct hv_vp_assist_page was defined incorrectly.
The "vtl_control" should be u64[3], "nested_enlightenments
_control" should be a u64 and there are 7 reserved bytes
following "enlighten_vmentry". Fix the definition.
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
These MSRs should be enumerated by KVM_GET_MSR_INDEX_LIST, so that
userspace knows that these MSRs may be part of the vCPU state.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Eric Hankland <ehankland@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"New Drivers:
- Add support for Merrifield Basin Cove PMIC
New Device Support:
- Add support for Intel Tiger Lake to Intel LPSS PCI
- Add support for Intel Sky Lake to Intel LPSS PCI
- Add support for ST-Ericsson DB8520 to DB8500 PRCMU
New Functionality:
- Add RTC and PWRC support to MT6323
Fix-ups:
- Clean-up include files; davinci_voicecodec, asic3, sm501, mt6397
- Ignore return values from debugfs_create*(); ab3100-*, ab8500-debugfs, aat2870-core
- Device Tree changes; rn5t618, mt6397
- Use new I2C API; tps80031, 88pm860x-core, ab3100-core, bcm590xx,
da9150-core, max14577, max77693, max77843, max8907,
max8925-i2c, max8997, max8998, palmas, twl-core,
- Remove obsolete code; da9063, jz4740-adc
- Simplify semantics; timberdale, htc-i2cpld
- Add 'fall-through' tags; omap-usb-host, db8500-prcmu
- Remove superfluous prints; ab8500-debugfs, db8500-prcmu, fsl-imx25-tsadc,
intel_soc_pmic_bxtwc, qcom_rpm, sm501
- Trivial rename/whitespace/typo fixes; mt6397-core, MAINTAINERS
- Reorganise code structure; mt6397-*
- Improve code consistency; intel-lpss
- Use MODULE_SOFTDEP() helper; intel-lpss
- Use DEFINE_RES_*() helpers; mt6397-core
Bug Fixes:
- Clean-up resources; max77620
- Prevent input events being dropped on resume; intel-lpss-pci
- Prevent sleeping in IRQ context; ezx-pcap"
* tag 'mfd-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (48 commits)
mfd: mt6323: Add MT6323 RTC and PWRC
mfd: mt6323: Replace boilerplate resource code with DEFINE_RES_* macros
mfd: mt6397: Add mutex include
dt-bindings: mfd: mediatek: Add MT6323 Power Controller
dt-bindings: mfd: mediatek: Update RTC to include MT6323
dt-bindings: mfd: mediatek: mt6397: Change to relative paths
mfd: db8500-prcmu: Support the higher DB8520 ARMSS
mfd: intel-lpss: Use MODULE_SOFTDEP() instead of implicit request
mfd: htc-i2cpld: Drop check because i2c_unregister_device() is NULL safe
mfd: sm501: Include the GPIO driver header
mfd: intel-lpss: Add Intel Skylake ACPI IDs
mfd: intel-lpss: Consistently use GENMASK()
mfd: Add support for Merrifield Basin Cove PMIC
mfd: ezx-pcap: Replace mutex_lock with spin_lock
mfd: asic3: Include the right header
MAINTAINERS: altera-sysmgr: Fix typo in a filepath
mfd: mt6397: Extract IRQ related code from core driver
mfd: mt6397: Rename macros to something more readable
mfd: Remove dev_err() usage after platform_get_irq()
mfd: db8500-prcmu: Mark expected switch fall-throughs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight
Pull backlight updates from Lee Jones:
"Core Frameworks
- Obtain scale type through sysfs
New Functionality:
- Provide Device Tree functionality in rave-sp-backlight
- Calculate if scale type is (non-)linear in pwm_bl
Fix-ups:
- Simplify code in lm3630a_bl
- Trivial rename/whitespace/typo fixes in lms283gf05
- Remove superfluous NULL check in tosa_lcd
- Fix power state initialisation in gpio_backlight
- List supported file in MAINTAINERS
Bug Fixes:
- Kconfig - default to not building unless requested in
{LED,BACKLIGHT}_CLASS_DEVICE"
* tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
backlight: pwm_bl: Set scale type for brightness curves specified in the DT
backlight: pwm_bl: Set scale type for CIE 1931 curves
backlight: Expose brightness curve type through sysfs
MAINTAINERS: Add entry for stable backlight sysfs ABI documentation
backlight: gpio-backlight: Correct initial power state handling
video: backlight: tosa_lcd: drop check because i2c_unregister_device() is NULL safe
video: backlight: Drop default m for {LCD,BACKLIGHT_CLASS_DEVICE}
backlight: lms283gf05: Fix a typo in the description passed to 'devm_gpio_request_one()'
backlight: lm3630a: Switch to use fwnode_property_count_uXX()
backlight: rave-sp: Leave initial state and register with correct device
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Consolidate _HPP/_HPX stuff in pci-acpi.c and simplify it
(Krzysztof Wilczynski)
- Fix incorrect PCIe device types and remove dev->has_secondary_link
to simplify code that deals with upstream/downstream ports (Mika
Westerberg)
- After suspend, restore Resizable BAR size bits correctly for 1MB
BARs (Sumit Saxena)
- Enable PCI_MSI_IRQ_DOMAIN support for RISC-V (Wesley Terpstra)
Virtualization:
- Add ACS quirks for iProc PAXB (Abhinav Ratna), Amazon Annapurna
Labs (Ali Saidi)
- Move sysfs SR-IOV functions to iov.c (Kelsey Skunberg)
- Remove group write permissions from sysfs sriov_numvfs,
sriov_drivers_autoprobe (Kelsey Skunberg)
Hotplug:
- Simplify pciehp indicator control (Denis Efremov)
Peer-to-peer DMA:
- Allow P2P DMA between root ports for whitelisted bridges (Logan
Gunthorpe)
- Whitelist some Intel host bridges for P2P DMA (Logan Gunthorpe)
- DMA map P2P DMA requests that traverse host bridge (Logan
Gunthorpe)
Amazon Annapurna Labs host bridge driver:
- Add DT binding and controller driver (Jonathan Chocron)
Hyper-V host bridge driver:
- Fix hv_pci_dev->pci_slot use-after-free (Dexuan Cui)
- Fix PCI domain number collisions (Haiyang Zhang)
- Use instance ID bytes 4 & 5 as PCI domain numbers (Haiyang Zhang)
- Fix build errors on non-SYSFS config (Randy Dunlap)
i.MX6 host bridge driver:
- Limit DBI register length (Stefan Agner)
Intel VMD host bridge driver:
- Fix config addressing issues (Jon Derrick)
Layerscape host bridge driver:
- Add bar_fixed_64bit property to endpoint driver (Xiaowei Bao)
- Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC drivers separately
(Xiaowei Bao)
Mediatek host bridge driver:
- Add MT7629 controller support (Jianjun Wang)
Mobiveil host bridge driver:
- Fix CPU base address setup (Hou Zhiqiang)
- Make "num-lanes" property optional (Hou Zhiqiang)
Tegra host bridge driver:
- Fix OF node reference leak (Nishka Dasgupta)
- Disable MSI for root ports to work around design problem (Vidya
Sagar)
- Add Tegra194 DT binding and controller support (Vidya Sagar)
- Add support for sideband pins and slot regulators (Vidya Sagar)
- Add PIPE2UPHY support (Vidya Sagar)
Misc:
- Remove unused pci_block_cfg_access() et al (Kelsey Skunberg)
- Unexport pci_bus_get(), etc (Kelsey Skunberg)
- Hide PM, VC, link speed, ATS, ECRC, PTM constants and interfaces in
the PCI core (Kelsey Skunberg)
- Clean up sysfs DEVICE_ATTR() usage (Kelsey Skunberg)
- Mark expected switch fall-through (Gustavo A. R. Silva)
- Propagate errors for optional regulators and PHYs (Thierry Reding)
- Fix kernel command line resource_alignment parameter issues (Logan
Gunthorpe)"
* tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (112 commits)
PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI
arm64: tegra: Add PCIe slot supply information in p2972-0000 platform
arm64: tegra: Add configuration for PCIe C5 sideband signals
PCI: tegra: Add support to enable slot regulators
PCI: tegra: Add support to configure sideband pins
PCI: vmd: Fix shadow offsets to reflect spec changes
PCI: vmd: Fix config addressing when using bus offsets
PCI: dwc: Add validation that PCIe core is set to correct mode
PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver
dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding
PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port
PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port
PCI: Add ACS quirk for Amazon Annapurna Labs root ports
PCI: Add Amazon's Annapurna Labs vendor ID
MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer
PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers
dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries
dt-bindings: PCI: tegra: Add sideband pins configuration entries
PCI: tegra: Add Tegra194 PCIe support
PCI: Get rid of dev->has_secondary_link flag
...
|
|
Pull smack updates from Casey Schaufler:
"Four patches for v5.4. Nothing is major.
All but one are in response to mechanically detected potential issues.
The remaining patch cleans up kernel-doc notations"
* tag 'smack-for-5.4-rc1' of git://github.com/cschaufler/smack-next:
smack: use GFP_NOFS while holding inode_smack::smk_lock
security: smack: Fix possible null-pointer dereferences in smack_socket_sock_rcv_skb()
smack: fix some kernel-doc notations
Smack: Don't ignore other bprm->unsafe flags if LSM_UNSAFE_PTRACE is set
|
|
- Fix typos and whitespace errors (Bjorn Helgaas, Krzysztof Wilczynski)
- Remove unnecessary "return" statements (Krzysztof Wilczynski)
- Correct of_irq_parse_pci() function documentation (Lubomir Rintel)
* pci/trivial:
PCI: Remove unnecessary returns
PCI: OF: Correct of_irq_parse_pci() documentation
PCI: Fix typos and whitespace errors
|