summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-04-24KVM: SVM: do not allow VMRUN inside SMMPaolo Bonzini
VMRUN is not supported inside the SMM handler and the behavior is undefined. Just raise a #UD. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24kvm: add capability for halt pollingDavid Matlack
KVM_CAP_HALT_POLL is a per-VM capability that lets userspace control the halt-polling time, allowing halt-polling to be tuned or disabled on particular VMs. With dynamic halt-polling, a VM's VCPUs can poll from anywhere from [0, halt_poll_ns] on each halt. KVM_CAP_HALT_POLL sets the upper limit on the poll time. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200417221446.108733-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32Sean Christopherson
Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the EXIT_QUALIFICATION field is naturally sized, not a 32-bit field. The bug is most easily observed by doing VMXON (or any VMX instruction) in L2 with a negative displacement, in which case dropping the upper bits on nested VM-Exit results in L1 calculating the wrong virtual address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields: Unhandled cpu exception 14 #PF at ip 0000000000400553 rbp=0000000000537000 cr2=0000000100536ff8 Fixes: fbdd50250396d ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: nVMX: Drop a redundant call to vmx_get_intr_info()Sean Christopherson
Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from vmx_get_intr_info() that was inadvertantly introduced along with the caching mechanism. EXIT_REASON_EXCEPTION_NMI, the only consumer of intr_info, populates the variable before using it. Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: x86: move nested-related kvm_x86_ops to a separate structPaolo Bonzini
Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to nested virtualization into a separate struct. As a result, these ops will always be non-NULL on VMX. This is not a problem: * check_nested_events is only called if is_guest_mode(vcpu) returns true * get_nested_state treats VMXOFF state the same as nested being disabled * set_nested_state fails if you attempt to set nested state while nesting is disabled * nested_enable_evmcs could already be called on a CPU without VMX enabled in CPUID. * nested_get_evmcs_version was fixed in the previous patch Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: eVMCS: check if nesting is enabledPaolo Bonzini
In the next patch nested_get_evmcs_version will be always set in kvm_x86_ops for VMX, even if nesting is disabled. Therefore, check whether VMX (aka nesting) is available in the function, the caller will not do the check anymore. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: x86: check_nested_events is never NULLPaolo Bonzini
Both Intel and AMD now implement it, so there is no need to check if the callback is implemented. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21selftests: kvm/set_memory_region_test: do not check RIP if the guest shuts downPaolo Bonzini
On AMD, the state of the VMCB is undefined after a shutdown VMEXIT. KVM takes a very conservative approach to that and resets the guest altogether when that happens. This causes the set_memory_region_test to fail because the RIP is 0xfff0 (the reset vector). Restrict the RIP test to KVM_EXIT_INTERNAL_ERROR in order to fix this. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: SVM: avoid infinite loop on NPF from bad addressPaolo Bonzini
When a nested page fault is taken from an address that does not have a memslot associated to it, kvm_mmu_do_page_fault returns RET_PF_EMULATE (via mmu_set_spte) and kvm_mmu_page_fault then invokes svm_need_emulation_on_page_fault. The default answer there is to return false, but in this case this just causes the page fault to be retried ad libitum. Since this is not a fast path, and the only other case where it is taken is an erratum, just stick a kvm_vcpu_gfn_to_memslot check in there to detect the common case where the erratum is not happening. This fixes an infinite loop in the new set_memory_region_test. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21tools/kvm_stat: add sample systemd unit fileStefan Raspl
Add a sample unit file as a basis for systemd integration of kvm_stat logs. Signed-off-by: Stefan Raspl <raspl@de.ibm.com> Message-Id: <20200402085705.61155-4-raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21tools/kvm_stat: Add command line switch '-L' to log to fileStefan Raspl
To integrate with logrotate, we have a signal handler that will re-open the logfile. Assuming we have a systemd unit file with ExecStart=kvm_stat -dtc -s 10 -L /var/log/kvm_stat.csv ExecReload=/bin/kill -HUP $MAINPID and a logrotate config featuring postrotate /bin/systemctl reload kvm_stat.service endscript Then the overall flow will look like this: (1) systemd starts kvm_stat, logging to A. (2) At some point, logrotate runs, moving A to B. kvm_stat continues to write to B at this point. (3) After rotating, logrotate restarts the kvm_stat unit via systemctl. (4) The kvm_stat unit sends a SIGHUP to kvm_stat, finally making it switch over to writing to A again. Note that in order to keep the structure of the cvs output in tact, we make sure to, in contrast to the standard log format, only write the header once at the beginning of a file. This implies that the header is suppressed when appending to an existing file. Unlike with the standard format, where we append to an existing file by starting out with a header. Signed-off-by: Stefan Raspl <raspl@de.ibm.com> Message-Id: <20200402085705.61155-3-raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21tools/kvm_stat: add command line switch '-z' to skip zero recordsStefan Raspl
When running in logging mode, skip records with all zeros (=empty records) to preserve space when logging to files. Signed-off-by: Stefan Raspl <raspl@de.ibm.com> Message-Id: <20200402085705.61155-2-raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: Remove redundant argument to kvm_arch_vcpu_ioctl_runTianjia Zhang
In earlier versions of kvm, 'kvm_run' was an independent structure and was not included in the vcpu structure. At present, 'kvm_run' is already included in the vcpu structure, so the parameter 'kvm_run' is redundant. This patch simplifies the function definition, removes the extra 'kvm_run' parameter, and extracts it from the 'kvm_vcpu' structure if necessary. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Message-Id: <20200416051057.26526-1-tianjia.zhang@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nSVM: Check for CR0.CD and CR0.NW on VMRUN of nested guestsKrish Sadhukhan
According to section "Canonicalization and Consistency Checks" in APM vol. 2, the following guest state combination is illegal: "CR0.CD is zero and CR0.NW is set" Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20200409205035.16830-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: X86: Improve latency for single target IPI fastpathWanpeng Li
IPI and Timer cause the main MSRs write vmexits in cloud environment observation, let's optimize virtual IPI latency more aggressively to inject target IPI as soon as possible. Running kvm-unit-tests/vmexit.flat IPI testing on SKX server, disable adaptive advance lapic timer and adaptive halt-polling to avoid the interference, this patch can give another 7% improvement. w/o fastpath -> x86.c fastpath 4238 -> 3543 16.4% x86.c fastpath -> vmx.c fastpath 3543 -> 3293 7% w/o fastpath -> vmx.c fastpath 4238 -> 3293 22.3% Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200410174703.1138-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Optimize handling of VM-Entry failures in vmx_vcpu_run()Sean Christopherson
Mark the VM-Fail, VM-Exit on VM-Enter, and #MC on VM-Enter paths as 'unlikely' so as to improve code generation so that it favors successful VM-Enter. The performance of successful VM-Enter is for more important, irrespective of whether or not success is actually likely. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200410174703.1138-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Remove non-functional "support" for CR3 target valuesSean Christopherson
Remove all references to cr3_target_value[0-3] and replace the fields in vmcs12 with "dead_space" to preserve the vmcs12 layout. KVM doesn't support emulating CR3-target values, despite a variety of code that implies otherwise, as KVM unconditionally reports '0' for the number of supported CR3-target values. This technically fixes a bug where KVM would incorrectly allow VMREAD and VMWRITE to nonexistent fields, i.e. cr3_target_value[0-3]. Per Intel's SDM, the number of supported CR3-target values reported in VMX_MISC also enumerates the existence of the associated VMCS fields: If a future implementation supports more than 4 CR3-target values, they will be encoded consecutively following the 4 encodings given here. Alternatively, the "bug" could be fixed by actually advertisting support for 4 CR3-target values, but that'd likely just enable kvm-unit-tests given that no one has complained about lack of support for going on ten years, e.g. KVM, Xen and HyperV don't use CR3-target values. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200416000739.9012-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86/mmu: Avoid an extra memslot lookup in try_async_pf() for L2Paolo Bonzini
Create a new function kvm_is_visible_memslot() and use it from kvm_is_visible_gfn(); use the new function in try_async_pf() too, to avoid an extra memslot lookup. Opportunistically squish a multi-line comment into a single-line comment. Note, the end result, KVM_PFN_NOSLOT, is unchanged. Cc: Jim Mattson <jmattson@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86/mmu: Set @writable to false for non-visible accesses by L2Sean Christopherson
Explicitly set @writable to false in try_async_pf() if the GFN->PFN translation is short-circuited due to the requested GFN not being visible to L2. Leaving @writable ('map_writable' in the callers) uninitialized is ok in that it's never actually consumed, but one has to track it all the way through set_spte() being short-circuited by set_mmio_spte() to understand that the uninitialized variable is benign, and relying on @writable being ignored is an unnecessary risk. Explicitly setting @writable also aligns try_async_pf() with __gfn_to_pfn_memslot(). Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415214414.10194-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flagsSean Christopherson
Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_INTR_INFO. Drop a comment in vmx_recover_nmi_blocking() that is obsoleted by the generic caching mechanism. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flagsSean Christopherson
Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_QUALIFICATION. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Drop manual clearing of segment cache on nested VMCS switchSean Christopherson
Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that the entire register cache is reset when switching the active VMCS, e.g. vmx_segment_cache_test_set() will reset the segment cache due to VCPU_EXREG_SEGMENTS being unavailable. Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked by the nested code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switchSean Christopherson
Reset the per-vCPU available and dirty register masks when switching between vmcs01 and vmcs02, as the masks track state relative to the current VMCS. The stale masks don't cause problems in the current code base because the registers are either unconditionally written on nested transitions or, in the case of segment registers, have an additional tracker that is manually reset. Note, by dropping (previously implicitly, now explicitly) the dirty mask when switching the active VMCS, KVM is technically losing writes to the associated fields. But, the only regs that can be dirtied (RIP, RSP and PDPTRs) are unconditionally written on nested transitions, e.g. explicit writeback is a waste of cycles, and a WARN_ON would be rather pointless. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Invoke ept_save_pdptrs() if and only if PAE paging is enabledSean Christopherson
Invoke ept_save_pdptrs() when restoring L1's host state on a "late" VM-Fail if and only if PAE paging is enabled. This saves a CALL in the common case where L1 is a 64-bit host, and avoids incorrectly marking the PDPTRs as dirty. WARN if ept_save_pdptrs() is called with PAE disabled now that the nested usage pre-checks is_pae_paging(). Barring a bug in KVM's MMU, attempting to read the PDPTRs with PAE disabled is now impossible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Rename exit_reason to vm_exit_reason for nested VM-ExitSean Christopherson
Use "vm_exit_reason" for code related to injecting a nested VM-Exit to VM-Exits to make it clear that nested_vmx_vmexit() expects the full exit eason, not just the basic exit reason. The basic exit reason (bits 15:0 of vmcs.VM_EXIT_REASON) is colloquially referred to as simply "exit reason". Note, other flows, e.g. vmx_handle_exit(), are intentionally left as is. A future patch will convert vmx->exit_reason to a union + bit-field, and the exempted flows will interact with the unionized of "exit_reason". Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Cast exit_reason to u16 to check for nested EXTERNAL_INTERRUPTSean Christopherson
Explicitly check only the basic exit reason when emulating an external interrupt VM-Exit in nested_vmx_vmexit(). Checking the full exit reason doesn't currently cause problems, but only because the only exit reason modifier support by KVM is FAILED_VMENTRY, which is mutually exclusive with EXTERNAL_INTERRUPT. Future modifiers, e.g. ENCLAVE_MODE, will coexist with EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Pull exit_reason from vcpu_vmx in nested_vmx_reflect_vmexit()Sean Christopherson
Grab the exit reason from the vcpu struct in nested_vmx_reflect_vmexit() instead of having the exit reason explicitly passed from the caller. This fixes a discrepancy between VM-Fail and VM-Exit handling, as the VM-Fail case is already handled by checking vcpu_vmx, e.g. the exit reason previously passed on the stack is bogus if vmx->fail is set. Not taking the exit reason on the stack also avoids having to document that nested_vmx_reflect_vmexit() requires the full exit reason, as opposed to just the basic exit reason, which is not at all obvious since the only usages of the full exit reason are for tracing and way down in prepare_vmcs12() where it's propagated to vmcs12. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Drop a superfluous WARN on reflecting EXTERNAL_INTERRUPTSean Christopherson
Drop the WARN in nested_vmx_reflect_vmexit() that fires if KVM attempts to reflect an external interrupt. The WARN is blatantly impossible to hit now that nested_vmx_l0_wants_exit() is called from nested_vmx_reflect_vmexit() unconditionally returns true for EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Split VM-Exit reflection logic into L0 vs. L1 wantsSean Christopherson
Split the logic that determines whether a nested VM-Exit is reflected into L1 into "L0 wants" and "L1 wants" to document the core control flow at a high level. If L0 wants the VM-Exit, e.g. because the exit is due to a hardware event that isn't passed through to L1, then KVM should handle the exit in L0 without considering L1's configuration. Then, if L0 doesn't want the exit, KVM needs to query L1's wants to determine whether or not L1 "caused" the exit, e.g. by setting an exiting control, versus the exit occurring due to an L0 setting, e.g. when L0 intercepts an action that L1 chose to pass-through. Note, this adds an extra read on vmcs.VM_EXIT_INTR_INFO for exception. This will be addressed in a future patch via a VMX-wide enhancement, rather than pile on another case where vmx->exit_intr_info is conditionally available. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Move nested VM-Exit tracepoint into nested_vmx_reflect_vmexit()Sean Christopherson
Move the tracepoint for nested VM-Exits in preparation of splitting the reflection logic into L1 wants the exit vs. L0 always handles the exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()Sean Christopherson
Check for VM-Fail on nested VM-Enter in nested_vmx_reflect_vmexit() in preparation for separating nested_vmx_exit_reflected() into separate "L0 wants exit exit" and "L1 wants the exit" helpers. Explicitly set exit_intr_info and exit_qual to zero instead of reading them from vmcs02, as they are invalid on VM-Fail (and thankfully ignored by nested_vmx_vmexit() for nested VM-Fail). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Uninline nested_vmx_reflect_vmexit(), i.e. move it to nested.cSean Christopherson
Uninline nested_vmx_reflect_vmexit() in preparation of refactoring nested_vmx_exit_reflected() to split up the reflection logic into more consumable chunks, e.g. VM-Fail vs. L1 wants the exit vs. L0 always handles the exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Move reflection check into nested_vmx_reflect_vmexit()Sean Christopherson
Move the call to nested_vmx_exit_reflected() from vmx_handle_exit() into nested_vmx_reflect_vmexit() and change the semantics of the return value for nested_vmx_reflect_vmexit() to indicate whether or not the exit was reflected into L1. nested_vmx_exit_reflected() and nested_vmx_reflect_vmexit() are intrinsically tied together, calling one without simultaneously calling the other makes little sense. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21kvm_host: unify VM_STAT and VCPU_STAT definitions in a single placeEmanuele Giuseppe Esposito
The macros VM_STAT and VCPU_STAT are redundantly implemented in multiple files, each used by a different architecure to initialize the debugfs entries for statistics. Since they all have the same purpose, they can be unified in a single common definition in include/linux/kvm_host.h Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20200414155625.20559-1-eesposit@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86: move kvm_create_vcpu_debugfs after last failure pointPaolo Bonzini
The placement of kvm_create_vcpu_debugfs is more or less irrelevant, since it cannot fail and userspace should not care about the debugfs entries until it knows the vcpu has been created. Moving it after the last failure point removes the need to remove the directory when unwinding the creation. Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20200331224222.393439-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: SVM: Use do_machine_check to pass MCE to the hostUros Bizjak
Use do_machine_check instead of INT $12 to pass MCE to the host, the same approach VMX uses. On a related note, there is no reason to limit the use of do_machine_check to 64 bit targets, as is currently done for VMX. MCE handling works for both target families. The patch is only compile tested, for both, 64 and 32 bit targets, someone should test the passing of the exception by injecting some MCEs into the guest. For future non-RFC patch, kvm_machine_check should be moved to some appropriate header file. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200411153627.3474710-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Clean cr3/pgd handling in vmx_load_mmu_pgd()Sean Christopherson
Rename @cr3 to @pgd in vmx_load_mmu_pgd() to reflect that it will be loaded into vmcs.EPT_POINTER and not vmcs.GUEST_CR3 when EPT is enabled. Similarly, load guest_cr3 with @pgd if and only if EPT is disabled. This fixes one of the last, if not _the_ last, cases in KVM where a variable that is not strictly a cr3 value uses "cr3" instead of "pgd". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-38-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86: Replace "cr3" with "pgd" in "new cr3/pgd" related codeSean Christopherson
Rename functions and variables in kvm_mmu_new_cr3() and related code to replace "cr3" with "pgd", i.e. continue the work started by commit 727a7e27cf88a ("KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd"). kvm_mmu_new_cr3() and company are not always loading a new CR3, e.g. when nested EPT is enabled "cr3" is actually an EPTP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-37-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Free only the affected contexts when emulating INVEPTSean Christopherson
Add logic to handle_invept() to free only those roots that match the target EPT context when emulating a single-context INVEPT. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-36-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Don't flush TLB on nested VMX transitionSean Christopherson
Unconditionally skip the TLB flush triggered when reusing a root for a nested transition as nested_vmx_transition_tlb_flush() ensures the TLB is flushed when needed, regardless of whether the MMU can reuse a cached root (or the last root). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-35-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Skip MMU sync on nested VMX transition when possibleSean Christopherson
Skip the MMU sync when reusing a cached root if EPT is enabled or L1 enabled VPID for L2. If EPT is enabled, guest-physical mappings aren't flushed even if VPID is disabled, i.e. L1 can't expect stale TLB entries to be flushed if it has enabled EPT and L0 isn't shadowing PTEs (for L1 or L2) if L1 has EPT disabled. If VPID is enabled (and EPT is disabled), then L1 can't expect stale TLB entries to be flushed (for itself or L2). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-34-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86/mmu: Add module param to force TLB flush on root reuseSean Christopherson
Add a module param, flush_on_reuse, to override skip_tlb_flush and skip_mmu_sync when performing a so called "fast cr3 switch", i.e. when reusing a cached root. The primary motiviation for the control is to provide a fallback mechanism in the event that TLB flushing and/or MMU sync bugs are exposed/introduced by upcoming changes to stop unconditionally flushing on nested VMX transitions. Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-33-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86/mmu: Add separate override for MMU sync during fast CR3 switchSean Christopherson
Add a separate "skip" override for MMU sync, a future change to avoid TLB flushes on nested VMX transitions may need to sync the MMU even if the TLB flush is unnecessary. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-32-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86/mmu: Move fast_cr3_switch() side effects to __kvm_mmu_new_cr3()Sean Christopherson
Handle the side effects of a fast CR3 (PGD) switch up a level in __kvm_mmu_new_cr3(), which is the only caller of fast_cr3_switch(). This consolidates handling all side effects in __kvm_mmu_new_cr3() (where freeing the current root when KVM can't do a fast switch is already handled), and ameliorates the pain of adding a second boolean in a future patch to provide a separate "skip" override for the MMU sync. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-31-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Don't reload APIC access page if its control is disabledSean Christopherson
Don't reload the APIC access page if its control is disabled, e.g. if the guest is running with x2APIC (likely) or with the local APIC disabled (unlikely), to avoid unnecessary TLB flushes and VMWRITEs. Unconditionally reload the APIC access page and flush the TLB when the guest's virtual APIC transitions to "xAPIC enabled", as any changes to the APIC access page's mapping will not be recorded while the guest's virtual APIC is disabled. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-30-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Retrieve APIC access page HPA only when necessarySean Christopherson
Move the retrieval of the HPA associated with L1's APIC access page into VMX code to avoid unnecessarily calling gfn_to_page(), e.g. when the vCPU is in guest mode (L2). Alternatively, the optimization logic in VMX could be mirrored into the common x86 code, but that will get ugly fast when further optimizations are introduced. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-29-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Reload APIC access page on nested VM-Exit only if necessarySean Christopherson
Defer reloading L1's APIC page by logging the need for a reload and processing it during nested VM-Exit instead of unconditionally reloading the APIC page on nested VM-Exit. This eliminates a TLB flush on the majority of VM-Exits as the APIC page rarely needs to be reloaded. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-28-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Selectively use TLB_FLUSH_CURRENT for nested VM-Enter/VM-ExitSean Christopherson
Flush only the current context, as opposed to all contexts, when requesting a TLB flush to handle the scenario where a L1 does not expect a TLB flush, but one is required because L1 and L2 shared an ASID. This occurs if EPT is disabled (no per-EPTP tag), VPID is enabled (hardware doesn't flush unconditionally) and vmcs02 does not have its own VPID due to exhaustion of available VPIDs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-27-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushesSean Christopherson
Flush only the current ASID/context when requesting a TLB flush due to a change in the current vCPU's MMU to avoid blasting away TLB entries associated with other ASIDs/contexts, e.g. entries cached for L1 when a change in L2's MMU requires a flush. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-26-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86: Introduce KVM_REQ_TLB_FLUSH_CURRENT to flush current ASIDSean Christopherson
Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's EPTP/VPID contexts[*] from the KVM MMU and/or in a deferred manner, e.g. to flush L2's context during nested VM-Enter. Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where the flush is directly associated with vCPU-scoped instruction emulation, i.e. MOV CR3 and INVPCID. Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to make it clear that it deliberately requests a flush of all contexts. Service any pending flush request on nested VM-Exit as it's possible a nested VM-Exit could occur after requesting a flush for L2. Add the same logic for nested VM-Enter even though it's _extremely_ unlikely for flush to be pending on nested VM-Enter, but theoretically possible (in the future) due to RSM (SMM) emulation. [*] Intel also has an Address Space Identifier (ASID) concept, e.g. EPTP+VPID+PCID == ASID, it's just not documented in the SDM because the rules of invalidation are different based on which piece of the ASID is being changed, i.e. whether the EPTP, VPID, or PCID context must be invalidated. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-25-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>