summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2020-06-01KVM: nSVM: remove HF_VINTR_MASKPaolo Bonzini
Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can use it instead of vcpu->arch.hflags to check whether L2 is running in V_INTR_MASKING mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: remove trailing padding for struct vmcb_control_areaPaolo Bonzini
Allow placing the VMCB structs on the stack or in other structs without wasting too much space. Add BUILD_BUG_ON as a quick safeguard against typos. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-31Merge tag 'x86-urgent-2020-05-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A pile of x86 fixes: - Prevent a memory leak in ioperm which was caused by the stupid assumption that the exit cleanup is always called for current, which is not the case when fork fails after taking a reference on the ioperm bitmap. - Fix an arithmething overflow in the DMA code on 32bit systems - Fill gaps in the xstate copy with defaults instead of leaving them uninitialized - Revert: "Make __X32_SYSCALL_BIT be unsigned long" as it turned out that existing user space fails to build" * tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioperm: Prevent a memory leak when fork fails x86/dma: Fix max PFN arithmetic overflow on 32 bit systems copy_xstate_to_kernel(): don't leave parts of destination uninitialized x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"
2020-05-29x86: switch 32bit csum_and_copy_to_user() to user_access_{begin,end}()Al Viro
consolidate HAVE_CSUM_COPY_USER for 32bit and 64bit, while are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29x86: switch both 32bit and 64bit to providing csum_and_copy_from_user()Al Viro
... rather than messing with the wrapper. As a side effect, 32bit variant gets access_ok() into it and can be switched to user_access_begin()/user_access_end() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-28x86/ioperm: Prevent a memory leak when fork failsJay Lang
In the copy_process() routine called by _do_fork(), failure to allocate a PID (or further along in the function) will trigger an invocation to exit_thread(). This is done to clean up from an earlier call to copy_thread_tls(). Naturally, the child task is passed into exit_thread(), however during the process, io_bitmap_exit() nullifies the parent's io_bitmap rather than the child's. As copy_thread_tls() has been called ahead of the failure, the reference count on the calling thread's io_bitmap is incremented as we would expect. However, io_bitmap_exit() doesn't accept any arguments, and thus assumes it should trash the current thread's io_bitmap reference rather than the child's. This is pretty sneaky in practice, because in all instances but this one, exit_thread() is called with respect to the current task and everything works out. A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to get a bitmap allocated, and force a clone3() syscall to fail by passing in a zeroed clone_args structure. The kernel handles the erroneous struct and the buggy code path is followed, and even though the parent's reference to the io_bitmap is trashed, the child still holds a reference and thus the structure will never be freed. Fix this by tweaking io_bitmap_exit() and its subroutines to accept a task_struct argument which to operate on. Fixes: ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped") Signed-off-by: Jay Lang <jaytlang@mit.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable#@vger.kernel.org Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu
2020-05-28x86/spinlock: Remove obsolete ticket spinlock macros and typesWaiman Long
Even though the x86 ticket spinlock code has been removed with cfd8983f03c7 ("x86, locking/spinlocks: Remove ticket (spin)lock implementation") a while ago, there are still some ticket spinlock specific macros and types left in the asm/spinlock_types.h header file that are no longer used. Remove those as well to avoid confusion. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200526122014.25241-1-longman@redhat.com
2020-05-28x86/dma: Fix max PFN arithmetic overflow on 32 bit systemsAlexander Dahl
The intermediate result of the old term (4UL * 1024 * 1024 * 1024) is 4 294 967 296 or 0x100000000 which is no problem on 64 bit systems. The patch does not change the later overall result of 0x100000 for MAX_DMA32_PFN (after it has been shifted by PAGE_SHIFT). The new calculation yields the same result, but does not require 64 bit arithmetic. On 32 bit systems the old calculation suffers from an arithmetic overflow in that intermediate term in braces: 4UL aka unsigned long int is 4 byte wide and an arithmetic overflow happens (the 0x100000000 does not fit in 4 bytes), the in braces result is truncated to zero, the following right shift does not alter that, so MAX_DMA32_PFN evaluates to 0 on 32 bit systems. That wrong value is a problem in a comparision against MAX_DMA32_PFN in the init code for swiotlb in pci_swiotlb_detect_4gb() to decide if swiotlb should be active. That comparison yields the opposite result, when compiling on 32 bit systems. This was not possible before 1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too") when that MAX_DMA32_PFN was first made visible to x86_32 (and which landed in v3.0). In practice this wasn't a problem, unless CONFIG_SWIOTLB is active on x86-32. However if one has set CONFIG_IOMMU_INTEL, since c5a5dc4cbbf4 ("iommu/vt-d: Don't switch off swiotlb if bounce page is used") there's a dependency on CONFIG_SWIOTLB, which was not necessarily active before. That landed in v5.4, where we noticed it in the fli4l Linux distribution. We have CONFIG_IOMMU_INTEL active on both 32 and 64 bit kernel configs there (I could not find out why, so let's just say historical reasons). The effect is at boot time 64 MiB (default size) were allocated for bounce buffers now, which is a noticeable amount of memory on small systems like pcengines ALIX 2D3 with 256 MiB memory, which are still frequently used as home routers. We noticed this effect when migrating from kernel v4.19 (LTS) to v5.4 (LTS) in fli4l and got that kernel messages for example: Linux version 5.4.22 (buildroot@buildroot) (gcc version 7.3.0 (Buildroot 2018.02.8)) #1 SMP Mon Nov 26 23:40:00 CET 2018 … Memory: 183484K/261756K available (4594K kernel code, 393K rwdata, 1660K rodata, 536K init, 456K bss , 78272K reserved, 0K cma-reserved, 0K highmem) … PCI-DMA: Using software bounce buffering for IO (SWIOTLB) software IO TLB: mapped [mem 0x0bb78000-0x0fb78000] (64MB) The initial analysis and the suggested fix was done by user 'sourcejedi' at stackoverflow and explicitly marked as GPLv2 for inclusion in the Linux kernel: https://unix.stackexchange.com/a/520525/50007 The new calculation, which does not suffer from that overflow, is the same as for arch/mips now as suggested by Robin Murphy. The fix was tested by fli4l users on round about two dozen different systems, including both 32 and 64 bit archs, bare metal and virtualized machines. [ bp: Massage commit message. ] Fixes: 1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too") Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Alexander Dahl <post@lespocky.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: stable@vger.kernel.org Link: https://unix.stackexchange.com/q/520065/50007 Link: https://web.nettworks.org/bugs/browse/FFL-2560 Link: https://lkml.kernel.org/r/20200526175749.20742-1-post@lespocky.de
2020-05-28x86/mm: Drop deprecated DISCONTIGMEM support for 32-bitMike Rapoport
The DISCONTIGMEM support was marked as deprecated in v5.2 and since there were no complaints about it for almost 5 releases it can be completely removed. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200223094322.15206-1-rppt@kernel.org
2020-05-28KVM: nSVM: inject exceptions via svm_check_nested_eventsPaolo Bonzini
This allows exceptions injected by the emulator to be properly delivered as vmexits. The code also becomes simpler, because we can just let all L0-intercepted exceptions go through the usual path. In particular, our emulation of the VMX #DB exit qualification is very much simplified, because the vmexit injection path can use kvm_deliver_exception_payload to update DR6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: x86: enable event window in inject_pending_eventPaolo Bonzini
In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28perf/x86/rapl: Add AMD Fam17h RAPL supportStephane Eranian
This patch enables AMD Fam17h RAPL support for the Package level metric. The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR. The same output is available via the energy-pkg pseudo event: $ perf stat -a -I 1000 --per-socket -e power/energy-pkg/ Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-6-eranian@google.com
2020-05-27KVM: x86: Take an unsigned 32-bit int for has_emulated_msr()'s indexSean Christopherson
Take a u32 for the index in has_emulated_msr() to match hardware, which treats MSR indices as unsigned 32-bit values. Functionally, taking a signed int doesn't cause problems with the current code base, but could theoretically cause problems with 32-bit KVM, e.g. if the index were checked via a less-than statement, which would evaluate incorrectly for MSR indices with bit 31 set. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200218234012.7110-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27x86: Hide the archdata.iommu field behind generic IOMMU_APIKrzysztof Kozlowski
There is a generic, kernel wide configuration symbol for enabling the IOMMU specific bits: CONFIG_IOMMU_API. Implementations (including INTEL_IOMMU and AMD_IOMMU driver) select it so use it here as well. This makes the conditional archdata.iommu field consistent with other platforms and also fixes any compile test builds of other IOMMU drivers, when INTEL_IOMMU or AMD_IOMMU are not selected). For the case when INTEL_IOMMU/AMD_IOMMU and COMPILE_TEST are not selected, this should create functionally equivalent code/choice. With COMPILE_TEST this field could appear if other IOMMU drivers are chosen but neither INTEL_IOMMU nor AMD_IOMMU are not. Reported-by: kbuild test robot <lkp@intel.com> Fixes: e93a1695d7fb ("iommu: Enable compile testing for some of drivers") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200518120855.27822-2-krzk@kernel.org Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-05-27x86/apb_timer: Drop unused declaration and macroJohan Hovold
Drop an extern declaration that has never been used and a no longer needed macro. Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200513100944.9171-2-johan@kernel.org
2020-05-27x86/apb_timer: Drop unused TSC calibrationJohan Hovold
Drop the APB-timer TSC calibration, which hasn't been used since the removal of Moorestown support by commit 1a8359e411eb ("x86/mid: Remove Intel Moorestown"). Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200513100944.9171-1-johan@kernel.org
2020-05-26x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"Andy Lutomirski
Revert 45e29d119e99 ("x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long") and add a comment to discourage someone else from making the same mistake again. It turns out that some user code fails to compile if __X32_SYSCALL_BIT is unsigned long. See, for example [1] below. [ bp: Massage and do the same thing in the respective tools/ header. ] Fixes: 45e29d119e99 ("x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long") Reported-by: Thorsten Glaser <t.glaser@tarent.de> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@kernel.org Link: [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954294 Link: https://lkml.kernel.org/r/92e55442b744a5951fdc9cfee10badd0a5f7f828.1588983892.git.luto@kernel.org
2020-05-25Merge tag 'efi-changes-for-v5.8' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core More EFI changes for v5.8: - Rename pr_efi/pr_efi_err to efi_info/efi_err, and use them consistently - Simplify and unify initrd loading - Parse the builtin command line on x86 (if provided) - Implement printk() support, including support for wide character strings - Some fixes for issues introduced by the first batch of v5.8 changes - Fix a missing prototypes warning - Simplify GDT handling in early mixed mode thunking code - Some other minor fixes and cleanups Conflicts: drivers/firmware/efi/libstub/efistub.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-25Merge tag 'v5.7-rc7' into efi/core, to refresh the branch and pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-23x86: bitops: fix build regressionNick Desaulniers
This is easily reproducible via CC=clang + CONFIG_STAGING=y + CONFIG_VT6656=m. It turns out that if your config tickles __builtin_constant_p via differences in choices to inline or not, these statements produce invalid assembly: $ cat foo.c long a(long b, long c) { asm("orb %1, %0" : "+q"(c): "r"(b)); return c; } $ gcc foo.c foo.c: Assembler messages: foo.c:2: Error: `%rax' not allowed with `orb' Use the `%b` "x86 Operand Modifier" to instead force register allocation to select a lower-8-bit GPR operand. The "q" constraint only has meaning on -m32 otherwise is treated as "r". Not all GPRs have low-8-bit aliases for -m32. Fixes: 1651e700664b4 ("x86: Fix bitops.h warning with a moved cast") Reported-by: kernelci.org bot <bot@kernelci.org> Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Suggested-by: Brian Gerst <brgerst@gmail.com> Suggested-by: H. Peter Anvin <hpa@zytor.com> Suggested-by: Ilie Halip <ilie.halip@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> [build, clang-11] Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-By: Brian Gerst <brgerst@gmail.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Marco Elver <elver@google.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Daniel Axtens <dja@axtens.net> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: http://lkml.kernel.org/r/20200508183230.229464-1-ndesaulniers@google.com Link: https://github.com/ClangBuiltLinux/linux/issues/961 Link: https://lore.kernel.org/lkml/20200504193524.GA221287@google.com/ Link: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86Operandmodifiers Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-20efi/libstub: Add definitions for console input and eventsArvind Sankar
Add the required typedefs etc for using con_in's simple text input protocol, and for using the boottime event services. Also add the prototype for the "stall" boot service. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20200518190716.751506-19-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-05-20x86/hyperv: Split hyperv-tlfs.h into arch dependent and independent filesMichael Kelley
In preparation for adding ARM64 support, split hyperv-tlfs.h into architecture dependent and architecture independent files, similar to what has been done with mshyperv.h. Move architecture independent definitions into include/asm-generic/hyperv-tlfs.h. The split will avoid duplicating significant lines of code in the ARM64 version of hyperv-tlfs.h. The split has no functional impact. Some of the common definitions have "X64" in the symbol name. Change these to remove the "X64" in the architecture independent version of hyperv-tlfs.h, but add aliases with the "X64" in the x86 version so that x86 code will continue to compile. A later patch set will change all the references and allow removal of the aliases. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20200422195737.10223-4-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2020-05-20x86/hyperv: Remove HV_PROCESSOR_POWER_STATE #definesMichael Kelley
The HV_PROCESSOR_POWER_STATE_C<n> #defines date back to year 2010, but they are not in the TLFS v6.0 document and are not used anywhere in Linux. Remove them. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20200422195737.10223-3-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2020-05-20KVM: x86: hyperv: Remove duplicate definitions of Reference TSC PageMichael Kelley
The Hyper-V Reference TSC Page structure is defined twice. struct ms_hyperv_tsc_page has padding out to a full 4 Kbyte page size. But the padding is not needed because the declaration includes a union with HV_HYP_PAGE_SIZE. KVM uses the second definition, which is struct _HV_REFERENCE_TSC_PAGE, because it does not have the padding. Fix the duplication by removing the padding from ms_hyperv_tsc_page. Fix up the KVM code to use it. Remove the no longer used struct _HV_REFERENCE_TSC_PAGE. There is no functional change. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20200422195737.10223-2-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2020-05-20Merge tag 'noinstr-x86-kvm-2020-05-16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-05-19x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()Benjamin Thiel
Lift the prototype of ia32_classify_syscall() into its own header. Signed-off-by: Benjamin Thiel <b.thiel@posteo.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200516123816.2680-1-b.thiel@posteo.de
2020-05-19x86/kvm: Sanitize kvm_async_pf_task_wait()Thomas Gleixner
While working on the entry consolidation I stumbled over the KVM async page fault handler and kvm_async_pf_task_wait() in particular. It took me a while to realize that the randomly sprinkled around rcu_irq_enter()/exit() invocations are just cargo cult programming. Several patches "fixed" RCU splats by curing the symptoms without noticing that the code is flawed from a design perspective. The main problem is that this async injection is not based on a proper handshake mechanism and only respects the minimal requirement, i.e. the guest is not in a state where it has interrupts disabled. Aside of that the actual code is a convoluted one fits it all swiss army knife. It is invoked from different places with different RCU constraints: 1) Host side: vcpu_enter_guest() kvm_x86_ops->handle_exit() kvm_handle_page_fault() kvm_async_pf_task_wait() The invocation happens from fully preemptible context. 2) Guest side: The async page fault interrupted: a) user space b) preemptible kernel code which is not in a RCU read side critical section c) non-preemtible kernel code or a RCU read side critical section or kernel code with CONFIG_PREEMPTION=n which allows not to differentiate between #2b and #2c. RCU is watching for: #1 The vCPU exited and current is definitely not the idle task #2a The #PF entry code on the guest went through enter_from_user_mode() which reactivates RCU #2b There is no preemptible, interrupts enabled code in the kernel which can run with RCU looking away. (The idle task is always non preemptible). I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU voodoo at all. In #2c RCU is eventually not watching, but as that state cannot schedule anyway there is no point to worry about it so it has to invoke rcu_irq_enter() before running that code. This can be optimized, but this will be done as an extra step in course of the entry code consolidation work. So the proper solution for this is to: - Split kvm_async_pf_task_wait() into schedule and halt based waiting interfaces which share the enqueueing code. - Add comments (condensed form of this changelog) to spare others the time waste and pain of reverse engineering all of this with the help of uncomprehensible changelogs and code history. - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(), user mode and schedulable kernel side async page faults (#1, #2a, #2b) - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel case (#2c). For this case also remove the rcu_irq_exit()/enter() pair around the halt as it is just a pointless exercise: - vCPUs can VMEXIT at any random point and can be scheduled out for an arbitrary amount of time by the host and this is not any different except that it voluntary triggers the exit via halt. - The interrupted context could have RCU watching already. So the rcu_irq_exit() before the halt is not gaining anything aside of confusing the reader. Claiming that this might prevent RCU stalls is just an illusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
2020-05-19x86/kvm: Handle async page faults directly through do_page_fault()Andy Lutomirski
KVM overloads #PF to indicate two types of not-actually-page-fault events. Right now, the KVM guest code intercepts them by modifying the IDT and hooking the #PF vector. This makes the already fragile fault code even harder to understand, and it also pollutes call traces with async_page_fault and do_async_page_fault for normal page faults. Clean it up by moving the logic into do_page_fault() using a static branch. This gets rid of the platform trap_init override mechanism completely. [ tglx: Fixed up 32bit, removed error code from the async functions and massaged coding style ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de
2020-05-19x86: Replace ist_enter() with nmi_enter()Peter Zijlstra
A few exceptions (like #DB and #BP) can happen at any location in the code, this then means that tracers should treat events from these exceptions as NMI-like. The interrupted context could be holding locks with interrupts disabled for instance. Similarly, #MC is an actual NMI-like exception. All of them use ist_enter() which only concerns itself with RCU, but does not do any of the other setup that NMIs need. This means things like: printk() raw_spin_lock_irq(&logbuf_lock); <#DB/#BP/#MC> printk() raw_spin_lock_irq(&logbuf_lock); are entirely possible (well, not really since printk tries hard to play nice, but the concept stands). So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter() caller must be both notrace and NOKPROBE, or in the noinstr text section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.de
2020-05-19x86/entry: Get rid of ist_begin/end_non_atomic()Thomas Gleixner
This is completely overengineered and definitely not an interface which should be made available to anything else than this particular MCE case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.de
2020-05-18x86/cpu: Use RDRAND and RDSEED mnemonics in archrandom.hUros Bizjak
Current minimum required version of binutils is 2.23, which supports RDRAND and RDSEED instruction mnemonics. Replace the byte-wise specification of RDRAND and RDSEED with these proper mnemonics. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200508105817.207887-1-ubizjak@gmail.com
2020-05-18Merge tag 'v5.7-rc6' into objtool/core, to pick up fixes and resolve ↵Ingo Molnar
semantic conflict Resolve structural conflict between: 59566b0b622e: ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up") which introduced a new reference to 'ftrace_epilogue', and: 0298739b7983: ("x86,ftrace: Fix ftrace_regs_caller() unwind") Which renamed it to 'ftrace_caller_end'. Rename the new usage site in the merge commit. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-17Merge tag 'x86_urgent_for_v5.7-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: "A single fix for early boot crashes of kernels built with gcc10 and stack protector enabled" * tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix early boot crash on gcc-10, third try
2020-05-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "A new testcase for guest debugging (gdbstub) that exposed a bunch of bugs, mostly for AMD processors. And a few other x86 fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c KVM: SVM: Disable AVIC before setting V_IRQ KVM: Introduce kvm_make_all_cpus_request_except() KVM: VMX: pass correct DR6 for GD userspace exit KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6 KVM: nSVM: trap #DB and #BP to userspace if guest debugging is on KVM: selftests: Add KVM_SET_GUEST_DEBUG test KVM: X86: Fix single-step with KVM_SET_GUEST_DEBUG KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUG KVM: x86: fix DR6 delivery for various cases of #DB injection KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly
2020-05-16x86/fpu: Introduce copy_supervisor_to_kernel()Yu-cheng Yu
The XSAVES instruction takes a mask and saves only the features specified in that mask. The kernel normally specifies that all features be saved. XSAVES also unconditionally uses the "compacted format" which means that all specified features are saved next to each other in memory. If a feature is removed from the mask, all the features after it will "move up" into earlier locations in the buffer. Introduce copy_supervisor_to_kernel(), which saves only supervisor states and then moves those states into the standard location where they are normally found. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200512145444.15483-9-yu-cheng.yu@intel.com
2020-05-15kvm: add halt-polling cpu usage statsDavid Matlack
Two new stats for exposing halt-polling cpu usage: halt_poll_success_ns halt_poll_fail_ns Thus sum of these 2 stats is the total cpu time spent polling. "success" means the VCPU polled until a virtual interrupt was delivered. "fail" means the VCPU had to schedule out (either because the maximum poll time was reached or it needed to yield the CPU). To avoid touching every arch's kvm_vcpu_stat struct, only update and export halt-polling cpu usage stats if we're on x86. Exporting cpu usage as a u64 and in nanoseconds means we will overflow at ~500 years, which seems reasonably large. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200508182240.68440-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Migrate the VMX-preemption timerJim Mattson
The hrtimer used to emulate the VMX-preemption timer must be pinned to the same logical processor as the vCPU thread to be interrupted if we want to have any hope of adhering to the architectural specification of the VMX-preemption timer. Even with this change, the emulated VMX-preemption timer VM-exit occasionally arrives too late. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce more exit_fastpath_completion enum valuesWanpeng Li
Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86: Print symbolic names of VMX VM-Exit flags in tracesSean Christopherson
Use __print_flags() to display the names of VMX flags in VM-Exit traces and strip the flags when printing the basic exit reason, e.g. so that a failed VM-Entry due to invalid guest state gets recorded as "INVALID_STATE FAILED_VMENTRY" instead of "0x80000021". Opportunstically fix misaligned variables in the kvm_exit and kvm_nested_vmexit_inject tracepoints. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200508235348.19427-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Force ASYNC_PF_PER_VCPU to be power of twoPeter Xu
Forcing the ASYNC_PF_PER_VCPU to be power of two is much easier to be used rather than calling roundup_pow_of_two() from time to time. Do this by adding a BUILD_BUG_ON() inside the hash function. Another point is that generally async pf does not allow concurrency over ASYNC_PF_PER_VCPU after all (see kvm_setup_async_pf()), so it does not make much sense either to have it not a power of two or some of the entries will definitely be wasted. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155859.267366-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enumsSean Christopherson
Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Move max hugepage level to a separate #defineSean Christopherson
Rename PT_MAX_HUGEPAGE_LEVEL to KVM_MAX_HUGEPAGE_LEVEL and make it a separate define in anticipation of dropping KVM's PT_*_LEVEL enums in favor of the kernel's PG_LEVEL_* enums. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: x86: Cleanup vcpu->arch.guest_xstate_sizeXiaoyao Li
vcpu->arch.guest_xstate_size lost its only user since commit df1daba7d1cb ("KVM: x86: support XSAVES usage in the host"), so clean it up. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200429154312.1411-1-xiaoyao.li@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Tweak handling of failure code for nested VM-Enter failureSean Christopherson
Use an enum for passing around the failure code for a failed VM-Enter that results in VM-Exit to provide a level of indirection from the final resting place of the failure code, vmcs.EXIT_QUALIFICATION. The exit qualification field is an unsigned long, e.g. passing around 'u32 exit_qual' throws up red flags as it suggests KVM may be dropping bits when reporting errors to L1. This is a red herring because the only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely close to overflowing a u32. Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated by the MSR load list, which returns the (1-based) entry that failed, and the number of MSRs to load is a 32-bit VMCS field. At first blush, it would appear that overflowing a u32 is possible, but the number of MSRs that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC). In other words, there are two completely disparate types of data that eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is an 'unsigned long' in nature. This was presumably the reasoning for switching to 'u32' when the related code was refactored in commit ca0bde28f2ed6 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()"). Using an enum for the failure code addresses the technically-possible- but-will-never-happen scenario where Intel defines a failure code that doesn't fit in a 32-bit integer. The enum variables and values will either be automatically sized (gcc 5.4 behavior) or be subjected to some combination of truncation. The former case will simply work, while the latter will trigger a compile-time warning unless the compiler is being particularly unhelpful. Separating the failure code from the failed MSR entry allows for disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the conundrum where KVM has to choose between 'u32 exit_qual' and tracking values as 'unsigned long' that have no business being tracked as such. To cement the split, set vmcs12->exit_qualification directly from the entry error code or failed MSR index instead of bouncing through a local variable. Opportunistically rename the variables in load_vmcs12_host_state() and vmx_set_nested_state() to call out that they're ignored, set exit_reason on demand on nested VM-Enter failure, and add a comment in nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap. No functional change intended. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15x86: Fix early boot crash on gcc-10, third tryBorislav Petkov
... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-05-14Merge tag 'trace-v5.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a crash when having function tracing and function stack tracing on the command line. The ftrace trampolines are created as executable and read only. But the stack tracer tries to modify them with text_poke() which expects all kernel text to still be writable at boot. Keep the trampolines writable at boot, and convert them to read-only with the rest of the kernel. - A selftest was triggering in the ring buffer iterator code, that is no longer valid with the update of keeping the ring buffer writable while a iterator is reading. Just bail after three failed attempts to get an event and remove the warning and disabling of the ring buffer. - While modifying the ring buffer code, decided to remove all the unnecessary BUG() calls" * tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Remove all BUG() calls ring-buffer: Don't deactivate the ring buffer on failed iterator reads x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
2020-05-14x86/fpu/xstate: Update copy_kernel_to_xregs_err() for supervisor statesYu-cheng Yu
The function copy_kernel_to_xregs_err() uses XRSTOR which can work with standard or compacted format without supervisor xstates. However, when supervisor xstates are present, XRSTORS must be used. Fix it by using XRSTORS when supervisor state handling is enabled. I also considered if there were additional cases where XRSTOR might be mistakenly called instead of XRSTORS. There are only three XRSTOR sites in the kernel: 1. copy_kernel_to_xregs_booting(), already switches between XRSTOR and XRSTORS based on X86_FEATURE_XSAVES. 2. copy_user_to_xregs(), which *needs* XRSTOR because it is copying from userspace and must never copy supervisor state with XRSTORS. 3. copy_kernel_to_xregs_err() mistakenly used XRSTOR only. Fix it. [ bp: Massage commit message. ] Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200512145444.15483-8-yu-cheng.yu@intel.com
2020-05-13KVM: x86/mmu: Capture TDP level when updating CPUIDSean Christopherson
Snapshot the TDP level now that it's invariant (SVM) or dependent only on host capabilities and guest CPUID (VMX). This avoids having to call kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or calculating the page role, and thus avoids the associated retpoline. Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is active is legal, if dodgy. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: VMX: Add proper cache tracking for CR0Sean Christopherson
Move CR0 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs in the (uncommon) case where kvm_read_cr0() is called multiple times in a single VM-Exit, and more importantly eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR0, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr0_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: VMX: Add proper cache tracking for CR4Sean Christopherson
Move CR4 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs and retpolines (when configured) during nested VMX transitions as kvm_read_cr4_bits() is invoked multiple times on each transition, e.g. when stuffing CR0 and CR3. As an added bonus, this eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR4, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr4_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>