Age | Commit message (Collapse) | Author |
|
Use the new inline stack switching and remove the old ASM indirect call
implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.972714001@linutronix.de
|
|
Convert device interrupts to inline stack switching by replacing the
existing macro implementation with the new inline version. Tweak the
function signature of the actual handler function to have the vector
argument as u32. That allows the inline macro to avoid extra intermediates
and lets the compiler be smarter about the whole thing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.769728139@linutronix.de
|
|
sysvec_spurious_apic_interrupt() calls into the handling body of
__spurious_interrupt() which is not obvious as that function is declared
inside the DEFINE_IDTENTRY_IRQ(spurious_interrupt) macro.
As __spurious_interrupt() is currently always inlined this ends up with two
copies of the same code for no reason.
Split the handling function out and invoke it from both entry points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.469379641@linutronix.de
|
|
The per CPU hardirq_stack_ptr contains the pointer to the irq stack in the
form that it is ready to be assigned to [ER]SP so that the first push ends
up on the top entry of the stack.
But the stack switching on 64 bit has the following rules:
1) Store the current stack pointer (RSP) in the top most stack entry
to allow the unwinder to link back to the previous stack
2) Set RSP to the top most stack entry
3) Invoke functions on the irq stack
4) Pop RSP from the top most stack entry (stored in #1) so it's back
to the original stack.
That requires all stack switching code to decrement the stored pointer by 8
in order to be able to store the current RSP and then set RSP to that
location. That's a pointless exercise.
Do the -8 adjustment right when storing the pointer and make the data type
a void pointer to avoid confusion vs. the struct irq_stack data type which
is on 64bit only used to declare the backing store. Move the definition
next to the inuse flag so they likely end up in the same cache
line. Sticking them into a struct to enforce it is a seperate change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.354260928@linutronix.de
|
|
The recursion protection for hard interrupt stacks is an unsigned int per
CPU variable initialized to -1 named __irq_count.
The irq stack switching is only done when the variable is -1, which creates
worse code than just checking for 0. When the stack switching happens it
uses this_cpu_add/sub(1), but there is no reason to do so. It simply can
use straight writes. This is a historical leftover from the low level ASM
code which used inc and jz to make a decision.
Rename it to hardirq_stack_inuse, make it a bool and use plain stores.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de
|
|
Currently REG_SP_INDIRECT is unused but means (%rsp + offset),
change it to mean (%rsp) + offset.
The reason is that we're going to swizzle stack in the middle of a C
function with non-trivial stack footprint. This means that when the
unwinder finds the ToS, it needs to dereference it (%rsp) and then add
the offset to the next frame, resulting in: (%rsp) + offset
This is somewhat unfortunate, since REG_BP_INDIRECT is used (by DRAP)
and thus needs to retain the current (%rbp + offset).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
|
|
POPF is a rather expensive operation, so don't use it for restoring
irq flags. Instead, test whether interrupts are enabled in the flags
parameter and enable interrupts via STI in that case.
This results in the restore_fl paravirt op to be no longer needed.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210120135555.32594-7-jgross@suse.com
|
|
USERGS_SYSRET64 is used to return from a syscall via SYSRET, but
a Xen PV guest will nevertheless use the IRET hypercall, as there
is no sysret PV hypercall defined.
So instead of testing all the prerequisites for doing a sysret and
then mangling the stack for Xen PV again for doing an iret just use
the iret exit from the beginning.
This can easily be done via an ALTERNATIVE like it is done for the
sysenter compat case already.
It should be noted that this drops the optimization in Xen for not
restoring a few registers when returning to user mode, but it seems
as if the saved instructions in the kernel more than compensate for
this drop (a kernel build in a Xen PV guest was slightly faster with
this patch applied).
While at it remove the stale sysret32 remnants.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210120135555.32594-6-jgross@suse.com
|
|
SWAPGS is used only for interrupts coming from user mode or for
returning to user mode. So there is no reason to use the PARAVIRT
framework, as it can easily be replaced by an ALTERNATIVE depending
on X86_FEATURE_XENPV.
There are several instances using the PV-aware SWAPGS macro in paths
which are never executed in a Xen PV guest. Replace those with the
plain swapgs instruction. For SWAPGS_UNSAFE_STACK the same applies.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210120135555.32594-5-jgross@suse.com
|
|
Intel Moorestown and Medfield are quite old Intel Atom based
32-bit platforms, which were in limited use in some Android phones,
tablets and consumer electronics more than eight years ago.
There are no bugs or problems ever reported outside from Intel
for breaking any of that platforms for years. It seems no real
users exists who run more or less fresh kernel on it. Commit
05f4434bc130 ("ASoC: Intel: remove mfld_machine") is also in align
with this theory.
Due to above and to reduce a burden of supporting outdated drivers,
remove the support for outdated platforms completely.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
ACRN Hypervisor reports hypervisor features via CPUID leaf 0x40000001
which is similar to KVM. A VM can check if it's the privileged VM using
the feature bits. The Service VM is the only privileged VM by design.
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengwei Yin <fengwei.yin@intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Shuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-4-shuo.a.liu@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The ACRN Hypervisor builds an I/O request when a trapped I/O access
happens in User VM. Then, ACRN Hypervisor issues an upcall by sending
a notification interrupt to the Service VM. HSM in the Service VM needs
to hook the notification interrupt to handle I/O requests.
Notification interrupts from ACRN Hypervisor are already supported and
a, currently uninitialized, callback called.
Export two APIs for HSM to setup/remove its callback.
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengwei Yin <fengwei.yin@intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Yu Wang <yu1.wang@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Originally-by: Yakui Zhao <yakui.zhao@intel.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Shuo Liu <shuo.a.liu@intel.com>
Link: https://lore.kernel.org/r/20210207031040.49576-3-shuo.a.liu@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This has been shown in tests:
[ +0.000008] WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100
This is essentially a use-after free, although SRCU notices it as
an SRCU cleanup in an invalid context.
== Background ==
SGX has a data structure (struct sgx_encl_mm) which keeps per-mm SGX
metadata. This is separate from struct sgx_encl because, in theory,
an enclave can be mapped from more than one mm. sgx_encl_mm includes
a pointer back to the sgx_encl.
This means that sgx_encl must have a longer lifetime than all of the
sgx_encl_mm's that point to it. That's usually the case: sgx_encl_mm
is freed only after the mmu_notifier is unregistered in sgx_release().
However, there's a race. If the process is exiting,
sgx_mmu_notifier_release() can be called in parallel with sgx_release()
instead of being called *by* it. The mmu_notifier path keeps encl_mm
alive past when sgx_encl can be freed. This inverts the lifetime rules
and means that sgx_mmu_notifier_release() can access a freed sgx_encl.
== Fix ==
Increase encl->refcount when encl_mm->encl is established. Release
this reference when encl_mm is freed. This ensures that encl outlives
encl_mm.
[ bp: Massage commit message. ]
Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20210207221401.29933-1-jarkko@kernel.org
|
|
If the maximum performance level taken for computing the
arch_max_freq_ratio value used in the x86 scale-invariance code is
higher than the one corresponding to the cpuinfo.max_freq value
coming from the acpi_cpufreq driver, the scale-invariant utilization
falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly
faster, which causes the schedutil governor to select a frequency
below cpuinfo.max_freq. That frequency corresponds to a frequency
table entry below the maximum performance level necessary to get to
the "boost" range of CPU frequencies which prevents "boost"
frequencies from being used in some workloads.
While this issue is related to scale-invariance, it may be amplified
by commit db865272d9c4 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from the 5.10 development cycle which
made it extremely easy to default to schedutil even if the preferred
driver is acpi_cpufreq as long as intel_pstate is built too, because
the mere presence of the latter effectively removes the ondemand
governor from the defaults. Distro kernels are likely to include
both intel_pstate and acpi_cpufreq on x86, so their users who cannot
use intel_pstate or choose to use acpi_cpufreq may easily be
affectecd by this issue.
If CPPC is available, it can be used to address this issue by
extending the frequency tables created by acpi_cpufreq to cover the
entire available frequency range (including "boost" frequencies) for
each CPU, but if CPPC is not there, acpi_cpufreq has no idea what
the maximum "boost" frequency is and the frequency tables created by
it cannot be extended in a meaningful way, so in that case make it
ask the arch scale-invariance code to to use the "nominal" performance
level for CPU utilization scaling in order to avoid the issue at hand.
Fixes: db865272d9c4 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
This functionality has nothing to do with MCE, move it to the thermal
framework and untangle it from MCE.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/20210202121003.GD18075@zn.tnic
|
|
Move the APIC_LVTTHMR read which needs to happen on the BSP, to
intel_init_thermal(). One less boot dependency.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull syscall entry fixes from Borislav Petkov:
- For syscall user dispatch, separate prctl operation from syscall
redirection range specification before the API has been made official
in 5.11.
- Ensure tasks using the generic syscall code do trap after returning
from a syscall when single-stepping is requested.
* tag 'core_urgent_for_v5.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Use different define for selector variable in SUD
entry: Ensure trap after single-step on system call return
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"I hope this is the last batch of x86/urgent updates for this round:
- Remove superfluous EFI PGD range checks which lead to those
assertions failing with certain kernel configs and LLVM.
- Disable setting breakpoints on facilities involved in #DB exception
handling to avoid infinite loops.
- Add extra serialization to non-serializing MSRs (IA32_TSC_DEADLINE
and x2 APIC MSRs) to adhere to SDM's recommendation and avoid any
theoretical issues.
- Re-add the EPB MSR reading on turbostat so that it works on older
kernels which don't have the corresponding EPB sysfs file.
- Add Alder Lake to the list of CPUs which support split lock.
- Fix %dr6 register handling in order to be able to set watchpoints
with gdb again.
- Disable CET instrumentation in the kernel so that gcc doesn't add
ENDBR64 to kernel code and thus confuse tracing"
* tag 'x86_urgent_for_v5.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Remove EFI PGD build time checks
x86/debug: Prevent data breakpoints on cpu_dr7
x86/debug: Prevent data breakpoints on __per_cpu_offset
x86/apic: Add extra serialization for non-serializing MSRs
tools/power/turbostat: Fallback to an MSR read for EPB
x86/split_lock: Enable the split lock feature on another Alder Lake CPU
x86/debug: Fix DR6 handling
x86/build: Disable CET instrumentation in the kernel
|
|
Commit 299155244770 ("entry: Drop usage of TIF flags in the generic syscall
code") introduced a bug on architectures using the generic syscall entry
code, in which processes stopped by PTRACE_SYSCALL do not trap on syscall
return after receiving a TIF_SINGLESTEP.
The reason is that the meaning of TIF_SINGLESTEP flag is overloaded to
cause the trap after a system call is executed, but since the above commit,
the syscall call handler only checks for the SYSCALL_WORK flags on the exit
work.
Split the meaning of TIF_SINGLESTEP such that it only means single-step
mode, and create a new type of SYSCALL_WORK to request a trap immediately
after a syscall in single-step mode. In the current implementation, the
SYSCALL_WORK flag shadows the TIF_SINGLESTEP flag for simplicity.
Update x86 to flip this bit when a tracer enables single stepping.
Fixes: 299155244770 ("entry: Drop usage of TIF flags in the generic syscall code")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com>
Link: https://lore.kernel.org/r/87h7mtc9pr.fsf_-_@collabora.com
|
|
local_db_save() is called at the start of exc_debug_kernel(), reads DR7 and
disables breakpoints to prevent recursion.
When running in a guest (X86_FEATURE_HYPERVISOR), local_db_save() reads the
per-cpu variable cpu_dr7 to check whether a breakpoint is active or not
before it accesses DR7.
A data breakpoint on cpu_dr7 therefore results in infinite #DB recursion.
Disallow data breakpoints on cpu_dr7 to prevent that.
Fixes: 84b6a3491567a("x86/entry: Optimize local_db_save() for virt")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210204152708.21308-2-jiangshanlai@gmail.com
|
|
When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU GSBASE value
via __per_cpu_offset or pcpu_unit_offsets.
When a data breakpoint is set on __per_cpu_offset[cpu] (read-write
operation), the specific CPU will be stuck in an infinite #DB loop.
RCU will try to send an NMI to the specific CPU, but it is not working
either since NMI also relies on paranoid_entry(). Which means it's
undebuggable.
Fixes: eaad981291ee3("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210204152708.21308-1-jiangshanlai@gmail.com
|
|
PTE insertion is fundamentally racy, and this check doesn't do anything
useful. Quoting Sean:
"Yeah, it can be whacked. The original, never-upstreamed code asserted
that the resolved PFN matched the PFN being installed by the fault
handler as a sanity check on the SGX driver's EPC management. The
WARN assertion got dropped for whatever reason, leaving that useless
chunk."
Jason stumbled over this as a new user of follow_pfn(), and I'm trying
to get rid of unsafe callers of that function so it can be locked down
further.
This is independent prep work for the referenced patch series:
https://lore.kernel.org/dri-devel/20201127164131.2244124-1-daniel.vetter@ffwll.ch/
Fixes: 947c6e11fa43 ("x86/sgx: Add ptrace() support for the SGX driver")
Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210204184519.2809313-1-daniel.vetter@ffwll.ch
|
|
Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a plain
MFENCE while the Intel SDM (10.12.3 MSR Access in x2APIC Mode) calls for
MFENCE; LFENCE.
Short summary: we have special MSRs that have weaker ordering than all
the rest. Add fencing consistent with current SDM recommendations.
This is not known to cause any issues in practice, only in theory.
Longer story below:
The reason the kernel uses a different semantic is that the SDM changed
(roughly in late 2017). The SDM changed because folks at Intel were
auditing all of the recommended fences in the SDM and realized that the
x2apic fences were insufficient.
Why was the pain MFENCE judged insufficient?
WRMSR itself is normally a serializing instruction. No fences are needed
because the instruction itself serializes everything.
But, there are explicit exceptions for this serializing behavior written
into the WRMSR instruction documentation for two classes of MSRs:
IA32_TSC_DEADLINE and the X2APIC MSRs.
Back to x2apic: WRMSR is *not* serializing in this specific case.
But why is MFENCE insufficient? MFENCE makes writes visible, but
only affects load/store instructions. WRMSR is unfortunately not a
load/store instruction and is unaffected by MFENCE. This means that a
non-serializing WRMSR could be reordered by the CPU to execute before
the writes made visible by the MFENCE have even occurred in the first
place.
This means that an x2apic IPI could theoretically be triggered before
there is any (visible) data to process.
Does this affect anything in practice? I honestly don't know. It seems
quite possible that by the time an interrupt gets to consume the (not
yet) MFENCE'd data, it has become visible, mostly by accident.
To be safe, add the SDM-recommended fences for all x2apic WRMSRs.
This also leaves open the question of the _other_ weakly-ordered WRMSR:
MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as
the x2APIC MSRs, it seems substantially less likely to be a problem in
practice. While writes to the in-memory Local Vector Table (LVT) might
theoretically be reordered with respect to a weakly-ordered WRMSR like
TSC_DEADLINE, the SDM has this to say:
In x2APIC mode, the WRMSR instruction is used to write to the LVT
entry. The processor ensures the ordering of this write and any
subsequent WRMSR to the deadline; no fencing is required.
But, that might still leave xAPIC exposed. The safest thing to do for
now is to add the extra, recommended LFENCE.
[ bp: Massage commit message, fix typos, drop accidentally added
newline to tools/arch/x86/include/asm/barrier.h. ]
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200305174708.F77040DD@viggo.jf.intel.com
|
|
This reverts commit bde9cfa3afe4324ec251e4af80ebf9b7afaf7afe.
Changing the first memory page type from E820_TYPE_RESERVED to
E820_TYPE_RAM makes it a part of "System RAM" resource rather than a
reserved resource and this in turn causes devmem_is_allowed() to treat
is as area that can be accessed but it is filled with zeroes instead of
the actual data as previously.
The change in /dev/mem output causes lilo to fail as was reported at
slakware users forum, and probably other legacy applications will
experience similar problems.
Link: https://www.linuxquestions.org/questions/slackware-14/slackware-current-lilo-vesa-warnings-after-recent-updates-4175689617/#post6214439
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
task_user_regset_view() has nonsensical semantics, but those semantics
appear to be relied on by existing users of PTRACE_GETREGSET and
PTRACE_SETREGSET. (See added comments below for details.)
It shouldn't be used for PTRACE_GETREGS or PTRACE_SETREGS, though. A
native 64-bit ptrace() call and an x32 ptrace() call using GETREGS
or SETREGS wants the 64-bit regset views, and a 32-bit ptrace() call
(native or compat) should use the 32-bit regset.
task_user_regset_view() almost does this except that it will
malfunction if a ptracer is itself ptraced and the outer ptracer
modifies CS on entry to a ptrace() syscall. Hopefully that has never
happened. (The compat ptrace() code already hardcoded the 32-bit
regset, so this change has no effect on that path.)
Improve the situation and deobfuscate the code by hardcoding the
64-bit view in the x32 ptrace() and selecting the view based on the
kernel config in the native ptrace().
I tried to figure out the history behind this API. I naïvely assumed
that PTRAGE_GETREGSET and PTRACE_SETREGSET were ancient APIs that
predated compat, but no. They were introduced by
2225a122ae26 ("ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET")
in 2010, and they are simply a poor design. ELF core dumps have the
ELF e_machine field and a bunch of register sets in ELF notes, and the
pair (e_machine, NT_XXX) indicates the format of the regset blob. But
the new PTRACE_GET/SETREGSET API coopted the NT_XXX numbering without
any way to specify which e_machine was in effect. This is especially
bad on x86, where a process can freely switch between 32-bit and
64-bit mode, and, in fact, the PTRAGE_SETREGSET call itself can cause
this switch to happen. Oops.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/9daa791d0c7eaebd59c5bc2b2af1b0e7bebe707d.1612375698.git.luto@kernel.org
|
|
Force all CPUs to do VMXOFF (via NMI shootdown) during an emergency
reboot if VMX is _supported_, as VMX being off on the current CPU does
not prevent other CPUs from being in VMX root (post-VMXON). This fixes
a bug where a crash/panic reboot could leave other CPUs in VMX root and
prevent them from being woken via INIT-SIPI-SIPI in the new kernel.
Fixes: d176720d34c7 ("x86: disable VMX on all CPUs on reboot")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David P. Reed <dpreed@deepplum.com>
[sean: reworked changelog and further tweaked comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Export x2apic_mode so that KVM can query whether x2APIC is active
without having to incur the RDMSR in x2apic_enabled(). When Posted
Interrupts are in use for a guest with an assigned device, KVM ends up
checking for x2APIC at least once every time a vCPU halts. KVM could
obviously snapshot x2apic_enabled() to avoid the RDMSR, but that's
rather silly given that x2apic_mode holds the exact info needed by KVM.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115220354.434807-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add Alder Lake mobile processor to CPU list to enumerate and enable the
split lock feature.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210201190007.4031869-1-fenghua.yu@intel.com
|
|
Tom reported that one of the GDB test-cases failed, and Boris bisected
it to commit:
d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
The debugging session led us to commit:
6c0aca288e72 ("x86: Ignore trap bits on single step exceptions")
It turns out that TF and data breakpoints are both traps and will be
merged, while instruction breakpoints are faults and will not be merged.
This means 6c0aca288e72 is wrong, only TF and instruction breakpoints
need to be excluded while TF and data breakpoints can be merged.
[ bp: Massage commit message. ]
Fixes: d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
Fixes: 6c0aca288e72 ("x86: Ignore trap bits on single step exceptions")
Reported-by: Tom de Vries <tdevries@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YBMAbQGACujjfz%2Bi@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20210128211627.GB4348@worktop.programming.kicks-ass.net
|
|
free_ldt_pgtables() uses the MMU gather API for batching TLB flushes
over the call to free_pgd_range(). However, tlb_gather_mmu() expects
to operate on user addresses and so passing LDT_{BASE,END}_ADDR will
confuse the range setting logic in __tlb_adjust_range(), causing the
gather to identify a range starting at TASK_SIZE. Such a large range
will be converted into a 'fullmm' flush by the low-level invalidation
code, so change the caller to invoke tlb_gather_mmu_fullmm() directly.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-7-will@kernel.org
|
|
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
needed now that there is a separate function for 'fullmm' flushing.
Remove the unused arguments and update all callers.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
|
|
Since commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range()
for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
are no longer used, since we flush the whole mm in case of a nested
invalidation.
Remove the unused arguments and update all callers.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
|
|
Use sizeof() instead of a constant in fpstate_sanitize_xstate().
Remove use of the address of the 0th array element of ->st_space and
->xmm_space which is equivalent to the array address itself:
No code changed:
# arch/x86/kernel/fpu/xstate.o:
text data bss dec hex filename
9694 899 4 10597 2965 xstate.o.before
9694 899 4 10597 2965 xstate.o.after
md5:
5a43fc70bad8e2a1784f67f01b71aabb xstate.o.before.asm
5a43fc70bad8e2a1784f67f01b71aabb xstate.o.after.asm
[ bp: Massage commit message. ]
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210122071925.41285-1-yejune.deng@gmail.com
|
|
The "oprofile" user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.
Remove the old oprofile's architecture specific support.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Collect the scattered SME/SEV related feature flags into a dedicated
word. There are now five recognized features in CPUID.0x8000001F.EAX,
with at least one more on the horizon (SEV-SNP). Using a dedicated word
allows KVM to use its automagic CPUID adjustment logic when reporting
the set of supported features to userspace.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/20210122204047.2860075-2-seanjc@google.com
|
|
This is similar to commit
b21ebf2fb4cd ("x86: Treat R_X86_64_PLT32 as R_X86_64_PC32")
but for i386. As far as the kernel is concerned, R_386_PLT32 can be
treated the same as R_386_PC32.
R_386_PLT32/R_X86_64_PLT32 are PC-relative relocation types which
can only be used by branches. If the referenced symbol is defined
externally, a PLT will be used.
R_386_PC32/R_X86_64_PC32 are PC-relative relocation types which can be
used by address taking operations and branches. If the referenced symbol
is defined externally, a copy relocation/canonical PLT entry will be
created in the executable.
On x86-64, there is no PIC vs non-PIC PLT distinction and an
R_X86_64_PLT32 relocation is produced for both `call/jmp foo` and
`call/jmp foo@PLT` with newer (2018) GNU as/LLVM integrated assembler.
This avoids canonical PLT entries (st_shndx=0, st_value!=0).
On i386, there are 2 types of PLTs, PIC and non-PIC. Currently,
the GCC/GNU as convention is to use R_386_PC32 for non-PIC PLT and
R_386_PLT32 for PIC PLT. Copy relocations/canonical PLT entries
are possible ABI issues but GCC/GNU as will likely keep the status
quo because (1) the ABI is legacy (2) the change will drop a GNU
ld diagnostic for non-default visibility ifunc in shared objects.
clang-12 -fno-pic (since [1]) can emit R_386_PLT32 for compiler
generated function declarations, because preventing canonical PLT
entries is weighed over the rare ifunc diagnostic.
Further info for the more interested:
https://github.com/ClangBuiltLinux/linux/issues/1210
https://sourceware.org/bugzilla/show_bug.cgi?id=27169
https://github.com/llvm/llvm-project/commit/a084c0388e2a59b9556f2de0083333232da3f1d6 [1]
[ bp: Massage commit message. ]
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lkml.kernel.org/r/20210127205600.1227437-1-maskray@google.com
|
|
Commit
a7e1f67ed29f ("x86/msr: Filter MSR writes")
introduced a module parameter to disable writing to the MSR device file
and tainted the kernel upon writing. As MSR registers can be written by
the X86_IOC_WRMSR_REGS ioctl too, the same filtering and tainting should
be applied to the ioctl as well.
[ bp: Massage commit message and space out statements. ]
Fixes: a7e1f67ed29f ("x86/msr: Filter MSR writes")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210127122456.13939-1-misono.tomohiro@jp.fujitsu.com
|
|
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Instead, tell objtool to ignore do_suspend_lowlevel() directly with the
STACK_FRAME_NON_STANDARD annotation.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/269eda576c53bc9ecc8167c211989111013a67aa.1611263462.git.jpoimboe@redhat.com
|
|
This indirect jump is harmless; annotate it to keep objtool's retpoline
validation happy.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/a7288e7043265d95c1a5d64f9fd751ead4854bdc.1611263462.git.jpoimboe@redhat.com
|
|
With objtool vmlinux.o validation of return_to_handler(), now that
objtool has visibility inside the retpoline, jumping from EMPTY state to
a proper function state results in a stack state mismatch.
return_to_handler() is actually quite normal despite the underlying
magic. Just annotate it as a normal function.
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/14f48e623f61dbdcd84cf27a56ed8ccae73199ef.1611263462.git.jpoimboe@redhat.com
|
|
The ORC metadata generated for UNWIND_HINT_FUNC isn't actually very
func-like. With certain usages it can cause stack state mismatches
because it doesn't set the return address (CFI_RA).
Also, users of UNWIND_HINT_RET_OFFSET no longer need to set a custom
return stack offset. Instead they just need to specify a func-like
situation, so the current ret_offset code is hacky for no good reason.
Solve both problems by simplifying the RET_OFFSET handling and
converting it into a more useful UNWIND_HINT_FUNC.
If we end up needing the old 'ret_offset' functionality again in the
future, we should be able to support it pretty easily with the addition
of a custom 'sp_offset' in UNWIND_HINT_FUNC.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/db9d1f5d79dddfbb3725ef6d8ec3477ad199948d.1611263462.git.jpoimboe@redhat.com
|
|
Prevent an unreachable objtool warning after the sibling call detection
gets improved. ftrace_stub() is basically a function, annotate it as
such.
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/6845e1b2fb0723a95740c6674e548ba38c5ea489.1611263461.git.jpoimboe@redhat.com
|
|
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Correct the marking of kthreads which are supposed to run on a
specific, single CPU vs such which are affine to only one CPU, mark
per-cpu workqueue threads as such and make sure that marking
"survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.
- A fix to not push away tasks on CPUs coming online.
- Have workqueue CPU hotplug code use cpu_possible_mask when breaking
affinity on CPU offlining so that pending workers can finish on newly
arrived onlined CPUs too.
- Dump tasks which haven't vacated a CPU which is currently being
unplugged.
- Register a special scale invariance callback which gets called on
resume from RAM to read out APERF/MPERF after resume and thus make
the schedutil scaling governor more precise.
* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Relax the set_cpus_allowed_ptr() semantics
sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
sched: Prepare to use balance_push in ttwu()
workqueue: Restrict affinity change to rescuer
workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
kthread: Extract KTHREAD_IS_PER_CPU
sched: Don't run cpu-online with balance_push() enabled
workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
sched/core: Print out straggler tasks in sched_cpu_dying()
x86: PM: Register syscore_ops for scale invariance
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when
the FPU is used in-kernel and fix a boot crash on K7 due to early
MXCSR access before CR4.OSFXSR is even set.
- A couple of noinstr annotation fixes
- Correct die ID setting on AMD for users of topology information which
need the correct die ID
- A SEV-ES fix to handle string port IO to/from kernel memory properly
* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add another Alder Lake CPU to the Intel family
x86/mmx: Use KFPU_387 for MMX string operations
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/topology: Make __max_die_per_package available unconditionally
x86: __always_inline __{rd,wr}msr()
x86/mce: Remove explicit/superfluous tracing
locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
locking/lockdep: Cure noinstr fail
x86/sev: Fix nonistr violation
x86/entry: Fix noinstr fail
x86/cpu/amd: Set __max_die_per_package on AMD
x86/sev-es: Handle string port IO to kernel memory properly
|
|
Patch series "mm: fix initialization of struct page for holes in memory layout", v3.
Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.
Initially there were crashes during compaction that Qian Cai reported
back in April [1]. It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].
[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org
This patch (of 2):
The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory. As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.
If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch(). These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.
Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.
Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The implementation was rather buggy. It unconditionally marked PTEs
read-only, even for VM_SHARED mappings. I'm not sure whether this is
actually a problem, but it certainly seems unwise. More importantly, it
released the mmap lock before flushing the TLB, which could allow a racing
CoW operation to falsely believe that the underlying memory was not
writable.
I can't find any users at all of this mechanism, so just remove it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Stas Sergeev <stsp2@yandex.ru>
Link: https://lkml.kernel.org/r/f3086de0babcab36f69949b5780bde851f719bc8.1611078018.git.luto@kernel.org
|
|
device_initcall() expects a function of type initcall_t, which returns
an integer. Change the signature of sgx_init() to match.
Fixes: e7e0545299d8c ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210113232311.277302-1-samitolvanen@google.com
|
|
Currently, requesting kernel FPU access doesn't distinguish which parts of
the extended ("FPU") state are needed. This is nice for simplicity, but
there are a few cases in which it's suboptimal:
- The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
not use legacy 387 state. These users want MXCSR initialized but don't
care about the FPU control word. Skipping FNINIT would save time.
(Empirically, FNINIT is several times slower than LDMXCSR.)
- Code that wants MMX doesn't want or need MXCSR initialized.
_mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
initializing MXCSR will fail because LDMXCSR generates an #UD when the
aforementioned CR4 bit is not set.
- Any future in-kernel users of XFD (eXtended Feature Disable)-capable
dynamic states will need special handling.
Add a more specific API that allows callers to specify exactly what they
want.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Krzysztof Piotr Olędzki <ole@ans.pl>
Link: https://lkml.kernel.org/r/aff1cac8b8fc7ee900cf73e8f2369966621b053f.1611205691.git.luto@kernel.org
|
|
On x86 scale invariace tends to be disabled during resume from
suspend-to-RAM, because the MPERF or APERF MSR values are not as
expected then due to updates taking place after the platform
firmware has been invoked to complete the suspend transition.
That, of course, is not desirable, especially if the schedutil
scaling governor is in use, because the lack of scale invariance
causes it to be less reliable.
To counter that effect, modify init_freq_invariance() to register
a syscore_ops object for scale invariance with the ->resume callback
pointing to init_counter_refs() which will run on the CPU starting
the resume transition (the other CPUs will be taken care of the
"online" operations taking place later).
Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in frequency invariant accounting")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
|