Age | Commit message (Collapse) | Author |
|
This changes generic-compat-pmu.c so that it only uses architected
events defined in Power ISA v3.0B, rather than event encodings which,
while common to all the IBM Power Systems implementations, are
nevertheless implementation-specific rather than architected. The
intention is that any CPU implementation designed to conform to Power
ISA v3.0B or later can use generic-compat-pmu.c.
In addition to the existing events for cycles and instructions, this
adds several other architected events, including alternative encodings
for some events. In order to make it possible to measure cycles and
instructions at the same time as each other, we set the CC5-6RUN bit
in MMCR0, which makes PMC5 and PMC6 count instructions and cycles
regardless of the run bit, so their events are now PM_CYC and
PM_INST_CMPL rather than PM_RUN_CYC and PM_RUN_INST_CMPL (the latter
are still available via other event codes).
Note that POWER9 has an erratum where one architected event
(PM_FLOP_CMPL, floating-point operations completed, code 0x100f4) does
not work correctly. Given that there is a specific PMU driver for P9
which will be used in preference to generic-compat-pmu.c, that is not
a real problem.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/YJD7L9yeoxvxqeYi@thinks.paulus.ozlabs.org
|
|
Instead of making bare calls to get-sensor-state, use
rtas_get_sensor(), which correctly handles busy and extended delay
statuses.
Fixes: ab519a011caa ("powerpc/pseries: Kernel DLPAR Infrastructure")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210504025329.1713878-1-nathanl@linux.ibm.com
|
|
RTAS_CLOCK_BUSY is unused, remove it.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210503175811.1528208-1-nathanl@linux.ibm.com
|
|
There is a spelling mistake "byes" -> "bytes" in a comment of
function drc_pmem_query_stats(). Fix that typo.
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210418074003.6651-1-kjain@linux.ibm.com
|
|
Commit a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path
check") added is_kvm_guest() and changed kvm_para_available() to use it.
is_kvm_guest() checks a static key, kvm_guest, and that static key is
set in check_kvm_guest().
The problem is check_kvm_guest() is only called on pseries, and even
then only in some configurations. That means is_kvm_guest() always
returns false on all non-pseries and some pseries depending on
configuration. That's a bug.
For PR KVM guests this is noticable because they no longer do live
patching of themselves, which can be detected by the omission of a
message in dmesg such as:
KVM: Live patching for a fast VM worked
To fix it make check_kvm_guest() an initcall, to ensure it's always
called at boot. It needs to be core so that it runs before
kvm_guest_init() which is postcore. To be an initcall it needs to return
int, where 0 means success, so update that.
We still call it manually in pSeries_smp_probe(), because that runs
before init calls are run.
Fixes: a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path check")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623130514.2543232-1-mpe@ellerman.id.au
|
|
When we boot from open firmware (OF) using PPC_OF_BOOT_TRAMPOLINE, aka.
prom_init, we run parts of the kernel at an address other than the link
address. That happens because OF loads the kernel above zero (OF is at
zero) and we run prom_init before copying the kernel down to zero.
Currently that works even for non-relocatable kernels, because we do
various fixups to the prom_init code to make it run where it's loaded.
However those fixups are not sufficient if the kernel becomes large
enough. In that case prom_init()'s final call to __start() can end up
generating a plt branch:
bl c000000002000018 <00000078.plt_branch.__start>
That results in the kernel jumping to the linked address of __start,
0xc000000000000000, when really it needs to jump to the
0xc000000000000000 + the runtime address because the kernel is still
running at the load address.
We could do further shenanigans to handle that, see Jordan's patch for
example:
https://lore.kernel.org/linuxppc-dev/20210421021721.1539289-1-jniethe5@gmail.com
However it is much simpler to just require a kernel with prom_init() to
be built relocatable. The result works in all configurations without
further work, and requires less code.
This should have no effect on most people, as our defconfigs and
essentially all distro configs already have RELOCATABLE enabled.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623130454.2542945-1-mpe@ellerman.id.au
|
|
blrl corrupts the link stack. Instead use bctrl when making function
calls from BPF programs.
Reported-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210609090024.1446800-1-naveen.n.rao@linux.vnet.ibm.com
|
|
It is sometimes desirable to run a command on all cpus in xmon. A
typical scenario is to obtain the backtrace from all cpus in xmon if
there is a soft lockup. Add rudimentary support for the same. The
command to be run on all cpus should be prefixed with 'c#'. As an
example, 'c#t' will run 't' command and produce a backtrace on all cpus
in xmon.
Since many xmon commands are not sensible for running in this manner, we
only allow a predefined list of commands -- 'r', 'S' and 't' for now.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210601074801.617363-1-naveen.n.rao@linux.vnet.ibm.com
|
|
Both these config options are generally enabled in distro kernels.
Enable the same in a few powerpc64 configs to get better coverage and
testing.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210524120227.3333208-1-naveen.n.rao@linux.vnet.ibm.com
|
|
When arming and disarming probes, we currently assume that instruction
patching can never fail, and don't have a mechanism to surface errors.
Add a warning in case instruction patching ever fails.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/18d7b1309f938c08ce07738100932b551bdd3a52.1621416666.git.naveen.n.rao@linux.vnet.ibm.com
|
|
In kprobes and xmon, we should exclude both 32-bit and 64-bit variants
of mtmsr and rfi instructions from being stepped. Have IS_RFID() also
detect a rfi instruction similar to IS_MTMSRD().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eee32e1b75dae85d471c89b4c0a123ad4b0aabf8.1621416666.git.naveen.n.rao@linux.vnet.ibm.com
|
|
Persistent memory devices like NVDIMMs can loose cached writes in case
something prevents flush on power-fail. Such situations are termed as
dirty shutdown and are exposed to applications as
last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
described at [1]. The latter being useful in conditions where multiple
applications want to detect a dirty shutdown event without racing with
one another.
PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
dirty-shutdown-state. This patch further adds support for DSC via the
"ibm,persistence-failed-count" device tree property of an NVDIMM. This
property is a monotonic increasing 64-bit counter thats an indication
of number of times an NVDIMM has encountered a dirty-shutdown event
causing persistence loss.
Since this value is not expected to change after system-boot hence
papr_scm reads & caches its value during NVDIMM probe and exposes it
as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
similarly named NFIT sysfs attribute. Also this value is available to
libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
has been extended to add a new member called 'dimm_dsc' presence of
which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.
References:
[1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210624080621.252038-1-vaibhav@linux.ibm.com
|
|
In case performance stats for an nvdimm are not available, reading the
'perf_stats' sysfs file returns an -ENOENT error. A better approach is
to make the 'perf_stats' file entirely invisible to indicate that
performance stats for an nvdimm are unavailable.
So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
callback implemented as newly introduced 'papr_nd_attribute_visible()'
that returns an appropriate mode in case performance stats aren't
supported in a given nvdimm.
Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
available when 'papr_nd_attribute_visible()' is called during nvdimm
initialization.
Even though 'perf_stats' attribute is available since v5.9, there are
no known user-space tools/scripts that are dependent on presence of its
sysfs file. Hence I dont expect any user-space breakage with this
patch.
Fixes: 2d02bf835e57 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210513092349.285021-1-vaibhav@linux.ibm.com
|
|
on ppc32
Trying to use a kprobe on ppc32 results in the below splat:
BUG: Unable to handle kernel data access on read at 0x7c0802a6
Faulting instruction address: 0xc002e9f0
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K PowerPC 44x Platform
Modules linked in:
CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
NIP: c002e9f0 LR: c0011858 CTR: 00008a47
REGS: c292fd50 TRAP: 0300 Not tainted (5.13.0-rc1-01824-g3a81c0495fdb)
MSR: 00009000 <EE,ME> CR: 24002002 XER: 20000000
DEAR: 7c0802a6 ESR: 00000000
<snip>
NIP [c002e9f0] emulate_step+0x28/0x324
LR [c0011858] optinsn_slot+0x128/0x10000
Call Trace:
opt_pre_handler+0x7c/0xb4 (unreliable)
optinsn_slot+0x128/0x10000
ret_from_syscall+0x0/0x28
The offending instruction is:
81 24 00 00 lwz r9,0(r4)
Here, we are trying to load the second argument to emulate_step():
struct ppc_inst, which is the instruction to be emulated. On ppc64,
structures are passed in registers when passed by value. However, per
the ppc32 ABI, structures are always passed to functions as pointers.
This isn't being adhered to when setting up the call to emulate_step()
in the optprobe trampoline. Fix the same.
Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
Cc: stable@vger.kernel.org
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5bdc8cbc9a95d0779e27c9ddbf42b40f51f883c0.1624425798.git.christophe.leroy@csgroup.eu
|
|
copy-paste contains implicit "copy buffer" state that can contain
arbitrary user data (if the user process executes a copy instruction).
This could be snooped by another process if a context switch hits while
the state is live. So cp_abort is executed on context switch to clear
out possible sensitive data and prevent the leak.
cp_abort is done after the low level _switch(), which means it is never
reached by newly created tasks, so they could snoop on this buffer
between their first and second context switch.
Fix this by doing the cp_abort before calling _switch. Add some
comments which should make the issue harder to miss.
Fixes: 07d2a628bc000 ("powerpc/64s: Avoid cpabort in context switch when possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210622053036.474678-1-npiggin@gmail.com
|
|
On booke, SYSCALL_ENTRY macro nests an FTR_SECTION with a #ifdef
CONFIG_KVM_BOOKE_HV.
Duplicate the single instruction alternative to avoid nesting.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/33db61d5f85146262dbe26648f8f87eca3cae393.1622818435.git.christophe.leroy@csgroup.eu
|
|
booke and non booke do pretty similar things in SYSCALL_ENTRY macro
just before calling jumping to transfer_to_syscall().
Do them in transfer_to_syscall() instead.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/552e27fa09394a6bc70585fcdfa237f99a5d1267.1622818435.git.christophe.leroy@csgroup.eu
|
|
To better match non booke version of SYSCALL_ENTRY macro,
interchange r1 and r11 in the booke version.
While at it, in both versions use r1 instead of r11 to save
_NIP and _CCR.
All other uses of r11 will go away in next patch, so don't
bother changing them for now.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1684c39724a069b0ce1aa82eaee6ec194e354e4e.1622818435.git.christophe.leroy@csgroup.eu
|
|
To better match booke version of SYSCALL_ENTRY macro, interchange
r10 and r12 in the non booke version.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5ab3a517bc883a2fc905fb2cb5ee9344f37b2cfa.1622818435.git.christophe.leroy@csgroup.eu
|
|
klimit is a global variable initialised at build time with the
value of _end.
This variable is never modified, so _end symbol can be used directly.
Remove klimit.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9fa9ba6807c17f93f35a582c199c646c4a8bfd9c.1622800638.git.christophe.leroy@csgroup.eu
|
|
Commit aaa229529244 ("powerpc/mm: Add physical address to Linux page
table dump") changed range coalescing to only combine ranges that are
both virtually and physically contiguous, in order to avoid erroneous
combination of unrelated mappings in IOREMAP space.
But in the VMALLOC space, mappings almost never have contiguous
physical pages, so the commit mentionned above leads to dumping one
line per page for vmalloc mappings.
Taking into account the vmalloc always leave a gap between two areas,
we never have two mappings dumped as a single combination even if they
have the exact same flags. The only space that may have encountered
such an issue was the early IOREMAP which is not using vmalloc engine.
But previous commits added gaps between early IO mappings, so it is
not an issue anymore.
That commit created some difficulties with KASAN mappings, see
commit cabe8138b23c ("powerpc: dump as a single line areas mapping a
single physical page.") and with huge page, see
commit b00ff6d8c1c3 ("powerpc/ptdump: Properly handle non standard
page size").
So, almost revert commit aaa229529244 to properly coalesce pages
mapped with the same flags as before, only keep the display of the
first physical address of the range, as it can be usefull especially
for IO mappings.
It brings back powerpc at the same level as other architectures and
simplifies the conversion to GENERIC PTDUMP.
With the patch:
---[ kasan shadow mem start ]---
0xf8000000-0xf8ffffff 0x07000000 16M huge rw present dirty accessed
0xf9000000-0xf91fffff 0x01434000 2M r present accessed
0xf9200000-0xf95affff 0x02104000 3776K rw present dirty accessed
0xfef5c000-0xfeffffff 0x01434000 656K r present accessed
---[ kasan shadow mem end ]---
Before:
---[ kasan shadow mem start ]---
0xf8000000-0xf8ffffff 0x07000000 16M huge rw present dirty accessed
0xf9000000-0xf91fffff 0x01434000 16K r present accessed
0xf9200000-0xf9203fff 0x02104000 16K rw present dirty accessed
0xf9204000-0xf9207fff 0x0213c000 16K rw present dirty accessed
0xf9208000-0xf920bfff 0x02174000 16K rw present dirty accessed
0xf920c000-0xf920ffff 0x02188000 16K rw present dirty accessed
0xf9210000-0xf9213fff 0x021dc000 16K rw present dirty accessed
0xf9214000-0xf9217fff 0x02220000 16K rw present dirty accessed
0xf9218000-0xf921bfff 0x023c0000 16K rw present dirty accessed
0xf921c000-0xf921ffff 0x023d4000 16K rw present dirty accessed
0xf9220000-0xf9227fff 0x023ec000 32K rw present dirty accessed
...
0xf93b8000-0xf93e3fff 0x02614000 176K rw present dirty accessed
0xf93e4000-0xf94c3fff 0x027c0000 896K rw present dirty accessed
0xf94c4000-0xf94c7fff 0x0236c000 16K rw present dirty accessed
0xf94c8000-0xf94cbfff 0x041f0000 16K rw present dirty accessed
0xf94cc000-0xf94cffff 0x029c0000 16K rw present dirty accessed
0xf94d0000-0xf94d3fff 0x041ec000 16K rw present dirty accessed
0xf94d4000-0xf94d7fff 0x0407c000 16K rw present dirty accessed
0xf94d8000-0xf94f7fff 0x041c0000 128K rw present dirty accessed
...
0xf95ac000-0xf95affff 0x042b0000 16K rw present dirty accessed
0xfef5c000-0xfeffffff 0x01434000 16K r present accessed
---[ kasan shadow mem end ]---
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c56ce1f5c3c75adc9811b1a5f9c410fa74183a8d.1618828806.git.christophe.leroy@csgroup.eu
|
|
Vmalloc system leaves a gap between allocated areas. It helps catching
overflows.
Do the same for IO areas which are allocated with early_ioremap_range()
until slab_is_available().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c433e358190fb5d47650463ea1ab755fc7b73e6e.1618828806.git.christophe.leroy@csgroup.eu
|
|
Parse to and export from UUID own type, before dereferencing.
This also fixes wrong comment (Little Endian UUID is something else)
and should eliminate the direct strict types assignments.
Fixes: 43001c52b603 ("powerpc/papr_scm: Use ibm,unit-guid as the iset cookie")
Fixes: 259a948c4ba1 ("powerpc/pseries/scm: Use a specific endian format for storing uuid from the device tree")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210616134303.58185-1-andriy.shevchenko@linux.intel.com
|
|
The validation done at the start of dlpar_memory_add_by_ic() is an all
of nothing scenario - if any LMBs in the range is marked as RESERVED we
can fail right away.
We then can remove the 'lmbs_available' var and its check with
'lmbs_to_add' since the whole LMB range was already validated in the
previous step.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210622133923.295373-4-danielhb413@gmail.com
|
|
After a successful dlpar_add_lmb() call the LMB is marked as reserved.
Later on, depending whether we added enough LMBs or not, we rely on
the marked LMBs to see which ones might need to be removed, and we
remove the reservation of all of them.
These are done in for_each_drmem_lmb() loops without any break
condition. This means that we're going to check all LMBs of the partition
even after going through all the reserved ones.
This patch adds break conditions in both loops to avoid this. The
'lmbs_added' variable was renamed to 'lmbs_reserved', and it's now
being decremented each time a lmb reservation is removed, indicating
if there are still marked LMBs to be processed.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210622133923.295373-3-danielhb413@gmail.com
|
|
The function is counting reserved LMBs as available to be added, but
they aren't. This will cause the function to miscalculate the available
LMBs and can trigger errors later on when executing dlpar_add_lmb().
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210622133923.295373-2-danielhb413@gmail.com
|
|
printk_safe_flush_on_panic() has special lock breaking code for the case
where we panic()ed with the console lock held. It relies on panic IPI
causing other CPUs to mark themselves offline.
Do as most other architectures do.
This effectively reverts commit de6e5d38417e ("powerpc: smp_send_stop do
not offline stopped CPUs"), unfortunately it may result in some false
positive warnings, but the alternative is more situations where we can
crash without getting messages out.
Fixes: de6e5d38417e ("powerpc: smp_send_stop do not offline stopped CPUs")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623041245.865134-1-npiggin@gmail.com
|
|
32-bit platforms don't have irq soft masking.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623032909.826010-1-npiggin@gmail.com
|
|
The caller has been moved to C after irq soft-mask state has been
reconciled, and Linux IRQs have been marked as disabled, so this no
longer needs to play games with IRQ internals.
Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623022924.704645-1-npiggin@gmail.com
|
|
PowerVM will not arbitrarily oversubscribe or stop guests, page out the
guest kernel text to a NFS volume connected by carrier pigeon to abacus
based storage, etc., as a KVM host might. So PowerVM guests are not
likely to be killed by the hard lockup watchdog in normal operation,
even with shared processor LPARs which still get a minimum allotment of
CPU time.
Enable the hard lockup detector by default on !KVM guests, which we will
assume is PowerVM. It has been useful in finding problems on bare metal
kernels.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623021528.702241-1-npiggin@gmail.com
|
|
The PPC_RFI_SRR_DEBUG check added by patch "powerpc/64s: avoid reloading
(H)SRR registers if they are still valid" has a few deficiencies. It
does not fix the actual problem, it's not enabled by default, and it
causes a program check interrupt which can cause more difficulties.
However there are a lot of paths which may clobber SRRs or change return
regs, and difficult to have a high confidence that all paths are covered
without wider testing.
Add a relatively low overhead always-enabled check that catches most
such cases, reports once, and fixes it so the kernel can continue.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rebase, use switch & INT names, squash in race fix from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
prep_irq_for_user_exit() has only one caller, squash it
inside that caller.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-18-npiggin@gmail.com
|
|
prep_irq_for_user_exit() is a superset of
prep_irq_for_kernel_enabled_exit().
Rename prep_irq_for_kernel_enabled_exit() as prep_irq_for_enabled_exit()
and have prep_irq_for_user_exit() use it.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-17-npiggin@gmail.com
|
|
prep_irq_for_user_exit() is a superset of
prep_irq_for_kernel_enabled_exit(). In order to allow refactoring in
following patch, interchange the two. This will allow
prep_irq_for_user_exit() to call a renamed version of
prep_irq_for_kernel_enabled_exit().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-16-npiggin@gmail.com
|
|
interrupt_exit_user_prepare() is a superset of
interrupt_exit_user_prepare_main().
Refactor to avoid code duplication.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-15-npiggin@gmail.com
|
|
Rename syscall_exit_prepare_main() into interrupt_exit_prepare_main()
Pass it the 'ret' so that it can 'or' it directly instead of
oring twice, once inside the function and once outside.
And remove 'r3' parameter which is not used.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[np: split out some changes into other patches]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-14-npiggin@gmail.com
|
|
Use the restart table facility to return from interrupt or system calls
without disabling MSR[EE] or MSR[RI].
Interrupt return asm is put into the low soft-masked region, to prevent
interrupts being processed here, although they are still taken as masked
interrupts which causes SRRs to be clobbered, and a pending soft-masked
interrupt to require replaying.
The return code uses restart table regions to redirct to a fixup handler
rather than continue with the exit, if such an interrupt happens. In
this case the interrupt return is redirected to a fixup handler which
reloads r1 for the interrupt stack and reloads registers and sets state
up to replay the soft-masked interrupt and try the exit again.
Some types of security exit fallback flushes and barriers are currently
unable to cope with reentrant interrupts, e.g., because they store some
state in the scratch SPR which would be clobbered even by masked
interrupts. For now the interrupts-enabled exits are disabled when these
flushes are used.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Guard unused exit_must_hard_disable() as reported by lkp]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-13-npiggin@gmail.com
|
|
Treat code below __end_soft_masked as soft-masked for the purpose
of alternate return. 64s already mostly does this for scv entry.
This will be used to exit from interrupts without disabling MSR[EE].
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-12-npiggin@gmail.com
|
|
Prevent interrupt restore from allowing racing hard interrupts going
ahead of previous soft-pending ones, by using the soft-masked restart
handler to allow a store to clear the soft-mask while knowing nothing
is soft-pending.
This probably doesn't matter much in practice, but it's a simple
demonstrator / test case to exercise the restart table logic.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-11-npiggin@gmail.com
|
|
The exception table fixup adjusts a failed page fault's interrupt return
location if it was taken at an address specified in the exception table,
to a corresponding fixup handler address.
Introduce a variation of that idea which adds a fixup table for NMIs and
soft-masked asynchronous interrupts. This will be used to protect
certain critical sections that are sensitive to being clobbered by
interrupts coming in (due to using the same SPRs and/or irq soft-mask
state).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-10-npiggin@gmail.com
|
|
This frees up one more register (and takes advantage of that to
clean things up a little bit).
This register will be used in the following patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-9-npiggin@gmail.com
|
|
This extends the MSR[RI]=0 window a little further into the system
call in order to pair RI and EE enabling with a single mtmsrd.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-8-npiggin@gmail.com
|
|
The next patch would like to move interrupt return assembly code to a low
location before general text, so move it into its own file and include via
head_64.S
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-7-npiggin@gmail.com
|
|
When an interrupt is taken, the SRR registers are set to return to where
it left off. Unless they are modified in the meantime, or the return
address or MSR are modified, there is no need to reload these registers
when returning from interrupt.
Introduce per-CPU flags that track the validity of SRR and HSRR
registers. These are cleared when returning from interrupt, when
using the registers for something else (e.g., OPAL calls), when
adjusting the return address or MSR of a context, and when context
switching (which changes the return address and MSR).
This improves the performance of interrupt returns.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fold in fixup patch from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-5-npiggin@gmail.com
|
|
This makes no real difference yet except that HSRR type interrupts will
use hrfid to return. This is important for the next patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-4-npiggin@gmail.com
|
|
The msr argument is not used, remove it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-3-npiggin@gmail.com
|
|
CONFIG_PPC_BOOK3S should be CONFIG_PPC_BOOK3S_64. restore_math is a
no-op for other configurations.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[np: split from another patch]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210617155116.2167984-2-npiggin@gmail.com
|
|
Pass the value of linux_banner to firmware via option vector 7.
Option vector 7 is described in "LoPAR" Linux on Power Architecture
Reference v2.9, in table B.7 on page 824:
An ASCII character formatted null terminated string that describes
the client operating system. The string shall be human readable and
may be displayed on the console.
The string can be up to 256 bytes total, including the nul terminator.
linux_banner contains lots of information, and should make it possible
to identify the exact kernel version that is running:
const char linux_banner[] =
"Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"
LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";
For example:
Linux version 4.15.0-144-generic (buildd@bos02-ppc64el-018) (gcc
version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #148-Ubuntu SMP Sat May 8
02:32:13 UTC 2021 (Ubuntu 4.15.0-144.148-generic 4.15.18)
It's also printed at boot to the console/dmesg, which should make it
possible to correlate what firmware receives with the console/dmesg on
the machine.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210621064938.2021419-2-mpe@ellerman.id.au
|
|
In a subsequent patch we'd like to have something like a strscpy_pad()
implementation usable in prom_init.c.
Currently we have a strcpy() implementation with only one caller, so
convert it into strscpy_pad() and update the caller.
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210621064938.2021419-1-mpe@ellerman.id.au
|
|
When using the Radix MMU our PGD is always 64K, and must be naturally
aligned.
For a 4K page size kernel that means page alignment of swapper_pg_dir is
not sufficient, leading to failure to boot.
Use the existing MAX_PTRS_PER_PGD which has the correct value, and
avoids us hard-coding 64K here.
Fixes: e72421a085a8 ("powerpc: Define swapper_pg_dir[] in C")
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210624123420.2784187-1-mpe@ellerman.id.au
|