summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-02-11powerpc/kexec_file: fix FDT size estimation for kdump kernelHari Bathini
On systems with large amount of memory, loading kdump kernel through kexec_file_load syscall may fail with the below error: "Failed to update fdt with linux,drconf-usable-memory property" This happens because the size estimation for kdump kernel's FDT does not account for the additional space needed to setup usable memory properties. Fix it by accounting for the space needed to include linux,usable-memory & linux,drconf-usable-memory properties while estimating kdump kernel's FDT size. Fixes: 6ecd0163d360 ("powerpc/kexec_file: Add appropriate regions for memory reserve map") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/161243826811.119001.14083048209224609814.stgit@hbathini
2021-02-11powerpc/mm: Remove dcache flush from memory remove.Aneesh Kumar K.V
We added dcache flush on memory add/remove in commit fb5924fddf9e ("powerpc/mm: Flush cache on memory hot(un)plug") to handle crashes on GPU hotplug. Instead of adding dcache flush in generic memory add/remove routine which is used even for regular memory, we should handle these devices specific flush in the device driver code. memtrace did handle this in the driver and that was removed by commit 7fd6641de28f ("powerpc/powernv/memtrace: Let the arch hotunplug code flush cache"). This patch reverts that commit. The dcache flush in memory add was removed by commit ea458effa88e ("powerpc: Don't flush caches when adding memory") which I don't think is correct. The reason why we require dcache flush in memtrace is to make sure we don't have a dirty cache when we remap a pfn to cache inhibited. We should do that when the memtrace module removes the memory and make the pfn available for HTM traces to map it as cache inhibited. The other device mentioned in commit fb5924fddf9e ("powerpc/mm: Flush cache on memory hot(un)plug") is nvlink device with coherent memory. The support for that was removed in commit 7eb3cf761927 ("powerpc/powernv: remove unused NPU DMA code") and commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210203045812.234439-3-aneesh.kumar@linux.ibm.com
2021-02-11powerpc/mm: Add PG_dcache_clean to indicate dcache clean stateAneesh Kumar K.V
This just add a better name for PG_arch_1. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210203045812.234439-2-aneesh.kumar@linux.ibm.com
2021-02-11powerpc/mm: Enable compound page check for both THP and HugeTLBAneesh Kumar K.V
THP config results in compound pages. Make sure the kernel enables the PageCompound() check with CONFIG_HUGETLB_PAGE disabled and CONFIG_TRANSPARENT_HUGEPAGE enabled. This makes sure we correctly flush the icache with THP pages. flush_dcache_icache_page only matter for platforms that don't support COHERENT_ICACHE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210203045812.234439-1-aneesh.kumar@linux.ibm.com
2021-02-11powerpc/xive: Assign boolean values to a bool variableJiapeng Chong
Fix the following coccicheck warnings: ./arch/powerpc/kvm/book3s_xive.c:1856:2-17: WARNING: Assignment of 0/1 to bool variable. ./arch/powerpc/kvm/book3s_xive.c:1854:2-17: WARNING: Assignment of 0/1 to bool variable. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1612680192-43116-1-git-send-email-jiapeng.chong@linux.alibaba.com
2021-02-11powerpc/32: Preserve cr1 in exception prolog stack check to fix build errorChristophe Leroy
THREAD_ALIGN_SHIFT = THREAD_SHIFT + 1 = PAGE_SHIFT + 1 Maximum PAGE_SHIFT is 18 for 256k pages so THREAD_ALIGN_SHIFT is 19 at the maximum. No need to clobber cr1, it can be preserved when moving r1 into CR when we check stack overflow. This reduces the number of instructions in Machine Check Exception prolog and fixes a build failure reported by the kernel test robot on v5.10 stable when building with RTAS + VMAP_STACK + KVM. That build failure is due to too many instructions in the prolog hence not fitting between 0x200 and 0x300. Allthough the problem doesn't show up in mainline, it is still worth the change. Fixes: 98bf2d3f4970 ("powerpc/32s: Fix RTAS machine check with VMAP stack") Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5ae4d545e3ac58e133d2599e0deb88843cb494fc.1612768623.git.christophe.leroy@csgroup.eu
2021-02-11powerpc/64s: Remove EXSLB interrupt save areaNicholas Piggin
SLB faults should not be taken while the PACA save areas are live, all memory accesses should be fetches from the kernel text, and access to PACA and the current stack, before C code is called or any other accesses are made. All of these have pinned SLBs so will not take a SLB fault. Therefore EXSLB is not be required. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210208063406.331655-1-npiggin@gmail.com
2021-02-11powerpc/64s: syscall real mode entry use mtmsrd rather than rfidNicholas Piggin
Have the real mode system call entry handler branch to the kernel 0xc000... address and then use mtmsrd to enable the MMU, rather than use SRRs and rfid. Commit 8729c26e675c ("powerpc/64s/exception: Move real to virt switch into the common handler") implemented this style of real mode entry for other interrupt handlers, so this brings system calls into line with them, which is the main motivcation for the change. This tends to be slightly faster due to avoiding the mtsprs, and it also does not clobber the SRR registers, which becomes important in a subsequent change. The real mode entry points don't tend to be too important for performance these days, but it is possible for a hypervisor to run guests in AIL=0 mode for certian reasons. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210208063326.331502-1-npiggin@gmail.com
2021-02-11powerpc/kuap: Restore AMR after replaying soft interruptsAlexey Kardashevskiy
Since de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace Access Protection"), user access helpers call user_{read|write}_access_{begin|end} when user space access is allowed. Commit 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU") made the mentioned helpers program a AMR special register to allow such access for a short period of time, most of the time AMR is expected to block user memory access by the kernel. Since the code accesses the user space memory, unsafe_get_user() calls might_fault() which calls arch_local_irq_restore() if either CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP is enabled. arch_local_irq_restore() then attempts to replay pending soft interrupts as KUAP regions have hardware interrupts enabled. If a pending interrupt happens to do user access (performance interrupts do that), it enables access for a short period of time so after returning from the replay, the user access state remains blocked and if a user page fault happens - "Bug: Read fault blocked by AMR!" appears and SIGSEGV is sent. An example trace: Bug: Read fault blocked by AMR! WARNING: CPU: 0 PID: 1603 at /home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145 CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24 NIP: c00000000009ece8 LR: c00000000009ece4 CTR: 0000000000000000 REGS: c00000000dc63560 TRAP: 0700 Not tainted (5.10.0-rc6_v5.10-rc6_a+fstn1) MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 28002888 XER: 20040000 CFAR: c0000000001fa928 IRQMASK: 1 GPR00: c00000000009ece4 c00000000dc637f0 c000000002397600 000000000000001f GPR04: c0000000020eb318 0000000000000000 c00000000dc63494 0000000000000027 GPR08: c00000007fe4de68 c00000000dfe9180 0000000000000000 0000000000000001 GPR12: 0000000000002000 c0000000030a0000 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 bfffffffffffffff GPR20: 0000000000000000 c0000000134a4020 c0000000019c2218 0000000000000fe0 GPR24: 0000000000000000 0000000000000000 c00000000d106200 0000000040000000 GPR28: 0000000000000000 0000000000000300 c00000000dc63910 c000000001946730 NIP __do_page_fault+0xb38/0xde0 LR __do_page_fault+0xb34/0xde0 Call Trace: __do_page_fault+0xb34/0xde0 (unreliable) handle_page_fault+0x10/0x2c --- interrupt: 300 at strncpy_from_user+0x290/0x440 LR = strncpy_from_user+0x284/0x440 strncpy_from_user+0x2f0/0x440 (unreliable) getname_flags+0x88/0x2c0 do_sys_openat2+0x2d4/0x5f0 do_sys_open+0xcc/0x140 system_call_exception+0x160/0x240 system_call_common+0xf0/0x27c To fix it save/restore the AMR when replaying interrupts, and also add a check if AMR was not blocked prior to replaying interrupts. Originally found by syzkaller. Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Use normal commit citation format and add full oops log to change log, move kuap_check_amr() into the restore routine to avoid warnings about unreconciled IRQ state] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210202091541.36499-1-aik@ozlabs.ru
2021-02-11powerpc/uaccess: Avoid might_fault() when user access is enabledAlexey Kardashevskiy
The amount of code executed with enabled user space access (unlocked KUAP) should be minimal. However with CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() calls into various parts of the kernel, and may even end up replaying interrupts which in turn may access user space and forget to restore the KUAP state. The problem places are: 1. strncpy_from_user (and similar) which unlock KUAP and call unsafe_get_user -> __get_user_allowed -> __get_user_nocheck() with do_allow=false to skip KUAP as the caller took care of it. 2. __unsafe_put_user_goto() which is called with unlocked KUAP. eg: WARNING: CPU: 30 PID: 1 at arch/powerpc/include/asm/book3s/64/kup.h:324 arch_local_irq_restore+0x160/0x190 NIP arch_local_irq_restore+0x160/0x190 LR lock_is_held_type+0x140/0x200 Call Trace: 0xc00000007f392ff8 (unreliable) ___might_sleep+0x180/0x320 __might_fault+0x50/0xe0 filldir64+0x2d0/0x5d0 call_filldir+0xc8/0x180 ext4_readdir+0x948/0xb40 iterate_dir+0x1ec/0x240 sys_getdents64+0x80/0x290 system_call_exception+0x160/0x280 system_call_common+0xf0/0x27c Change __get_user_nocheck() to look at `do_allow` to decide whether to skip might_fault(). Since strncpy_from_user/etc call might_fault() anyway before unlocking KUAP, there should be no visible change. Drop might_fault() in __unsafe_put_user_goto() as it is only called from unsafe_put_user(), which already has KUAP unlocked. Since keeping might_fault() is still desirable for debugging, add calls to it in user_[read|write]_access_begin(). That also allows us to drop the is_kernel_addr() test, because there should be no code using user_[read|write]_access_begin() in order to access a kernel address. Fixes: de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace Access Protection") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Combine with related patch from myself, merge change logs] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210204121612.32721-1-aik@ozlabs.ru
2021-02-11powerpc/uaccess: Simplify unsafe_put_user() implementationMichael Ellerman
Currently unsafe_put_user() expands to __put_user_goto(), which expands to __put_user_nocheck_goto(). There are no other uses of __put_user_nocheck_goto(), and although there are some other uses of __put_user_goto() those could just use unsafe_put_user(). Every layer of indirection introduces the possibility that some code is calling that layer, and makes keeping track of the required semantics at each point more complicated. So drop __put_user_goto(), and rename __put_user_nocheck_goto() to __unsafe_put_user_goto(). The "nocheck" is implied by "unsafe". Replace the few uses of __put_user_goto() with unsafe_put_user(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210208135717.2618798-1-mpe@ellerman.id.au
2021-02-11powerpc/amigaone: Make amigaone_discover_phbs() staticMichael Ellerman
It's only used in setup.c, so make it static. Fixes: 053d58c87029 ("powerpc/amigaone: Move PHB discovery") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210210130804.3190952-3-mpe@ellerman.id.au
2021-02-11powerpc/mm/64s: Fix no previous prototype warningMichael Ellerman
As reported by lkp: arch/powerpc/mm/book3s64/radix_tlb.c:646:6: warning: no previous prototype for function 'exit_lazy_flush_tlb' Fix it by moving the prototype into the existing header. Fixes: 032b7f08932c ("powerpc/64s/radix: serialize_against_pte_lookup IPIs trim mm_cpumask") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210210130804.3190952-2-mpe@ellerman.id.au
2021-02-11powerpc/83xx: Fix build error when CONFIG_PCI=nMichael Ellerman
As reported by lkp: arch/powerpc/platforms/83xx/km83xx.c:183:19: error: 'mpc83xx_setup_pci' undeclared here (not in a function) 183 | .discover_phbs = mpc83xx_setup_pci, | ^~~~~~~~~~~~~~~~~ | mpc83xx_setup_arch There is a stub defined for the CONFIG_PCI=n case, but now that mpc83xx_setup_pci() is being assigned to discover_phbs the correct empty value is NULL. Fixes: 83f84041ff1c ("powerpc/83xx: Move PHB discovery") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210210130804.3190952-1-mpe@ellerman.id.au
2021-02-11powerpc: remove interrupt handler functions from the noinstr sectionNicholas Piggin
The allyesconfig ppc64 kernel fails to link with relocations unable to fit after commit 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers"), which is due to the interrupt handler functions being put into the .noinstr.text section, which the linker script places on the opposite side of the main .text section from the interrupt entry asm code which calls the handlers. This results in a lot of linker stubs that overwhelm the 252-byte sized space we allow for them, or in the case of BE a .opd relocation link error for some reason. It's not required to put interrupt handlers in the .noinstr section, previously they used NOKPROBE_SYMBOL, so take them out and replace with a NOKPROBE_SYMBOL in the wrapper macro. Remove the explicit NOKPROBE_SYMBOL macros in the interrupt handler functions. This makes a number of interrupt handlers nokprobe that were not prior to the interrupt wrappers commit, but since that commit they were made nokprobe due to being in .noinstr.text, so this fix does not change that. The fixes tag is different to the commit that first exposes the problem because it is where the wrapper macros were introduced. Fixes: 8d41fc618ab8 ("powerpc: interrupt handler wrapper functions") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Slightly fix up comment wording] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210211063636.236420-1-npiggin@gmail.com
2021-02-11powerpc/powernv/pci: Use kzalloc() for phb related allocationsMichael Ellerman
As part of commit fbbefb320214 ("powerpc/pci: Move PHB discovery for PCI_DN using platforms"), I switched some allocations from memblock_alloc() to kmalloc(), otherwise memblock would warn that it was being called after slab init. However I missed that the code relied on the allocations being zeroed, without which we could end up crashing: pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to ff BUG: Unable to handle kernel data access on read at 0x6b6b6b6b6b6b6af7 Faulting instruction address: 0xc0000000000dbc90 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV ... NIP pnv_ioda_get_pe_state+0xe0/0x1d0 LR pnv_ioda_get_pe_state+0xb4/0x1d0 Call Trace: pnv_ioda_get_pe_state+0xb4/0x1d0 (unreliable) pnv_pci_config_check_eeh.isra.9+0x78/0x270 pnv_pci_read_config+0xf8/0x160 pci_bus_read_config_dword+0xa4/0x120 pci_bus_generic_read_dev_vendor_id+0x54/0x270 pci_scan_single_device+0xb8/0x140 pci_scan_slot+0x80/0x1b0 pci_scan_child_bus_extend+0x94/0x490 pcibios_scan_phb+0x1f8/0x3c0 pcibios_init+0x8c/0x12c do_one_initcall+0x94/0x510 kernel_init_freeable+0x35c/0x3fc kernel_init+0x2c/0x168 ret_from_kernel_thread+0x5c/0x70 Switch them to kzalloc(). Fixes: fbbefb320214 ("powerpc/pci: Move PHB discovery for PCI_DN using platforms") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210211112749.3410771-1-mpe@ellerman.id.au
2021-02-09powerpc/64s: Handle program checks in wrong endian during early bootMichael Ellerman
There's a short window during boot where although the kernel is running little endian, any exceptions will cause the CPU to switch back to big endian. This situation persists until we call configure_exceptions(), which calls either the hypervisor or OPAL to configure the CPU so that exceptions will be taken in little endian (via HID0[HILE]). We don't intend to take exceptions during early boot, but one way we sometimes do is via a WARN/BUG etc. Those all boil down to a trap instruction, which will cause a program check exception. The first instruction of the program check handler is an mtsprg, which when executed in the wrong endian is an lhzu with a ~3GB displacement from r3. The content of r3 is random, so that becomes a load from some random location, and depending on the system (installed RAM etc.) can easily lead to a checkstop, or an infinitely recursive page fault. That prevents whatever the WARN/BUG was complaining about being printed to the console, and the user just sees a dead system. We can fix it by having a trampoline at the beginning of the program check handler that detects we are in the wrong endian, and flips us back to the correct endian. We can't flip MSR[LE] using mtmsr (alas), so we have to use rfid. That requires backing up SRR0/1 as well as a GPR. To do that we use SPRG0/2/3 (SPRG1 is already used for the paca). SPRG3 is user readable, but this trampoline is only active very early in boot, and SPRG3 will be reinitialised in vdso_getcpu_init() before userspace starts. With this trampoline in place we can survive a WARN early in boot and print a stack trace, which is eventually printed to the console once the console is up, eg: [83565.758545] kexec_core: Starting new kernel [ 0.000000] ------------[ cut here ]------------ [ 0.000000] static_key_enable_cpuslocked(): static key '0xc000000000ea6160' used before call to jump_label_init() [ 0.000000] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:166 static_key_enable_cpuslocked+0xfc/0x120 [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0-gcc-8.2.0-dirty #618 [ 0.000000] NIP: c0000000002fd46c LR: c0000000002fd468 CTR: c000000000170660 [ 0.000000] REGS: c000000001227940 TRAP: 0700 Not tainted (5.10.0-gcc-8.2.0-dirty) [ 0.000000] MSR: 9000000002823003 <SF,HV,VEC,VSX,FP,ME,RI,LE> CR: 24882422 XER: 20040000 [ 0.000000] CFAR: 0000000000000730 IRQMASK: 1 [ 0.000000] GPR00: c0000000002fd468 c000000001227bd0 c000000001228300 0000000000000065 [ 0.000000] GPR04: 0000000000000001 0000000000000065 c0000000010cf970 000000000000000d [ 0.000000] GPR08: 0000000000000000 0000000000000000 0000000000000000 c00000000122763f [ 0.000000] GPR12: 0000000000002000 c000000000f8a980 0000000000000000 0000000000000000 [ 0.000000] GPR16: 0000000000000000 0000000000000000 c000000000f88c8e c000000000f88c9a [ 0.000000] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 0.000000] GPR24: 0000000000000000 c000000000dea3a8 0000000000000000 c000000000f35114 [ 0.000000] GPR28: 0000002800000000 c000000000f88c9a c000000000f88c8e c000000000ea6160 [ 0.000000] NIP [c0000000002fd46c] static_key_enable_cpuslocked+0xfc/0x120 [ 0.000000] LR [c0000000002fd468] static_key_enable_cpuslocked+0xf8/0x120 [ 0.000000] Call Trace: [ 0.000000] [c000000001227bd0] [c0000000002fd468] static_key_enable_cpuslocked+0xf8/0x120 (unreliable) [ 0.000000] [c000000001227c40] [c0000000002fd4c0] static_key_enable+0x30/0x50 [ 0.000000] [c000000001227c70] [c000000000f6629c] early_page_poison_param+0x58/0x9c [ 0.000000] [c000000001227cb0] [c000000000f351b8] do_early_param+0xa4/0x10c [ 0.000000] [c000000001227d30] [c00000000011e020] parse_args+0x270/0x5e0 [ 0.000000] [c000000001227e20] [c000000000f35864] parse_early_options+0x48/0x5c [ 0.000000] [c000000001227e40] [c000000000f358d0] parse_early_param+0x58/0x84 [ 0.000000] [c000000001227e70] [c000000000f3a368] early_init_devtree+0xc4/0x490 [ 0.000000] [c000000001227f10] [c000000000f3bca0] early_setup+0xc8/0x1c8 [ 0.000000] [c000000001227f90] [000000000000c320] 0xc320 [ 0.000000] Instruction dump: [ 0.000000] 4bfffddd 7c2004ac 39200001 913f0000 4bffffb8 7c651b78 3c82ffac 3c62ffc0 [ 0.000000] 38841b00 3863f310 4bdf03a5 60000000 <0fe00000> 4bffff38 60000000 60000000 [ 0.000000] random: get_random_bytes called from print_oops_end_marker+0x40/0x80 with crng_init=0 [ 0.000000] ---[ end trace 0000000000000000 ]--- [ 0.000000] dt-cpu-ftrs: setup for ISA 3000 Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210202130207.1303975-2-mpe@ellerman.id.au
2021-02-09powerpc/64: Make stack tracing work during very early bootMichael Ellerman
If we try to stack trace very early during boot, either due to a WARN/BUG or manual dump_stack(), we will oops in valid_emergency_stack() when we try to dereference the paca_ptrs array. The fix is simple, we just return false if paca_ptrs isn't allocated yet. The stack pointer definitely isn't part of any emergency stack because we haven't allocated any yet. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210202130207.1303975-1-mpe@ellerman.id.au
2021-02-09powerpc64/idle: Fix SP offsets when saving GPRsChristopher M. Riedl
The idle entry/exit code saves/restores GPRs in the stack "red zone" (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset used for the first GPR is incorrect and overwrites the back chain - the Protected Zone actually starts below the current SP. In practice this is probably not an issue, but it's still incorrect so fix it. Also expand the comments to explain why using the stack "red zone" instead of creating a new stackframe is appropriate here. Signed-off-by: Christopher M. Riedl <cmr@codefail.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210206072342.5067-1-cmr@codefail.de
2021-02-09powerpc/32s: Allow constant folding in mtsr()/mfsr()Christophe Leroy
On the same way as we did in wrtee(), add an alternative using mtsr/mfsr instructions instead of mtsrin/mfsrin when the segment register can be determined at compile time. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/9baed0ff9d76723ec90f1b567ddd4ac1ecc7a190.1612612022.git.christophe.leroy@csgroup.eu
2021-02-09powerpc/32s: mfsrin()/mtsrin() become mfsr()/mtsr()Christophe Leroy
Function names should tell what the function does, not how. mfsrin() and mtsrin() are read/writing segment registers. They are called that way because they are using mfsrin and mtsrin instructions, but it doesn't matter for the caller. In preparation of following patch, change their name to mfsr() and mtsr() in order to make it obvious they manipulate segment registers without messing up with how they do it. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/f92d99f4349391b77766745900231aa880a0efb5.1612612022.git.christophe.leroy@csgroup.eu
2021-02-09powerpc/32s: Change mfsrin() into a static inline functionChristophe Leroy
mfsrin() is a macro. Change in into an inline function to avoid conflicts in KVM and make it more evolutive. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/72c7b9879e2e2e6f5c27dadda6486386c2b50f23.1612612022.git.christophe.leroy@csgroup.eu
2021-02-09powerpc/uaccess: Perform barrier_nospec() in KUAP allowance helpersChristophe Leroy
barrier_nospec() in uaccess helpers is there to protect against speculative accesses around access_ok(). When using user_access_begin() sequences together with unsafe_get_user() like macros, barrier_nospec() is called for every single read although we know the access_ok() is done onece. Since all user accesses must be granted by a call to either allow_read_from_user() or allow_read_write_user() which will always happen after the access_ok() check, move the barrier_nospec() there. Reported-by: Christopher M. Riedl <cmr@codefail.de> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c72f014730823b413528e90ab6c4d3bcb79f8497.1612692067.git.christophe.leroy@csgroup.eu
2021-02-09powerpc/sstep: Fix darn emulationSandipan Das
Commit 8813ff49607e ("powerpc/sstep: Check instruction validity against ISA version before emulation") introduced a proper way to skip unknown instructions. This makes sure that the same is used for the darn instruction when the range selection bits have a reserved value. Fixes: a23987ef267a ("powerpc: sstep: Add support for darn instruction") Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210204080744.135785-2-sandipan@linux.ibm.com
2021-02-09powerpc/sstep: Fix load-store and update emulationSandipan Das
The Power ISA says that the fixed-point load and update instructions must neither use R0 for the base address (RA) nor have the destination (RT) and the base address (RA) as the same register. Similarly, for fixed-point stores and floating-point loads and stores, the instruction is invalid when R0 is used as the base address (RA). This is applicable to the following instructions. * Load Byte and Zero with Update (lbzu) * Load Byte and Zero with Update Indexed (lbzux) * Load Halfword and Zero with Update (lhzu) * Load Halfword and Zero with Update Indexed (lhzux) * Load Halfword Algebraic with Update (lhau) * Load Halfword Algebraic with Update Indexed (lhaux) * Load Word and Zero with Update (lwzu) * Load Word and Zero with Update Indexed (lwzux) * Load Word Algebraic with Update Indexed (lwaux) * Load Doubleword with Update (ldu) * Load Doubleword with Update Indexed (ldux) * Load Floating Single with Update (lfsu) * Load Floating Single with Update Indexed (lfsux) * Load Floating Double with Update (lfdu) * Load Floating Double with Update Indexed (lfdux) * Store Byte with Update (stbu) * Store Byte with Update Indexed (stbux) * Store Halfword with Update (sthu) * Store Halfword with Update Indexed (sthux) * Store Word with Update (stwu) * Store Word with Update Indexed (stwux) * Store Doubleword with Update (stdu) * Store Doubleword with Update Indexed (stdux) * Store Floating Single with Update (stfsu) * Store Floating Single with Update Indexed (stfsux) * Store Floating Double with Update (stfdu) * Store Floating Double with Update Indexed (stfdux) E.g. the following behaviour is observed for an invalid load and update instruction having RA = RT. While a userspace program having an instruction word like 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting receiving a SIGILL on a Power system (observed on P8 and P9), the outcome of executing that instruction word varies and its behaviour can be considered to be undefined. Attaching an uprobe at that instruction's address results in emulation which currently performs the load as well as writes the effective address back to the base register. This might not match the outcome from hardware. To remove any inconsistencies, this adds additional checks for the aforementioned instructions to make sure that the emulation infrastructure treats them as unknown. The kernel can then fallback to executing such instructions on hardware. Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()") Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210204080744.135785-1-sandipan@linux.ibm.com
2021-02-09powerpc/8xx: Fix software emulation interruptChristophe Leroy
For unimplemented instructions or unimplemented SPRs, the 8xx triggers a "Software Emulation Exception" (0x1000). That interrupt doesn't set reason bits in SRR1 as the "Program Check Exception" does. Go through emulation_assist_interrupt() to set REASON_ILLEGAL. Fixes: fbbcc3bb139e ("powerpc/8xx: Remove SoftwareEmulation()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/ad782af87a222efc79cfb06079b0fd23d4224eaf.1612515180.git.christophe.leroy@csgroup.eu
2021-02-09powerpc/perf: Record counter overflow always if SAMPLE_IP is unsetAthira Rajeev
While sampling for marked events, currently we record the sample only if the SIAR valid bit of Sampled Instruction Event Register (SIER) is set. SIAR_VALID bit is used for fetching the instruction address from Sampled Instruction Address Register(SIAR). But there are some usecases, where the user is interested only in the PMU stats at each counter overflow and the exact IP of the overflow event is not required. Dropping SIAR invalid samples will fail to record some of the counter overflows in such cases. Example of such usecase is dumping the PMU stats (event counts) after some regular amount of instructions/events from the userspace (ex: via ptrace). Here counter overflow is indicated to userspace via signal handler, and captured by monitoring and enabling I/O signaling on the event file descriptor. In these cases, we expect to get sample/overflow indication after each specified sample_period. Perf event attribute will not have PERF_SAMPLE_IP set in the sample_type if exact IP of the overflow event is not requested. So while profiling if SAMPLE_IP is not set, just record the counter overflow irrespective of SIAR_VALID check. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> [mpe: Reflow comment and if formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1612516492-1428-1-git-send-email-atrajeev@linux.vnet.ibm.com
2021-02-09powerpc/pseries/dlpar: handle ibm, configure-connector delay statusNathan Lynch
dlpar_configure_connector() has two problems in its handling of ibm,configure-connector's return status: 1. When the status is -2 (busy, call again), we call ibm,configure-connector again immediately without checking whether to schedule, which can result in monopolizing the CPU. 2. Extended delay status (9900..9905) goes completely unhandled, causing the configuration to unnecessarily terminate. Fix both of these issues by using rtas_busy_delay(). Fixes: ab519a011caa ("powerpc/pseries: Kernel DLPAR Infrastructure") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210107025900.410369-1-nathanl@linux.ibm.com
2021-02-09powerpc/64s: Implement ptep_clear_flush_young that does not flush TLBsNicholas Piggin
Similarly to the x86 commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB"), implement ptep_clear_flush_young that does not actually flush the TLB in the case the referenced bit is cleared. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-8-npiggin@gmail.com
2021-02-09powerpc/64s/radix: serialize_against_pte_lookup IPIs trim mm_cpumaskNicholas Piggin
serialize_against_pte_lookup() performs IPIs to all CPUs in mm_cpumask. Take this opportunity to try trim the CPU out of mm_cpumask. This can reduce the cost of future serialize_against_pte_lookup() and/or the cost of future TLB flushes. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-7-npiggin@gmail.com
2021-02-09powerpc/64s/radix: occasionally attempt to trim mm_cpumaskNicholas Piggin
A single-threaded process that is flushing its own address space is so far the only case where the mm_cpumask is attempted to be trimmed. This patch expands that to flush in other situations, multi-threaded processes and external sources. For now it's a relatively simple occasional trim attempt. The main aim is to add the mechanism, tweaking and tuning can come with more data. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-6-npiggin@gmail.com
2021-02-09powerpc/64s/radix: Allow mm_cpumask trimming from external sourcesNicholas Piggin
mm_cpumask trimming is currently restricted to be issued by the current thread of a single-threaded mm. This patch relaxes that and allows the mask to be trimmed from any context. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-5-npiggin@gmail.com
2021-02-09powerpc/64s/radix: Check for no TLB flush requiredNicholas Piggin
If there are no CPUs in mm_cpumask, no TLB flush is required at all. This patch adds a check for this case. Currently it's not tested for, in fact mm_is_thread_local() returns false if the current CPU is not in mm_cpumask, so it's treated as a global flush. This can come up in some cases like exec failure before the new mm has ever been switched to. This patch reduces TLBIE instructions required to build a kernel from about 120,000 to 45,000. Another situation it could help is page reclaim, KSM, THP, etc., (i.e., asynch operations external to the process) where the process is sleeping and has all TLBs flushed out of all CPUs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-4-npiggin@gmail.com
2021-02-09powerpc/64s/radix: refactor TLB flush type selectionNicholas Piggin
The logic to decide what kind of TLB flush is required (local, global, or IPI) is spread multiple times over the several kinds of TLB flushes. Move it all into a single function which may issue IPIs if necessary, and also returns a flush type that is to be used. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-3-npiggin@gmail.com
2021-02-09powerpc/64s/radix: add warning and comments in mm_cpumask trimNicholas Piggin
Add a comment explaining part of the logic for mm_cpumask trimming, and add a (hopefully graceful) check and warning in case something gets it wrong. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-2-npiggin@gmail.com
2021-02-09powerpc/perf: Expose Performance Monitor Counter SPR's as part of extended regsAthira Rajeev
Currently Monitor Mode Control Registers and Sampling registers are part of extended regs. Patch adds support to include Performance Monitor Counter Registers (PMC1 to PMC6 ) as part of extended registers. PMCs are saved in the perf interrupt handler as part of per-cpu array 'pmcs' in struct cpu_hw_events. While capturing the register values for extended regs, fetch these saved PMC values. Simplified the PERF_REG_PMU_MASK_300/31 definition to include PMU SPRs MMCR0 to PMC6. Exclude the unsupported SPRs (MMCR3, SIER2, SIER3) from extended mask value for CPU_FTR_ARCH_300 in the new definition. PERF_REG_EXTENDED_MAX is used to check if any index beyond the extended registers is requested in the sample. Have one PERF_REG_EXTENDED_MAX for CPU_FTR_ARCH_300/CPU_FTR_ARCH_31 since perf_reg_validate function already checks the extended mask for the presence of any unsupported register. Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1612335337-1888-3-git-send-email-atrajeev@linux.vnet.ibm.com
2021-02-09powerpc/perf: Include PMCs as part of per-cpu cpuhw_events structAthira Rajeev
To support capturing of PMC's as part of extended registers, the value of SPR's PMC1 to PMC6 has to be saved in the starting of PMI interrupt handler. This is needed since we are resetting the overflown PMC before creating sample and hence directly reading SPRN_PMCx in 'perf_reg_value' will be capturing the modified value. To solve this, add a per-cpu array as part of structure cpu_hw_events and use this array to capture PMC values in the perf interrupt handler. Patch also re-factor's the interrupt handler code to use this per-cpu array instead of current local array. Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1612335337-1888-2-git-send-email-atrajeev@linux.vnet.ibm.com
2021-02-09powerpc/pkeys: Remove unused codeSandipan Das
This removes arch_supports_pkeys(), arch_usable_pkeys() and thread_pkey_regs_*() which are remnants from the following: commit 06bb53b33804 ("powerpc: store and restore the pkey state across context switches") commit 2cd4bd192ee9 ("powerpc/pkeys: Fix handling of pkey state across fork()") commit cf43d3b26452 ("powerpc: Enable pkey subsystem") arch_supports_pkeys() and arch_usable_pkeys() were unused since their introduction while thread_pkey_regs_*() became unused after the introduction of the following: commit d5fa30e6993f ("powerpc/book3s64/pkeys: Reset userspace AMR correctly on exec") commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode") Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210202150050.75335-1-sandipan@linux.ibm.com
2021-02-09powerpc/44x: Fix a spelling mismach to mismatch in head_44x.SBhaskar Chowdhury
s/mismach/mismatch/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210202093746.5198-1-unixbhaskar@gmail.com
2021-02-09powerpc: remove unneeded semicolonsChengyang Fan
Remove superfluous semicolons after function definitions. Signed-off-by: Chengyang Fan <cy.fan@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210125095338.1719405-1-cy.fan@huawei.com
2021-02-09powerpc/akebono: Fix unmet dependency errorsMichael Ellerman
The AKEBONO config has various selects under it, including some with user-selectable dependencies, which means those dependencies can be disabled. This leads to warnings from Kconfig. This can be seen with eg: $ make allnoconfig $ ./scripts/config --file build~/.config -k -e CONFIG_44x -k -e CONFIG_PPC_47x -e CONFIG_AKEBONO $ make olddefconfig WARNING: unmet direct dependencies detected for ATA Depends on [n]: HAS_IOMEM [=y] && BLOCK [=n] Selected by [y]: - AKEBONO [=y] && PPC_47x [=y] WARNING: unmet direct dependencies detected for NETDEVICES Depends on [n]: NET [=n] Selected by [y]: - AKEBONO [=y] && PPC_47x [=y] WARNING: unmet direct dependencies detected for ETHERNET Depends on [n]: NETDEVICES [=y] && NET [=n] Selected by [y]: - AKEBONO [=y] && PPC_47x [=y] WARNING: unmet direct dependencies detected for MMC_SDHCI Depends on [n]: MMC [=n] && HAS_DMA [=y] Selected by [y]: - AKEBONO [=y] && PPC_47x [=y] WARNING: unmet direct dependencies detected for MMC_SDHCI_PLTFM Depends on [n]: MMC [=n] && MMC_SDHCI [=y] Selected by [y]: - AKEBONO [=y] && PPC_47x [=y] The problem is that AKEBONO is using select to enable things that are not true dependencies, but rather things you probably want enabled in an AKEBONO kernel. That is what a defconfig is for. So drop those selects and instead move those symbols into the defconfig. This fixes all the kconfig warnings, and the result of make 44x/akebono_defconfig is the same before and after the patch. Reported-by: Yury Norov <yury.norov@gmail.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210201012503.940145-1-mpe@ellerman.id.au
2021-02-09powerpc/64s: runlatch interrupt handling in CNicholas Piggin
There is no need for this to be in asm, use the new intrrupt entry wrapper. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-42-npiggin@gmail.com
2021-02-09powerpc/64s: move NMI soft-mask handling to CNicholas Piggin
Saving and restoring soft-mask state can now be done in C using the interrupt handler wrapper functions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-41-npiggin@gmail.com
2021-02-09powerpc: move NMI entry/exit code into wrapperNicholas Piggin
This moves the common NMI entry and exit code into the interrupt handler wrappers. This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and also MCE interrupts on 64e, by adding missing parts of the NMI entry to them. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-40-npiggin@gmail.com
2021-02-09powerpc/pseries/mce: restore msr before returning from handlerNicholas Piggin
The pseries real-mode machine check handler can enable the MMU, and return from the handler with the MMU still enabled. This works, but real-mode handler wrapper exit handlers want to rely on the MMU being in real-mode. So change the pseries handler to restore the MSR after it has finished virtual mode tasks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1612702361.lm7fqo56re.astroid@bobo.none
2021-02-09powerpc/64: entry cpu time accounting in CNicholas Piggin
There is no need for this to be in asm, use the new interrupt entry wrapper. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-39-npiggin@gmail.com
2021-02-09powerpc/64: move account_stolen_time into its own functionNicholas Piggin
This will be used by interrupt entry as well. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-38-npiggin@gmail.com
2021-02-09powerpc/64s: reconcile interrupts in CNicholas Piggin
There is no need for this to be in asm, use the new intrrupt entry wrapper. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-37-npiggin@gmail.com
2021-02-09powerpc/64s: move context tracking exit to interrupt exit pathNicholas Piggin
The interrupt handler wrapper functions are not the ideal place to maintain context tracking because after they return, the low level exit code must then determine if there are interrupts to replay, or if the task should be preempted, etc. Those paths (e.g., schedule_user) include their own exception_enter/exit pairs to fix this up but it's a bit hacky (see schedule_user() comments). Ideally context tracking will go to user mode only when there are no more interrupts or context switches or other exit processing work to handle. 64e can not do this because it does not use the C interrupt exit code. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-36-npiggin@gmail.com
2021-02-09powerpc: handle irq_enter/irq_exit in interrupt handler wrappersNicholas Piggin
Move irq_enter/irq_exit into asynchronous interrupt handler wrappers. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-35-npiggin@gmail.com