summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-05-03powerpc/powernv: Fix TCE kill on NVLink2Alistair Popple
Commit 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") forced all TCE kills to go via the OPAL call for NVLink2. However the PHB3 implementation of TCE kill was still being called directly from some functions which in some circumstances caused a machine check. This patch adds an equivalent IODA2 version of the function which uses the correct invalidation method depending on PHB model and changes all external callers to use it instead. Fixes: 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/mm/radix: Drop support for CPUs without lockless tlbieMichael Ellerman
Currently the radix TLB code includes support for CPUs that do *not* have MMU_FTR_LOCKLESS_TLBIE. On those CPUs we are required to take a global spinlock before issuing a tlbie. Radix can only be built for 64-bit Book3s CPUs, and of those, only POWER4, 970, Cell and PA6T do not have MMU_FTR_LOCKLESS_TLBIE. Although it's possible to build a kernel with Radix support that can also boot on those CPUs, we happen to know that in reality none of those CPUs support the Radix MMU, so the code can never actually run on those CPUs. So remove the native_tlbie_lock in the Radix TLB code. Note that there is another lock of the same name in the hash code, which is unaffected by this patch. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/book3s/mce: Move add_taint() later in virtual modeMahesh Salgaonkar
machine_check_early() gets called in real mode. The very first time when add_taint() is called, it prints a warning which ends up calling opal call (that uses OPAL_CALL wrapper) for writing it to console. If we get a very first machine check while we are in opal we are doomed. OPAL_CALL overwrites the PACASAVEDMSR in r13 and in this case when we are done with MCE handling the original opal call will use this new MSR on it's way back to opal_return. This usually leads to unexpected behaviour or the kernel to panic. Instead move the add_taint() call later in the virtual mode where it is safe to call. This is broken with current FW level. We got lucky so far for not getting very first MCE hit while in OPAL. But easily reproducible on Mambo. Fixes: 27ea2c420cad ("powerpc: Set the correct kernel taint on machine check errors.") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function bodyMichael Ellerman
The entire body of unregister_cpu_online() is inside an #ifdef CONFIG_HOTPLUG_CPU block. This is ugly and means we create an empty function when hotplug is disabled for no reason. Instead move the #ifdef out of the function body and define the function to be NULL in the else case. This means we'll pass NULL to cpuhp_setup_state(), but that's fine because it accepts NULL to mean there is no teardown callback, which is exactly what we want. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/smp: Document irq enable/disable after migrating IRQsMichael Ellerman
This code was until recently completely undocumented and even now the comment is not very verbose. We've already had one patch sent to remove the IRQ enable/disable because it's "paradoxical and unnecessary". So document it thoroughly to save anyone else from puzzling over it. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/mpc52xx: Don't select user-visible RTAS_PROCMichael Ellerman
Otherwise we might select it when its dependenices aren't enabled, leading to a build break. It's default y anyway, so will be on unless someone disables it manually. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()Andrew Donnellan
pnv_eeh_reset() has special handling for PEs whose primary bus is the root bus or the bus immediately underneath the root port. The cxl bi-modal card support added in b0b5e5918ad1 ("cxl: Add cxl_check_and_switch_mode() API to switch bi-modal cards") relies on this behaviour when hot-resetting the CAPI adapter following a mode switch. Document this in pnv_eeh_reset() so we don't accidentally break it. Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02powerpc/eeh: Clean up and document event handling functionsRussell Currey
Remove unnecessary tags in eeh_handle_normal_event(), and add function comments for eeh_handle_normal_event() and eeh_handle_special_event(). The only functional difference is that in the case of a PE reaching the maximum number of failures, rather than one message telling you of this and suggesting you reseat the device, there are two separate messages. Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02powerpc/eeh: Avoid use after free in eeh_handle_special_event()Russell Currey
eeh_handle_special_event() is called when an EEH event is detected but can't be narrowed down to a specific PE. This function looks through every PE to find one in an erroneous state, then calls the regular event handler eeh_handle_normal_event() once it knows which PE has an error. However, if eeh_handle_normal_event() found that the PE cannot possibly be recovered, it will free it, rendering the passed PE stale. This leads to a use after free in eeh_handle_special_event() as it attempts to clear the "recovering" state on the PE after eeh_handle_normal_event() returns. Thus, make sure the PE is valid when attempting to clear state in eeh_handle_special_event(). Fixes: 8a6b1bc70dbb ("powerpc/eeh: EEH core to handle special event") Cc: stable@vger.kernel.org # v3.11+ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02cxl: Mask slice error interrupts after first occurrenceAlastair D'Silva
In some situations, a faulty AFU slice may create an interrupt storm of slice errors, rendering the machine unusable. Since these interrupts are informational only, present the interrupt once, then mask it off to prevent it from being retriggered until the AFU is reset. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02cxl: Route eeh events to all drivers in cxl_pci_error_detected()Vaibhav Jain
Fix a boundary condition where in some cases an eeh event that results in card reset isn't passed on to a driver attached to the virtual PCI device associated with a slice. This will happen in case when a slice attached device driver returns a value other than PCI_ERS_RESULT_NEED_RESET from the eeh error_detected() callback. This would result in an early return from cxl_pci_error_detected() and other drivers attached to other AFUs on the card wont be notified. The patch fixes this by making sure that all slice attached device-drivers are notified and the return values from error_detected() callback are aggregated in a scheme where request for 'disconnect' trumps all and 'none' trumps 'need_reset'. Fixes: 9e8df8a21963 ("cxl: EEH support") Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02cxl: Force context lock during EEH flowVaibhav Jain
During an eeh event when the cxl card is fenced and card sysfs attr perst_reloads_same_image is set following warning message is seen in the kernel logs: Adapter context unlocked with 0 active contexts ------------[ cut here ]------------ WARNING: CPU: 12 PID: 627 at ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] Even though this warning is harmless, it clutters the kernel log during an eeh event. This warning is triggered as the EEH callback cxl_pci_error_detected doesn't obtain a context-lock before forcibly detaching all active context and when context-lock is released during call to cxl_configure_adapter from cxl_pci_slot_reset, a warning in cxl_adapter_context_unlock is triggered. To fix this warning, we acquire the adapter context-lock via cxl_adapter_context_lock() in the eeh callback cxl_pci_error_detected() once all the virtual AFU PHBs are notified and their contexts detached. The context-lock is released in cxl_pci_slot_reset() after the adapter is successfully reconfigured and before the we call the slot_reset callback on slice attached device-drivers. Fixes: 70b565bbdb91 ("cxl: Prevent adapter reset if an active context exists") Cc: stable@vger.kernel.org # v4.9+ Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> Tested-by: Uma Krishnan <ukrishn@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-01powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TESTNicholas Piggin
This was a hack we added to work around the allmodconfig build breaking, see commit fb43e8477ed9 ("powerpc: Disable RELOCATABLE for COMPILE_TEST with PPC64"). Since we merged the thin archives support in commit 43c9127d94d6 ("powerpc: Add option to use thin archives") this hasn't been necessary, so remove it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-01powerpc/xmon: Teach xmon oops about radix vectorsMichael Neuling
Currently if we take an oops caused by an 0x380 or 0x480 exception, we get a print which assumes SLB problems. With radix, these vectors have different meanings. This patch updates the oops message to reflect these different meanings. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/mm/hash: Fix off-by-one in comment about kernel contexts idsMichael Ellerman
Michal Suchánek noticed a comment in book3s/64/mmu-hash.h about the context ids we use for the kernel was inconsistent with the code and other comments in the same file. It should read 1-4 not 1-5. While we're touching it, update "address" to "addresses" which makes more sense as it's referring to more than one address below. Reported-by: Michal Suchánek <msuchanek@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/pseries: Enable VFIOAlexey Kardashevskiy
This enables VFIO on pseries host in order to allow VFIO in nested guest under PR KVM or DPDK in a HV guest. This adds support of the VFIO_SPAPR_TCE_IOMMU type. This adds exchange() callback to allow TCE updates by the SPAPR TCE IOMMU driver in VFIO. This initializes DMA32 window parameters in iommu_table_group as as this does not implement VFIO_SPAPR_TCE_v2_IOMMU and VFIO_SPAPR_TCE_IOMMU just reuses the existing DMA32 window. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/powernv: Fix iommu table size calculation hook for small tablesAlexey Kardashevskiy
When the userspace requests a small TCE table (which takes less than the system page size) and more than 1 TCE level, the existing code returns a single page size which is a bug as each additional TCE level requires at least one page and this is what pnv_pci_ioda2_table_alloc_pages() does. And we end up seeing WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size)) in drivers/vfio/vfio_iommu_spapr_tce.c. This replaces incorrect _ALIGN_UP() (which aligns zero up to zero) with max_t() to fix the bug. Besides removing WARN_ON(), there should be no other changes in behaviour. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/powernv: Check kzalloc() return value in pnv_pci_table_allocAlexey Kardashevskiy
pnv_pci_table_alloc() ignores possible failure from kzalloc_node(), this adds a check. There are 2 callers of pnv_pci_table_alloc(), one already checks for tbl!=NULL, this adds WARN_ON() to the other path which only happens during boot time in IODA1 and not expected to fail. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Add arch/powerpc/tools directoryNicholas Piggin
Move a couple of existing scripts under there. Remove scripts directory: a script is a tool, a tool is not a script. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Use the new post-link pass to check relocationsNicholas Piggin
Currently powerpc has to introduce a dependency on its default build target zImage in order to run a relocation check pass over the linked vmlinux. This is deficient because the check is not run if the plain vmlinux target is built, or if one of the other boot targets is built. Switch to using the kbuild post-link pass, added in commit fbe6e37dab97 ("kbuild: add arch specific post-link Makefile") in order to run this check. In future powerpc will use this to do more complicated operations, but initially using it for something simple is a good first step. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/xmon: Wait for secondaries before IPI'ing on system resetNicholas Piggin
An externally triggered system reset (e.g., via QEMU nmi command, or pseries reset button) can cause system reset interrupts on all CPUs. In case this causes xmon to be entered, it is undesirable for the primary (first) CPU into xmon to trigger an NMI IPI to others, because this may cause a nested system reset interrupt. So spin for a time waiting for secondaries to join xmon before performing the NMI IPI, similarly to what the crash dump code does. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Only do it when we come in from system reset, not via sysrq etc.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/pseries: Implement NMI IPI with H_SIGNAL_SYS_RESETNicholas Piggin
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Add struct smp_ops_t.cause_nmi_ipi operationNicholas Piggin
Have the NMI IPI code use this op when the platform defines it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Add NMI IPI infrastructureNicholas Piggin
Add a simple NMI IPI system that handles concurrency and reentrancy. The platform does not have to implement a true non-maskable interrupt, the default is to simply use the debugger break IPI message. This has now been co-opted for a general IPI message, and users (debugger and crash) have been reimplemented on top of the NMI system. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Incorporate incremental fixes from Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Mark system reset as an NMI with nmi_enter/exit()Nicholas Piggin
System reset is a non-maskable interrupt from Linux's point of view (occurs under local_irq_disable()), so it should use nmi_enter/exit. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Dedicated system reset interrupt stackNicholas Piggin
The system reset interrupt is used for crash/debug situations, so it is desirable to have as little impact on the normal state of the system as possible. Currently it uses the current kernel stack to process the exception. This stores into the stack which may be involved with the crash. The stack pointer may be corrupted, or it may have overflowed. Avoid or minimise these problems by creating a dedicated NMI stack for the system reset interrupt to use. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Disallow system reset vs system reset reentrancyNicholas Piggin
In preparation for using a dedicated stack for system reset interrupts, prevent a nested system reset from recovering, in order to simplify code that is called in crash/debug path. This allows a system reset interrupt to just use the base stack pointer. Keep an in_nmi nesting counter similarly to the in_mce counter. Consider the interrrupt non-recoverable if it is taken inside another system reset. Interrupt nesting could be allowed similarly to MCE, but system reset is a special case that's not for normal operation, so simplicity wins until there is requirement for nested system reset interrupts. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Fix system reset vs general interrupt reentrancyNicholas Piggin
The system reset interrupt can occur when MSR_EE=0, and it currently uses the PACA_EXGEN save area. Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0 when the save area is still in use. A system reset interrupt in this window can lead to undetected corruption when the save area gets overwritten. This patch introduces PACA_EXNMI save area for system reset exceptions, which closes this corruption window. It's also helpful to retain the EXGEN state for debugging situations, even if not considering the recoverability aspect. This patch also moves the PACA_EXMC area down to a less frequently used part of the paca with the new save area. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Exception macro for stack frame and initial register saveNicholas Piggin
This code is common to a few exceptions, and another user will be added. This causes a trivial change to generated code: - 604: std r9,416(r1) - 608: mfspr r11,314 - 60c: std r11,368(r1) - 610: mfspr r12,315 + 604: mfspr r11,314 + 608: mfspr r12,315 + 60c: std r9,416(r1) + 610: std r11,368(r1) machine_check_powernv_early could also use this, but that requires non trivial changes to generated code, so that's for another patch. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Add exception macro that does not enable RINicholas Piggin
Subsequent patches will add more non-RI variant exceptions, so create a macro for it rather than open-code it. This does not change generated instructions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/cbe: Do not process external or decremeter interrupts from sresetNicholas Piggin
Cell will wake from low power state at the system reset interrupt, with the event encoded in SRR1, rather than waking at the interrupt vector that corresponds to that event. The system reset handler for this platform decodes SRR1 event reason and calls the interrupt handler to process it directly from the system reset handlre. A subsequent change will treat the system reset interrupt as a Linux NMI with its own per-CPU stack, and this will no longer work. Remove the external and decrementer handlers from the system reset handler. - The external exception remains raised and will fire again at the EE interrupt vector when system reset returns. - The decrementer is set to 1 so it will be raised again and fire when the system reset returns. It is possible to branch to an idle handler from the system reset interrupt (like POWER does), then restore a normal stack and restore this optimisation. But simplicity wins for now. Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/pasemi: Do not process external or decrementer interrupts from sresetNicholas Piggin
PA Semi will wake from low power state at the system reset interrupt, with the event encoded in SRR1, rather than waking at the interrupt vector that corresponds to that event. The system reset handler for this platform decodes SRR1 event reason and calls the interrupt handler to process it directly from the system reset handlre. A subsequent change will treat the system reset interrupt as a Linux NMI with its own per-CPU stack, and this will no longer work. Remove the external and decrementer handlers from the system reset handler. - The external exception remains raised and will fire again at the EE interrupt vector when system reset returns. - The decrementer is set to 1 so it will be raised again and fire when the system reset returns. It is possible to branch to an idle handler from the system reset interrupt (like POWER does), then restore a normal stack and restore this optimisation. But simplicity wins for now. Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28Merge branch 'topic/ppc-kvm' into nextMichael Ellerman
Merge the topic branch we were sharing with kvm-ppc, Paul has also merged it.
2017-04-27powerpc/ftrace/64: Split further based on -mprofile-kernelNaveen N. Rao
Split ftrace_64.S further retaining the core ftrace 64-bit aspects in ftrace_64.S and moving ftrace_caller() and ftrace_graph_caller() into separate files based on -mprofile-kernel. The livepatch routines are all now contained within the mprofile file. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc: Split ftrace bits into a separate fileNaveen N. Rao
entry_*.S now includes a lot more than just kernel entry/exit code. As a first step at cleaning this up, let's split out the ftrace bits into separate files. Also move all related tracing code into a new trace/ subdirectory. No functional changes. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: Rename table dump file nameChristophe Leroy
Page table dump debugfs file is named 'kernel_page_tables' on all other architectures implementing it, while is is named 'kernel_pagetables' on powerpc. This patch renames it. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: On PPC32, display 32 bits addresses in page table dumpChristophe Leroy
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: Fix missing page attributes in page table dumpChristophe Leroy
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used. There is also _PAGE_SHARED that is missing. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm: Fix page table dump build on PPC32Christophe Leroy
On PPC32 (eg. mpc885_ads_defconfig), page table dump compilation fails as follows. This is because the memory layout is slightly different on PPC32. This patch adapts it. arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables': arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' undeclared (first use in this function) ... Fixes: 8eb07b187000d ("powerpc/mm: Dump linux pagetables") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm/radix: Optimise tlbiel flush all caseAneesh Kumar K.V
_tlbiel_pid() is called with a ric (Radix Invalidation Control) argument of either RIC_FLUSH_TLB or RIC_FLUSH_ALL. RIC_FLUSH_ALL says to invalidate the entire TLB and the Page Walk Cache (PWC). To flush the whole TLB, we have to iterate over each set (congruence class) of the TLB. Currently we do that and pass RIC_FLUSH_ALL each time. That is not incorrect but it means we flush the PWC 128 times, when once would suffice. Fix it by doing the first flush with the ric value we're passed, and then if it was RIC_FLUSH_ALL, we downgrade it to RIC_FLUSH_TLB, because we know we have just flushed the PWC and don't need to do it again. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Split out of combined patch, tweak logic, rewrite change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc/mm/radix: Optimise Page Walk Cache flushAneesh Kumar K.V
Currently we implement flushing of the page walk cache (PWC) by calling _tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to only flush the PWC. But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is not necessary when we're just flushing the PWC. In fact the set argument is ignored for a PWC flush, so essentially we're just flushing the PWC 127 extra times for no benefit. Fix it by adding tlbiel_pwc() which just does a single flush of the PWC. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Split out of combined patch, drop _ in name, rewrite change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26powerpc/powernv: Fix oops on P9 DD1 in cause_ipi()Michael Ellerman
Recently we merged the native xive support for Power9, and then separately some reworks for doorbell IPI support. In isolation both series were OK, but the merged result had a bug in one case. On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then falls back to the interrupt controller. However the fallback is implemented by calling icp_ops->cause_ipi. But now that xive support is merged we might be using xive, in which case icp_ops is not initialised, it's a xics specific structure. This leads to an oops such as: Unable to handle kernel paging request for data at address 0x00000028 Oops: Kernel access of bad area, sig: 11 [#1] NIP pnv_p9_dd1_cause_ipi+0x74/0xe0 LR smp_muxed_ipi_message_pass+0x54/0x70 To fix it, rather than using icp_ops which might be NULL, have both xics and xive set smp_ops->cause_ipi, and then in the powernv code we save that as ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a WARN_ON() to check if somehow smp_ops->cause_ipi is NULL. Fixes: b866cc2199d6 ("powerpc: Change the doorbell IPI calling convention") Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26powerpc/powernv: Fix missing attr initialisation in opal_export_attrs()Michael Ellerman
In opal_export_attrs() we dynamically allocate some bin_attributes. They're allocated with kmalloc() and although we initialise most of the fields, we don't initialise write() or mmap(), and in particular we don't initialise the lockdep related fields in the embedded struct attribute. This leads to a lockdep warning at boot: BUG: key c0000000f11906d8 not in .data! WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 lockdep_init_map+0x28c/0x2a0 ... Call Trace: lockdep_init_map+0x288/0x2a0 (unreliable) __kernfs_create_file+0x8c/0x170 sysfs_add_file_mode_ns+0xc8/0x240 __machine_initcall_powernv_opal_init+0x60c/0x684 do_one_initcall+0x60/0x1c0 kernel_init_freeable+0x2f4/0x3d4 kernel_init+0x24/0x160 ret_from_kernel_thread+0x5c/0xb0 Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep fields. Fixes: 11fe909d2362 ("powerpc/powernv: Add OPAL exports attributes to sysfs") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd()Michael Ellerman
The recent patch to add runtime configuration of the ASLR limits added a bug in arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits, leading to undefined behaviour. In practice it exhibits as every process seg faulting instantly, presumably because the rnd value hasn't been restricited by the modulus at all. We didn't notice because it only happens under certain kernel configurations and if the number of bits is actually set to a large value. Fix it by switching to unsigned long. Fixes: 9fea59bd7ca5 ("powerpc/mm: Add support for runtime configuration of ASLR limits") Reported-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25powerpc/mm: Ensure IRQs are off in switch_mm()David Gibson
powerpc expects IRQs to already be (soft) disabled when switch_mm() is called, as made clear in the commit message of 9c1e105238c4 ("powerpc: Allow perf_counters to access user memory at interrupt time"). Aside from any race conditions that might exist between switch_mm() and an IRQ, there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't followed at some point by an IRQ enable then interrupts will remain disabled until we return to userspace. It is true that when switch_mm() is called from the scheduler IRQs are off, but not when it's called by use_mm(). Looking closer we see that last year in commit f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") this was made more explicit by the addition of switch_mm_irqs_off() which is now called by the scheduler, vs switch_mm() which is used by use_mm(). Arguably it is a bug in use_mm() to call switch_mm() in a different context than it expects, but fixing that will take time. This was discovered recently when vhost started throwing warnings such as: BUG: sleeping function called from invalid context at kernel/mutex.c:578 in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760 no locks held by vhost-10760/10768. irq event stamp: 10 hardirqs last enabled at (9): _raw_spin_unlock_irq+0x40/0x80 hardirqs last disabled at (10): switch_slb+0x2e4/0x490 softirqs last enabled at (0): copy_process+0x5e8/0x1260 softirqs last disabled at (0): (null) Call Trace: show_stack+0x88/0x390 (unreliable) dump_stack+0x30/0x44 __might_sleep+0x1c4/0x2d0 mutex_lock_nested+0x74/0x5c0 cgroup_attach_task_all+0x5c/0x180 vhost_attach_cgroups_work+0x58/0x80 [vhost] vhost_worker+0x24c/0x3d0 [vhost] kthread+0xec/0x100 ret_from_kernel_thread+0x5c/0xd4 Prior to commit 04b96e5528ca ("vhost: lockless enqueuing") (Aug 2016) the vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(), which had the effect of reenabling IRQs. Since that commit removed the locking in vhost_worker() the body of the vhost_worker() loop now runs with interrupts off causing the warnings. This patch addresses the problem by making the powerpc code mirror the x86 code, ie. we disable interrupts in switch_mm(), and optimise the scheduler case by defining switch_mm_irqs_off(). Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Flesh out/rewrite change log, add stable] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25powerpc/sysfs: Fix reference leak of cpu device_nodes present at bootTyrel Datwyler
For CPUs present at boot each logical CPU acquires a reference to the associated device node of the core. This happens in register_cpu() which is called by topology_init(). The result of this is that we end up with a reference held by each thread of the core. However, these references are never freed if the CPU core is DLPAR removed. This patch fixes the reference leaks by acquiring and releasing the references in the CPU hotplug callbacks un/register_cpu_online(). With this patch symmetric reference counting is observed with both CPUs present at boot, and those DLPAR added after boot. Fixes: f86e4718f24b ("driver/core: cpu: initialize of_node in cpu's device struture") Cc: stable@vger.kernel.org # v3.12+ Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25powerpc/pseries: Fix of_node_put() underflow during DLPAR removeTyrel Datwyler
Historically struct device_node references were tracked using a kref embedded as a struct field. Commit 75b57ecf9d1d ("of: Make device nodes kobjects so they show up in sysfs") (Mar 2014) refactored device_nodes to be kobjects such that the device tree could by more simply exposed to userspace using sysfs. Commit 0829f6d1f69e ("of: device_node kobject lifecycle fixes") (Mar 2014) followed up these changes to better control the kobject lifecycle and in particular the referecne counting via of_node_get(), of_node_put(), and of_node_init(). A result of this second commit was that it introduced an of_node_put() call when a dynamic node is detached, in of_node_remove(), that removes the initial kobj reference created by of_node_init(). Traditionally as the original dynamic device node user the pseries code had assumed responsibilty for releasing this final reference in its platform specific DLPAR detach code. This patch fixes a refcount underflow introduced by commit 0829f6d1f6, and recently exposed by the upstreaming of the recount API. Messages like the following are no longer seen in the kernel log with this patch following DLPAR remove operations of cpus and pci devices. rpadlpar_io: slot PHB 72 removed refcount_t: underflow; use-after-free. ------------[ cut here ]------------ WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 refcount_sub_and_test+0xf4/0x110 Fixes: 0829f6d1f69e ("of: device_node kobject lifecycle fixes") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> [mpe: Make change log commit references more verbose] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25powerpc/xmon: Deindent the SLB dumping logicMichael Ellerman
Currently the code that dumps SLB entries uses a double-nested if. This means the actual dumping logic is a bit squashed. Deindent it by using continue. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25Merge branch 'topic/kprobes' into nextMichael Ellerman
Although most of these kprobes patches are powerpc specific, there's a couple that touch generic code (with Acks). At the moment there's one conflict with acme's tree, but it's not too bad. Still just in case some other conflicts show up, we've put these in a topic branch so another tree could merge some or all of it if necessary.
2017-04-24powerpc/kprobes: Prefer ftrace when probing function entryNaveen N. Rao
KPROBES_ON_FTRACE avoids much of the overhead of regular kprobes as it eliminates the need for a trap, as well as the need to emulate or single-step instructions. Though OPTPROBES provides us with similar performance, we have limited optprobes trampoline slots. As such, when asked to probe at a function entry, default to using the ftrace infrastructure. With: # cd /sys/kernel/debug/tracing # echo 'p _do_fork' > kprobe_events before patch: # cat ../kprobes/list c0000000000daf08 k _do_fork+0x8 [DISABLED] c000000000044fc0 k kretprobe_trampoline+0x0 [OPTIMIZED] and after patch: # cat ../kprobes/list c0000000000d074c k _do_fork+0xc [DISABLED][FTRACE] c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED] Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>