summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-11-12powerpc/vas: Reduce polling interval for busy stateSukadev Bhattiprolu
A VAS window is normally in "busy" state for only a short duration. Reduce the time we wait for the window to go to "not-busy" state to speed-up vas_win_close() a bit. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-12powerpc/vas: Use helper to unpin/close windowSukadev Bhattiprolu
Use a helper to have the hardware unpin and mark a window closed. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-12powerpc/vas: Drop poll_window_cast_out().Sukadev Bhattiprolu
Polling for window cast out is listed in the spec, but turns out that it is not strictly necessary and slows down window close. Making it a stub for now. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-12powerpc/vas: Cleanup some debug codeSukadev Bhattiprolu
Clean up vas.h and the debug code around ifdef vas_debug. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-12powerpc/vas: Validate window creditsSukadev Bhattiprolu
NX-842, the only user of VAS, sets the window credits to default values but VAS should check the credits against the possible max values. The VAS_WCREDS_MIN is not needed and can be dropped. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-12powerpc/vas: init missing fields from [rt]xattrSukadev Bhattiprolu
Initialize a few missing window context fields from the window attributes specified by the caller. These fields are currently set to their default values by the caller (NX-842), but would be good to apply them anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/64: Set DSCR default initially from SPRNicholas Piggin
Take the DSCR value set by firmware as the dscr_default value, rather than zero. POWER9 recommends DSCR default to a non-zero value. Signed-off-by: From: Nicholas Piggin <npiggin@gmail.com> [mpe: Make record_spr_defaults() __init] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/powernv: Avoid waiting for secondary hold spinloop with OPALNicholas Piggin
OPAL boot does not insert secondaries at 0x60 to wait at the secondary hold spinloop. Instead they are started later, and inserted at generic_secondary_smp_init(), which is after the secondary hold spinloop. Avoid waiting on this spinloop when booting with OPAL firmware. This wait always times out that case. This saves 100ms boot time on powernv, and 10s of seconds of real time when booting on the simulator in SMP. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/64s/radix: Improve TLB flushing for page table freeingNicholas Piggin
Unmaps that free page tables always flush the entire PID, which is sub-optimal. Provide TLB range flushing with an additional PWC flush that can be use for va range invalidations with PWC flush. Time to munmap N pages of memory including last level page table teardown (after mmap, touch), local invalidate: N 1 2 4 8 16 32 64 vanilla 3.2us 3.3us 3.4us 3.6us 4.1us 5.2us 7.2us patched 1.4us 1.5us 1.7us 1.9us 2.6us 3.7us 6.2us Global invalidate: N 1 2 4 8 16 32 64 vanilla 2.2us 2.3us 2.4us 2.6us 3.2us 4.1us 6.2us patched 2.1us 2.5us 3.4us 5.2us 8.7us 15.7us 6.2us Local invalidates get much better across the board. Global ones have the same issue where multiple tlbies for va flush do get slower than the single tlbie to invalidate the PID. None of this test captures the TLB benefits of avoiding killing everything. Global gets worse, but it is brought in to line with global invalidate for munmap()s that do not free page tables. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/64s/radix: Introduce local single page ceiling for TLB range flushNicholas Piggin
The single page flush ceiling is the cut-off point at which we switch from invalidating individual pages, to invalidating the entire process address space in response to a range flush. Introduce a local variant of this heuristic because local and global tlbie have significantly different properties: - Local tlbiel requires 128 instructions to invalidate a PID, global tlbie only 1 instruction. - Global tlbie instructions are expensive broadcast operations. The local ceiling has been made much higher, 2x the number of instructions required to invalidate the entire PID (i.e., 256 pages). Time to mprotect N pages of memory (after mmap, touch), local invalidate: N 32 34 64 128 256 512 vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us The behaviour of both is identical at N=32 and N=512. Between there, the vanilla kernel does a PID invalidate and the patched kernel does a va range invalidate. At N=128, these require the same number of tlbiel instructions, so the patched version can be sen to be cheaper when < 128, and more expensive when > 128. However this does not well capture the cost of invalidated TLB. The additional cost at 256 pages does not seem prohibitive. It may be the case that increasing the limit further would continue to be beneficial to avoid invalidating all of the process's TLB entries. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/64s/radix: Optimize flush_tlb_rangeNicholas Piggin
Currently for radix, flush_tlb_range flushes the entire PID, because the Linux mm code does not tell us about page size here for THP vs regular pages. This is quite sub-optimal for small mremap / mprotect / change_protection. So implement va range flushes with two flush passes, one for each page size (regular and THP). The second flush has an order of matnitude fewer tlbie instructions than the first, so it is a relatively small additional cost. There is still room for improvement here with some changes to generic APIs, particularly if there are mostly THP pages to be invalidated, the small page flushes could be reduced. Time to mprotect 1 page of memory (after mmap, touch): vanilla 2.9us 1.8us patched 1.2us 1.6us Time to mprotect 30 pages of memory (after mmap, touch): vanilla 8.2us 7.2us patched 6.9us 17.9us Time to mprotect 34 pages of memory (after mmap, touch): vanilla 9.1us 8.0us patched 9.0us 8.0us 34 pages is the point at which the invalidation switches from va to entire PID, which tlbie can do in a single instruction. This is why in the case of 30 pages, the new code runs slower for this test. This is a deliberate tradeoff already present in the unmap and THP promotion code, the idea is that the benefit from avoiding flushing entire TLB for this PID on all threads in the system. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/64s/radix: Implement _tlbie(l)_va_range flush functionsNicholas Piggin
Move the barriers and range iteration down into the _tlbie* level, which improves readability. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10powerpc/64s/radix: Optimize TLB range flush barriersNicholas Piggin
Short range flushes issue a sequences of tlbie(l) instructions for individual effective addresses. These do not all require individual barrier sequences, only one covering all tlbie(l) instructions. Commit f7327e0ba3 ("powerpc/mm/radix: Remove unnecessary ptesync") made a similar optimization for tlbiel for PID flushing. For tlbie, the ISA says: The tlbsync instruction provides an ordering function for the effects of all tlbie instructions executed by the thread executing the tlbsync instruction, with respect to the memory barrier created by a subsequent ptesync instruction executed by the same thread. Time to munmap 30 pages of memory (after mmap, touch): local global vanilla 10.9us 22.3us patched 3.4us 14.4us Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-10Merge branch 'fixes' into nextMichael Ellerman
We have some dependencies & conflicts between patches in fixes and things to go in next, both in the radix TLB flush code and the IMC PMU driver. So merge fixes into next.
2017-11-09selftests/powerpc: Check FP/VEC on exception in TMGustavo Romero
Add a self test to check if FP/VEC/VSX registers are sane (restored correctly) after a FP/VEC/VSX unavailable exception is caught during a transaction. This test checks all possibilities in a thread regarding the combination of MSR.[FP|VEC] states in a thread and for each scenario raises a FP/VEC/VSX unavailable exception in transactional state, verifying if vs0 and vs32 registers, which are representatives of FP/VEC/VSX reg sets, are not corrupted. Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-08powerpc/xmon: Support dumping software pagetablesBalbir Singh
It would be nice to be able to dump page tables in a particular context. eg: dumping vmalloc space: 0:mon> dv 0xd00037fffff00000 pgd @ 0xc0000000017c0000 pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000 pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000 pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000 ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e Maps physical address = 0x00000000f1ba0000 Flags = Accessed Dirty Read Write This patch does not replicate the complex code of dump_pagetable and has no support for bolted linear mapping, thats why I've it's called dump virtual page table support. The format of the PTE can be expanded even further to add more useful information about the flags in the PTE if required. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Bike shed the output format, show the pgdir, fix build failures] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07powerpc/mm/hash: Remove stale comment.Michal Suchanek
In commit e6f81a92015b ("powerpc/mm/hash: Support 68 bit VA") the masking is folded into ASM_VSID_SCRAMBLE but the comment about masking is removed only from the firt use of ASM_VSID_SCRAMBLE. Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07powerpc/powernv/ioda: Remove explicit max window size checkAlexey Kardashevskiy
DMA windows can only have a size of power of two on IODA2 hardware and using memory_hotplug_max() to determine the upper limit won't work correcly if it returns not power of two value. This removes the check as the platform code does this check in pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO and that thing checks against locked_vm limit which prevents the userspace from locking too much memory. It is expected to impact DPDK on machines with non-power-of-two RAM size, mostly. KVM guests are less likely to be affected as usually guests get less than half of hosts RAM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfoShriya
The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which returns the last frequency requested by the kernel, but may not reflect the actual frequency the processor is running at. This patch makes a call to cpufreq_get() instead which returns the current frequency reported by the hardware. Fixes: fb5153d05a7d ("powerpc: powernv: Implement ppc_md.get_proc_freq()") Signed-off-by: Shriya <shriyak@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1Nicholas Piggin
DD2.1 does not have to save MMCR0 for all state-loss idle states, only after deep idle states (like other PMU registers). Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1Nicholas Piggin
DD2.1 does not have to flush the ERAT after a state-loss idle. Performance testing was done on a DD2.1 using only the stop0 idle state (the shallowest state which supports state loss), using context_switch selftest configured to ping-poing between two threads on the same core and two different cores. Performance improvement for same core is 7.0%, different cores is 14.8%. Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc: add POWER9_DD20 featureNicholas Piggin
Cc: Michael Neuling <mikey@neuling.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc: Remove facility loadups on transactional {fp, vec, vsx} unavailableCyril Bur
After handling a transactional FP, Altivec or VSX unavailable exception. The return to userspace code will detect that the TIF_RESTORE_TM bit is set and call restore_tm_state(). restore_tm_state() will call restore_math() to ensure that the correct facilities are loaded. This means that all the loadup code in {fp,altivec,vsx}_unavailable_tm() is doing pointless work and can simply be removed. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc: Always save/restore checkpointed regs during treclaim/trecheckpointCyril Bur
Lazy save and restore of FP/Altivec means that a userspace process can be sent to userspace with FP or Altivec disabled and loaded only as required (by way of an FP/Altivec unavailable exception). Transactional Memory complicates this situation as a transaction could be started without FP/Altivec being loaded up. This causes the hardware to checkpoint incorrect registers. Handling FP/Altivec unavailable exceptions while a thread is transactional requires a reclaim and recheckpoint to ensure the CPU has correct state for both sets of registers. tm_reclaim() has optimisations to not always save the FP/Altivec registers to the checkpointed save area. This was originally done because the caller might have information that the checkpointed registers aren't valid due to lazy save and restore. We've also been a little vague as to how tm_reclaim() leaves the FP/Altivec state since it doesn't necessarily always save it to the thread struct. This has lead to an (incorrect) assumption that it leaves the checkpointed state on the CPU. tm_recheckpoint() has similar optimisations in reverse. It may not always reload the checkpointed FP/Altivec registers from the thread struct before the trecheckpoint. It is therefore quite unclear where it expects to get the state from. This didn't help with the assumption made about tm_reclaim(). These optimisations sit in what is by definition a slow path. If a process has to go through a reclaim/recheckpoint then its transaction will be doomed on returning to userspace. This mean that the process will be unable to complete its transaction and be forced to its failure handler. This is already an out if line case for userspace. Furthermore, the cost of copying 64 times 128 bits from registers isn't very long[0] (at all) on modern processors. As such it appears these optimisations have only served to increase code complexity and are unlikely to have had a measurable performance impact. Our transactional memory handling has been riddled with bugs. A cause of this has been difficulty in following the code flow, code complexity has not been our friend here. It makes sense to remove these optimisations in favour of a (hopefully) more stable implementation. This patch does mean that some times the assembly will needlessly save 'junk' registers which will subsequently get overwritten with the correct value by the C code which calls the assembly function. This small inefficiency is far outweighed by the reduction in complexity for general TM code, context switching paths, and transactional facility unavailable exception handler. 0: I tried to measure it once for other work and found that it was hiding in the noise of everything else I was working with. I find it exceedingly likely this will be the case here. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc: Force reload for recheckpoint during tm {fp, vec, vsx} unavailable ↵Cyril Bur
exception Lazy save and restore of FP/Altivec means that a userspace process can be sent to userspace with FP or Altivec disabled and loaded only as required (by way of an FP/Altivec unavailable exception). Transactional Memory complicates this situation as a transaction could be started without FP/Altivec being loaded up. This causes the hardware to checkpoint incorrect registers. Handling FP/Altivec unavailable exceptions while a thread is transactional requires a reclaim and recheckpoint to ensure the CPU has correct state for both sets of registers. tm_reclaim() has optimisations to not always save the FP/Altivec registers to the checkpointed save area. This was originally done because the caller might have information that the checkpointed registers aren't valid due to lazy save and restore. We've also been a little vague as to how tm_reclaim() leaves the FP/Altivec state since it doesn't necessarily always save it to the thread struct. This has lead to an (incorrect) assumption that it leaves the checkpointed state on the CPU. tm_recheckpoint() has similar optimisations in reverse. It may not always reload the checkpointed FP/Altivec registers from the thread struct before the trecheckpoint. It is therefore quite unclear where it expects to get the state from. This didn't help with the assumption made about tm_reclaim(). This patch is a minimal fix for ease of backporting. A more correct fix which removes the msr parameter to tm_reclaim() and tm_recheckpoint() altogether has been upstreamed to apply on top of this patch. Fixes: dc3106690b20 ("powerpc: tm: Always use fp_state and vr_state to store live registers") Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc: Don't enable FP/Altivec if not checkpointedCyril Bur
Lazy save and restore of FP/Altivec means that a userspace process can be sent to userspace with FP or Altivec disabled and loaded only as required (by way of an FP/Altivec unavailable exception). Transactional Memory complicates this situation as a transaction could be started without FP/Altivec being loaded up. This causes the hardware to checkpoint incorrect registers. Handling FP/Altivec unavailable exceptions while a thread is transactional requires a reclaim and recheckpoint to ensure the CPU has correct state for both sets of registers. Lazy save and restore of FP/Altivec cannot be done if a process is transactional. If a facility was enabled it must remain enabled whenever a thread is transactional. Commit dc16b553c949 ("powerpc: Always restore FPU/VEC/VSX if hardware transactional memory in use") ensures that the facilities are always enabled if a thread is transactional. A bug in the introduced code may cause it to inadvertently enable a facility that was (and should remain) disabled. The problem with this extraneous enablement is that the registers for the erroneously enabled facility have not been correctly recheckpointed - the recheckpointing code assumed the facility would remain disabled. Further compounding the issue, the transactional {fp,altivec,vsx} unavailable code has been incorrectly using the MSR to enable facilities. The presence of the {FP,VEC,VSX} bit in the regs->msr simply means if the registers are live on the CPU, not if the kernel should load them before returning to userspace. This has worked due to the bug mentioned above. This causes transactional threads which return to their failure handler to observe incorrect checkpointed registers. Perhaps an example will help illustrate the problem: A userspace process is running and uses both FP and Altivec registers. This process then continues to run for some time without touching either sets of registers. The kernel subsequently disables the facilities as part of lazy save and restore. The userspace process then performs a tbegin and the CPU checkpoints 'junk' FP and Altivec registers. The process then performs a floating point instruction triggering a fp unavailable exception in the kernel. The kernel then loads the FP registers - and only the FP registers. Since the thread is transactional it must perform a reclaim and recheckpoint to ensure both the checkpointed registers and the transactional registers are correct. It then (correctly) enables MSR[FP] for the process. Later (on exception exist) the kernel also (inadvertently) enables MSR[VEC]. The process is then returned to userspace. Since the act of loading the FP registers doomed the transaction we know CPU will fail the transaction, restore its checkpointed registers, and return the process to its failure handler. The problem is that we're now running with Altivec enabled and the 'junk' checkpointed registers are restored. The kernel had only recheckpointed FP. This patch solves this by only activating FP/Altivec if userspace was using them when it entered the kernel and not simply if the process is transactional. Fixes: dc16b553c949 ("powerpc: Always restore FPU/VEC/VSX if hardware transactional memory in use") Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06mtd: powernv_flash: Use opal_async_wait_response_interruptible()Cyril Bur
The OPAL calls performed in this driver shouldn't be using opal_async_wait_response() as this performs a wait_event() which, on long running OPAL calls could result in hung task warnings. wait_event() prevents timely signal delivery which is also undesirable. This patch also attempts to quieten down the use of dev_err() when errors haven't actually occurred and also to return better information up the stack rather than always -EIO. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/powernv: Add OPAL_BUSY to opal_error_code()Cyril Bur
Also export opal_error_code() so that it can be used in modules Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/opal: Add opal_async_wait_response_interruptible() to opal-asyncCyril Bur
This patch adds an _interruptible version of opal_async_wait_response(). This is useful when a long running OPAL call is performed on behalf of a userspace thread, for example, the opal_flash_{read,write,erase} functions performed by the powernv-flash MTD driver. It is foreseeable that these functions would take upwards of two minutes causing the wait_event() to block long enough to cause hung task warnings. Furthermore, wait_event_interruptible() is preferable as otherwise there is no way for signals to stop the process which is going to be confusing in userspace. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powernv/opal-sensor: remove not needed lockStewart Smith
Parallel sensor reads could run out of async tokens due to opal_get_sensor_data grabbing tokens but then doing the sensor read behind a mutex, essentially serializing the (possibly asynchronous and relatively slow) sensor read. It turns out that the mutex isn't needed at all, not only should the OPAL interface allow concurrent reads, the implementation is certainly safe for that, and if any sensor we were reading from somewhere isn't, doing the mutual exclusion in the kernel is the wrong place to do it, OPAL should be doing it for the kernel. So, remove the mutex. Additionally, we shouldn't be printing out an error when we don't get a token as the only way this should happen is if we've been interrupted in down_interruptible() on the semaphore. Reported-by: Robert Lippert <rlippert@google.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/opal: Rework the opal-async interfaceCyril Bur
Future work will add an opal_async_wait_response_interruptible() which will call wait_event_interruptible(). This work requires extra token state to be tracked as wait_event_interruptible() can return and the caller could release the token before OPAL responds. Currently token state is tracked with two bitfields which are 64 bits big but may not need to be as OPAL informs Linux how many async tokens there are. It also uses an array indexed by token to store response messages for each token. The bitfields make it difficult to add more state and also provide a hard maximum as to how many tokens there can be - it is possible that OPAL will inform Linux that there are more than 64 tokens. Rather than add a bitfield to track the extra state, rework the internals slightly. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> [mpe: Fix __opal_async_get_token() when no tokens are free] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/opal: Make __opal_async_{get, release}_token() staticCyril Bur
There are no callers of both __opal_async_get_token() and __opal_async_release_token(). This patch also removes the possibility of "emergency through synchronous call to __opal_async_get_token()" as such it makes more sense to initialise opal_sync_sem for the maximum number of async tokens. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06mtd: powernv_flash: Don't return -ERESTARTSYS on interrupted token acquisitionCyril Bur
Because the MTD core might split up a read() or write() from userspace into several calls to the driver, we may fail to get a token but already have done some work, best to return -EINTR back to userspace and have them decide what to do. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06mtd: powernv_flash: Remove pointless goto in driver initCyril Bur
powernv_flash_probe() has pointless goto statements which jump to the end of the function to simply return a variable. Rather than checking for error and going to the label, just return the error as soon as it is detected. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06mtd: powernv_flash: Don't treat OPAL_SUCCESS as an errorCyril Bur
While this driver expects to interact asynchronously, OPAL is well within its rights to return OPAL_SUCCESS to indicate that the operation completed without the need for a callback. We shouldn't treat OPAL_SUCCESS as an error rather we should wrap up and return promptly to the caller. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06mtd: powernv_flash: Use WARN_ON_ONCE() rather than BUG_ON()Cyril Bur
BUG_ON() should be reserved in situations where we can not longer guarantee the integrity of the system. In the case where powernv_flash_async_op() receives an impossible op, we can still guarantee the integrity of the system. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/opal: Fix EBUSY bug in acquiring tokensWilliam A. Kennington III
The current code checks the completion map to look for the first token that is complete. In some cases, a completion can come in but the token can still be on lease to the caller processing the completion. If this completed but unreleased token is the first token found in the bitmap by another tasks trying to acquire a token, then the __test_and_set_bit call will fail since the token will still be on lease. The acquisition will then fail with an EBUSY. This patch reorganizes the acquisition code to look at the opal_async_token_map for an unleased token. If the token has no lease it must have no outstanding completions so we should never see an EBUSY, unless we have leased out too many tokens. Since opal_async_get_token_inrerruptible is protected by a semaphore, we will practically never see EBUSY anymore. Fixes: 8d7248232208 ("powerpc/powernv: Infrastructure to support OPAL async completion") Signed-off-by: William A. Kennington III <wak@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/eeh: Stop using do_gettimeofday()Arnd Bergmann
This interface is inefficient and deprecated because of the y2038 overflow. ktime_get_seconds() is an appropriate replacement here, since it has sufficient granularity but is more efficient and uses monotonic time. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06cxl: Rework the implementation of cxl_stop_trace_psl9()Vaibhav Jain
Presently the PSL9 specific cxl_stop_trace_psl9() only stops the RX0 traces on the CXL adapter when a PSL error irq is triggered. The patch updates the function to stop all the traces arrays and move them to the FIN state. The implementation issues the mmio to TRACECFG register to stop the trace array iff it already not in FIN state. This prevents the issue of trace data being reset in case of multiple stop mmio issued for a single trace array. Also the patch does some refactoring of existing cxl_stop_trace_psl9() and cxl_stop_trace_psl8() functions by moving them to 'pci.c' from 'debugfs.c' file and marking them as static. Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06bpf: take advantage of stack_depth tracking in powerpc JITSandipan Das
Take advantage of stack_depth tracking, originally introduced for x64, in powerpc JIT as well. Round up allocated stack by 16 bytes to make sure it stays aligned for functions called from JITed bpf program. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/tm: Don't check for WARN in TM Bad Thing handlingMichael Ellerman
Currently when we take a TM Bad Thing program check exception, we search the bug table to see if the program check was generated by a WARN/WARN_ON etc. That makes no sense, the WARN macros use trap instructions, which should never generate a TM Bad Thing exception. If they ever did that would be a bug and we should oops. We do have some hand-coded bugs in tm.S, using EMIT_BUG_ENTRY, but those are all BUGs not WARNs, and they all use trap instructions anyway. Almost certainly this check was incorrectly copied from the REASON_TRAP handling in the same function. Remove it. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Acked-By: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/mm: Add a CONFIG option to choose if radix is used by defaultMichael Ellerman
Currently if the hardware supports the radix MMU we will use it, *unless* "disable_radix" is passed on the kernel command line. However some users would like the reverse semantics. ie. The kernel uses the hash MMU by default, unless radix is explicitly requested on the command line. So add a CONFIG option to choose whether we use radix by default or not, and expand the disable_radix command line option to allow "disable_radix=no" which *enables* radix. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/64s: Replace CONFIG_PPC_STD_MMU_64 with CONFIG_PPC_BOOK3S_64Michael Ellerman
CONFIG_PPC_STD_MMU_64 indicates support for the "standard" powerpc MMU on 64-bit CPUs. The "standard" MMU refers to the hash page table MMU found in "server" processors, from IBM mainly. Currently CONFIG_PPC_STD_MMU_64 is == CONFIG_PPC_BOOK3S_64. While it's annoying to have two symbols that always have the same value, it's not quite annoying enough to bother removing one. However with the arrival of Power9, we now have the situation where CONFIG_PPC_STD_MMU_64 is enabled, but the kernel is running using the Radix MMU - *not* the "standard" MMU. So it is now actively confusing to use it, because it implies that code is disabled or inactive when the Radix MMU is in use, however that is not necessarily true. So s/CONFIG_PPC_STD_MMU_64/CONFIG_PPC_BOOK3S_64/, and do some minor formatting updates of some of the affected lines. This will be a pain for backports, but c'est la vie. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/64: Free up CPU_FTR_ICSWXMichael Ellerman
The last user of CPU_FTR_ICSWX was removed in commit 6ff4d3e96652 ("powerpc: Remove old unused icswx based coprocessor support"), so free the bit up for future use. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/mm/hash: Add pr_fmt() to hash_utils64.cAneesh Kumar K.V
Make the printks look a bit nicer by adding a prefix. Radix config now do radix-mmu: Page sizes from device-tree: radix-mmu: Page size shift = 12 AP=0x0 radix-mmu: Page size shift = 16 AP=0x5 radix-mmu: Page size shift = 21 AP=0x1 radix-mmu: Page size shift = 30 AP=0x2 This patch update hash config to do similar dmesg output. With the patch we have hash-mmu: Page sizes from device-tree: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0 hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7 hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56 hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1 hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8 hash-mmu: base_shift=20: shift=20, sllp=0x0111, avpnm=0x00000000, tlbiel=0, penc=2 hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0 hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/ipic: Fix status get and status clearChristophe Leroy
IPIC Status is provided by register IPIC_SERSR and not by IPIC_SERMR which is the mask register. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/powernv: Reserve a hole which appears after enabling IOVAlexey Kardashevskiy
In order to make generic IOV code work, the physical function IOV BAR should start from offset of the first VF. Since M64 segments share PE number space across PHB, and some PEs may be in use at the time when IOV is enabled, the existing code shifts the IOV BAR to the index of the first PE/VF. This creates a hole in IOMEM space which can be potentially taken by some other device. This reserves a temporary hole on a parent and releases it when IOV is disabled; the temporary resources are stored in pci_dn to avoid kmalloc/free. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/pseries/vio: Dispose of virq mapping on vdevice unregisterTyrel Datwyler
When a vdevice is DLPAR removed from the system the vio subsystem doesn't bother unmapping the virq from the irq_domain. As a result we have a virq mapped to a hardware irq that is no longer valid for the irq_domain. A side effect is that we are left with /proc/irq/<irq#> affinity entries, and attempts to modify the smp_affinity of the irq will fail. In the following observed example the kernel log is spammed by ics_rtas_set_affinity errors after the removal of a VSCSI adapter. This is a result of irqbalance trying to adjust the affinity every 10 seconds. rpadlpar_io: slot U8408.E8E.10A7ACV-V5-C25 removed ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3 ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3 This patch fixes the issue by calling irq_dispose_mapping() on the virq of the viodev on unregister. Fixes: f2ab6219969f ("powerpc/pseries: Add PFO support to the VIO bus") Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/64s/radix: Fix process table entry cache invalidationNicholas Piggin
According to the architecture, the process table entry cache must be flushed with tlbie RIC=2. Currently the process table entry is set to invalid right before the PID is returned to the allocator, with no invalidation. This works on existing implementations that are known to not cache the process table entry for any except the current PIDR. It is architecturally correct and cleaner to invalidate with RIC=2 after clearing the process table entry and before the PID is returned to the allocator. This can be done in arch_exit_mmap that runs before the final flush, and to ensure the final flush (fullmm) is always a RIC=2 variant. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/64s/radix: Improve preempt handling in TLB codeNicholas Piggin
Preempt should be consistently disabled for mm_is_thread_local tests, so bring the rest of these under preempt_disable(). Preempt does not need to be disabled for the mm->context.id tests, which allows simplification and removal of gotos. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>