diff options
author | Michael Ellerman <mpe@ellerman.id.au> | 2020-05-05 21:36:42 +1000 |
---|---|---|
committer | Michael Ellerman <mpe@ellerman.id.au> | 2020-05-06 15:53:24 +1000 |
commit | 1f12096aca212af8fad3ef58d5673cde691a1452 (patch) | |
tree | 945a7c88e06b348ccd9b01a58149269a5354a765 /arch/powerpc/kernel | |
parent | 140777a3d8dfdb3d3f20ea7707c0f1c0ce1b0aa5 (diff) | |
parent | 75358ea359e7c0dfceb3c7b3d854570b4260cb7f (diff) |
Merge the lockless page table walk rework into next
This merges the lockless page table walk rework series from Aneesh.
Because it touches powerpc KVM code we are sharing it with the kvm-ppc
tree in our topic/ppc-kvm branch.
This is the cover letter from Aneesh:
Avoid IPI while updating page table entries.
Problem Summary:
Slow termination of KVM guest with large guest RAM config due to a
large number of IPIs that were caused by clearing level 1 PTE
entries (THP) entries. This is shown in the stack trace below.
- qemu-system-ppc [kernel.vmlinux] [k] smp_call_function_many
- smp_call_function_many
- 36.09% smp_call_function_many
serialize_against_pte_lookup
radix__pmdp_huge_get_and_clear
zap_huge_pmd
unmap_page_range
unmap_vmas
unmap_region
__do_munmap
__vm_munmap
sys_munmap
system_call
__munmap
qemu_ram_munmap
qemu_anon_ram_free
reclaim_ramblock
call_rcu_thread
qemu_thread_start
start_thread
__clone
Why we need to do IPI when clearing PMD entries:
This was added as part of commit: 13bd817bb884 ("powerpc/thp: Serialize pmd clear against a linux page table walk")
serialize_against_pte_lookup makes sure that all parallel lockless
page table walk completes before we convert a PMD pte entry to regular
pmd entry. We end up doing that conversion in the below scenarios
1) __split_huge_zero_page_pmd
2) do_huge_pmd_wp_page_fallback
3) MADV_DONTNEED running parallel to page faults.
local_irq_disable and lockless page table walk:
The lockless page table walk work with the assumption that we can
dereference the page table contents without holding a lock. For this
to work, we need to make sure we read the page table contents
atomically and page table pages are not going to be freed/released
while we are walking the table pages. We can achieve by using a rcu
based freeing for page table pages or if the architecture implements
broadcast tlbie, we can block the IPI as we walk the page table pages.
To support both the above framework, lockless page table walk is done
with irq disabled instead of rcu_read_lock()
We do have two interface for lockless page table walk, gup fast and
__find_linux_pte. This patch series makes __find_linux_pte table walk
safe against the conversion of PMD PTE to regular PMD.
gup fast:
gup fast is already safe against THP split because kernel now
differentiate between a pmd split and a compound page split. gup fast
can run parallel to a pmd split and we prevent a parallel gup fast to
a hugepage split, by freezing the page refcount and failing the
speculative page ref increment.
Similar to how gup is safe against parallel pmd split, this patch
series updates the __find_linux_pte callers to be safe against a
parallel pmd split. We do that by enforcing the following rules.
1) Don't reload the pte value, because that can be updated in
parallel.
2) Code should be able to work with a stale PTE value and not the
recent one. ie, the pte value that we are looking at may not be the
latest value in the page table.
3) Before looking at pte value check for _PAGE_PTE bit. We now do this
as part of pte_present() check.
Performance:
This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:
Without patch:
munmap start: timer = 13162 ms, PID=7684
munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms
With patch (upto removing IPI)
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
With patch (with adding the tlb invalidate in pmdp_huge_get_and_clear_full)
munmap start: timer = 196345 ms, PID=6879
munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms
Link: https://lore.kernel.org/r/20200505071729.54912-1-aneesh.kumar@linux.ibm.com
Diffstat (limited to 'arch/powerpc/kernel')
-rw-r--r-- | arch/powerpc/kernel/mce_power.c | 14 | ||||
-rw-r--r-- | arch/powerpc/kernel/pci_64.c | 2 |
2 files changed, 10 insertions, 6 deletions
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index 067b094bfeff..1d18991f3854 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -27,7 +27,7 @@ */ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr) { - pte_t *ptep; + pte_t *ptep, pte; unsigned int shift; unsigned long pfn, flags; struct mm_struct *mm; @@ -39,19 +39,23 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr) local_irq_save(flags); ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift); + if (!ptep) { + pfn = ULONG_MAX; + goto out; + } + pte = READ_ONCE(*ptep); - if (!ptep || pte_special(*ptep)) { + if (!pte_present(pte) || pte_special(pte)) { pfn = ULONG_MAX; goto out; } if (shift <= PAGE_SHIFT) - pfn = pte_pfn(*ptep); + pfn = pte_pfn(pte); else { unsigned long rpnmask = (1ul << shift) - PAGE_SIZE; - pfn = pte_pfn(__pte(pte_val(*ptep) | (addr & rpnmask))); + pfn = pte_pfn(__pte(pte_val(pte) | (addr & rpnmask))); } - out: local_irq_restore(flags); return pfn; diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c index f83d1f69b1dd..30d07fc79dd1 100644 --- a/arch/powerpc/kernel/pci_64.c +++ b/arch/powerpc/kernel/pci_64.c @@ -100,7 +100,7 @@ int pcibios_unmap_io_space(struct pci_bus *bus) pci_name(bus->self)); #ifdef CONFIG_PPC_BOOK3S_64 - __flush_hash_table_range(&init_mm, res->start + _IO_BASE, + __flush_hash_table_range(res->start + _IO_BASE, res->end + _IO_BASE + 1); #endif return 0; |