summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-01-24mm/highmem: prepare for overriding set_pte_at()Thomas Gleixner
The generic kmap_local() map function uses set_pte_at(), but MIPS requires set_pte() and PowerPC wants __set_pte_at(). Provide arch_kmap_local_set_pte() and default it to set_pte_at(). Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24sparc/mm/highmem: flush cache and TLBThomas Gleixner
Patch series "mm/highmem: Fix fallout from generic kmap_local conversions". The kmap_local conversion wreckaged sparc, mips and powerpc as it missed some of the details in the original implementation. This patch (of 4): The recent conversion to the generic kmap_local infrastructure failed to assign the proper pre/post map/unmap flush operations for sparc. Sparc requires cache flush before map/unmap and tlb flush afterwards. Link: https://lkml.kernel.org/r/20210112170136.078559026@linutronix.de Link: https://lkml.kernel.org/r/20210112170410.905976187@linutronix.de Fixes: 3293efa97807 ("sparc/mm/highmem: Switch to generic kmap atomic") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: fix page reference leak in soft_offline_page()Dan Williams
The conversion to move pfn_to_online_page() internal to soft_offline_page() missed that the get_user_pages() reference taken by the madvise() path needs to be dropped when pfn_to_online_page() fails. Note the direct sysfs-path to soft_offline_page() does not perform a get_user_pages() lookup. When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to a leaked reference. Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24ubsan: disable unsigned-overflow check for i386Arnd Bergmann
Building ubsan kernels even for compile-testing introduced these warnings in my randconfig environment: crypto/blake2b_generic.c:98:13: error: stack frame size of 9636 bytes in function 'blake2b_compress' [-Werror,-Wframe-larger-than=] static void blake2b_compress(struct blake2b_state *S, crypto/sha512_generic.c:151:13: error: stack frame size of 1292 bytes in function 'sha512_generic_block_fn' [-Werror,-Wframe-larger-than=] static void sha512_generic_block_fn(struct sha512_state *sst, u8 const *src, lib/crypto/curve25519-fiat32.c:312:22: error: stack frame size of 2180 bytes in function 'fe_mul_impl' [-Werror,-Wframe-larger-than=] static noinline void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10]) lib/crypto/curve25519-fiat32.c:444:22: error: stack frame size of 1588 bytes in function 'fe_sqr_impl' [-Werror,-Wframe-larger-than=] static noinline void fe_sqr_impl(u32 out[10], const u32 in1[10]) Further testing showed that this is caused by -fsanitize=unsigned-integer-overflow, but is isolated to the 32-bit x86 architecture. The one in blake2b immediately overflows the 8KB stack area architectures, so better ensure this never happens by disabling the option for 32-bit x86. Link: https://lkml.kernel.org/r/20210112202922.2454435-1-arnd@kernel.org Link: https://lore.kernel.org/lkml/20201230154749.746641-1-arnd@kernel.org/ Fixes: d0a3ac549f38 ("ubsan: enable for all*config builds") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Marco Elver <elver@google.com> Cc: George Popescu <georgepope@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan, mm: fix resetting page_alloc tags for HW_TAGSAndrey Konovalov
A previous commit added resetting KASAN page tags to kernel_init_free_pages() to avoid false-positives due to accesses to metadata with the hardware tag-based mode. That commit did reset page tags before the metadata access, but didn't restore them after. As the result, KASAN fails to detect bad accesses to page_alloc allocations on some configurations. Fix this by recovering the tag after the metadata access. Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan, mm: fix conflicts with init_on_alloc/freeAndrey Konovalov
A few places where SLUB accesses object's data or metadata were missed in a previous patch. This leads to false positives with hardware tag-based KASAN when bulk allocations are used with init_on_alloc/free. Fix the false-positives by resetting pointer tags during these accesses. (The kasan_reset_tag call is removed from slab_alloc_node, as it's added into maybe_wipe_obj_freeptr.) Link: https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901 Link: https://lkml.kernel.org/r/093428b5d2ca8b507f4a79f92f9929b35f7fada7.1610731872.git.andreyknvl@google.com Fixes: aa1ef4d7b3f67 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan: fix HW_TAGS boot parametersAndrey Konovalov
The initially proposed KASAN command line parameters are redundant. This change drops the complex "kasan.mode=off/prod/full" parameter and adds a simpler kill switch "kasan=off/on" instead. The new parameter together with the already existing ones provides a cleaner way to express the same set of features. The full set of parameters with this change: kasan=off/on - whether KASAN is enabled kasan.fault=report/panic - whether to only print a report or also panic kasan.stacktrace=off/on - whether to collect alloc/free stack traces Default values: kasan=on kasan.fault=report kasan.stacktrace=on (if CONFIG_DEBUG_KERNEL=y) kasan.stacktrace=off (otherwise) Link: https://linux-review.googlesource.com/id/Ib3694ed90b1e8ccac6cf77dfd301847af4aba7b8 Link: https://lkml.kernel.org/r/4e9c4a4bdcadc168317deb2419144582a9be6e61.1610736745.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan: fix incorrect arguments passing in kasan_add_zero_shadowLecopzer Chen
kasan_remove_zero_shadow() shall use original virtual address, start and size, instead of shadow address. Link: https://lkml.kernel.org/r/20210103063847.5963-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan: fix unaligned address is unhandled in kasan_remove_zero_shadowLecopzer Chen
During testing kasan_populate_early_shadow and kasan_remove_zero_shadow, if the shadow start and end address in kasan_remove_zero_shadow() is not aligned to PMD_SIZE, the remain unaligned PTE won't be removed. In the test case for kasan_remove_zero_shadow(): shadow_start: 0xffffffb802000000, shadow end: 0xffffffbfbe000000 3-level page table: PUD_SIZE: 0x40000000 PMD_SIZE: 0x200000 PAGE_SIZE: 4K 0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE. In the correct condition, this should fallback to the next level kasan_remove_pmd_table() but the condition flow always continue to skip the unaligned part. Fix by correcting the condition when next and addr are neither aligned. Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24Merge tag 'irq_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix a kernel panic in mips-cpu due to invalid irq domain hierarchy. - Fix to not lose IPIs on bcm2836. - Fix for a bogus marking of ITS devices as shared due to unitialized stack variable. - Clear a phantom interrupt on qcom-pdc to unblock suspend. - Small cleanups, warning and build fixes. * tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Export irq_check_status_bit() irqchip/mips-cpu: Set IPI domain parent chip irqchip/pruss: Simplify the TI_PRUSS_INTC Kconfig irqchip/loongson-liointc: Fix build warnings driver core: platform: Add extra error check in devm_platform_get_irqs_affinity() irqchip/bcm2836: Fix IPI acknowledgement after conversion to handle_percpu_devid_irq irqchip/irq-sl28cpld: Convert comma to semicolon genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs()
2021-01-24Merge tag 'objtool_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Borislav Petkov: - Adjust objtool to handle a recent binutils change to not generate unused symbols anymore. - Revert the fail-the-build-on-fatal-errors objtool strategy for now due to the ever-increasing matrix of supported toolchains/plugins and them causing too many such fatal errors currently. - Do not add empty symbols to objdump's rbtree to accommodate clang removing section symbols. * tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Don't fail on missing symbol table objtool: Don't fail the kernel build on fatal errors objtool: Don't add empty symbols to the rbtree
2021-01-24Merge tag 'sched_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Correct the marking of kthreads which are supposed to run on a specific, single CPU vs such which are affine to only one CPU, mark per-cpu workqueue threads as such and make sure that marking "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads. - A fix to not push away tasks on CPUs coming online. - Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity on CPU offlining so that pending workers can finish on newly arrived onlined CPUs too. - Dump tasks which haven't vacated a CPU which is currently being unplugged. - Register a special scale invariance callback which gets called on resume from RAM to read out APERF/MPERF after resume and thus make the schedutil scaling governor more precise. * tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Relax the set_cpus_allowed_ptr() semantics sched: Fix CPU hotplug / tighten is_per_cpu_kthread() sched: Prepare to use balance_push in ttwu() workqueue: Restrict affinity change to rescuer workqueue: Tag bound workers with KTHREAD_IS_PER_CPU kthread: Extract KTHREAD_IS_PER_CPU sched: Don't run cpu-online with balance_push() enabled workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity sched/core: Print out straggler tasks in sched_cpu_dying() x86: PM: Register syscore_ops for scale invariance
2021-01-24Merge tag 'timers_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Fix an integer overflow in the NTP RTC synchronization which led to the latter happening every 2 seconds instead of the intended every 11 minutes. - Get rid of now unused get_seconds(). * tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ntp: Fix RTC synchronization on 32-bit platforms timekeeping: Remove unused get_seconds()
2021-01-24Merge tag 'x86_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add a new Intel model number for Alder Lake - Differentiate which aspects of the FPU state get saved/restored when the FPU is used in-kernel and fix a boot crash on K7 due to early MXCSR access before CR4.OSFXSR is even set. - A couple of noinstr annotation fixes - Correct die ID setting on AMD for users of topology information which need the correct die ID - A SEV-ES fix to handle string port IO to/from kernel memory properly * tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add another Alder Lake CPU to the Intel family x86/mmx: Use KFPU_387 for MMX string operations x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state x86/topology: Make __max_die_per_package available unconditionally x86: __always_inline __{rd,wr}msr() x86/mce: Remove explicit/superfluous tracing locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP locking/lockdep: Cure noinstr fail x86/sev: Fix nonistr violation x86/entry: Fix noinstr fail x86/cpu/amd: Set __max_die_per_package on AMD x86/sev-es: Handle string port IO to kernel memory properly
2021-01-24Merge tag 'powerpc-5.11-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a bad interaction between the scv handling and the fallback L1D flush, which could lead to user register corruption. Only affects people using scv (~no one) on machines with old firmware that are missing the L1D flush. - Two small selftest fixes. Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das, and Tulio Magno Quites Machado Filho. * tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: fix scv entry fallback flush vs interrupt selftests/powerpc: Only test lwm/stmw on big endian selftests/powerpc: Fix exit status of pkey tests
2021-01-24Merge tag 'for-linus-2021-01-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull misc fixes from Christian Brauner: - Jann reported sparse complaints because of a missing __user annotation in a helper we added way back when we added pidfd_send_signal() to avoid compat syscall handling. Fix it. - Yanfei replaces a reference in a comment to the _do_fork() helper I removed a while ago with a reference to the new kernel_clone() replacement - Alexander Guril added a simple coding style fix * tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: kthread: remove comments about old _do_fork() helper Kernel: fork.c: Fix coding style: Do not use {} around single-line statements signal: Add missing __user annotation to copy_siginfo_from_user_any
2021-01-24Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "An important signal handling patch for stable, and two small cleanup patches" * tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: do not fail __smb_send_rqst if non-fatal signals are pending fs/cifs: Simplify bool comparison. fs/cifs: Assign boolean values to a bool variable
2021-01-24mm: fix numa stats for thp migrationShakeel Butt
Currently the kernel is not correctly updating the numa stats for NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that. For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment there is no need to handle THP migration as kernel still does not have write support for file THP but to be more future proof, this patch adds the THP support for those stats as well. Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: memcg: fix memcg file_dirty numa statShakeel Butt
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but not the memcg numa stats. That was not an issue until recently the commit 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") exposed numa stats for the memcg. So fix the file_dirty per-memcg numa stat. Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: memcg/slab: optimize objcg stock drainingRoman Gushchin
Imran Khan reported a 16% regression in hackbench results caused by the commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages"). The regression is noticeable in the case of a consequent allocation of several relatively large slab objects, e.g. skb's. As soon as the amount of stocked bytes exceeds PAGE_SIZE, drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads to a number of atomic operations in page_counter_uncharge(). The corresponding call graph is below (provided by Imran Khan): |__alloc_skb | | | |__kmalloc_reserve.isra.61 | | | | | |__kmalloc_node_track_caller | | | | | | | |slab_pre_alloc_hook.constprop.88 | | | obj_cgroup_charge | | | | | | | | | |__memcg_kmem_charge | | | | | | | | | | | |page_counter_try_charge | | | | | | | | | |refill_obj_stock | | | | | | | | | | | |drain_obj_stock.isra.68 | | | | | | | | | | | | | |__memcg_kmem_uncharge | | | | | | | | | | | | | | | |page_counter_uncharge | | | | | | | | | | | | | | | | | |page_counter_cancel | | | | | | | | | | | |__slab_alloc | | | | | | | | | |___slab_alloc | | | | | | | | |slab_post_alloc_hook Instead of directly uncharging the accounted kernel memory, it's possible to refill the generic page-sized per-cpu stock instead. It's a much faster operation, especially on a default hierarchy. As a bonus, __memcg_kmem_uncharge_page() will also get faster, so the freeing of page-sized kernel allocations (e.g. large kmallocs) will become faster. A similar change has been done earlier for the socket memory by the commit 475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket memory uncharging"). Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Michal Koutn <mkoutny@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: fix initialization of struct page for holes in memory layoutMike Rapoport
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory. Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields if this page are set to default values and the page is marked as Reserved. init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero. On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below: # grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM unset zone link in struct page will trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock. Update init_unavailable_mem() to use zone constraints defined by an architecture to properly setup the zone link and use node ID of the adjacent range in memblock.memory to set the node link. Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24x86/setup: don't remove E820_TYPE_RAM for pfn 0Mike Rapoport
Patch series "mm: fix initialization of struct page for holes in memory layout", v3. Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") exposed several issues with the memory map initialization and these patches fix those issues. Initially there were crashes during compaction that Qian Cai reported back in April [1]. It seemed back then that the problem was fixed, but a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional discussion at [3]. [1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw [2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com [3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org This patch (of 2): The first 4Kb of memory is a BIOS owned area and to avoid its allocation for the kernel it was not listed in e820 tables as memory. As the result, pfn 0 was never recognised by the generic memory management and it is not a part of neither node 0 nor ZONE_DMA. If set_pfnblock_flags_mask() would be ever called for the pageblock corresponding to the first 2Mbytes of memory, having pfn 0 outside of ZONE_DMA would trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); Along with reserving the first 4Kb in e820 tables, several first pages are reserved with memblock in several places during setup_arch(). These reservations are enough to ensure the kernel does not touch the BIOS area and it is not necessary to remove E820_TYPE_RAM for pfn 0. Remove the update of e820 table that changes the type of pfn 0 and move the comment describing why it was done to trim_low_memory_range() that reserves the beginning of the memory. Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24io_uring: account io_uring internal files as REQ_F_INFLIGHTJens Axboe
We need to actively cancel anything that introduces a potential circular loop, where io_uring holds a reference to itself. If the file in question is an io_uring file, then add the request to the inflight list. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24io_uring: fix sleeping under spin in __io_clean_opPavel Begunkov
[ 27.629441] BUG: sleeping function called from invalid context at fs/file.c:402 [ 27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1012, name: io_wqe_worker-0 [ 27.633220] 1 lock held by io_wqe_worker-0/1012: [ 27.634286] #0: ffff888105e26c98 (&ctx->completion_lock) {....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70 [ 27.649249] Call Trace: [ 27.649874] dump_stack+0xac/0xe3 [ 27.650666] ___might_sleep+0x284/0x2c0 [ 27.651566] put_files_struct+0xb8/0x120 [ 27.652481] __io_clean_op+0x10c/0x2a0 [ 27.653362] __io_cqring_fill_event+0x2c1/0x350 [ 27.654399] __io_req_complete.part.102+0x41/0x70 [ 27.655464] io_openat2+0x151/0x300 [ 27.656297] io_issue_sqe+0x6c/0x14e0 [ 27.660991] io_wq_submit_work+0x7f/0x240 [ 27.662890] io_worker_handle_work+0x501/0x8a0 [ 27.664836] io_wqe_worker+0x158/0x520 [ 27.667726] kthread+0x134/0x180 [ 27.669641] ret_from_fork+0x1f/0x30 Instead of cleaning files on overflow, return back overflow cancellation into io_uring_cancel_files(). Previously it was racy to clean REQ_F_OVERFLOW flag, but we got rid of it, and can do it through repetitive attempts targeting all matching requests. Reported-by: Abaci <abaci@linux.alibaba.com> Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-23Merge branch 'mtd/fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull mtd fixes from Miquel Raynal. * 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: mtd: rawnand: omap: Use BCH private fields in the specific OOB layout mtd: spinand: Fix MTD_OPS_AUTO_OOB requests mtd: rawnand: intel: check the mtd name only after setting the variable mtd: rawnand: nandsim: Fix the logic when selecting Hamming soft ECC engine mtd: rawnand: gpmi: fix dst bit offset when extracting raw payload
2021-01-23Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Another bunch of driver fixes" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: sprd: depend on COMMON_CLK to fix compile tests Revert "i2c: imx: Remove unused .id_table support" i2c: octeon: check correct size of maximum RECV_LEN packet i2c: tegra: Create i2c_writesl_vi() to use with VI I2C for filling TX FIFO i2c: bpmp-tegra: Ignore unknown I2C_M flags i2c: tegra: Wait for config load atomically while in ISR
2021-01-23Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Twelve minor fixes, all in drivers or doc. Most of the fixes are pretty obvious (although we had two goes to get the UFS sysfs doc right) and the biggest change is in the ufs driver which they've extensively tested" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ibmvfc: Set default timeout to avoid crash during migration scsi: target: tcmu: Fix use-after-free of se_cmd->priv scsi: fnic: Fix memleak in vnic_dev_init_devcmd2 scsi: libfc: Avoid invoking response handler twice if ep is already completed scsi: scsi_transport_srp: Don't block target in failfast state scsi: docs: ABI: sysfs-driver-ufs: Rectify table formatting scsi: ufs: Fix tm request when non-fatal error happens scsi: ufs: Fix livelock of ufshcd_clear_ua_wluns() scsi: ibmvfc: Fix missing cast of ibmvfc_event pointer to u64 handle scsi: ufs: ufshcd-pltfrm depends on HAS_IOMEM scsi: megaraid_sas: Fix MEGASAS_IOC_FIRMWARE regression scsi: docs: ABI: sysfs-driver-ufs: Add DeepSleep power mode
2021-01-23Merge tag 'linux-kselftest-kunit-fixes-5.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit fixes from Shuah : "Five fixes to the kunit tool and documentation from Daniel Latypov and David Gow" * tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: tool: move kunitconfig parsing into __init__, make it optional kunit: tool: fix minor typing issue with None status kunit: tool: surface and address more typing issues Documentation: kunit: include example of a parameterized test kunit: tool: Fix spelling of "diagnostic" in kunit_parser
2021-01-23cifs: do not fail __smb_send_rqst if non-fatal signals are pendingRonnie Sahlberg
RHBZ 1848178 The original intent of returning an error in this function in the patch: "CIFS: Mask off signals when sending SMB packets" was to avoid interrupting packet send in the middle of sending the data (and thus breaking an SMB connection), but we also don't want to fail the request for non-fatal signals even before we have had a chance to try to send it (the reported problem could be reproduced e.g. by exiting a child process when the parent process was in the midst of calling futimens to update a file's timestamps). In addition, since the signal may remain pending when we enter the sending loop, we may end up not sending the whole packet before TCP buffers become full. In this case the code returns -EINTR but what we need here is to return -ERESTARTSYS instead to allow system calls to be restarted. Fixes: b30c74c73c78 ("CIFS: Mask off signals when sending SMB packets") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-22Merge tag 'for-5.11/dm-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM integrity crash if "recalculate" used without "internal_hash" - Fix DM integrity "recalculate" support to prevent recalculating checksums if we use internal_hash or journal_hash with a key (e.g. HMAC). Use of crypto as a means to prevent malicious corruption requires further changes and was never a design goal for dm-integrity's primary usecase of detecting accidental corruption. - Fix a benign dm-crypt copy-and-paste bug introduced as part of a fix that was merged for 5.11-rc4. - Fix DM core's dm_get_device() to avoid filesystem lookup to get block device (if possible). * tag 'for-5.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: avoid filesystem lookup in dm_get_dev_t() dm crypt: fix copy and paste bug in crypt_alloc_req_aead dm integrity: conditionally disable "recalculate" feature dm integrity: fix a crash if "recalculate" used without "internal_hash"
2021-01-22Merge tag 'perf-tools-fixes-v5.11-2-2021-01-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull more perf tools fixes from Arnaldo Carvalho de Melo: - Fix id index used in Intel PT for heterogeneous systems - Fix overrun issue in 'perf script' for dynamically-allocated PMU type number - Fix 'perf stat' metrics containing the 'duration_time' synthetic event - Fix system PMU 'perf stat' metrics * tag 'perf-tools-fixes-v5.11-2-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: perf script: Fix overrun issue for dynamically-allocated PMU type number perf metricgroup: Fix system PMU metrics perf metricgroup: Fix for metrics containing duration_time perf evlist: Fix id index for heterogeneous systems
2021-01-22Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - Correctly mask out bits 63:60 in a kernel tag check fault address (specified as unknown by the architecture). Previously they were just zeroed but for kernel pointers they need to be all ones. - Fix a panic (unexpected kernel BRK exception) caused by kprobes being reentered due to an interrupt. * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: kprobes: Fix Uexpected kernel BRK exception at EL1 kasan, arm64: fix pointer tags in KASAN reports
2021-01-22Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A patch to zero out sensitive cryptographic data and two minor cleanups prompted by the fact that a bunch of code was moved in this cycle" * tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client: libceph: fix "Boolean result is used in bitwise operation" warning libceph, ceph: disambiguate ceph_connection_operations handlers libceph: zero out session key and connection secret
2021-01-22Merge tag 'fixes-2021-01-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull typo fix from Mike Rapoport: "Fix typo in comment of memblock_phys_alloc_try_nid()" * tag 'fixes-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm/memblock: Fix typo in comment of memblock_phys_alloc_try_nid()
2021-01-22Merge tag 'mmc-v5.11-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC fixes from Ulf Hansson: "MMC core: - Fix initialization of block size when ext_csd isn't present MMC host: - sdhci-brcmstb: Fix mmc timeout errors on S5 suspend - sdhci-of-dwcmshc: Fix request accessing RPMB - sdhci-xenon: Fix 1.8v regulator stabilization" * tag 'mmc-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: core: don't initialize block size from ext_csd if not present mmc: sdhci-brcmstb: Fix mmc timeout errors on S5 suspend mmc: sdhci-xenon: fix 1.8v regulator stabilization mmc: sdhci-of-dwcmshc: fix rpmb access
2021-01-22Merge tag 'platform-drivers-x86-v5.11-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Hans de Goede: "A small collection of bug-fixes and model-specific quirks" * tag 'platform-drivers-x86-v5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86: thinkpad_acpi: Add P53/73 firmware to fan_quirk_table for dual fan control platform/x86: hp-wmi: Don't log a warning on HPWMI_RET_UNKNOWN_COMMAND errors platform/x86: intel-vbtn: Drop HP Stream x360 Convertible PC 11 from allow-list platform/x86: ideapad-laptop: Disable touchpad_switch for ELAN0634 platform/x86: amd-pmc: Fix CONFIG_DEBUG_FS check platform/x86: thinkpad_acpi: correct palmsensor error checking platform/x86: intel-vbtn: Support for tablet mode on Dell Inspiron 7352 platform/x86: touchscreen_dmi: Add swap-x-y quirk for Goodix touchscreen on Estar Beauty HD tablet platform/x86: i2c-multi-instantiate: Don't create platform device for INT3515 ACPI nodes platform/surface: SURFACE_PLATFORMS should depend on ACPI platform/surface: surface_gpe: Fix non-PM_SLEEP build warnings tools/power/x86/intel-speed-select: Set higher of cpuinfo_max_freq or base_frequency tools/power/x86/intel-speed-select: Set scaling_max_freq to base_frequency
2021-01-22io_uring: fix short read retries for non-reg filesPavel Begunkov
Sockets and other non-regular files may actually expect short reads to happen, don't retry reads for them. Because non-reg files don't set FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter out those cases after first do_read() attempt with ret>0. Cc: stable@vger.kernel.org # 5.9+ Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-22io_uring: fix SQPOLL IORING_OP_CLOSE cancelation stateJens Axboe
IORING_OP_CLOSE is special in terms of cancelation, since it has an intermediate state where we've removed the file descriptor but hasn't closed the file yet. For that reason, it's currently marked with IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op is always run even if canceled, to prevent leaving us with a live file but an fd that is gone. However, with SQPOLL, since a cancel request doesn't carry any resources on behalf of the request being canceled, if we cancel before any of the close op has been run, we can end up with io-wq not having the ->files assigned. This can result in the following oops reported by Joseph: BUG: kernel NULL pointer dereference, address: 00000000000000d8 PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__lock_acquire+0x19d/0x18c0 Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe RSP: 0018:ffffc90001933828 EFLAGS: 00010002 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8 RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8 FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: lock_acquire+0x31a/0x440 ? close_fd_get_file+0x39/0x160 ? __lock_acquire+0x647/0x18c0 _raw_spin_lock+0x2c/0x40 ? close_fd_get_file+0x39/0x160 close_fd_get_file+0x39/0x160 io_issue_sqe+0x1334/0x14e0 ? lock_acquire+0x31a/0x440 ? __io_free_req+0xcf/0x2e0 ? __io_free_req+0x175/0x2e0 ? find_held_lock+0x28/0xb0 ? io_wq_submit_work+0x7f/0x240 io_wq_submit_work+0x7f/0x240 io_wq_cancel_cb+0x161/0x580 ? io_wqe_wake_worker+0x114/0x360 ? io_uring_get_socket+0x40/0x40 io_async_find_and_cancel+0x3b/0x140 io_issue_sqe+0xbe1/0x14e0 ? __lock_acquire+0x647/0x18c0 ? __io_queue_sqe+0x10b/0x5f0 __io_queue_sqe+0x10b/0x5f0 ? io_req_prep+0xdb/0x1150 ? mark_held_locks+0x6d/0xb0 ? mark_held_locks+0x6d/0xb0 ? io_queue_sqe+0x235/0x4b0 io_queue_sqe+0x235/0x4b0 io_submit_sqes+0xd7e/0x12a0 ? _raw_spin_unlock_irq+0x24/0x30 ? io_sq_thread+0x3ae/0x940 io_sq_thread+0x207/0x940 ? do_wait_intr_irq+0xc0/0xc0 ? __ia32_sys_io_uring_enter+0x650/0x650 kthread+0x134/0x180 ? kthread_create_worker_on_cpu+0x90/0x90 ret_from_fork+0x1f/0x30 Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified the fdtable. Canceling before this point is totally fine, and running it in the io-wq context _after_ that point is also fine. For 5.12, we'll handle this internally and get rid of the no-cancel flag, as IORING_OP_CLOSE is the only user of it. Cc: stable@vger.kernel.org Fixes: b5dba59e0cf7 ("io_uring: add support for IORING_OP_CLOSE") Reported-by: "Abaci <abaci@linux.alibaba.com>" Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-22arm64: kprobes: Fix Uexpected kernel BRK exception at EL1Qais Yousef
I was hitting the below panic continuously when attaching kprobes to scheduler functions [ 159.045212] Unexpected kernel BRK exception at EL1 [ 159.053753] Internal error: BRK handler: f2000006 [#1] PREEMPT SMP [ 159.059954] Modules linked in: [ 159.063025] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.11.0-rc4-00008-g1e2a199f6ccd #56 [rt-app] <notice> [1] Exiting.[ 159.071166] Hardware name: ARM Juno development board (r2) (DT) [ 159.079689] pstate: 600003c5 (nZCv DAIF -PAN -UAO -TCO BTYPE=--) [ 159.085723] pc : 0xffff80001624501c [ 159.089377] lr : attach_entity_load_avg+0x2ac/0x350 [ 159.094271] sp : ffff80001622b640 [rt-app] <notice> [0] Exiting.[ 159.097591] x29: ffff80001622b640 x28: 0000000000000001 [ 159.105515] x27: 0000000000000049 x26: ffff000800b79980 [ 159.110847] x25: ffff00097ef37840 x24: 0000000000000000 [ 159.116331] x23: 00000024eacec1ec x22: ffff00097ef12b90 [ 159.121663] x21: ffff00097ef37700 x20: ffff800010119170 [rt-app] <notice> [11] Exiting.[ 159.126995] x19: ffff00097ef37840 x18: 000000000000000e [ 159.135003] x17: 0000000000000001 x16: 0000000000000019 [ 159.140335] x15: 0000000000000000 x14: 0000000000000000 [ 159.145666] x13: 0000000000000002 x12: 0000000000000002 [ 159.150996] x11: ffff80001592f9f0 x10: 0000000000000060 [ 159.156327] x9 : ffff8000100f6f9c x8 : be618290de0999a1 [ 159.161659] x7 : ffff80096a4b1000 x6 : 0000000000000000 [ 159.166990] x5 : ffff00097ef37840 x4 : 0000000000000000 [ 159.172321] x3 : ffff000800328948 x2 : 0000000000000000 [ 159.177652] x1 : 0000002507d52fec x0 : ffff00097ef12b90 [ 159.182983] Call trace: [ 159.185433] 0xffff80001624501c [ 159.188581] update_load_avg+0x2d0/0x778 [ 159.192516] enqueue_task_fair+0x134/0xe20 [ 159.196625] enqueue_task+0x4c/0x2c8 [ 159.200211] ttwu_do_activate+0x70/0x138 [ 159.204147] sched_ttwu_pending+0xbc/0x160 [ 159.208253] flush_smp_call_function_queue+0x16c/0x320 [ 159.213408] generic_smp_call_function_single_interrupt+0x1c/0x28 [ 159.219521] ipi_handler+0x1e8/0x3c8 [ 159.223106] handle_percpu_devid_irq+0xd8/0x460 [ 159.227650] generic_handle_irq+0x38/0x50 [ 159.231672] __handle_domain_irq+0x6c/0xc8 [ 159.235781] gic_handle_irq+0xcc/0xf0 [ 159.239452] el1_irq+0xb4/0x180 [ 159.242600] rcu_is_watching+0x28/0x70 [ 159.246359] rcu_read_lock_held_common+0x44/0x88 [ 159.250991] rcu_read_lock_any_held+0x30/0xc0 [ 159.255360] kretprobe_dispatcher+0xc4/0xf0 [ 159.259555] __kretprobe_trampoline_handler+0xc0/0x150 [ 159.264710] trampoline_probe_handler+0x38/0x58 [ 159.269255] kretprobe_trampoline+0x70/0xc4 [ 159.273450] run_rebalance_domains+0x54/0x80 [ 159.277734] __do_softirq+0x164/0x684 [ 159.281406] irq_exit+0x198/0x1b8 [ 159.284731] __handle_domain_irq+0x70/0xc8 [ 159.288840] gic_handle_irq+0xb0/0xf0 [ 159.292510] el1_irq+0xb4/0x180 [ 159.295658] arch_cpu_idle+0x18/0x28 [ 159.299245] default_idle_call+0x9c/0x3e8 [ 159.303265] do_idle+0x25c/0x2a8 [ 159.306502] cpu_startup_entry+0x2c/0x78 [ 159.310436] secondary_start_kernel+0x160/0x198 [ 159.314984] Code: d42000c0 aa1e03e9 d42000c0 aa1e03e9 (d42000c0) After a bit of head scratching and debugging it turned out that it is due to kprobe handler being interrupted by a tick that causes us to go into (I think another) kprobe handler. The culprit was kprobe_breakpoint_ss_handler() returning DBG_HOOK_ERROR which leads to the Unexpected kernel BRK exception. Reverting commit ba090f9cafd5 ("arm64: kprobes: Remove redundant kprobe_step_ctx") seemed to fix the problem for me. Further analysis showed that kcb->kprobe_status is set to KPROBE_REENTER when the error occurs. By teaching kprobe_breakpoint_ss_handler() to handle this status I can no longer reproduce the problem. Fixes: ba090f9cafd5 ("arm64: kprobes: Remove redundant kprobe_step_ctx") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20210122110909.3324607-1-qais.yousef@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-01-22sched: Relax the set_cpus_allowed_ptr() semanticsPeter Zijlstra
Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu tasks to retain during CPU offline, we can relax the warning in set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at the last minute will get pushed off before it can run. While during CPU online there is no harm, and actual benefit, to allowing kthreads back on early, it simplifies hotplug code and fixes a number of outstanding races. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lai jiangshan <jiangshanlai@gmail.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103507.240724591@infradead.org
2021-01-22sched: Fix CPU hotplug / tighten is_per_cpu_kthread()Peter Zijlstra
Prior to commit 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug") we'd leave any task on the dying CPU and break affinity and force them off at the very end. This scheme had to change in order to enable migrate_disable(). One cannot wait for migrate_disable() to complete while stuck in stop_machine(). Furthermore, since we need at the very least: idle, hotplug and stop threads at any point before stop_machine, we can't break affinity and/or push those away. Under the assumption that all per-cpu kthreads are sanely handled by CPU hotplug, the new code no long breaks affinity or migrates any of them (which then includes the critical ones above). However, there's an important difference between per-cpu kthreads and kthreads that happen to have a single CPU affinity which is lost. The latter class very much relies on the forced affinity breaking and migration semantics previously provided. Use the new kthread_is_per_cpu() infrastructure to tighten is_per_cpu_kthread() and fix the hot-unplug problems stemming from the change. Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103507.102416009@infradead.org
2021-01-22sched: Prepare to use balance_push in ttwu()Peter Zijlstra
In preparation of using the balance_push state in ttwu() we need it to provide a reliable and consistent state. The immediate problem is that rq->balance_callback gets cleared every schedule() and then re-set in the balance_push_callback() itself. This is not a reliable signal, so add a variable that stays set during the entire time. Also move setting it before the synchronize_rcu() in sched_cpu_deactivate(), such that we get guaranteed visibility to ttwu(), which is a preempt-disable region. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.966069627@infradead.org
2021-01-22workqueue: Restrict affinity change to rescuerPeter Zijlstra
create_worker() will already set the right affinity using kthread_bind_mask(), this means only the rescuer will need to change it's affinity. Howveer, while in cpu-hot-unplug a regular task is not allowed to run on online&&!active as it would be pushed away quite agressively. We need KTHREAD_IS_PER_CPU to survive in that environment. Therefore set the affinity after getting that magic flag. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
2021-01-22workqueue: Tag bound workers with KTHREAD_IS_PER_CPUPeter Zijlstra
Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU. Workqueues have unfortunate semantics in that per-cpu workers are not default flushed and parked during hotplug, however a subset does manual flush on hotplug and hard relies on them for correctness. Therefore play silly games.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.693465814@infradead.org
2021-01-22kthread: Extract KTHREAD_IS_PER_CPUPeter Zijlstra
There is a need to distinguish geniune per-cpu kthreads from kthreads that happen to have a single CPU affinity. Geniune per-cpu kthreads are kthreads that are CPU affine for correctness, these will obviously have PF_KTHREAD set, but must also have PF_NO_SETAFFINITY set, lest userspace modify their affinity and ruins things. However, these two things are not sufficient, PF_NO_SETAFFINITY is also set on other tasks that have their affinities controlled through other means, like for instance workqueues. Therefore another bit is needed; it turns out kthread_create_per_cpu() already has such a bit: KTHREAD_IS_PER_CPU, which is used to make kthread_park()/kthread_unpark() work correctly. Expose this flag and remove the implicit setting of it from kthread_create_on_cpu(); the io_uring usage of it seems dubious at best. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org
2021-01-22sched: Don't run cpu-online with balance_push() enabledPeter Zijlstra
We don't need to push away tasks when we come online, mark the push complete right before the CPU dies. XXX hotplug state machine has trouble with rollback here. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.415606087@infradead.org
2021-01-22workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinityLai Jiangshan
The scheduler won't break affinity for us any more, and we should "emulate" the same behavior when the scheduler breaks affinity for us. The behavior is "changing the cpumask to cpu_possible_mask". And there might be some other CPUs online later while the worker is still running with the pending work items. The worker should be allowed to use the later online CPUs as before and process the work items ASAP. If we use cpu_active_mask here, we can't achieve this goal but using cpu_possible_mask can. Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210111152638.2417-4-jiangshanlai@gmail.com
2021-01-22sched/core: Print out straggler tasks in sched_cpu_dying()Valentin Schneider
Since commit 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug") tasks are expected to move themselves out of a out-going CPU. For most tasks this will be done automagically via BALANCE_PUSH, but percpu kthreads will have to cooperate and move themselves away one way or another. Currently, some percpu kthreads (workqueues being a notable exemple) do not cooperate nicely and can end up on an out-going CPU at the time sched_cpu_dying() is invoked. Print the dying rq's tasks to shed some light on the stragglers. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210113183141.11974-1-valentin.schneider@arm.com
2021-01-22misc: rtsx: init value of aspm_enabledRicky Wu
make sure ASPM state sync with pcr->aspm_enabled init value pcr->aspm_enabled Cc: stable@vger.kernel.org Signed-off-by: Ricky Wu <ricky_wu@realtek.com> Link: https://lore.kernel.org/r/20210122081906.19100-1-ricky_wu@realtek.com Fixes: d928061c3143 ("misc: rtsx: modify en/disable aspm function") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-22tty: fix up hung_up_tty_write() conversionLinus Torvalds
In commit "tty: implement write_iter", I left the write_iter conversion of the hung up tty case alone, because I incorrectly thought it didn't matter. Jiri showed me the errors of my ways, and pointed out the problems with that incomplete conversion. Fix it all up. Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/r/CAHk-=wh+-rGsa=xruEWdg_fJViFG8rN9bpLrfLz=_yBYh2tBhA@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>