summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-06-10Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 updates and fixes from Thomas Gleixner: - Fix the (late) fallout from the vector management rework causing hlist corruption and irq descriptor reference leaks caused by a missing sanity check. The straight forward fix triggered another long standing issue to surface. The pre rework code hid the issue due to being way slower, but now the chance that user space sees an EBUSY error return when updating irq affinities is way higher, though quite a bunch of userspace tools do not handle it properly despite the fact that EBUSY could be returned for at least 10 years. It turned out that the EBUSY return can be avoided completely by utilizing the existing delayed affinity update mechanism for irq remapped scenarios as well. That's a bit more error handling in the kernel, but avoids fruitless fingerpointing discussions with tool developers. - Decouple PHYSICAL_MASK from AMD SME as its going to be required for the upcoming Intel memory encryption support as well. - Handle legacy device ACPI detection properly for newer platforms - Fix the wrong argument ordering in the vector allocation tracepoint - Simplify the IDT setup code for the APIC=n case - Use the proper string helpers in the MTRR code - Remove a stale unused VDSO source file - Convert the microcode update lock to a raw spinlock as its used in atomic context. * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/intel_rdt: Enable CMT and MBM on new Skylake stepping x86/apic/vector: Print APIC control bits in debugfs genirq/affinity: Defer affinity setting if irq chip is busy x86/platform/uv: Use apic_ack_irq() x86/ioapic: Use apic_ack_irq() irq_remapping: Use apic_ack_irq() x86/apic: Provide apic_ack_irq() genirq/migration: Avoid out of line call if pending is not set genirq/generic_pending: Do not lose pending affinity update x86/apic/vector: Prevent hlist corruption and leaks x86/vector: Fix the args of vector_alloc tracepoint x86/idt: Simplify the idt_setup_apic_and_irq_gates() x86/platform/uv: Remove extra parentheses x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME x86: Mark native_set_p4d() as __always_inline x86/microcode: Make the late update update_lock a raw lock for RT x86/mtrr: Convert to use strncpy_from_user() helper x86/mtrr: Convert to use match_string() helper x86/vdso: Remove unused file x86/i8237: Register device based on FADT legacy boot flag
2018-06-10Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Thomas Gleixner: "A small set of core updates: - Make objtool cope with GCC8 oddities some more - Remove a stale local_irq_save/restore sequence in the signal code along with the stale comment in the RCU code. The underlying issue which led to this has been solved long time ago, but nobody cared to cleanup the hackarounds" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: signal: Remove no longer required irqsave/restore rcu: Update documentation of rcu_read_unlock() objtool: Fix GCC 8 cold subfunction detection for aliased functions
2018-06-10signal: Remove no longer required irqsave/restoreAnna-Maria Gleixner
Commit a841796f11c9 ("signal: align __lock_task_sighand() irq disabling and RCU") introduced a rcu read side critical section with interrupts disabled. The changelog suggested that a better long-term fix would be "to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock". This long-term fix has been made in commit b4abf91047cf ("rtmutex: Make wait_lock irq safe") for a different reason. Therefore revert commit a841796f11c9 ("signal: align > __lock_task_sighand() irq disabling and RCU") as the interrupt disable dance is not longer required. The change was tested on the base of b4abf91047cf ("rtmutex: Make wait_lock irq safe") with a four hour run of rcutorture scenario TREE03 with lockdep enabled as suggested by Paul McKenney. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20180525090507.22248-3-anna-maria@linutronix.de
2018-06-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ideLinus Torvalds
Pull IDE updates from David Miller: "Primarily IRQ disabling avoidance changes from Sebastian Andrzej Siewior" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide: ide: don't enable/disable interrupts in force threaded-IRQ mode ide: don't disable interrupts during kmap_atomic() ide: Handle irq disabling consistently alim15x3: move irq-restore before pci_dev_put()
2018-06-08Merge tag 'libnvdimm-for-4.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This adds a user for the new 'bytes-remaining' updates to memcpy_mcsafe() that you already received through Ingo via the x86-dax- for-linus pull. Not included here, but still targeting this cycle, is support for handling memory media errors (poison) consumed via userspace dax mappings. Summary: - DAX broke a fundamental assumption of truncate of file mapped pages. The truncate path assumed that it is safe to disconnect a pinned page from a file and let the filesystem reclaim the physical block. With DAX the page is equivalent to the filesystem block. Introduce dax_layout_busy_page() to enable filesystems to wait for pinned DAX pages to be released. Without this wait a filesystem could allocate blocks under active device-DMA to a new file. - DAX arranges for the block layer to be bypassed and uses dax_direct_access() + copy_to_iter() to satisfy read(2) calls. However, the memcpy_mcsafe() facility is available through the pmem block driver. In order to safely handle media errors, via the DAX block-layer bypass, introduce copy_to_iter_mcsafe(). - Fix cache management policy relative to the ACPI NFIT Platform Capabilities Structure to properly elide cache flushes when they are not necessary. The table indicates whether CPU caches are power-fail protected. Clarify that a deep flush is always performed on REQ_{FUA,PREFLUSH} requests" * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) dax: Use dax_write_cache* helpers libnvdimm, pmem: Do not flush power-fail protected CPU caches libnvdimm, pmem: Unconditionally deep flush on *sync libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH acpi, nfit: Remove ecc_unit_size dax: dax_insert_mapping_entry always succeeds libnvdimm, e820: Register all pmem resources libnvdimm: Debug probe times linvdimm, pmem: Preserve read-only setting for pmem devices x86, nfit_test: Add unit test for memcpy_mcsafe() pmem: Switch to copy_to_iter_mcsafe() dax: Report bytes remaining in dax_iomap_actor() dax: Introduce a ->copy_to_iter dax operation uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation xfs, dax: introduce xfs_break_dax_layouts() xfs: prepare xfs_break_layouts() for another layout type xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL mm, fs, dax: handle layout changes to pinned dax mappings mm: fix __gup_device_huge vs unmap mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS ...
2018-06-08Merge branch 'for-4.18/dax' into libnvdimm-for-nextDan Williams
2018-06-08bpf: implement dummy fops for bpf objectsDaniel Borkmann
syzkaller was able to trigger the following warning in do_dentry_open(): WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 fs/open.c:778 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: [...] vfs_open+0x139/0x230 fs/open.c:908 do_last fs/namei.c:3370 [inline] path_openat+0x1717/0x4dc0 fs/namei.c:3511 do_filp_open+0x249/0x350 fs/namei.c:3545 do_sys_open+0x56f/0x740 fs/open.c:1101 __do_sys_openat fs/open.c:1128 [inline] __se_sys_openat fs/open.c:1122 [inline] __x64_sys_openat+0x9d/0x100 fs/open.c:1122 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Problem was that prog and map inodes in bpf fs did not implement a dummy file open operation that would return an error. The patch in do_dentry_open() checks whether f_ops are present and if not bails out with an error. While this may be fine, we really shouldn't be throwing a warning though. Thus follow the model similar to bad_file_ops and reject the request unconditionally with -EIO. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Reported-by: syzbot+2e7fcab0f56fdbb330b8@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-08gcov: remove CONFIG_GCOV_FORMAT_AUTODETECTMasahiro Yamada
CONFIG_GCOV_FORMAT_AUTODETECT compiles either gcc_3_4.c or gcc_4_7.c according to your GCC version. We can achieve the equivalent behavior by setting reasonable dependency with the knowledge of the compiler version. If GCC older than 4.7 is used, GCOV_FORMAT_3_4 is the default, but users are still allowed to select GCOV_FORMAT_4_7 in case the newer format is back-ported. On the other hand, If GCC 4.7 or newer is used, there is no reason to use GCOV_FORMAT_3_4, so it should be hidden. If you downgrade the compiler to GCC 4.7 or older, oldconfig/syncconfig will display a prompt for the choice because GCOV_FORMAT_3_4 becomes visible as a new symbol. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2018-06-07kernel/hung_task.c: show all hung tasks before panicTetsuo Handa
When we get a hung task it can often be valuable to see _all_ the hung tasks on the system before calling panic(). Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952 ---------------------------------------- INFO: task syz-executor0:6540 blocked for more than 120 seconds. Not tainted 4.16.0+ #13 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor0 D23560 6540 4521 0x80000004 Call Trace: context_switch kernel/sched/core.c:2848 [inline] __schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490 schedule+0xf5/0x430 kernel/sched/core.c:3549 schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607 __mutex_lock_common kernel/locking/mutex.c:833 [inline] __mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 __blkdev_driver_ioctl block/ioctl.c:303 [inline] blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601 ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060 isofs_get_last_session fs/isofs/inode.c:567 [inline] isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660 mount_bdev+0x2b7/0x370 fs/super.c:1119 isofs_mount+0x34/0x40 fs/isofs/inode.c:1560 mount_fs+0x66/0x2d0 fs/super.c:1222 vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:2514 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0xea4/0x2b90 fs/namespace.c:2847 ksys_mount+0xab/0x120 fs/namespace.c:3063 SYSC_mount fs/namespace.c:3077 [inline] SyS_mount+0x39/0x50 fs/namespace.c:3074 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 (...snipped...) Showing all locks held in the system: (...snipped...) 2 locks held by syz-executor0/6540: #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline] #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); */ #1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */ (...snipped...) 3 locks held by syz-executor7/6541: #0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */ #1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192 #2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 /* down_read(&sb->s_umount); */ ---------------------------------------- When reporting an AB-BA deadlock like shown above, it would be nice if trace of PID=6541 is printed as well as trace of PID=6540 before calling panic(). Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay calling panic() but normally there should not be so many hung tasks. Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: split page_type out from _mapcountMatthew Wilcox
We're already using a union of many fields here, so stop abusing the _mapcount and make page_type its own field. That implies renaming some of the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring back the PG_buddy, PG_balloon and PG_kmemcg names. As suggested by Kirill, make page_type a bitmask. Because it starts out life as -1 (thanks to sharing the storage with _mapcount), setting a page flag means clearing the appropriate bit. This gives us space for probably twenty or so extra bits (depending how paranoid we want to be about _mapcount underflow). Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_structYang Shi
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-06-08 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix in the BPF verifier to reject modified ctx pointers on helper functions, from Daniel. 2) Fix in BPF kselftests for get_cgroup_id_user() helper to only record the cgroup id for a provided pid in order to reduce test failures from processes interferring with the test, from Yonghong. 3) Fix a crash in AF_XDP's mem accounting when the process owning the sock has CAP_IPC_LOCK capabilities set, from Daniel. 4) Fix an issue for AF_XDP on 32 bit machines where XDP_UMEM_PGOFF_*_RING defines need ULL suffixes and use loff_t type as they are otherwise truncated, from Geert. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07umh: fix race conditionAlexei Starovoitov
kasan reported use-after-free: BUG: KASAN: use-after-free in call_usermodehelper_exec_work+0x2d3/0x310 kernel/umh.c:195 Write of size 4 at addr ffff8801d9202370 by task kworker/u4:2/50 Workqueue: events_unbound call_usermodehelper_exec_work Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 __asan_report_store4_noabort+0x17/0x20 mm/kasan/report.c:437 call_usermodehelper_exec_work+0x2d3/0x310 kernel/umh.c:195 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 The reason is that 'sub_info' cannot be accessed out of parent task context, since it will be freed by the child. Instead remember the pid in the child task. Fixes: 449325b52b7a ("umh: introduce fork_usermode_blob() helper") Reported-by: syzbot+2c73319c406f1987d156@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07bpf: reject passing modified ctx to helper functionsDaniel Borkmann
As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on context pointer") already describes, f1174f77b50c ("bpf/verifier: rework value tracking") removed the specific white-listed cases we had previously where we would allow for pointer arithmetic in order to further generalize it, and allow e.g. context access via modified registers. While the dereferencing of modified context pointers had been forbidden through 28e33f9d78ee, syzkaller did recently manage to trigger several KASAN splats for slab out of bounds access and use after frees by simply passing a modified context pointer to a helper function which would then do the bad access since verifier allowed it in adjust_ptr_min_max_vals(). Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() generally could break existing programs as there's a valid use case in tracing in combination with passing the ctx to helpers as bpf_probe_read(), where the register then becomes unknown at verification time due to adding a non-constant offset to it. An access sequence may look like the following: offset = args->filename; /* field __data_loc filename */ bpf_probe_read(&dst, len, (char *)args + offset); // args is ctx There are two options: i) we could special case the ctx and as soon as we add a constant or bounded offset to it (hence ctx type wouldn't change) we could turn the ctx into an unknown scalar, or ii) we generalize the sanity test for ctx member access into a small helper and assert it on the ctx register that was passed as a function argument. Fwiw, latter is more obvious and less complex at the same time, and one case that may potentially be legitimate in future for ctx member access at least would be for ctx to carry a const offset. Therefore, fix follows approach from ii) and adds test cases to BPF kselftests. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Reported-by: syzbot+3d0b2441dbb71751615e@syzkaller.appspotmail.com Reported-by: syzbot+c8504affd4fdd0c1b626@syzkaller.appspotmail.com Reported-by: syzbot+e5190cb881d8660fb1a3@syzkaller.appspotmail.com Reported-by: syzbot+efae31b384d5badbd620@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-07Merge tag 'powerpc-4.18-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Support for split PMD page table lock on 64-bit Book3S (Power8/9). - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live patching again. - Add support for patching barrier_nospec in copy_from_user() and syscall entry. - A couple of fixes for our data breakpoints on Book3S. - A series from Nick optimising TLB/mm handling with the Radix MMU. - Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre. - Several series optimising various parts of the 32-bit code from Christophe Leroy. - Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"), which is why the diffstat has so many deletions. And many other small improvements & fixes. There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details around pkey support. It was ack'ed/reviewed by Ingo & Dave and has been in next for several weeks. Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao, Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing" * tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits) powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap cpuidle: powernv: Fix promotion from snooze if next state disabled powerpc: fix build failure by disabling attribute-alias warning in pci_32 ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait() powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted" powerpc: fix spelling mistake: "Usupported" -> "Unsupported" powerpc/pkeys: Detach execute_only key on !PROT_EXEC powerpc/powernv: copy/paste - Mask SO bit in CR powerpc: Remove core support for Marvell mv64x60 hostbridges powerpc/boot: Remove core support for Marvell mv64x60 hostbridges powerpc/boot: Remove support for Marvell mv64x60 i2c controller powerpc/boot: Remove support for Marvell MPSC serial controller powerpc/embedded6xx: Remove C2K board support powerpc/lib: optimise PPC32 memcmp powerpc/lib: optimise 32 bits __clear_user() powerpc/time: inline arch_vtime_task_switch() powerpc/Makefile: set -mcpu=860 flag for the 8xx powerpc: Implement csum_ipv6_magic in assembly powerpc/32: Optimise __csum_partial() powerpc/lib: Adjust .balign inside string functions for PPC32 ...
2018-06-07Merge tag 'fuse-update-4.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "The most interesting part of this update is user namespace support, mostly done by Eric Biederman. This enables safe unprivileged fuse mounts within a user namespace. There are also a couple of fixes for bugs found by syzbot and miscellaneous fixes and cleanups" * tag 'fuse-update-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: don't keep dead fuse_conn at fuse_fill_super(). fuse: fix control dir setup and teardown fuse: fix congested state leak on aborted connections fuse: Allow fully unprivileged mounts fuse: Ensure posix acls are translated outside of init_user_ns fuse: add writeback documentation fuse: honor AT_STATX_FORCE_SYNC fuse: honor AT_STATX_DONT_SYNC fuse: Restrict allow_other to the superblock's namespace or a descendant fuse: Support fuse filesystems outside of init_user_ns fuse: Fail all requests with invalid uids or gids fuse: Remove the buggy retranslation of pids in fuse_dev_do_read fuse: return -ECONNABORTED on /dev/fuse read after abort fuse: atomic_o_trunc should truncate pagecache
2018-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Add Maglev hashing scheduler to IPVS, from Inju Song. 2) Lots of new TC subsystem tests from Roman Mashak. 3) Add TCP zero copy receive and fix delayed acks and autotuning with SO_RCVLOWAT, from Eric Dumazet. 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard Brouer. 5) Add ttl inherit support to vxlan, from Hangbin Liu. 6) Properly separate ipv6 routes into their logically independant components. fib6_info for the routing table, and fib6_nh for sets of nexthops, which thus can be shared. From David Ahern. 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP messages from XDP programs. From Nikita V. Shirokov. 8) Lots of long overdue cleanups to the r8169 driver, from Heiner Kallweit. 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau. 10) Add traffic condition monitoring to iwlwifi, from Luca Coelho. 11) Plumb extack down into fib_rules, from Roopa Prabhu. 12) Add Flower classifier offload support to igb, from Vinicius Costa Gomes. 13) Add UDP GSO support, from Willem de Bruijn. 14) Add documentation for eBPF helpers, from Quentin Monnet. 15) Add TLS tx offload to mlx5, from Ilya Lesokhin. 16) Allow applications to be given the number of bytes available to read on a socket via a control message returned from recvmsg(), from Soheil Hassas Yeganeh. 17) Add x86_32 eBPF JIT compiler, from Wang YanQing. 18) Add AF_XDP sockets, with zerocopy support infrastructure as well. From Björn Töpel. 19) Remove indirect load support from all of the BPF JITs and handle these operations in the verifier by translating them into native BPF instead. From Daniel Borkmann. 20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha. 21) Allow XDP programs to do lookups in the main kernel routing tables for forwarding. From David Ahern. 22) Allow drivers to store hardware state into an ELF section of kernel dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy. 23) Various RACK and loss detection improvements in TCP, from Yuchung Cheng. 24) Add TCP SACK compression, from Eric Dumazet. 25) Add User Mode Helper support and basic bpfilter infrastructure, from Alexei Starovoitov. 26) Support ports and protocol values in RTM_GETROUTE, from Roopa Prabhu. 27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard Brouer. 28) Add lots of forwarding selftests, from Petr Machata. 29) Add generic network device failover driver, from Sridhar Samudrala. * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits) strparser: Add __strp_unpause and use it in ktls. rxrpc: Fix terminal retransmission connection ID to include the channel net: hns3: Optimize PF CMDQ interrupt switching process net: hns3: Fix for VF mailbox receiving unknown message net: hns3: Fix for VF mailbox cannot receiving PF response bnx2x: use the right constant Revert "net: sched: cls: Fix offloading when ingress dev is vxlan" net: dsa: b53: Fix for brcm tag issue in Cygnus SoC enic: fix UDP rss bits netdev-FAQ: clarify DaveM's position for stable backports rtnetlink: validate attributes in do_setlink() mlxsw: Add extack messages for port_{un, }split failures netdevsim: Add extack error message for devlink reload devlink: Add extack to reload and port_{un, }split operations net: metrics: add proper netlink validation ipmr: fix error path when ipmr_new_table fails ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds net: hns3: remove unused hclgevf_cfg_func_mta_filter netfilter: provide udp*_lib_lookup for nf_tproxy qed*: Utilize FW 8.37.2.0 ...
2018-06-06Merge tag 'overflow-v4.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull overflow updates from Kees Cook: "This adds the new overflow checking helpers and adds them to the 2-factor argument allocators. And this adds the saturating size helpers and does a treewide replacement for the struct_size() usage. Additionally this adds the overflow testing modules to make sure everything works. I'm still working on the treewide replacements for allocators with "simple" multiplied arguments: *alloc(a * b, ...) -> *alloc_array(a, b, ...) and *zalloc(a * b, ...) -> *calloc(a, b, ...) as well as the more complex cases, but that's separable from this portion of the series. I expect to have the rest sent before -rc1 closes; there are a lot of messy cases to clean up. Summary: - Introduce arithmetic overflow test helper functions (Rasmus) - Use overflow helpers in 2-factor allocators (Kees, Rasmus) - Introduce overflow test module (Rasmus, Kees) - Introduce saturating size helper functions (Matthew, Kees) - Treewide use of struct_size() for allocators (Kees)" * tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: treewide: Use struct_size() for devm_kmalloc() and friends treewide: Use struct_size() for vmalloc()-family treewide: Use struct_size() for kmalloc()-family device: Use overflow helpers for devm_kmalloc() mm: Use overflow helpers in kvmalloc() mm: Use overflow helpers in kmalloc_array*() test_overflow: Add memory allocation overflow tests overflow.h: Add allocation size calculation helpers test_overflow: Report test failures test_overflow: macrofy some more, do more tests for free lib: add runtime test of check_*_overflow functions compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06Merge tag 'trace-v4.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "One new feature was added to ftrace, which is the trace_marker now supports triggers. For example: # cd /sys/kernel/debug/tracing # echo 'snapshot' > events/ftrace/print/trigger # echo 'cause snapshot' > trace_marker The rest of the changes are various clean ups and also one stable fix that was added late in the cycle" * tag 'trace-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (21 commits) tracing: Use match_string() instead of open coding it in trace_set_options() branch-check: fix long->int truncation when profiling branches ring-buffer: Fix typo in comment ring-buffer: Fix a bunch of typos in comments tracing/selftest: Add test to test simple snapshot trigger for trace_marker tracing/selftest: Add test to test hist trigger between kernel event and trace_marker tracing/selftest: Add selftests to test trace_marker histogram triggers ftrace/selftest: Fix reset_trigger() to handle triggers with filters ftrace/selftest: Have the reset_trigger code be a bit more careful tracing: Document trace_marker triggers tracing: Allow histogram triggers to access ftrace internal events tracing: Prevent further users of zero size static arrays in trace events tracing: Have zero size length in filter logic be full string tracing: Add trigger file for trace_markers tracefs/ftrace/print tracing: Do not show filter file for ftrace internal events tracing: Add brackets in ftrace event dynamic arrays tracing: Have event_trace_init() called by trace_init_tracefs() tracing: Add __find_event_file() to find event files without restrictions tracing: Do not reference event data in post call triggers tracepoints: Fix the descriptions of tracepoint_probe_register{_prio} ...
2018-06-06Merge tag 'audit-pr-20180605' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Another reasonable chunk of audit changes for v4.18, thirteen patches in total. The thirteen patches can mostly be broken down into one of four categories: general bug fixes, accessor functions for audit state stored in the task_struct, negative filter matches on executable names, and extending the (relatively) new seccomp logging knobs to the audit subsystem. The main driver for the accessor functions from Richard are the changes we're working on to associate audit events with containers, but I think they have some standalone value too so I figured it would be good to get them in now. The seccomp/audit patches from Tyler apply the seccomp logging improvements from a few releases ago to audit's seccomp logging; starting with this patchset the changes in /proc/sys/kernel/seccomp/actions_logged should apply to both the standard kernel logging and audit. As usual, everything passes the audit-testsuite and it happens to merge cleanly with your tree" [ Heh, except it had trivial merge conflicts with the SELinux tree that also came in from Paul - Linus ] * tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Fix wrong task in comparison of session ID audit: use existing session info function audit: normalize loginuid read access audit: use new audit_context access funciton for seccomp_actions_logged audit: use inline function to set audit context audit: use inline function to get audit context audit: convert sessionid unset to a macro seccomp: Don't special case audited processes when logging seccomp: Audit attempts to modify the actions_logged sysctl seccomp: Configurable separator for the actions_logged string seccomp: Separate read and write code for actions_logged sysctl audit: allow not equal op for audit by executable audit: add syscall information to FEATURE_CHANGE records
2018-06-06Merge tag 'printk-for-4.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Help userspace log daemons to catch up with a flood of messages. They will get woken after each message even if the console is far behind and handled by another process. - Flush printk safe buffers safely even when panic() happens in the normal context. - Fix possible va_list reuse when race happened in printk_safe(). - Remove %pCr printf format to prevent sleeping in the atomic context. - Misc vsprintf code cleanup. * tag 'printk-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: printk: drop in_nmi check from printk_safe_flush_on_panic() lib/vsprintf: Remove atomic-unsafe support for %pCr serial: sh-sci: Stop using printk format %pCr thermal: bcm2835: Stop using printk format %pCr clk: renesas: cpg-mssr: Stop using printk format %pCr printk: fix possible reuse of va_list variable printk: wake up klogd in vprintk_emit vsprintf: Tweak pF/pf comment lib/vsprintf: Mark expected switch fall-through lib/vsprintf: Replace space with '_' before crng is ready lib/vsprintf: Deduplicate pointer_string() lib/vsprintf: Move pointer_string() upper lib/vsprintf: Make flag_spec global lib/vsprintf: Make strspec global lib/vsprintf: Make dec_spec global lib/test_printf: Mark big constant with UL
2018-06-06treewide: Use struct_size() for kmalloc()-familyKees Cook
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This patch makes the changes for kmalloc()-family (and kvmalloc()-family) uses. It was done via automatic conversion with manual review for the "CHECKME" non-standard cases noted below, using the following Coccinelle script: // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * // sizeof *pkey_cache->table, GFP_KERNEL); @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; identifier VAR, ELEMENT; expression COUNT; @@ - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP) + alloc(struct_size(VAR, ELEMENT, COUNT), GFP) // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL); @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; identifier VAR, ELEMENT; expression COUNT; @@ - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP) + alloc(struct_size(VAR, ELEMENT, COUNT), GFP) // Same pattern, but can't trivially locate the trailing element name, // or variable name. @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; expression SOMETHING, COUNT, ELEMENT; @@ - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP) + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-06genirq/affinity: Defer affinity setting if irq chip is busyThomas Gleixner
The case that interrupt affinity setting fails with -EBUSY can be handled in the kernel completely by using the already available generic pending infrastructure. If a irq_chip::set_affinity() fails with -EBUSY, handle it like the interrupts for which irq_chip::set_affinity() can only be invoked from interrupt context. Copy the new affinity mask to irq_desc::pending_mask and set the affinity pending bit. The next raised interrupt for the affected irq will check the pending bit and try to set the new affinity from the handler. This avoids that -EBUSY is returned when an affinity change is requested from user space and the previous change has not been cleaned up. The new affinity will take effect when the next interrupt is raised from the device. Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <liu.song.a23@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Link: https://lkml.kernel.org/r/20180604162224.819273597@linutronix.de
2018-06-06genirq/migration: Avoid out of line call if pending is not setThomas Gleixner
The upcoming fix for the -EBUSY return from affinity settings requires to use the irq_move_irq() functionality even on irq remapped interrupts. To avoid the out of line call, move the check for the pending bit into an inline helper. Preparatory change for the real fix. No functional change. Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <liu.song.a23@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Link: https://lkml.kernel.org/r/20180604162224.471925894@linutronix.de
2018-06-06genirq/generic_pending: Do not lose pending affinity updateThomas Gleixner
The generic pending interrupt mechanism moves interrupts from the interrupt handler on the original target CPU to the new destination CPU. This is required for x86 and ia64 due to the way the interrupt delivery and acknowledge works if the interrupts are not remapped. However that update can fail for various reasons. Some of them are valid reasons to discard the pending update, but the case, when the previous move has not been fully cleaned up is not a legit reason to fail. Check the return value of irq_do_set_affinity() for -EBUSY, which indicates a pending cleanup, and rearm the pending move in the irq dexcriptor so it's tried again when the next interrupt arrives. Fixes: 996c591227d9 ("x86/irq: Plug vector cleanup race") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <liu.song.a23@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Link: https://lkml.kernel.org/r/20180604162224.386544292@linutronix.de
2018-06-06rseq: Introduce restartable sequences system callMathieu Desnoyers
Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-05Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: - make kworkers report the workqueue it is executing or has executed most recently in /proc/PID/comm (so they show up in ps/top) - CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds inside CONFIG_SMP. * 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: move function definitions within CONFIG_SMP block workqueue: Make sure struct worker is accessible for wq_worker_comm() workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status} proc: Consolidate task->comm formatting into proc_task_name() workqueue: Set worker->desc to workqueue name by default workqueue: Make worker_attach/detach_pool() update worker->pool workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
2018-06-05Merge branch 'for-4.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - For cpustat, cgroup has a percpu hierarchical stat mechanism which propagates up the hierarchy lazily. This contains commits to factor out and generalize the mechanism so that it can be used for other cgroup stats too. The original intention was to update memcg stats to use it but memcg went for a different approach, so still the only user is cpustat. The factoring out and generalization still make sense and it's likely that this can be used for other purposes in the future. - cgroup uses kernfs_notify() (which uses fsnotify()) to inform user space of certain events. A rate limiting mechanism is added. - Other misc changes. * 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: css_set_lock should nest inside tasklist_lock rdmacg: Convert to use match_string() helper cgroup: Make cgroup_rstat_updated() ready for root cgroup usage cgroup: Add memory barriers to plug cgroup_rstat_updated() race window cgroup: Add cgroup_subsys->css_rstat_flush() cgroup: Replace cgroup_rstat_mutex with a spinlock cgroup: Factor out and expose cgroup_rstat_*() interface functions cgroup: Reorganize kernel/cgroup/rstat.c cgroup: Distinguish base resource stat implementation from rstat cgroup: Rename stat to rstat cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c cgroup: Limit event generation frequency cgroup: Explicitly remove core interface files
2018-06-05ide: don't enable/disable interrupts in force threaded-IRQ modeSebastian Andrzej Siewior
The interrupts are enabled/disabled so the interrupt handler can run with enabled interrupts while serving the interrupt and not lose other interrupts especially the timer tick. If the system runs with force-threaded interrupts then there is no need to enable the interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05tracing: Use match_string() instead of open coding it in trace_set_options()Yisheng Xie
match_string() returns the index of an array for a matching string, which can be used to simplify the code. Link: http://lkml.kernel.org/r/1526546163-4609-1-git-send-email-xieyisheng1@huawei.com Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-06-05 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add a new BPF hook for sendmsg similar to existing hooks for bind and connect: "This allows to override source IP (including the case when it's set via cmsg(3)) and destination IP:port for unconnected UDP (slow path). TCP and connected UDP (fast path) are not affected. This makes UDP support complete, that is, connected UDP is handled by connect hooks, unconnected by sendmsg ones.", from Andrey. 2) Rework of the AF_XDP API to allow extending it in future for type writer model if necessary. In this mode a memory window is passed to hardware and multiple frames might be filled into that window instead of just one that is the case in the current fixed frame-size model. With the new changes made this can be supported without having to add a new descriptor format. Also, core bits for the zero-copy support for AF_XDP have been merged as agreed upon, where i40e bits will be routed via Jeff later on. Various improvements to documentation and sample programs included as well, all from Björn and Magnus. 3) Given BPF's flexibility, a new program type has been added to implement infrared decoders. Quote: "The kernel IR decoders support the most widely used IR protocols, but there are many protocols which are not supported. [...] There is a 'long tail' of unsupported IR protocols, for which lircd is need to decode the IR. IR encoding is done in such a way that some simple circuit can decode it; therefore, BPF is ideal. [...] user-space can define a decoder in BPF, attach it to the rc device through the lirc chardev.", from Sean. 4) Several improvements and fixes to BPF core, among others, dumping map and prog IDs into fdinfo which is a straight forward way to correlate BPF objects used by applications, removing an indirect call and therefore retpoline in all map lookup/update/delete calls by invoking the callback directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper for tc BPF programs to have an efficient way of looking up cgroup v2 id for policy or other use cases. Fixes to make sure we zero tunnel/xfrm state that hasn't been filled, to allow context access wrt pt_regs in 32 bit archs for tracing, and last but not least various test cases for fixes that landed in bpf earlier, from Daniel. 5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper. 6) Add a new bpf_get_current_cgroup_id() helper that can be used in tracing to retrieve the cgroup id from the current process in order to allow for e.g. aggregation of container-level events, from Yonghong. 7) Two follow-up fixes for BTF to reject invalid input values and related to that also two test cases for BPF kselftests, from Martin. 8) Various API improvements to the bpf_fib_lookup() helper, that is, dropping MPLS bits which are not fully hashed out yet, rejecting invalid helper flags, returning error for unsupported address families as well as renaming flowlabel to flowinfo, from David. 9) Various fixes and improvements to sockmap BPF kselftests in particular in proper error detection and data verification, from Prashant. 10) Two arm32 BPF JIT improvements. One is to fix imm range check with regards to whether immediate fits into 24 bits, and a naming cleanup to get functions related to rsh handling consistent to those handling lsh, from Wang. 11) Two compile warning fixes in BPF, one for BTF and a false positive to silent gcc in stack_map_get_build_id_offset(), from Arnd. 12) Add missing seg6.h header into tools include infrastructure in order to fix compilation of BPF kselftests, from Mathieu. 13) Several formatting cleanups in the BPF UAPI helper description that also fix an error during rst2man compilation, from Quentin. 14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is not built into the kernel, from Yue. 15) Remove a useless double assignment in dev_map_enqueue(), from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05Merge tag 'pm-4.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These include a significant update of the generic power domains (genpd) and Operating Performance Points (OPP) frameworks, mostly related to the introduction of power domain performance levels, cpufreq updates (new driver for Qualcomm Kryo processors, updates of the existing drivers, some core fixes, schedutil governor improvements), PCI power management fixes, ACPI workaround for EC-based wakeup events handling on resume from suspend-to-idle, and major updates of the turbostat and pm-graph utilities. Specifics: - Introduce power domain performance levels into the the generic power domains (genpd) and Operating Performance Points (OPP) frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter). - Fix two issues in the runtime PM framework related to the initialization and removal of devices using device links (Ulf Hansson). - Clean up the initialization of drivers for devices in PM domains (Ulf Hansson, Geert Uytterhoeven). - Fix a cpufreq core issue related to the policy sysfs interface causing CPU online to fail for CPUs sharing one cpufreq policy in some situations (Tao Wang). - Make it possible to use platform-specific suspend/resume hooks in the cpufreq-dt driver and make the Armada 37xx DVFS use that feature (Viresh Kumar, Miquel Raynal). - Optimize policy transition notifications in cpufreq (Viresh Kumar). - Improve the iowait boost mechanism in the schedutil cpufreq governor (Patrick Bellasi). - Improve the handling of deferred frequency updates in the schedutil cpufreq governor (Joel Fernandes, Dietmar Eggemann, Rafael Wysocki, Viresh Kumar). - Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin). - Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman, Viresh Kumar). - Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag set and update stale comments in the PCI core PM code (Rafael Wysocki). - Work around an issue related to the handling of EC-based wakeup events in the ACPI PM core during resume from suspend-to-idle if the EC has been put into the low-power mode (Rafael Wysocki). - Improve the handling of wakeup source objects in the PM core (Doug Berger, Mahendran Ganesh, Rafael Wysocki). - Update the driver core to prevent deferred probe from breaking suspend/resume ordering (Feng Kan). - Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael Wysocki). - Make the core suspend/resume code and cpufreq support the RT patch (Sebastian Andrzej Siewior, Thomas Gleixner). - Consolidate the PM QoS handling in cpuidle governors (Rafael Wysocki). - Fix a possible crash in the hibernation core (Tetsuo Handa). - Update the rockchip-io Adaptive Voltage Scaling (AVS) driver (David Wu). - Update the turbostat utility (fixes, cleanups, new CPU IDs, new command line options, built-in "Low Power Idle" counters support, new POLL and POLL% columns) and add an entry for it to MAINTAINERS (Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner, Prarit Bhargava, Srinivas Pandruvada). - Update the pm-graph to version 5.1 (Todd Brandt). - Update the intel_pstate_tracer utility (Doug Smythies)" * tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (128 commits) tools/power turbostat: update version number tools/power turbostat: Add Node in output tools/power turbostat: add node information into turbostat calculations tools/power turbostat: remove num_ from cpu_topology struct tools/power turbostat: rename num_cores_per_pkg to num_cores_per_node tools/power turbostat: track thread ID in cpu_topology tools/power turbostat: Calculate additional node information for a package tools/power turbostat: Fix node and siblings lookup data tools/power turbostat: set max_num_cpus equal to the cpumask length tools/power turbostat: if --num_iterations, print for specific number of iterations tools/power turbostat: Add Cannon Lake support tools/power turbostat: delete duplicate #defines x86: msr-index.h: Correct SNB_C1/C3_AUTO_UNDEMOTE defines tools/power turbostat: Correct SNB_C1/C3_AUTO_UNDEMOTE defines tools/power turbostat: add POLL and POLL% column tools/power turbostat: Fix --hide Pk%pc10 tools/power turbostat: Build-in "Low Power Idle" counters support tools/power turbostat: Don't make man pages executable tools/power turbostat: remove blank lines tools/power turbostat: a small C-states dump readability immprovement ...
2018-06-05printk: drop in_nmi check from printk_safe_flush_on_panic()Sergey Senozhatsky
Drop the in_nmi() check from printk_safe_flush_on_panic() and attempt to re-init (IOW unlock) locked logbuf spinlock from panic CPU regardless of its context. Otherwise, theoretically, we can deadlock on logbuf trying to flush per-CPU buffers: a) Panic CPU is running in non-NMI context b) Panic CPU sends out shutdown IPI via reboot vector c) Panic CPU fails to stop all remote CPUs d) Panic CPU sends out shutdown IPI via NMI vector One of the CPUs that we bring down via NMI vector can hold logbuf spin lock (theoretically). Link: http://lkml.kernel.org/r/20180530070350.10131-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-06-04Merge branch 'timers-2038-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time/Y2038 updates from Thomas Gleixner: - Consolidate SySV IPC UAPI headers - Convert SySV IPC to the new COMPAT_32BIT_TIME mechanism - Cleanup the core interfaces and standardize on the ktime_get_* naming convention. - Convert the X86 platform ops to timespec64 - Remove the ugly temporary timespec64 hack * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) x86: Convert x86_platform_ops to timespec64 timekeeping: Add more coarse clocktai/boottime interfaces timekeeping: Add ktime_get_coarse_with_offset timekeeping: Standardize on ktime_get_*() naming timekeeping: Clean up ktime_get_real_ts64 timekeeping: Remove timespec64 hack y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedop y2038: ipc: Enable COMPAT_32BIT_TIME y2038: ipc: Use __kernel_timespec y2038: ipc: Report long times to user space y2038: ipc: Use ktime_get_real_seconds consistently y2038: xtensa: Extend sysvipc data structures y2038: powerpc: Extend sysvipc data structures y2038: sparc: Extend sysvipc data structures y2038: parisc: Extend sysvipc data structures y2038: mips: Extend sysvipc data structures y2038: arm64: Extend sysvipc compat data structures y2038: s390: Remove unneeded ipc uapi header files y2038: ia64: Remove unneeded ipc uapi header files y2038: alpha: Remove unneeded ipc uapi header files ...
2018-06-04Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Core infrastucture work for Y2038 to address the COMPAT interfaces: + Add a new Y2038 safe __kernel_timespec and use it in the core code + Introduce config switches which allow to control the various compat mechanisms + Use the new config switch in the posix timer code to control the 32bit compat syscall implementation. - Prevent bogus selection of CPU local clocksources which causes an endless reselection loop - Remove the extra kthread in the clocksource code which has no value and just adds another level of indirection - The usual bunch of trivial updates, cleanups and fixlets all over the place - More SPDX conversions * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) clocksource/drivers/mxs_timer: Switch to SPDX identifier clocksource/drivers/timer-imx-tpm: Switch to SPDX identifier clocksource/drivers/timer-imx-gpt: Switch to SPDX identifier clocksource/drivers/timer-imx-gpt: Remove outdated file path clocksource/drivers/arc_timer: Add comments about locking while read GFRC clocksource/drivers/mips-gic-timer: Add pr_fmt and reword pr_* messages clocksource/drivers/sprd: Fix Kconfig dependency clocksource: Move inline keyword to the beginning of function declarations timer_list: Remove unused function pointer typedef timers: Adjust a kernel-doc comment tick: Prefer a lower rating device only if it's CPU local device clocksource: Remove kthread time: Change nanosleep to safe __kernel_* types time: Change types to new y2038 safe __kernel_* types time: Fix get_timespec64() for y2038 safe compat interfaces time: Add new y2038 safe __kernel_timespec posix-timers: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME time: Introduce CONFIG_COMPAT_32BIT_TIME time: Introduce CONFIG_64BIT_TIME in architectures compat: Enable compat_get/put_timespec64 always ...
2018-06-04Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: - Consolidation of softirq pending: The softirq mask and its accessors/mutators have many implementations scattered around many architectures. Most do the same things consisting in a field in a per-cpu struct (often irq_cpustat_t) accessed through per-cpu ops. We can provide instead a generic efficient version that most of them can use. In fact s390 is the only exception because the field is stored in lowcore. - Support for level!?! triggered MSI (ARM) Over the past couple of years, we've seen some SoCs coming up with ways of signalling level interrupts using a new flavor of MSIs, where the MSI controller uses two distinct messages: one that raises a virtual line, and one that lowers it. The target MSI controller is in charge of maintaining the state of the line. This allows for a much simplified HW signal routing (no need to have hundreds of discrete lines to signal level interrupts if you already have a memory bus), but results in a departure from the current idea the kernel has of MSIs. - Support for Meson-AXG GPIO irqchip - Large stm32 irqchip rework (suspend/resume, hierarchical domains) - More SPDX conversions * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) ARM: dts: stm32: Add exti support to stm32mp157 pinctrl ARM: dts: stm32: Add exti support for stm32mp157c pinctrl/stm32: Add irq_eoi for stm32gpio irqchip irqchip/stm32: Add suspend/resume support for hierarchy domain irqchip/stm32: Add stm32mp1 support with hierarchy domain irqchip/stm32: Prepare common functions irqchip/stm32: Add host and driver data structures irqchip/stm32: Add suspend support irqchip/stm32: Add falling pending register support irqchip/stm32: Checkpatch fix irqchip/stm32: Optimizes and cleans up stm32-exti irq_domain irqchip/meson-gpio: Add support for Meson-AXG SoCs dt-bindings: interrupt-controller: New binding for Meson-AXG SoC dt-bindings: interrupt-controller: Fix the double quotes softirq/s390: Move default mutators of overwritten softirq mask to s390 softirq/x86: Switch to generic local_softirq_pending() implementation softirq/sparc: Switch to generic local_softirq_pending() implementation softirq/powerpc: Switch to generic local_softirq_pending() implementation softirq/parisc: Switch to generic local_softirq_pending() implementation softirq/ia64: Switch to generic local_softirq_pending() implementation ...
2018-06-04Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - power-aware scheduling improvements (Patrick Bellasi) - NUMA balancing improvements (Mel Gorman) - vCPU scheduling fixes (Rohit Jain) * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Update util_est before updating schedutil sched/cpufreq: Modify aggregate utilization to always include blocked FAIR utilization sched/deadline/Documentation: Add overrun signal and GRUB-PA documentation sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu() sched/wait: Include <linux/wait.h> in <linux/swait.h> sched/numa: Stagger NUMA balancing scan periods for new threads sched/core: Don't schedule threads on pre-empted vCPUs sched/fair: Avoid calling sync_entity_load_avg() unnecessarily sched/fair: Rearrange select_task_rq_fair() to optimize it
2018-06-04Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "Kernel side changes: - x86 Intel uncore driver cleanups and enhancements (Kan Liang) - group scheduling and other fixes (Song Liu - store frame pointer in the sample traces for better profiling (Alexey Budankov) - compat fixes/enhancements (Eugene Syromiatnikov) Tooling side changes, which you can build and install in a single step via: make -C tools/perf clean install perf annotate: - Support 'perf annotate --group' for non-explicit recorded event "groups", showing multiple columns, one for each event, just like when dealing with explicit event groups (those enclosed with {}) (Jin Yao) - Record min/max LBR cycles (>= Skylake) and add 'perf annotate' TUI hotkey to show it (c) (Jin Yao) perf bpf: - Add infrastructure to help in writing eBPF C programs to be used with '-e name.c' type events in tools such as 'record' and 'trace', with headers for common constructs and an examples directory that will get populated as we add more such helpers and the 'perf bpf' (Arnaldo Carvalho de Melo) perf stat: - Display time in precision based on std deviation (Jiri Olsa) - Add --table option to display time of each run (Jiri Olsa) - Display length strings of each run for --table option (Jiri Olsa) perf buildid-cache: - Add --list and --purge-all options (Ravi Bangoria) perf test: - Let 'perf test list' display subtests (Hendrik Brueckner) perf pti: - Create extra kernel maps to help in decoding samples in x86 PTI entry trampolines (Adrian Hunter) - Copy x86 PTI entry trampoline sections in the kcore copy used for annotation and intel_pt CPU traces decoding (Adrian Hunter) ... and a lot of other fixes, enhancements and cleanups I did not list, see the shortlog and git log for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits) perf/x86/intel/uncore: Clean up client IMC uncore perf/x86/intel/uncore: Expose uncore_pmu_event*() functions perf/x86/intel/uncore: Support IIO free-running counters on SKX perf/x86/intel/uncore: Add infrastructure for free running counters perf/x86/intel/uncore: Add new data structures for free running counters perf/x86/intel/uncore: Correct fixed counter index check in generic code perf/x86/intel/uncore: Correct fixed counter index check for NHM perf/x86/intel/uncore: Introduce customized event_read() for client IMC uncore perf/x86: Store user space frame-pointer value on a sample perf/core: Wire up compat PERF_EVENT_IOC_QUERY_BPF, PERF_EVENT_IOC_MODIFY_ATTRIBUTES perf/core: Fix bad use of igrab() perf/core: Fix group scheduling with mixed hw and sw events perf kcore_copy: Amend the offset of sections that remap kernel text perf kcore_copy: Copy x86 PTI entry trampoline sections perf kcore_copy: Get rid of kernel_map perf kcore_copy: Iterate phdrs perf kcore_copy: Layout sections perf kcore_copy: Calculate offset from phnum perf kcore_copy: Keep a count of phdrs perf kcore_copy: Keep phdr data in a list ...
2018-06-04Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Lots of tidying up changes all across the map for Linux's formal memory/locking-model tooling, by Alan Stern, Akira Yokosawa, Andrea Parri, Paul E. McKenney and SeongJae Park. Notable changes beyond an overall update in the tooling itself is the tidying up of spin_is_locked() semantics, which spills over into the kernel proper as well. - qspinlock improvements: the locking algorithm now guarantees forward progress whereas the previous implementation in mainline could starve threads indefinitely in cmpxchg() loops. Also other related cleanups to the qspinlock code (Will Deacon) - misc smaller improvements, cleanups and fixes all across the locking subsystem * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) locking/rwsem: Simplify the is-owner-spinnable checks tools/memory-model: Add reference for 'Simplifying ARM concurrency' tools/memory-model: Update ASPLOS information MAINTAINERS, tools/memory-model: Update e-mail address for Andrea Parri tools/memory-model: Fix coding style in 'lock.cat' tools/memory-model: Remove out-of-date comments and code from lock.cat tools/memory-model: Improve mixed-access checking in lock.cat tools/memory-model: Improve comments in lock.cat tools/memory-model: Remove duplicated code from lock.cat tools/memory-model: Flag "cumulativity" and "propagation" tests tools/memory-model: Add model support for spin_is_locked() tools/memory-model: Add scripts to test memory model tools/memory-model: Fix coding style in 'linux-kernel.def' tools/memory-model: Model 'smp_store_mb()' tools/memory-order: Update the cheat-sheet to show that smp_mb__after_atomic() orders later RMW operations tools/memory-order: Improve key for SELF and SV tools/memory-model: Fix cheat sheet typo tools/memory-model: Update required version of herdtools7 tools/memory-model: Redefine rb in terms of rcu-fence tools/memory-model: Rename link and rcu-path to rcu-link and rb ...
2018-06-04Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - updates to the handling of expedited grace periods - updates to reduce lock contention in the rcu_node combining tree [ These are in preparation for the consolidation of RCU-bh, RCU-preempt, and RCU-sched into a single flavor, which was requested by Linus in response to a security flaw whose root cause included confusion between the multiple flavors of RCU ] - torture-test updates that save their users some time and effort - miscellaneous fixes * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) rcu/x86: Provide early rcu_cpu_starting() callback torture: Make kvm-find-errors.sh find build warnings rcutorture: Abbreviate kvm.sh summary lines rcutorture: Print end-of-test state in kvm.sh summary rcutorture: Print end-of-test state torture: Fold parse-torture.sh into parse-console.sh torture: Add a script to edit output from failed runs rcu: Update list of rcu_future_grace_period() trace events rcu: Drop early GP request check from rcu_gp_kthread() rcu: Simplify and inline cpu_needs_another_gp() rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp() rcu: Make rcu_start_this_gp() check for out-of-range requests rcu: Add funnel locking to rcu_start_this_gp() rcu: Make rcu_start_future_gp() caller select grace period rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp() rcu: Clear request other than RCU_GP_FLAG_INIT at GP end rcu: Cleanup, don't put ->completed into an int rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs() rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition rcu: Make rcu_migrate_callbacks wake GP kthread when needed ...
2018-06-04Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "This set of changes close the known issues with setting si_code to an invalid value, and with not fully initializing struct siginfo. There remains work to do on nds32, arc, unicore32, powerpc, arm, arm64, ia64 and x86 to get the code that generates siginfo into a simpler and more maintainable state. Most of that work involves refactoring the signal handling code and thus careful code review. Also not included is the work to shrink the in kernel version of struct siginfo. That depends on getting the number of places that directly manipulate struct siginfo under control, as it requires the introduction of struct kernel_siginfo for the in kernel things. Overall this set of changes looks like it is making good progress, and with a little luck I will be wrapping up the siginfo work next development cycle" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) signal/sh: Stop gcc warning about an impossible case in do_divide_error signal/mips: Report FPE_FLTUNK for undiagnosed floating point exceptions signal/um: More carefully relay signals in relay_signal. signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR} signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code signal/signalfd: Add support for SIGSYS signal/signalfd: Remove __put_user from signalfd_copyinfo signal/xtensa: Use force_sig_fault where appropriate signal/xtensa: Consistenly use SIGBUS in do_unaligned_user signal/um: Use force_sig_fault where appropriate signal/sparc: Use force_sig_fault where appropriate signal/sparc: Use send_sig_fault where appropriate signal/sh: Use force_sig_fault where appropriate signal/s390: Use force_sig_fault where appropriate signal/riscv: Replace do_trap_siginfo with force_sig_fault signal/riscv: Use force_sig_fault where appropriate signal/parisc: Use force_sig_fault where appropriate signal/parisc: Use force_sig_mceerr where appropriate signal/openrisc: Use force_sig_fault where appropriate signal/nios2: Use force_sig_fault where appropriate ...
2018-06-04ring-buffer: Fix a bunch of typos in commentsSteven Rostedt (VMware)
An anonymous source sent me a bunch of typo fixes in the comments of ring_buffer.c file. That source did not want to be associated to this patch because they don't want to be known as "one of those" commiters (you know who you are!). They gave me permission to sign this off in my own name. Suggested-by: One-of-those-commiters@YouKnowWhoYouAre.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-06-04Merge branch 'work.aio-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull aio updates from Al Viro: "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly. The only thing I'm holding back for a day or so is Adam's aio ioprio - his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case), but let it sit in -next for decency sake..." * 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) aio: sanitize the limit checking in io_submit(2) aio: fold do_io_submit() into callers aio: shift copyin of iocb into io_submit_one() aio_read_events_ring(): make a bit more readable aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way aio: take list removal to (some) callers of aio_complete() aio: add missing break for the IOCB_CMD_FDSYNC case random: convert to ->poll_mask timerfd: convert to ->poll_mask eventfd: switch to ->poll_mask pipe: convert to ->poll_mask crypto: af_alg: convert to ->poll_mask net/rxrpc: convert to ->poll_mask net/iucv: convert to ->poll_mask net/phonet: convert to ->poll_mask net/nfc: convert to ->poll_mask net/caif: convert to ->poll_mask net/bluetooth: convert to ->poll_mask net/sctp: convert to ->poll_mask net/tipc: convert to ->poll_mask ...
2018-06-04bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPSYonghong Song
Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper") introduced a new helper bpf_get_current_cgroup_id(). The helper has a dependency on CONFIG_CGROUPS. When CONFIG_CGROUPS is not defined, using the helper will result the following verifier error: kernel subsystem misconfigured func bpf_get_current_cgroup_id#80 which is hard for users to interpret. Guarding the reference to bpf_get_current_cgroup_id_proto with CONFIG_CGROUPS will result in below better message: unknown func bpf_get_current_cgroup_id#80 Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-04Merge branch 'hch.procfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull procfs updates from Al Viro: "Christoph's proc_create_... cleanups series" * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits) xfs, proc: hide unused xfs procfs helpers isdn/gigaset: add back gigaset_procinfo assignment proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields tty: replace ->proc_fops with ->proc_show ide: replace ->proc_fops with ->proc_show ide: remove ide_driver_proc_write isdn: replace ->proc_fops with ->proc_show atm: switch to proc_create_seq_private atm: simplify procfs code bluetooth: switch to proc_create_seq_data netfilter/x_tables: switch to proc_create_seq_private netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data neigh: switch to proc_create_seq_data hostap: switch to proc_create_{seq,single}_data bonding: switch to proc_create_seq_data rtc/proc: switch to proc_create_single_data drbd: switch to proc_create_single resource: switch to proc_create_seq_data staging/rtl8192u: simplify procfs code jfs: simplify procfs code ...
2018-06-04Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - clean up how we pass around gfp_t and blk_mq_req_flags_t (Christoph) - prepare us to defer scheduler attach (Christoph) - clean up drivers handling of bounce buffers (Christoph) - fix timeout handling corner cases (Christoph/Bart/Keith) - bcache fixes (Coly) - prep work for bcachefs and some block layer optimizations (Kent). - convert users of bio_sets to using embedded structs (Kent). - fixes for the BFQ io scheduler (Paolo/Davide/Filippo) - lightnvm fixes and improvements (Matias, with contributions from Hans and Javier) - adding discard throttling to blk-wbt (me) - sbitmap blk-mq-tag handling (me/Omar/Ming). - remove the sparc jsflash block driver, acked by DaveM. - Kyber scheduler improvement from Jianchao, making it more friendly wrt merging. - conversion of symbolic proc permissions to octal, from Joe Perches. Previously the block parts were a mix of both. - nbd fixes (Josef and Kevin Vigor) - unify how we handle the various kinds of timestamps that the block core and utility code uses (Omar) - three NVMe pull requests from Keith and Christoph, bringing AEN to feature completeness, file backed namespaces, cq/sq lock split, and various fixes - various little fixes and improvements all over the map * tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits) blk-mq: update nr_requests when switching to 'none' scheduler block: don't use blocking queue entered for recursive bio submits dm-crypt: fix warning in shutdown path lightnvm: pblk: take bitmap alloc. out of critical section lightnvm: pblk: kick writer on new flush points lightnvm: pblk: only try to recover lines with written smeta lightnvm: pblk: remove unnecessary bio_get/put lightnvm: pblk: add possibility to set write buffer size manually lightnvm: fix partial read error path lightnvm: proper error handling for pblk_bio_add_pages lightnvm: pblk: fix smeta write error path lightnvm: pblk: garbage collect lines with failed writes lightnvm: pblk: rework write error recovery path lightnvm: pblk: remove dead function lightnvm: pass flag on graceful teardown to targets lightnvm: pblk: check for chunk size before allocating it lightnvm: pblk: remove unnecessary argument lightnvm: pblk: remove unnecessary indirection lightnvm: pblk: return NVM_ error on failed submission lightnvm: pblk: warn in case of corrupted write buffer ...
2018-06-04Merge branches 'pm-pci', 'acpi-pm', 'pm-sleep' and 'pm-avs'Rafael J. Wysocki
* pm-pci: PCI / PM: Clean up outdated comments in pci_target_state() PCI / PM: Do not clear state_saved for devices that remain suspended * acpi-pm: ACPI: EC: Dispatch the EC GPE directly on s2idle wake ACPICA: Introduce acpi_dispatch_gpe() * pm-sleep: PM / hibernate: Fix oops at snapshot_write() PM / wakeup: Make s2idle_lock a RAW_SPINLOCK PM / s2idle: Make s2idle_wait_head swait based PM / wakeup: Make events_lock a RAW_SPINLOCK PM / suspend: Prevent might sleep splats * pm-avs: PM / AVS: rockchip-io: add io selectors and supplies for PX30
2018-06-04Merge branches 'pm-cpufreq-sched' and 'pm-cpuidle'Rafael J. Wysocki
* pm-cpufreq-sched: cpufreq: schedutil: Avoid missing updates for one-CPU policies schedutil: Allow cpufreq requests to be made even when kthread kicked cpufreq: Rename cpufreq_can_do_remote_dvfs() cpufreq: schedutil: Cleanup and document iowait boost cpufreq: schedutil: Fix iowait boost reset cpufreq: schedutil: Don't set next_freq to UINT_MAX Revert "cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily" * pm-cpuidle: cpuidle: governors: Consolidate PM QoS handling cpuidle: governors: Drop redundant checks related to PM QoS
2018-06-04Merge branches 'pm-qos' and 'pm-core'Rafael J. Wysocki
* pm-qos: PM / QoS: Drop redundant declaration of pm_qos_get_value() * pm-core: PM / runtime: Drop usage count for suppliers at device link removal PM / runtime: Fixup reference counting of device link suppliers at probe PM: wakeup: Use pr_debug() for the "aborting suspend" message PM / core: Drop unused internal inline functions for sysfs PM / core: Drop unused internal functions for pm_qos sysfs PM / core: Drop unused internal inline functions for wakeirqs PM / core: Drop internal unused inline functions for wakeups PM / wakeup: Only update last time for active wakeup sources PM / wakeup: Use seq_open() to show wakeup stats PM / core: Use dev_printk() and symbols in suspend/resume diagnostics PM / core: Simplify initcall_debug_report() timing PM / core: Remove unused initcall_debug_report() arguments PM / core: fix deferred probe breaking suspend resume order
2018-06-03bpf: implement bpf_get_current_cgroup_id() helperYonghong Song
bpf has been used extensively for tracing. For example, bcc contains an almost full set of bpf-based tools to trace kernel and user functions/events. Most tracing tools are currently either filtered based on pid or system-wide. Containers have been used quite extensively in industry and cgroup is often used together to provide resource isolation and protection. Several processes may run inside the same container. It is often desirable to get container-level tracing results as well, e.g. syscall count, function count, I/O activity, etc. This patch implements a new helper, bpf_get_current_cgroup_id(), which will return cgroup id based on the cgroup within which the current task is running. The later patch will provide an example to show that userspace can get the same cgroup id so it could configure a filter or policy in the bpf program based on task cgroup id. The helper is currently implemented for tracing. It can be added to other program types as well when needed. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>