summaryrefslogtreecommitdiff
path: root/arch/s390/kernel
AgeCommit message (Collapse)Author
2020-03-23s390/topology: remove offline CPUs from CPU topology masksAlexander Gordeev
The CPU topology masks on s390 contain also bits of CPUs which are offline. Currently this is already a problem, since common code scheduler expects e.g. cpu_smt_mask() to reflect reality. This update changes the described behaviour and s390 starts to behave like all other architectures. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-23s390/numa: remove redundant cpus_with_topology variableAlexander Gordeev
Variable cpus_with_topology is a leftover that became unneeded once the fake NUMA support has been removed. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-23s390/cpuinfo: show processor physical addressAlexander Gordeev
Show CPU physical address as reported by STAP instruction Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-11s390/traps: mark test_monitor_call __initHeiko Carstens
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-11s390/irq: make init_ext_interrupts staticHeiko Carstens
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-11s390/irq: replace setup_irq() by request_irq()afzal mohammed
request_irq() is preferred over setup_irq(). Invocations of setup_irq() occur after memory allocators are ready. Per tglx[1], setup_irq() existed in olden days when allocators were not ready by the time early interrupts were initialized. Hence replace setup_irq() by request_irq(). [1] https://lkml.kernel.org/r/alpine.DEB.2.20.1710191609480.1971@nanos Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com> Message-Id: <20200304005049.5291-1-afzal.mohd.ma@gmail.com> [heiko.carstens@de.ibm.com: replace pr_err with panic] Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-10s390/cpuinfo: add system topology informationAlexander Gordeev
This update adjusts /proc/cpuinfo format to meet some user level programs expectations. It also makes the layout consistent with x86 where CPU topology is presented as blocks of key-value pairs. Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-10s390: prevent leaking kernel address in BEARSven Schnelle
When userspace executes a syscall or gets interrupted, BEAR contains a kernel address when returning to userspace. This make it pretty easy to figure out where the kernel is mapped even with KASLR enabled. To fix this, add lpswe to lowcore and always execute it there, so userspace sees only the lowcore address of lpswe. For this we have to extend both critical_cleanup and the SWITCH_ASYNC macro to also check for lpswe addresses in lowcore. Fixes: b2d24b97b2a9 ("s390/kernel: add support for kernel address space layout randomization (KASLR)") Cc: <stable@vger.kernel.org> # v5.2+ Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-10s390/cpum_cf: Add new extended counters for IBM z15Thomas Richter
Add CPU measurement counter facility event description for IBM z15. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-02-27s390/mm: remove fake numa supportHeiko Carstens
It turned out that fake numa support is rather useless on s390, since there are no scenarios where there is any performance or other benefit when used. However it does provide maintenance cost and breaks from time to time. Therefore remove it. CONFIG_NUMA is still supported with a very small backend and only one node. This way userspace applications which require NUMA interfaces continue to work. Note that NODES_SHIFT is set to 1 (= 2 nodes) instead of 0 (= 1 node), since there is quite a bit of kernel code which assumes that more than one node is possible if CONFIG_NUMA is enabled. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-02-17s390/cpum_sf: Rework sampling buffer allocationThomas Richter
Adjust sampling buffer allocation depending on frequency and correct comments. Investigation on the interrupt handler revealed that almost always one interupt services one SDB, even when running with the maximum frequency of 100000. Very rarely there have been 2 SBD serviced per interrupt. Therefore reduce the number of SBD per CPU. Each SDB is one page in size. The new formula results in freq:4000 n_sdb:32 new:16 freq:10000 n_sdb:80 new:16 freq:20000 n_sdb:159 new:17 freq:40000 n_sdb:318 new:19 freq:50000 n_sdb:397 new:20 freq:62500 n_sdb:497 new:22 freq:83333 n_sdb:662 new:24 freq:100000 n_sdb:794 new:25 Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-02-05Merge tag 's390-5.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Vasily Gorbik: "The second round of s390 fixes and features for 5.6: - Add KPROBES_ON_FTRACE support - Add EP11 AES secure keys support - PAES rework and prerequisites for paes-s390 ciphers selftests - Fix page table upgrade for hugetlbfs" * tag 's390-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/pkey/zcrypt: Support EP11 AES secure keys s390/zcrypt: extend EP11 card and queue sysfs attributes s390/zcrypt: add new low level ep11 functions support file s390/zcrypt: ep11 structs rework, export zcrypt_send_ep11_cprb s390/zcrypt: enable card/domain autoselect on ep11 cprbs s390/crypto: enable clear key values for paes ciphers s390/pkey: Add support for key blob with clear key value s390/crypto: Rework on paes implementation s390: support KPROBES_ON_FTRACE s390/mm: fix dynamic pagetable upgrade for hugetlbfs
2020-01-31s390/boot: add dfltcc= kernel command line parameterMikhail Zaslonko
Add the new kernel command line parameter 'dfltcc=' to configure s390 zlib hardware support. Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) Link: http://lkml.kernel.org/r/20200103223334.20669-5-zaslonko@linux.ibm.com Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Sterba <dsterba@suse.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/memblock: define memblock_physmem_add()Anshuman Khandual
On the s390 platform memblock.physmem array is being built by directly calling into memblock_add_range() which is a low level function not intended to be used outside of memblock. Hence lets conditionally add helper functions for physmem array when HAVE_MEMBLOCK_PHYS_MAP is enabled. Also use MAX_NUMNODES instead of 0 as node ID similar to memblock_add() and memblock_reserve(). Make memblock_add_range() a static function as it is no longer getting used outside of memblock. Link: http://lkml.kernel.org/r/1578283835-21969-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Collin Walling <walling@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Philipp Rudo <prudo@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-30s390: support KPROBES_ON_FTRACESven Schnelle
Instead of using our own kprobes-on-ftrace handling convert the code to support KPROBES_ON_FTRACE. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-01-29Merge tag 'threads-v5.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
2020-01-29Merge branch 'work.openat2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull openat2 support from Al Viro: "This is the openat2() series from Aleksa Sarai. I'm afraid that the rest of namei stuff will have to wait - it got zero review the last time I'd posted #work.namei, and there had been a leak in the posted series I'd caught only last weekend. I was going to repost it on Monday, but the window opened and the odds of getting any review during that... Oh, well. Anyway, openat2 part should be ready; that _did_ get sane amount of review and public testing, so here it comes" From Aleksa's description of the series: "For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful. In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added: LOOKUP_NO_XDEV: Blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted). LOOKUP_NO_MAGICLINKS: Blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name. It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link. In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required. LOOKUP_BENEATH: Disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..". Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion. In addition, two new flags are added that expand on the above ideas: LOOKUP_NO_SYMLINKS: Does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component. LOOKUP_IN_ROOT: This is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not. If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT. The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few). In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready. Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles)" * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Documentation: path-lookup: include new LOOKUP flags selftests: add openat2(2) selftests open: introduce openat2(2) syscall namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: allow set_root() to produce errors namei: allow nd_jump_link() to produce errors nsfs: clean-up ns_get_path() signature to return int namei: only return -ECHILD from follow_dotdot_rcu()
2020-01-29Merge tag 'tty-5.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Here are the big set of tty and serial driver updates for 5.6-rc1 Included in here are: - dummy_con cleanups (touches lots of arch code) - sysrq logic cleanups (touches lots of serial drivers) - samsung driver fixes (wasn't really being built) - conmakeshash move to tty subdir out of scripts - lots of small tty/serial driver updates All of these have been in linux-next for a while with no reported issues" * tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits) tty: n_hdlc: Use flexible-array member and struct_size() helper tty: baudrate: SPARC supports few more baud rates tty: baudrate: Synchronise baud_table[] and baud_bits[] tty: serial: meson_uart: Add support for kernel debugger serial: imx: fix a race condition in receive path serial: 8250_bcm2835aux: Document struct bcm2835aux_data serial: 8250_bcm2835aux: Use generic remapping code serial: 8250_bcm2835aux: Allocate uart_8250_port on stack serial: 8250_bcm2835aux: Suppress register_port error on -EPROBE_DEFER serial: 8250_bcm2835aux: Suppress clk_get error on -EPROBE_DEFER serial: 8250_bcm2835aux: Fix line mismatch on driver unbind serial_core: Remove unused member in uart_port vt: Correct comment documenting do_take_over_console() vt: Delete comment referencing non-existent unbind_con_driver() arch/xtensa/setup: Drop dummy_con initialization arch/x86/setup: Drop dummy_con initialization arch/unicore32/setup: Drop dummy_con initialization arch/sparc/setup: Drop dummy_con initialization arch/sh/setup: Drop dummy_con initialization arch/s390/setup: Drop dummy_con initialization ...
2020-01-28Merge tag 's390-5.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Add clang 10 build support. - Fix BUG() implementation to contain precise bug address, which is relevant for kprobes. - Make ftraced function appear in a stacktrace. - Minor perf improvements and refactoring. - Possible deadlock and recovery fixes in pci code. * tag 's390-5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: fix __EMIT_BUG() macro s390/ftrace: generate traced function stack frame s390: adjust -mpacked-stack support check for clang 10 s390/jump_label: use "i" constraint for clang s390/cpum_sf: Use DIV_ROUND_UP s390/cpum_sf: Use kzalloc and minor changes s390/cpum_sf: Convert debug trace to common layout s390/pci: Fix possible deadlock in recover_store() s390/pci: Recover handle in clp_set_pci_fn()
2020-01-28Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "These were the main changes in this cycle: - More -rt motivated separation of CONFIG_PREEMPT and CONFIG_PREEMPTION. - Add more low level scheduling topology sanity checks and warnings to filter out nonsensical topologies that break scheduling. - Extend uclamp constraints to influence wakeup CPU placement - Make the RT scheduler more aware of asymmetric topologies and CPU capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y - Make idle CPU selection more consistent - Various fixes, smaller cleanups, updates and enhancements - please see the git log for details" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) sched/fair: Define sched_idle_cpu() only for SMP configurations sched/topology: Assert non-NUMA topology masks don't (partially) overlap idle: fix spelling mistake "iterrupts" -> "interrupts" sched/fair: Remove redundant call to cpufreq_update_util() sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP sched/fair: calculate delta runnable load only when it's needed sched/cputime: move rq parameter in irqtime_account_process_tick stop_machine: Make stop_cpus() static sched/debug: Reset watchdog on all CPUs while processing sysrq-t sched/core: Fix size of rq::uclamp initialization sched/uclamp: Fix a bug in propagating uclamp value in new cgroups sched/fair: Load balance aggressively for SCHED_IDLE CPUs sched/fair : Improve update_sd_pick_busiest for spare capacity case watchdog: Remove soft_lockup_hrtimer_cnt and related code sched/rt: Make RT capacity-aware sched/fair: Make EAS wakeup placement consider uclamp restrictions sched/fair: Make task_fits_capacity() consider uclamp restrictions sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with() sched/uclamp: Make uclamp util helpers use and return UL values ...
2020-01-22s390: fix __EMIT_BUG() macroSven Schnelle
Setting a kprobe on getname_flags() failed: $ echo 'p:tmr1 getname_flags +0(%r2):ustring' > kprobe_events -bash: echo: write error: Invalid argument Debugging the kprobes code showed that the address of getname_flags() is contained in the __bug_table. Kprobes doesn't allow to set probes at BUG() locations. $ objdump -j __bug_table -x build/fs/namei.o [..] 0000000000000108 R_390_PC32 .text+0x00000000000075a8 000000000000010c R_390_PC32 .L223+0x0000000000000004 I was expecting getname_flags() to start with a BUG(), but: 7598: e3 20 10 00 00 04 lg %r2,0(%r1) 759e: c0 f4 00 00 00 00 jg 759e <putname+0x7e> 75a0: R_390_PLT32DBL kmem_cache_free+0x2 75a4: a7 f4 00 01 j 75a6 <putname+0x86> 00000000000075a8 <getname_flags>: 75a8: c0 04 00 00 00 00 brcl 0,75a8 <getname_flags> 75ae: eb 6f f0 48 00 24 stmg %r6,%r15,72(%r15) 75b4: b9 04 00 ef lgr %r14,%r15 75b8: e3 f0 ff a8 ff 71 lay %r15,-88(%r15) So the BUG() is actually the last opcode of the previous function. Fix this by switching to using the MONITOR CALL (MC) instruction, and set the entry in __bug_table to the beginning of that MC. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-01-22s390/ftrace: generate traced function stack frameVasily Gorbik
Currently backtrace from ftraced function does not contain ftraced function itself. e.g. for "path_openat": arch_stack_walk+0x15c/0x2d8 stack_trace_save+0x50/0x68 stack_trace_call+0x15e/0x3d8 ftrace_graph_caller+0x0/0x1c <-- ftrace code do_filp_open+0x7c/0xe8 <-- ftraced function caller do_open_execat+0x76/0x1b8 open_exec+0x52/0x78 load_elf_binary+0x180/0x1160 search_binary_handler+0x8e/0x288 load_script+0x2a8/0x2b8 search_binary_handler+0x8e/0x288 __do_execve_file.isra.39+0x6fa/0xb40 __s390x_sys_execve+0x56/0x68 system_call+0xdc/0x2d8 Ftraced function is expected in the backtrace by ftrace kselftests, which are now failing. It would also be nice to have it for clarity reasons. "ftrace_caller" itself is called without stack frame allocated for it and does not store its caller (ftraced function). Instead it simply allocates a stack frame for "ftrace_trace_function" and sets backchain to point to ftraced function stack frame (which contains ftraced function caller in saved r14). To fix this issue make "ftrace_caller" allocate a stack frame for itself just to store ftraced function for the stack unwinder. As a result backtrace looks like the following: arch_stack_walk+0x15c/0x2d8 stack_trace_save+0x50/0x68 stack_trace_call+0x15e/0x3d8 ftrace_graph_caller+0x0/0x1c <-- ftrace code path_openat+0x6/0xd60 <-- ftraced function do_filp_open+0x7c/0xe8 <-- ftraced function caller do_open_execat+0x76/0x1b8 open_exec+0x52/0x78 load_elf_binary+0x180/0x1160 search_binary_handler+0x8e/0x288 load_script+0x2a8/0x2b8 search_binary_handler+0x8e/0x288 __do_execve_file.isra.39+0x6fa/0xb40 __s390x_sys_execve+0x56/0x68 system_call+0xdc/0x2d8 Reported-by: Sven Schnelle <sven.schnelle@ibm.com> Tested-by: Sven Schnelle <sven.schnelle@ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-01-22s390/cpum_sf: Use DIV_ROUND_UPThomas Richter
Use macro DIV_ROUND_UP() for calculation of number of SDBT SDBT pages required for index pages. This macro is already used throughout the file. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-01-22s390/cpum_sf: Use kzalloc and minor changesThomas Richter
Use kzalloc() to allocate auxiliary buffer structure initialized with all zeroes to avoid random value in trace output. Avoid double access to SBD hardware flags. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-01-22s390/cpum_sf: Convert debug trace to common layoutThomas Richter
Convert debug traces to print the head/alert/empty marks consistently as decimal numbers. Add some trace statements to enable easier debugging during auxiliary tracing. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-01-18open: introduce openat2(2) syscallAleksa Sarai
/* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-14arch/s390/setup: Drop dummy_con initializationArvind Sankar
con_init in tty/vt.c will now set conswitchp to dummy_con if it's unset. Drop it from arch setup code. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20191218214506.49252-20-nivedita@alum.mit.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-13arch: wire up pidfd_getfd syscallSargun Dhillon
This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-09s390/setup: Fix secure ipl messagePhilipp Rudo
The new machine loader on z15 always creates an IPL Report block and thus sets the IPL_PL_FLAG_IPLSR even when secure boot is disabled. This causes the wrong message being printed at boot. Fix this by checking for IPL_PL_FLAG_SIPL instead. Fixes: 9641b8cc733f ("s390/ipl: read IPL report at early boot") Signed-off-by: Philipp Rudo <prudo@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-12-25Merge tag 'v5.5-rc3' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-18s390/ftrace: save traced function callerVasily Gorbik
A typical backtrace acquired from ftraced function currently looks like the following (e.g. for "path_openat"): arch_stack_walk+0x15c/0x2d8 stack_trace_save+0x50/0x68 stack_trace_call+0x15a/0x3b8 ftrace_graph_caller+0x0/0x1c 0x3e0007e3c98 <- ftraced function caller (should be do_filp_open+0x7c/0xe8) do_open_execat+0x70/0x1b8 __do_execve_file.isra.0+0x7d8/0x860 __s390x_sys_execve+0x56/0x68 system_call+0xdc/0x2d8 Note random "0x3e0007e3c98" stack value as ftraced function caller. This value causes either imprecise unwinder result or unwinding failure. That "0x3e0007e3c98" comes from r14 of ftraced function stack frame, which it haven't had a chance to initialize since the very first instruction calls ftrace code ("ftrace_caller"). (ftraced function might never save r14 as well). Nevertheless according to s390 ABI any function is called with stack frame allocated for it and r14 contains return address. "ftrace_caller" itself is called with "brasl %r0,ftrace_caller". So, to fix this issue simply always save traced function caller onto ftraced function stack frame. Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-12-18s390/unwind: stop gracefully at user mode pt_regs in irq stackVasily Gorbik
Consider reaching user mode pt_regs at the bottom of irq stack graceful unwinder termination. This is the case when irq/mcck/ext interrupt arrives while in user mode. Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-12-11s390: remove last diag 0x44 callerHeiko Carstens
diag 0x44 is a voluntary undirected yield of a virtual CPU. This has caused a lot of performance issues in the past. There is only one caller left, and that one is only executed if diag 0x9c (directed yield) is not present. Given that all hypervisors implement diag 0x9c anyway, remove the last diag 0x44 to avoid that more callers will be added. Worst case that could happen now, if diag 0x9c is not present, is that a virtual CPU would loop a bit instead of giving its time slice up. diag 0x44 statistics in debugfs are kept and will always be zero, so that user space can tell that there are no calls. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-12-11s390/cpum_sf: Avoid SBD overflow condition in irq handlerThomas Richter
The s390 CPU Measurement sampling facility has an overflow condition which fires when all entries in a SBD are used. The measurement alert interrupt is triggered and reads out all samples in this SDB. It then tests the successor SDB, if this SBD is not full, the interrupt handler does not read any samples at all from this SDB The design waits for the hardware to fill this SBD and then trigger another meassurement alert interrupt. This scheme works nicely until an perf_event_overflow() function call discards the sample due to a too high sampling rate. The interrupt handler has logic to read out a partially filled SDB when the perf event overflow condition in linux common code is met. This causes the CPUM sampling measurement hardware and the PMU device driver to operate on the same SBD's trailer entry. This should not happen. This can be seen here using this trace: cpumsf_pmu_add: tear:0xb5286000 hw_perf_event_update: sdbt 0xb5286000 full 1 over 0 flush_all:0 hw_perf_event_update: sdbt 0xb5286008 full 0 over 0 flush_all:0 above shows 1. interrupt hw_perf_event_update: sdbt 0xb5286008 full 1 over 0 flush_all:0 hw_perf_event_update: sdbt 0xb5286008 full 0 over 0 flush_all:0 above shows 2. interrupt ... this goes on fine until... hw_perf_event_update: sdbt 0xb5286068 full 1 over 0 flush_all:0 perf_push_sample1: overflow one or more samples read from the IRQ handler are rejected by perf_event_overflow() and the IRQ handler advances to the next SDB and modifies the trailer entry of a partially filled SDB. hw_perf_event_update: sdbt 0xb5286070 full 0 over 0 flush_all:1 timestamp: 14:32:52.519953 Next time the IRQ handler is called for this SDB the trailer entry shows an overflow count of 19 missed entries. hw_perf_event_update: sdbt 0xb5286070 full 1 over 19 flush_all:1 timestamp: 14:32:52.970058 Remove access to a follow on SDB when event overflow happened. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-12-11s390/cpum_sf: Adjust sampling interval to avoid hitting sample limitsThomas Richter
Function perf_event_ever_overflow() and perf_event_account_interrupt() are called every time samples are processed by the interrupt handler. However function perf_event_account_interrupt() has checks to avoid being flooded with interrupts (more then 1000 samples are received per task_tick). Samples are then dropped and a PERF_RECORD_THROTTLED is added to the perf data. The perf subsystem limit calculation is: maximum sample frequency := 100000 --> 1 samples per 10 us task_tick = 10ms = 10000us --> 1000 samples per task_tick The work flow is measurement_alert() uses SDBT head and each SBDT points to 511 SDB pages, each with 126 sample entries. After processing 8 SBDs and for each valid sample calling: perf_event_overflow() perf_event_account_interrupts() there is a considerable amount of samples being dropped, especially when the sample frequency is very high and near the 100000 limit. To avoid the high amount of samples being dropped near the end of a task_tick time frame, increment the sampling interval in case of dropped events. The CPU Measurement sampling facility on the s390 supports only intervals, specifiing how many CPU cycles have to be executed before a sample is generated. Increase the interval when the samples being generated hit the task_tick limit. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-12-08sched/rt, s390: Use CONFIG_PREEMPTIONThomas Gleixner
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the preemption and entry code over to use CONFIG_PREEMPTION. Add PREEMPT_RT output to die(). [bigeasy: +Kconfig, dumpstack.c] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: linux-s390@vger.kernel.org Link: https://lore.kernel.org/r/20191015191821.11479-18-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-03Merge tag 's390-5.5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Vasily Gorbik: - Make stack unwinder reliable and suitable for livepatching. Add unwinder testing module. - Fixes for CALL_ON_STACK helper used for stack switching. - Fix unwinding from bpf code. - Fix getcpu and remove compat support in vdso code. - Fix address space control registers initialization. - Save KASLR offset for early dumps. - Handle new FILTERED_BY_HYPERVISOR reply code in crypto code. - Minor perf code cleanup and potential memory leak fix. - Add couple of error messages for corner cases during PCI device creation. * tag 's390-5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (33 commits) s390: remove compat vdso code s390/livepatch: Implement reliable stack tracing for the consistency model s390/unwind: add stack pointer alignment sanity checks s390/unwind: filter out unreliable bogus %r14 s390/unwind: start unwinding from reliable state s390/test_unwind: add program check context tests s390/test_unwind: add irq context tests s390/test_unwind: print verbose unwinding results s390/test_unwind: add CALL_ON_STACK tests s390: fix register clobbering in CALL_ON_STACK s390/test_unwind: require that unwinding ended successfully s390/unwind: add a test for the internal API s390/unwind: always inline get_stack_pointer s390/pci: add error message on device number limit s390/pci: add error message for UID collision s390/cpum_sf: Check for SDBT and SDB consistency s390/cpum_sf: Use TEAR_REG macro consistantly s390/cpum_sf: Remove unnecessary check for pending SDBs s390/cpum_sf: Replace function name in debug statements s390/kaslr: store KASLR offset for early dumps ...
2019-12-01s390: remove compat vdso codeHeiko Carstens
Remove compat vdso code, since there is hardly any compat user space left. Still existing compat user space will have to use system calls instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30Merge tag 'seccomp-v5.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "Mostly this is implementing the new flag SECCOMP_USER_NOTIF_FLAG_CONTINUE, but there are cleanups as well. - implement SECCOMP_USER_NOTIF_FLAG_CONTINUE (Christian Brauner) - fixes to selftests (Christian Brauner) - remove secure_computing() argument (Christian Brauner)" * tag 'seccomp-v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUE seccomp: fix SECCOMP_USER_NOTIF_FLAG_CONTINUE test seccomp: simplify secure_computing() seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE seccomp: avoid overflow in implicit constant conversion
2019-11-30s390/livepatch: Implement reliable stack tracing for the consistency modelMiroslav Benes
The livepatch consistency model requires reliable stack tracing architecture support in order to work properly. In order to achieve this, two main issues have to be solved. First, reliable and consistent call chain backtracing has to be ensured. Second, the unwinder needs to be able to detect stack corruptions and return errors. The "zSeries ELF Application Binary Interface Supplement" says: "The stack pointer points to the first word of the lowest allocated stack frame. If the "back chain" is implemented this word will point to the previously allocated stack frame (towards higher addresses), except for the first stack frame, which shall have a back chain of zero (NULL). The stack shall grow downwards, in other words towards lower addresses." "back chain" is optional. GCC option -mbackchain enables it. Quoting Martin Schwidefsky [1]: "The compiler is called with the -mbackchain option, all normal C function will store the backchain in the function prologue. All functions written in assembler code should do the same, if you find one that does not we should fix that. The end result is that a task that *voluntarily* called schedule() should have a proper backchain at all times. Dependent on the use case this may or may not be enough. Asynchronous interrupts may stop the CPU at the beginning of a function, if kernel preemption is enabled we can end up with a broken backchain. The production kernels for IBM Z are all compiled *without* kernel preemption. So yes, we might get away without the objtool support. On a side-note, we do have a line item to implement the ORC unwinder for the kernel, that includes the objtool support. Once we have that we can drop the -mbackchain option for the kernel build. That gives us a nice little performance benefit. I hope that the change from backchain to the ORC unwinder will not be too hard to implement in the livepatch tools." Since -mbackchain is enabled by default when the kernel is compiled, the call chain backtracing should be currently ensured and objtool should not be necessary for livepatch purposes. Regarding the second issue, stack corruptions and non-reliable states have to be recognized by the unwinder. Mainly it means to detect preemption or page faults, the end of the task stack must be reached, return addresses must be valid text addresses and hacks like function graph tracing and kretprobes must be properly detected. Unwinding a running task's stack is not a problem, because there is a livepatch requirement that every checked task is blocked, except for the current task. Due to that, the implementation can be much simpler compared to the existing non-reliable infrastructure. We can consider a task's kernel/thread stack only and skip the other stacks. [1] 20180912121106.31ffa97c@mschwideX1 [not archived on lore.kernel.org] Link: https://lkml.kernel.org/r/20191106095601.29986-5-mbenes@suse.cz Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/unwind: add stack pointer alignment sanity checksMiroslav Benes
ABI requires SP to be aligned 8 bytes, report unwinding error otherwise. Link: https://lkml.kernel.org/r/20191106095601.29986-5-mbenes@suse.cz Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/unwind: filter out unreliable bogus %r14Vasily Gorbik
Currently unwinder unconditionally returns %r14 from the first frame pointed by %r15 from pt_regs. A task could be interrupted when a function already allocated this frame (if it needs it) for its callees or to store local variables. In that case this frame would contain random values from stack or values stored there by a callee. As we are only interested in %r14 to get potential return address, skip bogus return addresses which doesn't belong to kernel text. This helps to avoid duplicating filtering logic in unwider users, most of which use unwind_get_return_address() and would choke on bogus 0 address returned by it otherwise. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/unwind: start unwinding from reliable stateVasily Gorbik
A comment in arch/s390/include/asm/unwind.h says: > If 'first_frame' is not zero unwind_start skips unwind frames until it > reaches the specified stack pointer. > The end of the unwinding is indicated with unwind_done, this can be true > right after unwind_start, e.g. with first_frame!=0 that can not be found. > unwind_next_frame skips to the next frame. > Once the unwind is completed unwind_error() can be used to check if there > has been a situation where the unwinder could not correctly understand > the tasks call chain. With this change backchain unwinder now comply with behaviour described. As well as matches orc unwinder implementation. Now unwinder starts from reliable state, i.e. __unwind_start own stack frame is taken or stack frame generated by __switch_to (ksp) - both known to be valid. In case of pt_regs %r15 is better match for pt_regs psw, than sometimes random "sp" caller passed. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/test_unwind: print verbose unwinding resultsVasily Gorbik
Add stack name, sp and reliable information into test unwinding results. Also consider ip outside of kernel text as failure if the state is reported reliable. Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/cpum_sf: Check for SDBT and SDB consistencyThomas Richter
Each SBDT is located at a 4KB page and contains 512 entries. Each entry of a SDBT points to a SDB, a 4KB page containing sampled data. The last entry is a link to another SDBT page. When an event is created the function sequence executed is: __hw_perf_event_init() +--> allocate_buffers() +--> realloc_sampling_buffers() +---> alloc_sample_data_block() Both functions realloc_sampling_buffers() and alloc_sample_data_block() allocate pages and the allocation can fail. This is handled correctly and all allocated pages are freed and error -ENOMEM is returned to the top calling function. Finally the event is not created. Once the event has been created, the amount of initially allocated SDBT and SDB can be too low. This is detected during measurement interrupt handling, where the amount of lost samples is calculated. If the number of lost samples is too high considering sampling frequency and already allocated SBDs, the number of SDBs is enlarged during the next execution of cpumsf_pmu_enable(). If more SBDs need to be allocated, functions realloc_sampling_buffers() +---> alloc-sample_data_block() are called to allocate more pages. Page allocation may fail and the returned error is ignored. A SDBT and SDB setup already exists. However the modified SDBTs and SDBs might end up in a situation where the first entry of an SDBT does not point to an SDB, but another SDBT, basicly an SBDT without payload. This can not be handled by the interrupt handler, where an SDBT must have at least one entry pointing to an SBD. Add a check to avoid SDBTs with out payload (SDBs) when enlarging the buffer setup. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/cpum_sf: Use TEAR_REG macro consistantlyThomas Richter
The macro TEAR_REG() saves the last used SDBT address in the perf_hw_event structure. This is also done by function hw_reset_registers() which is a one-liner and simply uses macro TEAR_REG(). Remove function hw_reset_registers(), which is only used one time and use macro TEAR_REG() instead. This macro is used throughout the code anyway. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/cpum_sf: Remove unnecessary check for pending SDBsThomas Richter
In interrupt handling the function extend_sampling_buffer() is called after checking for a possibly extension. This check is not necessary as the called function itself performs this check again. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/cpum_sf: Replace function name in debug statementsThomas Richter
Replace hard coded function names in debug statements by the "%s ...", __func__ construct suggested by checkpatch.pl script. Use consistent debug print format of the form variable blank value. Also add leading 0x for all hex values. Print allocated page addresses consistantly as hex numbers with leading 0x. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/kaslr: store KASLR offset for early dumpsGerald Schaefer
The KASLR offset is added to vmcoreinfo in arch_crash_save_vmcoreinfo(), so that it can be found by crash when processing kernel dumps. However, arch_crash_save_vmcoreinfo() is called during a subsys_initcall, so if the kernel crashes before that, we have no vmcoreinfo and no KASLR offset. Fix this by storing the KASLR offset in the lowcore, where the vmcore_info pointer will be stored, and where it can be found by crash. In order to make it distinguishable from a real vmcore_info pointer, mark it as uneven (KASLR offset itself is aligned to THREAD_SIZE). When arch_crash_save_vmcoreinfo() stores the real vmcore_info pointer in the lowcore, it overwrites the KASLR offset. At that point, the KASLR offset is not yet added to vmcoreinfo, so we also need to move the mem_assign_absolute() behind the vmcoreinfo_append_str(). Fixes: b2d24b97b2a9 ("s390/kernel: add support for kernel address space layout randomization (KASLR)") Cc: <stable@vger.kernel.org> # v5.2+ Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-11-30s390/unwind: stop gracefully at task pt_regsVasily Gorbik
Consider reaching task pt_regs graceful unwinder termination. Task pt_regs itself never contains a valid state to which a task might return within the kernel context (user task pt_regs is a special case). Since we already avoid printing user task pt_regs and in most cases we don't even bother filling task pt_regs psw and r15 with something reasonable simply skip task pt_regs altogether. With this change unwind_error() now accurately represent whether unwinder reached task pt_regs successfully or failed along the way. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>