summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-05-17tracing/kprobes: Enforce kprobes teardown after testingThomas Gleixner
Enabling the tracer selftest triggers occasionally the warning in text_poke(), which warns when the to be modified page is not marked reserved. The reason is that the tracer selftest installs kprobes on functions marked __init for testing. These probes are removed after the tests, but that removal schedules the delayed kprobes_optimizer work, which will do the actual text poke. If the work is executed after the init text is freed, then the warning triggers. The bug can be reproduced reliably when the work delay is increased. Flush the optimizer work and wait for the optimizing/unoptimizing lists to become empty before returning from the kprobes tracer selftest. That ensures that all operations which were queued due to the probes removal have completed. Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 6274de498 ("kprobes: Support delayed unoptimizing") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17tracing: Move postpone selftests to core from early_initcallSteven Rostedt
I hit the following lockdep splat when booting with ftrace selftests enabled, as well as CONFIG_PREEMPT and LOCKDEP. Testing dynamic ftrace ops #1: (1 0 1 0 0) (1 1 2 0 0) (2 1 3 0 169) (2 2 4 0 50066) ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13 at kernel/rcu/srcutree.c:202 check_init_srcu_struct+0x60/0x70 Modules linked in: CPU: 0 PID: 13 Comm: rcu_tasks_kthre Not tainted 4.12.0-rc1-test+ #587 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: ffff880119628040 task.stack: ffffc900006a4000 RIP: 0010:check_init_srcu_struct+0x60/0x70 RSP: 0000:ffffc900006a7d98 EFLAGS: 00010246 RAX: 0000000000000246 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff880119628040 RSI: 00000000ffffffff RDI: ffffffff81e5fb40 RBP: ffffc900006a7e20 R08: 00000023b403c000 R09: 0000000000000001 R10: ffffc900006a7e40 R11: 0000000000000000 R12: ffffffff81e5fb40 R13: 0000000000000286 R14: ffff880119628040 R15: ffffc900006a7e98 FS: 0000000000000000(0000) GS:ffff88011ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88011edff000 CR3: 0000000001e0f000 CR4: 00000000001406f0 Call Trace: ? __synchronize_srcu+0x6e/0x140 ? lock_acquire+0xdc/0x1d0 ? ktime_get_mono_fast_ns+0x5d/0xb0 synchronize_srcu+0x6f/0x110 ? synchronize_srcu+0x6f/0x110 rcu_tasks_kthread+0x20a/0x540 kthread+0x114/0x150 ? __rcu_read_unlock+0x70/0x70 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 Code: f6 83 70 06 00 00 03 49 89 c5 74 0d be 01 00 00 00 48 89 df e8 42 fa ff ff 4c 89 ee 4c 89 e7 e8 b7 42 75 00 5b 41 5c 41 5d 5d c3 <0f> ff eb aa 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 ---[ end trace 5c3f4206ce50f6ac ]--- What happens is that the selftests include a creating of a dynamically allocated ftrace_ops, which requires the use of synchronize_rcu_tasks() which uses srcu, and triggers the above warning. It appears that synchronize_rcu_tasks() is not set up at early_initcall(), but it is at core_initcall(). By moving the tests down to that location works out properly. Link: http://lkml.kernel.org/r/20170517111435.7388c033@gandalf.local.home Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17cgroup: Prevent kill_css() from being called more than onceWaiman Long
The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.5+
2017-05-16genirq: Fix chained interrupt data orderingThomas Gleixner
irq_set_chained_handler_and_data() sets up the chained interrupt and then stores the handler data. That's racy against an immediate interrupt which gets handled before the store of the handler data happened. The handler will dereference a NULL pointer and crash. Cure it by storing handler data before installing the chained handler. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2017-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Track alignment in BPF verifier so that legitimate programs won't be rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures. 2) Make tail calls work properly in arm64 BPF JIT, from Deniel Borkmann. 3) Make the configuration and semantics Generic XDP make more sense and don't allow both generic XDP and a driver specific instance to be active at the same time. Also from Daniel. 4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov. 5) Fix use-after-free in VRF driver, from Gao Feng. 6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in qca_spi driver, from Stefan Wahren. 7) Always run cleanup routines in BPF samples when we get SIGTERM, from Andy Gospodarek. 8) The mdio phy code should bring PHYs out of reset using the shared GPIO lines before invoking bus->reset(). From Florian Fainelli. 9) Some USB descriptor access endian fixes in various drivers from Johan Hovold. 10) Handle PAUSE advertisements properly in mlx5 driver, from Gal Pressman. 11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed. 12) Cure netdev leak in AF_PACKET when using timestamping via control messages. From Douglas Caetano dos Santos. 13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav Lichvar. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) ldmvsw: stop the clean timer at beginning of remove ldmvsw: unregistering netdev before disable hardware net: netcp: fix check of requested timestamping filter ipv6: avoid dad-failures for addresses with NODAD qed: Fix uninitialized data in aRFS infrastructure mdio: mux: fix device_node_continue.cocci warnings net/packet: fix missing net_device reference release net/mlx4_core: Use min3 to select number of MSI-X vectors macvlan: Fix performance issues with vlan tagged packets net: stmmac: use correct pointer when printing normal descriptor ring net/mlx5: Use underlay QPN from the root name space net/mlx5e: IPoIB, Only support regular RQ for now net/mlx5e: Fix setup TC ndo net/mlx5e: Fix ethtool pause support and advertise reporting net/mlx5e: Use the correct pause values for ethtool advertising vmxnet3: ensure that adapter is in proper state during force_close sfc: revert changes to NIC revision numbers net: ch9200: add missing USB-descriptor endianness conversions net: irda: irda-usb: fix firmware name on big-endian hosts net: dsa: mv88e6xxx: add default case to switch ...
2017-05-15sched/core: Call __schedule() from do_idle() without enabling preemptionSteven Rostedt (VMware)
I finally got around to creating trampolines for dynamically allocated ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace function hook callbacks, like perf, that allocate the ftrace_ops descriptor via kmalloc() and friends, ftrace was not able to optimize the functions being traced to use a trampoline because they would also need to be allocated dynamically. The problem is that they cannot be freed when CONFIG_PREEMPT is set, as there's no way to tell if a task was preempted on the trampoline. That was before Paul McKenney implemented synchronize_rcu_tasks() that would make sure all tasks (except idle) have scheduled out or have entered user space. While testing this, I triggered this bug: BUG: unable to handle kernel paging request at ffffffffa0230077 ... RIP: 0010:0xffffffffa0230077 ... Call Trace: schedule+0x5/0xe0 schedule_preempt_disabled+0x18/0x30 do_idle+0x172/0x220 What happened was that the idle task was preempted on the trampoline. As synchronize_rcu_tasks() ignores the idle thread, there's nothing that lets ftrace know that the idle task was preempted on a trampoline. The idle task shouldn't need to ever enable preemption. The idle task is simply a loop that calls schedule or places the cpu into idle mode. In fact, having preemption enabled is inefficient, because it can happen when idle is just about to call schedule anyway, which would cause schedule to be called twice. Once for when the interrupt came in and was returning back to normal context, and then again in the normal path that the idle loop is running in, which would be pointless, as it had already scheduled. The only reason schedule_preempt_disable() enables preemption is to be able to call sched_submit_work(), which requires preemption enabled. As this is a nop when the task is in the RUNNING state, and idle is always in the running state, there's no reason that idle needs to enable preemption. But that means it cannot use schedule_preempt_disable() as other callers of that function require calling sched_submit_work(). Adding a new function local to kernel/sched/ that allows idle to call the scheduler without enabling preemption, fixes the synchronize_rcu_tasks() issue, as well as removes the pointless spurious schedule calls caused by interrupts happening in the brief window where preemption is enabled just before it calls schedule. Reviewed: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-14PM / hibernate: Declare variables as staticPushkar Jambhlekar
Fixing sparse warnings: 'symbol not declared. Should it be static?' Signed-off-by: Pushkar Jambhlekar <pushkar.iit@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-13pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()Kirill Tkhai
Imagine we have a pid namespace and a task from its parent's pid_ns, which made setns() to the pid namespace. The task is doing fork(), while the pid namespace's child reaper is dying. We have the race between them: Task from parent pid_ns Child reaper copy_process() .. alloc_pid() .. .. zap_pid_ns_processes() .. disable_pid_allocation() .. read_lock(&tasklist_lock) .. iterate over pids in pid_ns .. kill tasks linked to pids .. read_unlock(&tasklist_lock) write_lock_irq(&tasklist_lock); .. attach_pid(p, PIDTYPE_PID); .. .. .. So, just created task p won't receive SIGKILL signal, and the pid namespace will be in contradictory state. Only manual kill will help there, but does the userspace care about this? I suppose, the most users just inject a task into a pid namespace and wait a SIGCHLD from it. The patch fixes the problem. It simply checks for (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process(). We do it under the tasklist_lock, and can't skip PIDNS_HASH_ADDING as noted by Oleg: "zap_pid_ns_processes() does disable_pid_allocation() and then takes tasklist_lock to kill the whole namespace. Given that copy_process() checks PIDNS_HASH_ADDING under write_lock(tasklist) they can't race; if copy_process() takes this lock first, the new child will be killed, otherwise copy_process() can't miss the change in ->nr_hashed." If allocation is disabled, we just return -ENOMEM like it's made for such cases in alloc_pid(). v2: Do not move disable_pid_allocation(), do not introduce a new variable in copy_process() and simplify the patch as suggested by Oleg Nesterov. Account the problem with double irq enabling found by Eric W. Biederman. Fixes: c876ad768215 ("pidns: Stop pid allocation when init dies") Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Ingo Molnar <mingo@kernel.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Oleg Nesterov <oleg@redhat.com> CC: Mike Rapoport <rppt@linux.vnet.ibm.com> CC: Michal Hocko <mhocko@suse.com> CC: Andy Lutomirski <luto@kernel.org> CC: "Eric W. Biederman" <ebiederm@xmission.com> CC: Andrei Vagin <avagin@openvz.org> CC: Cyrill Gorcunov <gorcunov@openvz.org> CC: Serge Hallyn <serge@hallyn.com> Cc: stable@vger.kernel.org Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-05-13pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processesEric W. Biederman
The code can potentially sleep for an indefinite amount of time in zap_pid_ns_processes triggering the hung task timeout, and increasing the system average. This is undesirable. Sleep with a task state of TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these undesirable side effects. Apparently under heavy load this has been allowing Chrome to trigger the hung time task timeout error and cause ChromeOS to reboot. Reported-by: Vovo Yang <vovoy@google.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped") Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-12gcov: support GCC 7.1Martin Liska
Starting from GCC 7.1, __gcov_exit is a new symbol expected to be implemented in a profiling runtime. [akpm@linux-foundation.org: coding-style fixes] [mliska@suse.cz: v2] Link: http://lkml.kernel.org/r/e63a3c59-0149-c97e-4084-20ca8f146b26@suse.cz Link: http://lkml.kernel.org/r/8c4084fa-3885-29fe-5fc4-0d4ca199c785@suse.cz Signed-off-by: Martin Liska <mliska@suse.cz> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12time: delete current_fs_time()Deepa Dinamani
All uses of the current_fs_time() function have been replaced by other time interfaces. And, its use cases can be fulfilled by current_time() or ktime_get_* variants. Link: http://lkml.kernel.org/r/1491613030-11599-13-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates/fixes from Ingo Molnar: "Mostly tooling updates, but also two kernel fixes: a call chain handling robustness fix and an x86 PMU driver event definition fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/callchain: Force USER_DS when invoking perf_callchain_user() tools build: Fixup sched_getcpu feature test perf tests kmod-path: Don't fail if compressed modules aren't supported perf annotate: Fix AArch64 comment char perf tools: Fix spelling mistakes perf/x86: Fix Broadwell-EP DRAM RAPL events perf config: Refactor a duplicated code for obtaining config file name perf symbols: Allow user probes on versioned symbols perf symbols: Accept symbols starting at address 0 tools lib string: Adopt prefixcmp() from perf and subcmd perf units: Move parse_tag_value() to units.[ch] perf ui gtk: Move gtk .so name to the only place where it is used perf tools: Move HAS_BOOL define to where perl headers are used perf memswap: Split the byteswap memory range wrappers from util.[ch] perf tools: Move event prototypes from util.h to event.h perf buildid: Move prototypes from util.h to build-id.h
2017-05-12Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stackprotector fixlet from Ingo Molnar: "A single fix/enhancement to increase stackprotector canary randomness on 64-bit kernels with very little cost" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: stackprotector: Increase the per-task stack canary's random range from 32 bits to 64 bits on 64-bit platforms
2017-05-11bpf: Handle multiple variable additions into packet pointers in verifier.David S. Miller
We must accumulate into reg->aux_off rather than use a plain assignment. Add a test for this situation to test_align. Reported-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11bpf: Add strict alignment flag for BPF_PROG_LOAD.David S. Miller
Add a new field, "prog_flags", and an initial flag value BPF_F_STRICT_ALIGNMENT. When set, the verifier will enforce strict pointer alignment regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS. The verifier, in this mode, will also use a fixed value of "2" in place of NET_IP_ALIGN. This facilitates test cases that will exercise and validate this part of the verifier even when run on architectures where alignment doesn't matter. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-11bpf: Do per-instruction state dumping in verifier when log_level > 1.David S. Miller
If log_level > 1, do a state dump every instruction and emit it in a more compact way (without a leading newline). This will facilitate more sophisticated test cases which inspect the verifier log for register state. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-11bpf: Track alignment of register values in the verifier.David S. Miller
Currently if we add only constant values to pointers we can fully validate the alignment, and properly check if we need to reject the program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures. However, once an unknown value is introduced we only allow byte sized memory accesses which is too restrictive. Add logic to track the known minimum alignment of register values, and propagate this state into registers containing pointers. The most common paradigm that makes use of this new logic is computing the transport header using the IP header length field. For example: struct ethhdr *ep = skb->data; struct iphdr *iph = (struct iphdr *) (ep + 1); struct tcphdr *th; ... n = iph->ihl; th = ((void *)iph + (n * 4)); port = th->dest; The existing code will reject the load of th->dest because it cannot validate that the alignment is at least 2 once "n * 4" is added the the packet pointer. In the new code, the register holding "n * 4" will have a reg->min_align value of 4, because any value multiplied by 4 will be at least 4 byte aligned. (actually, the eBPF code emitted by the compiler in this case is most likely to use a shift left by 2, but the end result is identical) At the critical addition: th = ((void *)iph + (n * 4)); The register holding 'th' will start with reg->off value of 14. The pointer addition will transform that reg into something that looks like: reg->aux_off = 14 reg->aux_off_align = 4 Next, the verifier will look at the th->dest load, and it will see a load offset of 2, and first check: if (reg->aux_off_align % size) which will pass because aux_off_align is 4. reg_off will be computed: reg_off = reg->off; ... reg_off += reg->aux_off; plus we have off==2, and it will thus check: if ((NET_IP_ALIGN + reg_off + off) % size != 0) which evaluates to: if ((NET_IP_ALIGN + 14 + 2) % size != 0) On strict alignment architectures, NET_IP_ALIGN is 2, thus: if ((2 + 14 + 2) % size != 0) which passes. These pointer transformations and checks work regardless of whether the constant offset or the variable with known alignment is added first to the pointer register. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-10Merge tag 'trace-v4.12-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "This is a trivial patch that changes a check for a cpumask from a NULL pointer to using cpumask_available(), which will do the check. This is because cpumasks when not allocated are always set, and clang complains about it" * tag 'trace-v4.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Use cpumask_available() to check if cpumask variable may be used
2017-05-10Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes are: - Debloat RCU headers - Parallelize SRCU callback handling (plus overlapping patches) - Improve the performance of Tree SRCU on a CPU-hotplug stress test - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Open-code the rcu_cblist_n_lazy_cbs() function rcu: Open-code the rcu_cblist_n_cbs() function rcu: Open-code the rcu_cblist_empty() function rcu: Separately compile large rcu_segcblist functions srcu: Debloat the <linux/rcu_segcblist.h> header srcu: Adjust default auto-expediting holdoff srcu: Specify auto-expedite holdoff time srcu: Expedite first synchronize_srcu() when idle srcu: Expedited grace periods with reduced memory contention srcu: Make rcutorture writer stalls print SRCU GP state srcu: Exact tracking of srcu_data structures containing callbacks srcu: Make SRCU be built by default srcu: Fix Kconfig botch when SRCU not selected rcu: Make non-preemptive schedule be Tasks RCU quiescent state srcu: Expedite srcu_schedule_cbs_snp() callback invocation srcu: Parallelize callback handling kvm: Move srcu_struct fields to end of struct kvm rcu: Fix typo in PER_RCU_NODE_PERIOD header comment rcu: Use true/false in assignment to bool rcu: Use bool value directly ...
2017-05-10Merge tag 'pm-extra-4.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These add new CPU IDs to a couple of drivers, fix a possible NULL pointer dereference in the cpuidle core, update DT-related things in the generic power domains framework and finally update the suspend/resume infrastructure to improve the handling of wakeups from suspend-to-idle. Specifics: - Add Intel Gemini Lake CPU IDs to the intel_idle and intel_rapl drivers (David Box). - Add a NULL pointer check to the cpuidle core to prevent it from crashing on platforms with incomplete cpuidle configuration (Fei Li). - Fix DT-related documentation in the generic power domains (genpd) framework and add a MAINTAINERS entry for DT-related material in genpd (Viresh Kumar). - Update the system suspend/resume infrastructure to improve the handling of aborts of suspend transitions in progress in the wakeup framework and rework the suspend-to-idle core loop to make it possible to filter out spurious wakeup events (specifically the ones coming from ACPI) without resuming all the way up to user space every time (Rafael Wysocki)" * tag 'pm-extra-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle PM / wakeup: Integrate mechanism to abort transitions in progress x86/intel_idle: add Gemini Lake support cpuidle: check dev before usage in cpuidle_use_deepest_state() powercap: intel_rapl: Add support for Gemini Lake PM / Domains: Add DT file to MAINTAINERS PM / Domains: Fix DT example
2017-05-10perf/callchain: Force USER_DS when invoking perf_callchain_user()Will Deacon
Perf can generate and record a user callchain in response to a synchronous request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we can end up walking the user stack (and dereferencing/saving whatever we find there) without the protections usually afforded by checks such as access_ok. Rather than play whack-a-mole with each architecture's stack unwinding implementation, fix the root of the problem by ensuring that we force USER_DS when invoking perf_callchain_user from the perf core. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko. 2) cdc_ncm doesn't actually fully zero out the padding area is allocates on TX, from Jim Baxter. 3) Don't leak map addresses in BPF verifier, from Daniel Borkmann. 4) If we randomize TCP timestamps, we have to do it everywhere including SYN cookies. From Eric Dumazet. 5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous. 6) Fix allocation size for ntp filter bitmap in bnxt_en driver, from Dan Carpenter. 7) Add missing memory allocation return value check to DSA loop driver, from Christophe Jaillet. 8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy Kalluru. 9) Don't inherit MC list from parent inet connection sockets, another syzkaller spotted gem. Fix from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits) dccp/tcp: do not inherit mc_list from parent qede: Split PF/VF ndos. qed: Correct doorbell configuration for !4Kb pages qed: Tell QM the number of tasks qed: Fix VF removal sequence qede: Fix XDP memory leak on unload net/mlx4_core: Reduce harmless SRIOV error message to debug level net/mlx4_en: Avoid adding steering rules with invalid ring net/mlx4_en: Change the error print to debug print drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison DECnet: Use container_of() for embedded struct Revert "ipv4: restore rt->fi for reference counting" net: mdio-mux: bcm-iproc: call mdiobus_free() in error path net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf net: cdc_ncm: Fix TX zero padding stmmac: pci: split out common_default_data() helper stmmac: pci: RX queue routing configuration stmmac: pci: TX and RX queue priority configuration stmmac: pci: set default number of rx and tx queues ...
2017-05-09Merge branches 'pm-domains', 'pm-cpuidle', 'pm-sleep' and 'powercap'Rafael J. Wysocki
* pm-domains: PM / Domains: Add DT file to MAINTAINERS PM / Domains: Fix DT example * pm-cpuidle: x86/intel_idle: add Gemini Lake support cpuidle: check dev before usage in cpuidle_use_deepest_state() * pm-sleep: ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle PM / wakeup: Integrate mechanism to abort transitions in progress * powercap: powercap: intel_rapl: Add support for Gemini Lake
2017-05-09Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted bits and pieces from various people. No common topic in this pile, sorry" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/affs: add rename exchange fs/affs: add rename2 to prepare multiple methods Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx() fs: don't set *REFERENCED on single use objects fs: compat: Remove warning from COMPATIBLE_IOCTL remove pointless extern of atime_need_update_rcu() fs: completely ignore unknown open flags fs: add a VALID_OPEN_FLAGS fs: remove _submit_bh() fs: constify tree_descr arrays passed to simple_fill_super() fs: drop duplicate header percpu-rwsem.h fs/affs: bugfix: Write files greater than page size on OFS fs/affs: bugfix: enable writes on OFS disks fs/affs: remove node generation check fs/affs: import amigaffs.h fs/affs: bugfix: make symbolic links work again
2017-05-08Merge tag 'trace-v4.12-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: "These are three simple changes. The first one is just a switch from using strcpy() to strlcpy(). Someone thought that it may cause an overflow bug, but since it only copies comms into a pre-allocated array of TASK_COMM_LEN, and no comm should ever be bigger than that, nor not end with a nul character, this change is more of a safety precaution than fixing anything that is actually broken. The other two changes are simply cleaning and optimizing some code" * tag 'trace-v4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Simplify ftrace_match_record() even more ftrace: Remove an unneeded condition tracing: Use strlcpy() instead of strcpy() in __trace_find_cmdline()
2017-05-08Merge tag 'pci-v4.12-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - add framework for supporting PCIe devices in Endpoint mode (Kishon Vijay Abraham I) - use non-postable PCI config space mappings when possible (Lorenzo Pieralisi) - clean up and unify mmap of PCI BARs (David Woodhouse) - export and unify Function Level Reset support (Christoph Hellwig) - avoid FLR for Intel 82579 NICs (Sasha Neftin) - add pci_request_irq() and pci_free_irq() helpers (Christoph Hellwig) - short-circuit config access failures for disconnected devices (Keith Busch) - remove D3 sleep delay when possible (Adrian Hunter) - freeze PME scan before suspending devices (Lukas Wunner) - stop disabling MSI/MSI-X in pci_device_shutdown() (Prarit Bhargava) - disable boot interrupt quirk for ASUS M2N-LR (Stefan Assmann) - add arch-specific alignment control to improve device passthrough by avoiding multiple BARs in a page (Yongji Xie) - add sysfs sriov_drivers_autoprobe to control VF driver binding (Bodong Wang) - allow slots below PCI-to-PCIe "reverse bridges" (Bjorn Helgaas) - fix crashes when unbinding host controllers that don't support removal (Brian Norris) - add driver for MicroSemi Switchtec management interface (Logan Gunthorpe) - add driver for Faraday Technology FTPCI100 host bridge (Linus Walleij) - add i.MX7D support (Andrey Smirnov) - use generic MSI support for Aardvark (Thomas Petazzoni) - make Rockchip driver modular (Brian Norris) - advertise 128-byte Read Completion Boundary support for Rockchip (Shawn Lin) - advertise PCI_EXP_LNKSTA_SLC for Rockchip root port (Shawn Lin) - convert atomic_t to refcount_t in HV driver (Elena Reshetova) - add CPU IRQ affinity in HV driver (K. Y. Srinivasan) - fix PCI bus removal in HV driver (Long Li) - add support for ThunderX2 DMA alias topology (Jayachandran C) - add ThunderX pass2.x 2nd node MCFG quirk (Tomasz Nowicki) - add ITE 8893 bridge DMA alias quirk (Jarod Wilson) - restrict Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices (Manish Jaggi) * tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (146 commits) PCI: Don't allow unbinding host controllers that aren't prepared ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP MAINTAINERS: Add PCI Endpoint maintainer Documentation: PCI: Add userguide for PCI endpoint test function tools: PCI: Add sample test script to invoke pcitest tools: PCI: Add a userspace tool to test PCI endpoint Documentation: misc-devices: Add Documentation for pci-endpoint-test driver misc: Add host side PCI driver for PCI test function device PCI: Add device IDs for DRA74x and DRA72x dt-bindings: PCI: dra7xx: Add DT bindings to enable unaligned access PCI: dwc: dra7xx: Workaround for errata id i870 dt-bindings: PCI: dra7xx: Add DT bindings for PCI dra7xx EP mode PCI: dwc: dra7xx: Add EP mode support PCI: dwc: dra7xx: Facilitate wrapper and MSI interrupts to be enabled independently dt-bindings: PCI: Add DT bindings for PCI designware EP mode PCI: dwc: designware: Add EP mode support Documentation: PCI: Add binding documentation for pci-test endpoint function ixgbe: Use pcie_flr() instead of duplicating it IB/hfi1: Use pcie_flr() instead of duplicating it PCI: imx6: Fix spelling mistake: "contol" -> "control" ...
2017-05-08Merge tag 'tty-4.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial updates from Greg KH: "Here is the "big" TTY/Serial patch updates for 4.12-rc1 Not a lot of new things here, the normal number of serial driver updates and additions, tiny bugs fixed, and some core files split up to make future changes a bit easier for Nicolas's "tiny-tty" work. All of these have been in linux-next for a while" * tag 'tty-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (62 commits) serial: small Makefile reordering tty: split job control support into a file of its own tty: move baudrate handling code to a file of its own console: move console_init() out of tty_io.c serial: 8250_early: Add earlycon support for Palmchip UART tty: pl011: use "qdf2400_e44" as the earlycon name for QDF2400 E44 vt: make mouse selection of non-ASCII consistent vt: set mouse selection word-chars to gpm's default imx-serial: Reduce RX DMA startup latency when opening for reading serial: omap: suspend device on probe errors serial: omap: fix runtime-pm handling on unbind tty: serial: omap: add UPF_BOOT_AUTOCONF flag for DT init serial: samsung: Remove useless spinlock serial: samsung: Add missing checks for dma_map_single failure serial: samsung: Use right device for DMA-mapping calls serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off tty: fix comment typo s/repsonsible/responsible/ tty: amba-pl011: Fix spurious TX interrupts serial: xuartps: Enable clocks in the pm disable case also serial: core: Re-use struct uart_port {name} field ...
2017-05-08tracing: Use cpumask_available() to check if cpumask variable may be usedMatthias Kaehlcke
This fixes the following clang warning: kernel/trace/trace.c:3231:12: warning: address of array 'iter->started' will always evaluate to 'true' [-Wpointer-bool-conversion] if (iter->started) Link: http://lkml.kernel.org/r/20170421234110.117075-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-08trace: make trace_hwlat timestamp y2038 safeDeepa Dinamani
struct timespec is not y2038 safe on 32 bit machines and needs to be replaced by struct timespec64 in order to represent times beyond year 2038 on such machines. Fix all the timestamp representation in struct trace_hwlat and all the corresponding implementations. Link: http://lkml.kernel.org/r/1491613030-11599-3-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08kernel/power/snapshot.c: use set_memory.h headerLaura Abbott
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-13-git-send-email-labbott@redhat.com Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08kernel/module.c: use set_memory.h headerLaura Abbott
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-12-git-send-email-labbott@redhat.com Signed-off-by: Laura Abbott <labbott@redhat.com> Acked-by: Jessica Yu <jeyu@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08mm, vmalloc: use __GFP_HIGHMEM implicitlyMichal Hocko
__vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08kcov: simplify interrupt checkDmitry Vyukov
in_interrupt() semantics are confusing and wrong for most users as it also returns true when bh is disabled. Thus we open coded a proper check for interrupts in __sanitizer_cov_trace_pc() with a lengthy explanatory comment. Use the new in_task() predicate instead. Link: http://lkml.kernel.org/r/20170321091026.139655-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: James Morse <james.morse@arm.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08taskstats: add e/u/stime for TGID commandZhang Xiao
The elapsed time, user CPU time and system CPU time for the thread group status request are presently left at zero. Fill these in. [akpm@linux-foundation.org: run ktime_get_ns() a single time] [akpm@linux-foundation.org: include linux/sched/cputime.h for task_cputime()] Link: http://lkml.kernel.org/r/1488508424-12322-1-git-send-email-xiao.zhang@windriver.com Signed-off-by: Zhang Xiao <xiao.zhang@windriver.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08pidns: expose task pid_ns_for_children to userspaceKirill Tkhai
pid_ns_for_children set by a task is known only to the task itself, and it's impossible to identify it from outside. It's a big problem for checkpoint/restore software like CRIU, because it can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of their work. This patch solves the problem, and it exposes pid_ns_for_children to ns directory in standard way with the name "pid_for_children": ~# ls /proc/5531/ns -l | grep pid lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836] lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286] Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Andrei Vagin <avagin@virtuozzo.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08pidns: disable pid allocation if pid_ns_prepare_proc() is failed in alloc_pid()Kirill Tkhai
alloc_pidmap() advances pid_namespace::last_pid. When first pid allocation fails, then next created process will have pid 2 and pid_ns_prepare_proc() won't be called. So, pid_namespace::proc_mnt will never be initialized (not to mention that there won't be a child reaper). I saw crash stack of such case on kernel 3.10: BUG: unable to handle kernel NULL pointer dereference at (null) IP: proc_flush_task+0x8f/0x1b0 Call Trace: release_task+0x3f/0x490 wait_consider_task.part.10+0x7ff/0xb00 do_wait+0x11f/0x280 SyS_wait4+0x7d/0x110 We may fix this by restore of last_pid in 0 or by prohibiting of futher allocations. Since there was a similar issue in Oleg Nesterov's commit 314a8ad0f18a ("pidns: fix free_pid() to handle the first fork failure"). and it was fixed via prohibiting allocation, let's follow this way, and do the same. Link: http://lkml.kernel.org/r/149201021004.4863.6762095011554287922.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andrei Vagin <avagin@virtuozzo.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08ia64: reuse append_elf_note() and final_note() functionsHari Bathini
Get rid of multiple definitions of append_elf_note() & final_note() functions. Reuse these functions compiled under CONFIG_CRASH_CORE Also, define Elf_Word and use it instead of generic u32 or the more specific Elf64_Word. Link: http://lkml.kernel.org/r/149035342324.6881.11667840929850361402.stgit@hbathini.in.ibm.com Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_COREHari Bathini
Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and reuse crashkernel parameter for fadump", v4. Traditionally, kdump is used to save vmcore in case of a crash. Some architectures like powerpc can save vmcore using architecture specific support instead of kexec/kdump mechanism. Such architecture specific support also needs to reserve memory, to be used by dump capture kernel. crashkernel parameter can be a reused, for memory reservation, by such architecture specific infrastructure. This patchset removes dependency with CONFIG_KEXEC for crashkernel parameter and vmcoreinfo related code as it can be reused without kexec support. Also, crashkernel parameter is reused instead of fadump_reserve_mem to reserve memory for fadump. The first patch moves crashkernel parameter parsing and vmcoreinfo related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE. The second patch reuses the definitions of append_elf_note() & final_note() functions under CONFIG_CRASH_CORE in IA64 arch code. The third patch removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump) in powerpc. The next patch reuses crashkernel parameter for reserving memory for fadump, instead of the fadump_reserve_mem parameter. This has the advantage of using all syntaxes crashkernel parameter supports, for fadump as well. The last patch updates fadump kernel documentation about use of crashkernel parameter. This patch (of 5): Traditionally, kdump is used to save vmcore in case of a crash. Some architectures like powerpc can save vmcore using architecture specific support instead of kexec/kdump mechanism. Such architecture specific support also needs to reserve memory, to be used by dump capture kernel. crashkernel parameter can be a reused, for memory reservation, by such architecture specific infrastructure. But currently, code related to vmcoreinfo and parsing of crashkernel parameter is built under CONFIG_KEXEC_CORE. This patch introduces CONFIG_CRASH_CORE and moves the above mentioned code under this config, allowing code reuse without dependency on CONFIG_KEXEC. There is no functional change with this patch. Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.com Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08fork: free vmapped stacks in cache when cpus are offlineHoeun Ryu
Using virtually mapped stack, kernel stacks are allocated via vmalloc. In the current implementation, two stacks per cpu can be cached when tasks are freed and the cached stacks are used again in task duplications. But the cached stacks may remain unfreed even when cpu are offline. By adding a cpu hotplug callback to free the cached stacks when a cpu goes offline, the pages of the cached stacks are not wasted. Link: http://lkml.kernel.org/r/1487076043-17802-1-git-send-email-hoeun.ryu@gmail.com Signed-off-by: Hoeun Ryu <hoeun.ryu@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08kernel/hung_task.c: defer showing held locksTetsuo Handa
When I was running my testcase which may block hundreds of threads on fs locks, I got lockup due to output from debug_show_all_locks() added by commit b2d4c2edb2e4 ("locking/hung_task: Show all locks"). For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state and 500 out of 1000 threads hold some lock, debug_show_all_locks() from for_each_process_thread() loop will report locks held by 500 threads for 1000 times. This is a too much noise. In order to make sure rcu_lock_break() is called frequently, we should avoid calling debug_show_all_locks() from for_each_process_thread() loop because debug_show_all_locks() effectively calls for_each_process_thread() loop. Let's defer calling debug_show_all_locks() till before panic() or leaving for_each_process_thread() loop. Link: http://lkml.kernel.org/r/1489296834-60436-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08proc/sysctl: fix the int overflow for jiffies conversionGao Feng
do_proc_dointvec_jiffies_conv() uses LONG_MAX/HZ as the max value to avoid overflow. But actually the *valp is int type, so it still causes overflow. For example, echo 2147483647 > ./sys/net/ipv4/tcp_keepalive_time Then, cat ./sys/net/ipv4/tcp_keepalive_time The output is "-1", it is not expected. Now use INT_MAX/HZ as the max value instead LONG_MAX/HZ to fix it. Link: http://lkml.kernel.org/r/1490109532-9228-1-git-send-email-fgao@ikuai8.com Signed-off-by: Gao Feng <fgao@ikuai8.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08bpf: don't let ldimm64 leak map addresses on unprivilegedDaniel Borkmann
The patch fixes two things at once: 1) It checks the env->allow_ptr_leaks and only prints the map address to the log if we have the privileges to do so, otherwise it just dumps 0 as we would when kptr_restrict is enabled on %pK. Given the latter is off by default and not every distro sets it, I don't want to rely on this, hence the 0 by default for unprivileged. 2) Printing of ldimm64 in the verifier log is currently broken in that we don't print the full immediate, but only the 32 bit part of the first insn part for ldimm64. Thus, fix this up as well; it's okay to access, since we verified all ldimm64 earlier already (including just constants) through replace_map_fd_with_map_ptr(). Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-05cpufreq: schedutil: use now as reference when aggregating shared policy requestsJuri Lelli
Currently, sugov_next_freq_shared() uses last_freq_update_time as a reference to decide when to start considering CPU contributions as stale. However, since last_freq_update_time is set by the last CPU that issued a frequency transition, this might cause problems in certain cases. In practice, the detection of stale utilization values fails whenever the CPU with such values was the last to update the policy. For example (and please note again that the SCHED_CPUFREQ_RT flag is not the problem here, but only the detection of after how much time that flag has to be considered stale), suppose a policy with 2 CPUs: CPU0 | CPU1 | | RT task scheduled | SCHED_CPUFREQ_RT is set | CPU1->last_update = now | freq transition to max | last_freq_update_time = now | more than TICK_NSEC nsecs | a small CFS wakes up | CPU0->last_update = now1 | delta_ns(CPU0) < TICK_NSEC* | CPU0's util is considered | delta_ns(CPU1) = | last_freq_update_time - | CPU1->last_update = 0 | < TICK_NSEC | CPU1 is still considered | CPU1->SCHED_CPUFREQ_RT is set | we stay at max (until CPU1 | exits from idle) | * delta_ns is actually negative as now1 > last_freq_update_time While last_freq_update_time is a sensible reference for rate limiting, it doesn't seem to be useful for working around stale CPU states. Fix the problem by always considering now (time) as the reference for deciding when CPUs have stale contributions. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-05ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idleRafael J. Wysocki
The ACPI SCI (System Control Interrupt) is set up as a wakeup IRQ during suspend-to-idle transitions and, consequently, any events signaled through it wake up the system from that state. However, on some systems some of the events signaled via the ACPI SCI while suspended to idle should not cause the system to wake up. In fact, quite often they should just be discarded. Arguably, systems should not resume entirely on such events, but in order to decide which events really should cause the system to resume and which are spurious, it is necessary to resume up to the point when ACPI SCIs are actually handled and processed, which is after executing dpm_resume_noirq() in the system resume path. For this reasons, add a loop around freeze_enter() in which the platforms can process events signaled via multiplexed IRQ lines like the ACPI SCI and add suspend-to-idle hooks that can be used for this purpose to struct platform_freeze_ops. In the ACPI case, the ->wake hook is used for checking if the SCI has triggered while suspended and deferring the interrupt-induced system wakeup until the events signaled through it are actually processed sufficiently to decide whether or not the system should resume. In turn, the ->sync hook allows all of the relevant event queues to be flushed so as to prevent events from being missed due to race conditions. In addition to that, some ACPI code processing wakeup events needs to be modified to use the "hard" version of wakeup triggers, so that it will cause a system resume to happen on device-induced wakeup events even if the "soft" mechanism to prevent the system from suspending is not enabled (that also helps to catch device-induced wakeup events occurring during suspend transitions in progress). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-05Merge tag 'powerpc-4.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap(). - Support for the new Power9 "XIVE" interrupt controller. - TLB flushing optimisations for the radix MMU on Power9. - Support for CAPI cards on Power9, using the "Coherent Accelerator Interface Architecture 2.0". - The ability to configure the mmap randomisation limits at build and runtime. - Several small fixes and cleanups to the kprobes code, as well as support for KPROBES_ON_FTRACE. - Major improvements to handling of system reset interrupts, correctly treating them as NMIs, giving them a dedicated stack and using a new hypervisor call to trigger them, all of which should aid debugging and robustness. - Many fixes and other minor enhancements. Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi" * tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits) powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it powerpc/powernv: Fix TCE kill on NVLink2 powerpc/mm/radix: Drop support for CPUs without lockless tlbie powerpc/book3s/mce: Move add_taint() later in virtual mode powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body powerpc/smp: Document irq enable/disable after migrating IRQs powerpc/mpc52xx: Don't select user-visible RTAS_PROC powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset() powerpc/eeh: Clean up and document event handling functions powerpc/eeh: Avoid use after free in eeh_handle_special_event() cxl: Mask slice error interrupts after first occurrence cxl: Route eeh events to all drivers in cxl_pci_error_detected() cxl: Force context lock during EEH flow powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST powerpc/xmon: Teach xmon oops about radix vectors powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids powerpc/pseries: Enable VFIO powerpc/powernv: Fix iommu table size calculation hook for small tables powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc powerpc: Add arch/powerpc/tools directory ...
2017-05-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This is a set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn't quite make the merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Fix unbalanced hard link numbers signal: Make kill_proc_info static rlimit: Properly call security_task_setrlimit signal: Remove unused definition of sig_user_definied ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET ipc: Remove unused declaration of recompute_msgmni posix-timers: Correct sanity check in posix_cpu_nsleep sysctl: Remove dead register_sysctl_root
2017-05-05stackprotector: Increase the per-task stack canary's random range from 32 ↵Daniel Micay
bits to 64 bits on 64-bit platforms The stack canary is an 'unsigned long' and should be fully initialized to random data rather than only 32 bits of random data. Signed-off-by: Daniel Micay <danielmicay@gmail.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Arjan van Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-hardening@lists.openwall.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-03ftrace: Simplify ftrace_match_record() even moreSteven Rostedt (VMware)
Dan Carpenter sent a patch to remove a check in ftrace_match_record() because the logic of the code made the check redundant. I looked deeper into the code, and made the following logic table, with the three variables and the result of the original code. modname mod_matches exclude_mod result ------- ----------- ----------- ------ 0 0 0 return 0 0 0 1 func_match 0 1 * < cannot exist > 1 0 0 return 0 1 0 1 func_match 1 1 0 func_match 1 1 1 return 0 Notice that when mod_matches == exclude mod, the result is always to return 0, and when mod_matches != exclude_mod, then the result is to test the function. This means we only need test if mod_matches is equal to exclude_mod. Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-03ftrace: Remove an unneeded conditionDan Carpenter
We know that "mod_matches" is true here so there is no need to check again. Link: http://lkml.kernel.org/r/20170331152130.GA4947@mwanda Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-03tracing: Use strlcpy() instead of strcpy() in __trace_find_cmdline()Amey Telawane
Strcpy is inherently not safe, and strlcpy() should be used instead. __trace_find_cmdline() uses strcpy() because the comms saved must have a terminating nul character, but it doesn't hurt to add the extra protection of using strlcpy() instead of strcpy(). Link: http://lkml.kernel.org/r/1493806274-13936-1-git-send-email-amit.pundir@linaro.org Signed-off-by: Amey Telawane <ameyt@codeaurora.org> [AmitP: Cherry-picked this commit from CodeAurora kernel/msm-3.10 https://source.codeaurora.org/quic/la/kernel/msm-3.10/commit/?id=2161ae9a70b12cf18ac8e5952a20161ffbccb477] Signed-off-by: Amit Pundir <amit.pundir@linaro.org> [ Updated change log and removed the "- 1" from len parameter ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>