summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-12-29bpf: hash: use per-bucket spinlocktom.leiming@gmail.com
Both htab_map_update_elem() and htab_map_delete_elem() can be called from eBPF program, and they may be in kernel hot path, so it isn't efficient to use a per-hashtable lock in this two helpers. The per-hashtable spinlock is used for protecting bucket's hlist, and per-bucket lock is just enough. This patch converts the per-hashtable lock into per-bucket spinlock, so that contention can be decreased a lot. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-29bpf: hash: move select_bucket() out of htab's spinlocktom.leiming@gmail.com
The spinlock is just used for protecting the per-bucket hlist, so it isn't needed for selecting bucket. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-29bpf: hash: use atomic counttom.leiming@gmail.com
Preparing for removing global per-hashtable lock, so the counter need to be defined as aotmic_t first. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18bpf: move clearing of A/X into classic to eBPF migration prologueDaniel Borkmann
Back in the days where eBPF (or back then "internal BPF" ;->) was not exposed to user space, and only the classic BPF programs internally translated into eBPF programs, we missed the fact that for classic BPF A and X needed to be cleared. It was fixed back then via 83d5b7ef99c9 ("net: filter: initialize A and X registers"), and thus classic BPF specifics were added to the eBPF interpreter core to work around it. This added some confusion for JIT developers later on that take the eBPF interpreter code as an example for deriving their JIT. F.e. in f75298f5c3fe ("s390/bpf: clear correct BPF accumulator register"), at least X could leak stack memory. Furthermore, since this is only needed for classic BPF translations and not for eBPF (verifier takes care that read access to regs cannot be done uninitialized), more complexity is added to JITs as they need to determine whether they deal with migrations or native eBPF where they can just omit clearing A/X in their prologue and thus reduce image size a bit, see f.e. cde66c2d88da ("s390/bpf: Only clear A and X for converted BPF programs"). In other cases (x86, arm64), A and X is being cleared in the prologue also for eBPF case, which is unnecessary. Lets move this into the BPF migration in bpf_convert_filter() where it actually belongs as long as the number of eBPF JITs are still few. It can thus be done generically; allowing us to remove the quirk from __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF, while reducing code duplication on this matter in current(/future) eBPF JITs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Zi Shen Lim <zlim.lnx@gmail.com> Cc: Yang Shi <yang.shi@linaro.org> Acked-by: Yang Shi <yang.shi@linaro.org> Acked-by: Zi Shen Lim <zlim.lnx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17locking/osq: Fix ordering of node initialisation in osq_lockWill Deacon
The Cavium guys reported a soft lockup on their arm64 machine, caused by commit c55a6ffa6285 ("locking/osq: Relax atomic semantics"): mutex_optimistic_spin+0x9c/0x1d0 __mutex_lock_slowpath+0x44/0x158 mutex_lock+0x54/0x58 kernfs_iop_permission+0x38/0x70 __inode_permission+0x88/0xd8 inode_permission+0x30/0x6c link_path_walk+0x68/0x4d4 path_openat+0xb4/0x2bc do_filp_open+0x74/0xd0 do_sys_open+0x14c/0x228 SyS_openat+0x3c/0x48 el0_svc_naked+0x24/0x28 This is because in osq_lock we initialise the node for the current CPU: node->locked = 0; node->next = NULL; node->cpu = curr; and then publish the current CPU in the lock tail: old = atomic_xchg_acquire(&lock->tail, curr); Once the update to lock->tail is visible to another CPU, the node is then live and can be both read and updated by concurrent lockers. Unfortunately, the ACQUIRE semantics of the xchg operation mean that there is no guarantee the contents of the node will be visible before lock tail is updated. This can lead to lock corruption when, for example, a concurrent locker races to set the next field. Fixes: c55a6ffa6285 ("locking/osq: Relax atomic semantics"): Reported-by: David Daney <ddaney@caviumnetworks.com> Reported-by: Andrew Pinski <andrew.pinski@caviumnetworks.com> Tested-by: Andrew Pinski <andrew.pinski@caviumnetworks.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-14net, cgroup: cgroup_sk_updat_lock was missing initializerTejun Heo
bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") added global spinlock cgroup_sk_update_lock but erroneously skipped initializer leading to uninitialized spinlock warning. Fix it by using DEFINE_SPINLOCK(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dexuan Cui <decui@microsoft.com> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-13sched/wait: Fix the signal handling fixPeter Zijlstra
Jan Stancek reported that I wrecked things for him by fixing things for Vladimir :/ His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which should not be possible, however my previous patch made this possible by unconditionally checking signal_pending(). We cannot use current->state as was done previously, because the instruction after the store to that variable it can be changed. We must instead pass the initial state along and use that. Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers") Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Chris Mason <clm@fb.com> Tested-by: Jan Stancek <jstancek@redhat.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Paul Turner <pjt@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: tglx@linutronix.de Cc: Oleg Nesterov <oleg@redhat.com> Cc: hpa@zytor.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12bpf, inode: allow for rename and link opsDaniel Borkmann
Add support for renaming and hard links to the fs. Most of this can be implemented by using simple library operations under the same constraints that we don't use a reserved name like elsewhere. Linking can be useful to share/manage things like maps across subsystem users. It works within the file system boundary, but is not allowed for directories. Symbolic links are explicitly not implemented here, as it can be better done already by doing bind mounts inside bpf fs to set up shared directories f.e. useful when using volumes in docker containers that map a private working directory into /sys/fs/bpf/ which contains itself a bind mounted path from the host's /sys/fs/bpf/ mount that is shared among multiple containers. For single maps instead of whole directory, hard links can be easily used to do the same. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12kernel: remove stop_machine() Kconfig dependencyChris Wilson
Currently the full stop_machine() routine is only enabled on SMP if module unloading is enabled, or if the CPUs are hotpluggable. This leads to configurations where stop_machine() is broken as it will then only run the callback on the local CPU with irqs disabled, and not stop the other CPUs or run the callback on them. For example, this breaks MTRR setup on x86 in certain configs since ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions") as the MTRR is only established on the boot CPU. This patch removes the Kconfig option for STOP_MACHINE and uses the SMP and HOTPLUG_CPU config options to compile the correct stop_machine() for the architecture, removing the false dependency on MODULE_UNLOAD in the process. Link: https://lkml.org/lkml/2014/10/8/124 References: https://bugs.freedesktop.org/show_bug.cgi?id=84794 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-08sock, cgroup: add sock->sk_cgroupTejun Heo
In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Daniel Wagner <daniel.wagner@bmw-carit.de> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-08Merge branch 'for-4.5-ancestor-test' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Preparatory changes for some new socket cgroup infrastructure and netfilter targets. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-08Merge branch 'for-4.4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "More change than I'd have liked at this stage. The pids controller and the changes made to cgroup core to support it introduced and revealed several important issues. - Assigning membership to a newly created task and migrating it can race leading to incorrect accounting. Oleg fixed it by widening threadgroup synchronization. It looks like we'll be able to merge it with a different percpu rwsem which is used in fork path making things simpler and cheaper. - The recent change to extend cgroup membership to zombies (so that pid accounting can extend till the pid is actually released) missed pinning the underlying data structures leading to use-after-free. Fixed. - v2 hierarchy was calling subsystem callbacks with the wrong target cgroup_subsys_state based on the incorrect assumption that they share the same target. pids is the first controller affected by this. Subsys callbacks updated so that they can deal with multi-target migrations" * 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup_pids: don't account for the root cgroup cgroup: fix handling of multi-destination migration from subtree_control enabling cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork() cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate() cgroup: make css_set pin its css's to avoid use-afer-free cgroup: fix cftype->file_offset handling
2015-12-08Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree includes four core perf fixes for misc bugs, three fixes to x86 PMU drivers, and two updates to old email addresses" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Do not send exit event twice perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell perf: Fix PERF_EVENT_IOC_PERIOD deadlock treewide: Remove old email address perf/x86: Fix LBR call stack save/restore perf: Update email address in MAINTAINERS perf/core: Robustify the perf_cgroup_from_task() RCU checks perf/core: Fix RCU problem with cgroup context switching code
2015-12-07Merge branch 'master' into for-4.4-fixesTejun Heo
The following commit which went into mainline through networking tree 3b13758f51de ("cgroups: Allow dynamically changing net_classid") conflicts in net/core/netclassid_cgroup.c with the following pending fix in cgroup/for-4.4-fixes. 1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling") The former separates out update_classid() from cgrp_attach() and updates it to walk all fds of all tasks in the target css so that it can be used from both migration and config change paths. The latter drops @css from cgrp_attach(). Resolve the conflict by making cgrp_attach() call update_classid() with the css from the first task. We can revive @tset walking in cgrp_attach() but given that net_cls is v1 only where there always is only one target css during migration, this is fine. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Nina Schiff <ninasc@fb.com>
2015-12-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "This updates contains the following changes: - Fix a signal handling regression in the bit wait functions. - Avoid false positive warnings in the wakeup path. - Initialize the scheduler root domain properly. - Handle gtime calculations in proc/$PID/stat proper. - Add more documentation for the barriers in try_to_wake_up(). - Fix a subtle race in try_to_wake_up() which might cause a task to be scheduled on two cpus - Compile static helper function only when it is used" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() sched/core: Better document the try_to_wake_up() barriers sched/cputime: Fix invalid gtime in proc sched/core: Clear the root_domain cpumasks in init_rootdomain() sched/core: Remove false-positive warning from wake_up_process() sched/wait: Fix signal handling in bit wait helpers sched/rt: Hide the push_irq_work_func() declaration
2015-12-06perf: Do not send exit event twiceJiri Olsa
In case we monitor events system wide, we get EXIT event (when configured) twice for each task that exited. Note doubled lines with same pid/tid in following example: $ sudo ./perf record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ] $ sudo ./perf report -D | grep EXIT 0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253) 1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253) 2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252) 2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252) The reason is that the cpu contexts are processes each time we call perf_event_task. I'm changing the perf_event_aux logic to serve task_ctx and cpu contexts separately, which ensure we don't get EXIT event generated twice on same cpu context. This does not affect other auxiliary events, as they don't use task_ctx at all. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()Peter Zijlstra
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such that we'll prematurely continue with the wakeup and effectively run p on two CPUs at the same time. Even though the overlap is very limited; the task is in the middle of being scheduled out; it could still result in corruption of the scheduler data structures. CPU0 CPU1 set_current_state(...) <preempt_schedule> context_switch(X, Y) prepare_lock_switch(Y) Y->on_cpu = 1; finish_lock_switch(X) store_release(X->on_cpu, 0); try_to_wake_up(X) LOCK(p->pi_lock); t = X->on_cpu; // 0 context_switch(Y, X) prepare_lock_switch(X) X->on_cpu = 1; finish_lock_switch(Y) store_release(Y->on_cpu, 0); </preempt_schedule> schedule(); deactivate_task(X); X->on_rq = 0; if (X->on_rq) // false if (t) while (X->on_cpu) cpu_relax(); context_switch(X, ..) finish_lock_switch(X) store_release(X->on_cpu, 0); Avoid the load of X->on_cpu being hoisted over the X->on_rq load. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Better document the try_to_wake_up() barriersPeter Zijlstra
Explain how the control dependency and smp_rmb() end up providing ACQUIRE semantics and pair with smp_store_release() in finish_lock_switch(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/cputime: Fix invalid gtime in procHiroshi Shimamoto
/proc/stats shows invalid gtime when the thread is running in guest. When vtime accounting is not enabled, we cannot get a valid delta. The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap is only updated when vtime accounting is runtime enabled. This patch makes task_gtime() just return gtime without computing the buggy non-existing tickless delta when vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless delta. This way we fix the gtime value regression on machines not running nohz full. The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full. I ran and stop a busy loop in VM and see the gtime in host. Dump the 43rd field which shows the gtime in every second: # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done S 4348 R 7064566 R 7064766 R 7064967 R 7065168 S 4759 S 4759 During running busy loop, it returns large value. After applying this patch, we can see right gtime. # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done S 5338 R 5365 R 5465 R 5566 R 5666 S 5726 S 5726 Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Clear the root_domain cpumasks in init_rootdomain()Xunlei Pang
root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, this may cause problems. For instance, When doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks(with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: dlo_mask, span, and online. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Remove false-positive warning from wake_up_process()Sasha Levin
Because wakeups can (fundamentally) be late, a task might not be in the expected state. Therefore testing against a task's state is racy, and can yield false positives. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: oleg@redhat.com Fixes: 9067ac85d533 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task") Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/wait: Fix signal handling in bit wait helpersPeter Zijlstra
Vladimir reported getting RCU stall warnings and bisected it back to commit: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") That commit inadvertently reversed the calls to schedule() and signal_pending(), thereby not handling the case where the signal receives while we sleep. Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mark.rutland@arm.com Cc: neilb@suse.de Cc: oleg@redhat.com Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.") Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04perf: Fix PERF_EVENT_IOC_PERIOD deadlockPeter Zijlstra
Dmitry reported a fairly silly recursive lock deadlock for PERF_EVENT_IOC_PERIOD, fix this by explicitly doing the inactive part of __perf_event_period() instead of calling that function. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: c7999c6f3fed ("perf: Fix PERF_EVENT_IOC_PERIOD migration race") Link: http://lkml.kernel.org/r/20151130115615.GJ17308@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "A lot of Thanksgiving turkey leftovers accumulated, here goes: 1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg. 2) IDs for some new iwlwifi chips, from Oren Givon. 3) Fix rtlwifi lockups on boot, from Larry Finger. 4) Fix memory leak in fm10k, from Stephen Hemminger. 5) We have a route leak in the ipv6 tunnel infrastructure, fix from Paolo Abeni. 6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim. 7) Wrong lockdep annotations in tcp md5 support, fix from Eric Dumazet. 8) Work around some middle boxes which prevent proper handling of TCP Fast Open, from Yuchung Cheng. 9) TCP repair can do huge kmalloc() requests, build paged SKBs instead. From Eric Dumazet. 10) Fix msg_controllen overflow in scm_detach_fds, from Daniel Borkmann. 11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from Nikolay Aleksandrov. 12) Fix use after free in epoll with AF_UNIX sockets, from Rainer Weikusat. 13) Fix double free in VRF code, from Nikolay Aleksandrov. 14) Fix skb leaks on socket receive queue in tipc, from Ying Xue. 15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian. 16) Fix clearing of persistent array maps in bpf, from Daniel Borkmann. 17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq early enough. From Eric Dumazet. 18) Fix out of bounds accesses in bpf array implementation when updating elements, from Daniel Borkmann. 19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric Dumazet. 20) When dumping proxy neigh entries, we have to accomodate NULL device pointers properly, from Konstantin Khlebnikov. 21) SCTP doesn't release all ipv6 socket resources properly, fix from Eric Dumazet. 22) Prevent underflows of sch->q.qlen for multiqueue packet schedulers, also from Eric Dumazet. 23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey Huang and Michael Chan. 24) Don't actively scan radar channels, from Antonio Quartulli" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits) net: phy: reset only targeted phy bnxt_en: Setup uc_list mac filters after resetting the chip. bnxt_en: enforce proper storing of MAC address bnxt_en: Fixed incorrect implementation of ndo_set_mac_address net: lpc_eth: remove irq > NR_IRQS check from probe() net_sched: fix qdisc_tree_decrease_qlen() races openvswitch: fix hangup on vxlan/gre/geneve device deletion ipv4: igmp: Allow removing groups from a removed interface ipv6: sctp: implement sctp_v6_destroy_sock() arm64: bpf: add 'store immediate' instruction ipv6: kill sk_dst_lock ipv6: sctp: add rcu protection around np->opt net/neighbour: fix crash at dumping device-agnostic proxy entries sctp: use GFP_USER for user-controlled kmalloc sctp: convert sack_needed and sack_generation to bits ipv6: add complete rcu protection around np->opt bpf: fix allocation warnings in bpf maps and integer overflow mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0 net: mvneta: enable setting custom TX IP checksum limit net: mvneta: fix error path for building skb ...
2015-12-03Merge tag 'trace-v4.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "During the merge window I added a new file that is used to filter trace events on pids. It filters all events where only tasks with their pid in that file exists. It also handles the sched_switch and sched_wakeup trace events where the current task does not have its pid in the file, but the task either being switched to or awaken does. Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both of these tracepoints use the same class as the sched_wakeup tracepoint, and they too should be included in what gets filtered by the set_event_pid file" * tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter
2015-12-03cgroup_pids: don't account for the root cgroupTejun Heo
Because accounting resources for the root cgroup sometimes incurs measureable overhead for workloads which don't care about cgroup and often ends up calculating a number which is available elsewhere in a slightly different form, cgroup is not in the business of providing system-wide statistics. The pids controller which was introduced recently was exposing "pids.current" at the root. This patch disable accounting for root cgroup and removes the file from the root directory. While this is a userland visible behavior change, pids has been available only in one version and was badly broken there, so I don't think this will be noticeable. If it turns out to be a problem, we can reinstate it for v1 hierarchies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03cgroup: fix handling of multi-destination migration from subtree_control ↵Tejun Heo
enabling Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in ↵Tejun Heo
freezer_attach() If one or more tasks get moved into a frozen css, the frozen state is cleared up from the destination css so that it can be reasserted once the migrated tasks are frozen. freezer_attach() implements this in two separate steps - clearing CGROUP_FROZEN on the target css while processing each task and propagating the clearing upwards after the task loop is done if necessary. This patch merges the two steps. Propagation now takes place inside the task loop. This simplifies the code and prepares it for the fix of multi-destination migration. Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-02bpf: fix allocation warnings in bpf maps and integer overflowAlexei Starovoitov
For large map->value_size the user space can trigger memory allocation warnings like: WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989 __alloc_pages_nodemask+0x695/0x14e0() Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50 [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460 [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493 [< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989 [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235 [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055 [< inline >] alloc_pages include/linux/gfp.h:451 [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414 [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007 [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018 [< inline >] kmalloc_large include/linux/slab.h:390 [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525 [< inline >] kmalloc include/linux/slab.h:463 [< inline >] map_update_elem kernel/bpf/syscall.c:288 [< inline >] SYSC_bpf kernel/bpf/syscall.c:744 To avoid never succeeding kmalloc with order >= MAX_ORDER check that elem->value_size and computed elem_size are within limits for both hash and array type maps. Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings. Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512, so keep those kmalloc-s as-is. Large value_size can cause integer overflows in elem_size and map.pages formulas, so check for that as well. Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01bpf, array: fix heap out-of-bounds access when updating elementsDaniel Borkmann
During own review but also reported by Dmitry's syzkaller [1] it has been noticed that we trigger a heap out-of-bounds access on eBPF array maps when updating elements. This happens with each map whose map->value_size (specified during map creation time) is not multiple of 8 bytes. In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and used to align array map slots for faster access. However, in function array_map_update_elem(), we update the element as ... memcpy(array->value + array->elem_size * index, value, array->elem_size); ... where we access 'value' out-of-bounds, since it was allocated from map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER) and later on copied through copy_from_user(value, uvalue, map->value_size). Thus, up to 7 bytes, we can access out-of-bounds. Same could happen from within an eBPF program, where in worst case we access beyond an eBPF program's designated stack. Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an official release yet, it only affects priviledged users. In case of array_map_lookup_elem(), the verifier prevents eBPF programs from accessing beyond map->value_size through check_map_access(). Also from syscall side map_lookup_elem() only copies map->value_size back to user, so nothing could leak. [1] http://github.com/google/syzkaller Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filterSteven Rostedt (Red Hat)
The set_event_pid filter relies on attaching to the sched_switch and sched_wakeup tracepoints to see if it should filter the tracing on schedule tracepoints. By adding the callbacks to sched_wakeup, pids in the set_event_pid file will trace the wakeups of those tasks with those pids. But sched_wakeup_new and sched_waking were missed. These two should also be traced. Luckily, these tracepoints share the same class as sched_wakeup which means they can use the same pre and post callbacks as sched_wakeup does. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-30Merge tag 'trace-v4.4-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "I found two minor bugs while doing development on the ring buffer code. The first is something that's been there since its creation. If a reader reads a page out of the ring buffer before there's any events on it, it can get an out of date timestamp for that event. It may be off by a few microseconds, more if the first event gets discarded. The fix was to only update the reader time stamp when it actually sees an event on the page, instead of just reading the timestamp from the page even if it has no events on it. That timestamp is still volatile until an event is present. The second bug is more recent. Instead of passing around parameters a descriptor was made and the parameters are passed via a single descriptor. This simplified the code a bit. But there was one place that expected the parameter to be passed by value not reference (which a descriptor now does). And it added to the length of the event, which may be ignored later, but the length should not have been increased. The only real problem with this bug is that it may allocate more than was needed for the event" * tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Put back the length if crossed page with add_timestamp ring-buffer: Update read stamp with first real commit on page
2015-11-30cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()Oleg Nesterov
Now that we know that the forking task can't migrate amd the child is always moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can change pids_can_fork() and pids_cancel_fork() to just use task_css(current). And since we no longer need to pin this css, we can remove pid_fork(). Note: the patch uses task_css_check(true), perhaps it makes sense to add a helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into account. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-30cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()Oleg Nesterov
If the new child migrates to another cgroup before cgroup_post_fork() calls subsys->fork(), then both pids_can_attach() and pids_fork() will do the same pids_uncharge(old_pids) + pids_charge(pids) sequence twice. Change copy_process() to call threadgroup_change_begin/threadgroup_change_end unconditionally. percpu_down_read() is cheap and this allows other cleanups, see the next changes. Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-30cgroup: make css_set pin its css's to avoid use-afer-freeTejun Heo
A css_set represents the relationship between a set of tasks and css's. css_set never pinned the associated css's. This was okay because tasks used to always disassociate immediately (in RCU sense) - either a task is moved to a different css_set or exits and never accesses css_set again. Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller") and patches leading up to it made a zombie hold onto its css_set and deref the associated css's on its release. Nothing pins the css's after exit and it might have already been freed leading to use-after-free. general protection fault: 0000 [#1] PREEMPT SMP task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000 RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 ... Call Trace: <IRQ> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0 [<ffffffff810f8893>] cgroup_free+0x53/0xe0 [<ffffffff8104ed62>] __put_task_struct+0x42/0x130 [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130 [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820 [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820 [<ffffffff81056e54>] __do_softirq+0xd4/0x460 [<ffffffff81057369>] irq_exit+0x89/0xa0 [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90 <EOI> ... Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02 RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 RSP <ffff88001fc03e20> ---[ end trace 89a4a4b916b90c49 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by making css_set pin the associate css's until its release. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
2015-11-25bpf: fix clearing on persistent program array mapsDaniel Borkmann
Currently, when having map file descriptors pointing to program arrays, there's still the issue that we unconditionally flush program array contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens when such a file descriptor is released and is independent of the map's refcount. Having this flush independent of the refcount is for a reason: there can be arbitrary complex dependency chains among tail calls, also circular ones (direct or indirect, nesting limit determined during runtime), and we need to make sure that the map drops all references to eBPF programs it holds, so that the map's refcount can eventually drop to zero and initiate its freeing. Btw, a walk of the whole dependency graph would not be possible for various reasons, one being complexity and another one inconsistency, i.e. new programs can be added to parts of the graph at any time, so there's no guaranteed consistent state for the time of such a walk. Now, the program array pinning itself works, but the issue is that each derived file descriptor on close would nevertheless call unconditionally into bpf_fd_array_map_clear(). Instead, keep track of users and postpone this flush until the last reference to a user is dropped. As this only concerns a subset of references (f.e. a prog array could hold a program that itself has reference on the prog array holding it, etc), we need to track them separately. Short analysis on the refcounting: on map creation time usercnt will be one, so there's no change in behaviour for bpf_map_release(), if unpinned. If we already fail in map_create(), we are immediately freed, and no file descriptor has been made public yet. In bpf_obj_pin_user(), we need to probe for a possible map in bpf_fd_probe_obj() already with a usercnt reference, so before we drop the reference on the fd with fdput(). Therefore, if actual pinning fails, we need to drop that reference again in bpf_any_put(), otherwise we keep holding it. When last reference drops on the inode, the bpf_any_put() in bpf_evict_inode() will take care of dropping the usercnt again. In the bpf_obj_get_user() case, the bpf_any_get() will grab a reference on the usercnt, still at a time when we have the reference on the path. Should we later on fail to grab a new file descriptor, bpf_any_put() will drop it, otherwise we hold it until bpf_map_release() time. Joint work with Alexei. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24pidns: fix NULL dereference in __task_pid_nr_ns()Eric Dumazet
I got a crash during a "perf top" session that was caused by a race in __task_pid_nr_ns() : pid_nr_ns() was inlined, but apparently compiler chose to read task->pids[type].pid twice, and the pid->level dereference crashed because we got a NULL pointer at the second read : if (pid && ns->level <= pid->level) { // CRASH Just use RCU API properly to solve this race, and not worry about "perf top" crashing hosts :( get_task_pid() can benefit from same fix. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-24ring-buffer: Put back the length if crossed page with add_timestampSteven Rostedt (Red Hat)
Commit fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data" added a descriptor that holds various data instead of passing around several variables through parameters. The problem was that one of the parameters was modified in a function and the code was designed not to have an effect on that modified parameter. Now that the parameter is a descriptor and any modifications to it are non-volatile, the size of the data could be unnecessarily expanded. Remove the extra space added if a timestamp was added and the event went across the page. Cc: stable@vger.kernel.org # 4.3+ Fixes: fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Update read stamp with first real commit on pageSteven Rostedt (Red Hat)
Do not update the read stamp after swapping out the reader page from the write buffer. If the reader page is swapped out of the buffer before an event is written to it, then the read_stamp may get an out of date timestamp, as the page timestamp is updated on the first commit to that page. rb_get_reader_page() only returns a page if it has an event on it, otherwise it will return NULL. At that point, check if the page being returned has events and has not been read yet. Then at that point update the read_stamp to match the time stamp of the reader page. Cc: stable@vger.kernel.org # 2.6.30+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-23treewide: Remove old email addressPeter Zijlstra
There were still a number of references to my old Red Hat email address in the kernel source. Remove these while keeping the Red Hat copyright notices intact. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23sched/rt: Hide the push_irq_work_func() declarationArnd Bergmann
The push_irq_work_func() function is conditionally defined only when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the forward declaration remains visibile without HAVE_RT_PUSH_IPI, causing a gcc warning in ARM64 allnoconfig: kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function] This changes the code to use the same condition for both the declaration and the function definition, which gets rid of the warning. As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI thing after: 8053871d0f7f ("smp: Fix smp_call_function_single_async() locking") Until that is done, this patch can be used to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23perf/core: Robustify the perf_cgroup_from_task() RCU checksStephane Eranian
This patch reinforces the lockdep checks performed by perf_cgroup_from_tsk() by passing the perf_event_context whenever possible. It is okay to not hold the RCU read lock when we know we hold the ctx->lock. This patch makes sure this property holds. In some functions, such as perf_cgroup_sched_in(), we do not pass the context because we are sure we are holding the RCU read lock. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: edumazet@google.com Link: http://lkml.kernel.org/r/1447322404-10920-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23perf/core: Fix RCU problem with cgroup context switching codeStephane Eranian
The RCU checker detected RCU violation in the cgroup switching routines perf_cgroup_sched_in() and perf_cgroup_sched_out(). We were dereferencing cgroup from task without holding the RCU lock. Fix this by holding the RCU read lock. We move the locking from perf_cgroup_switch() to avoid double locking. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: edumazet@google.com Link: http://lkml.kernel.org/r/1447322404-10920-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-20kernel/panic.c: turn off locks debug before releasing console lockVitaly Kuznetsov
Commit 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out") introduced an unwanted bad unlock balance report when panic() is called directly and not from OOPS (e.g. from out_of_memory()). The difference is that in case of OOPS we disable locks debug in oops_enter() and on direct panic call nobody does that. Fixes: 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out") Reported-by: kernel test robot <ying.huang@linux.intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20kernel/signal.c: unexport sigsuspend()Richard Weinberger
sigsuspend() is nowhere used except in signal.c itself, so we can mark it static do not pollute the global namespace. But this patch is more than a boring cleanup patch, it fixes a real issue on UserModeLinux. UML has a special console driver to display ttys using xterm, or other terminal emulators, on the host side. Vegard reported that sometimes UML is unable to spawn a xterm and he's facing the following warning: WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0() It turned out that this warning makes absolutely no sense as the UML xterm code calls sigsuspend() on the host side, at least it tries. But as the kernel itself offers a sigsuspend() symbol the linker choose this one instead of the glibc wrapper. Interestingly this code used to work since ever but always blocked signals on the wrong side. Some recent kernel change made the WARN_ON() trigger and uncovered the bug. It is a wonderful example of how much works by chance on computers. :-) Fixes: 68f3f16d9ad0f1 ("new helper: sigsuspend()") Signed-off-by: Richard Weinberger <richard@nod.at> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20cgroup: implement cgroup_get_from_path() and expose cgroup_put()Tejun Heo
Implement cgroup_get_from_path() using kernfs_walk_and_get() which obtains a default hierarchy cgroup from its path. This will be used to allow cgroup path based matching from outside cgroup proper - e.g. networking and perf. v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path). Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-20cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using itTejun Heo
cgroup_is_descendant() currently walks up the hierarchy and compares each ancestor to the cgroup in question. While enough for cgroup core usages, this can't be used in hot paths to test cgroup membership. This patch adds cgroup->ancestor_ids[] which records the IDs of all ancestors including self and cgroup->level for the nesting level. This allows testing whether a given cgroup is a descendant of another in three finite steps - testing whether the two belong to the same hierarchy, whether the descendant candidate is at the same or a higher level than the ancestor and comparing the recorded ancestor_id at the matching level. cgroup_is_descendant() is accordingly reimplmented and made inline. Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-20bpf: add show_fdinfo handler for mapsDaniel Borkmann
Add a handler for show_fdinfo() to be used by the anon-inodes backend for eBPF maps, and dump the map specification there. Not only useful for admins, but also it provides a minimal way to compare specs from ELF vs pinned object. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>