summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-07-02PM / hibernate: Recycle safe pages after image restorationRafael J. Wysocki
One of the memory bitmaps used by the hibernation image restoration code is freed after the image has been loaded. That is not quite efficient, though, because the memory pages used for building that bitmap are known to be safe (ie. they were not used by the image kernel before hibernation) and the arch-specific code finalizing the image restoration may need them. In that case it needs to allocate those pages again via the memory management subsystem, check if they are really safe again by consulting the other bitmaps and so on. To avoid that, recycle those pages by putting them into the global list of known safe pages so that they can be given to the arch code right away when necessary. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02PM / hibernate: Simplify mark_unsafe_pages()Rafael J. Wysocki
Rework mark_unsafe_pages() to use a simpler method of clearing all bits in free_pages_map and to set the bits for the "unsafe" pages (ie. pages that were used by the image kernel before hibernation) with the help of duplicate_memory_bitmap(). For this purpose, move the pfn_valid() check from mark_unsafe_pages() to unpack_orig_pfns() where the "unsafe" pages are discovered. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02PM / hibernate: Do not free preallocated safe pages during image restoreRafael J. Wysocki
The core image restoration code preallocates some safe pages (ie. pages that weren't used by the image kernel before hibernation) for future use before allocating the bulk of memory for loading the image data. Those safe pages are then freed so they can be allocated again (with the memory management subsystem's help). That's done to ensure that there will be enough safe pages for temporary data structures needed during image restoration. However, it is not really necessary to free those pages after they have been allocated. They can be added to the (global) list of safe pages right away and then picked up from there when needed without freeing. That reduces the overhead related to using safe pages, especially in the arch-specific code, so modify the code accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02PM / suspend: show workqueue state in suspend flowRoger Lu
If freezable workqueue aborts suspend flow, show workqueue state for debug purpose. Signed-off-by: Roger Lu <roger.lu@mediatek.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-01cgroup: bpf: Add bpf_skb_in_cgroup_protoMartin KaFai Lau
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAYMartin KaFai Lau
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01cgroup: Add cgroup_get_from_fdMartin KaFai Lau
Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01bpf: refactor bpf_prog_get and type check into helperDaniel Borkmann
Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01bpf: generally move prog destruction to RCU deferralDaniel Borkmann
Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in ceb56070359b ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01timers/nohz: Capitalize 'CPU' consistentlyIngo Molnar
While reviewing another patch I noticed that kernel/time/tick-sched.c had a charmingly (confusingly, annoyingly) rich set of variants for spelling 'CPU': cpu cpus CPU CPUs per CPU per-CPU per cpu ... sometimes these were mixed even within the same comment block! Compress these variants down to a single consistent set of: CPU CPUs per-CPU Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-01timers/nohz: Fix several typosWei Jiangang
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: fenghua.yu@intel.com Link: http://lkml.kernel.org/r/1467175910-2966-2-git-send-email-weijg.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-30cred: Reject inodes with invalid ids in set_create_file_as()Seth Forshee
Using INVALID_[UG]ID for the LSM file creation context doesn't make sense, so return an error if the inode passed to set_create_file_as() has an invalid id. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-06-30timekeeping: export get_monotonic_coarse64 symbolGregor Boirie
EXPORT_SYMBOL() get_monotonic_coarse64 for new IIO timestamping clock selection usage. This provides user apps the ability to request a particular IIO device to timestamp samples using a monotonic coarse clock granularity. Signed-off-by: Gregor Boirie <gregor.boirie@parrot.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2016-06-30bpf: don't use raw processor id in generic helperDaniel Borkmann
Use smp_processor_id() for the generic helper bpf_get_smp_processor_id() instead of the raw variant. This allows for preemption checks when we have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only need to keep the raw variant for socket filters, but we can reuse the helper that is already there from cBPF side. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_readDaniel Borkmann
Follow-up commit to 1e33759c788c ("bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output") to add the same functionality into bpf_perf_event_read() helper. The split of index into flags and index component is also safe here, since such large maps are rejected during map allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30bpf, trace: fetch current cpu only onceDaniel Borkmann
We currently have two invocations, which is unnecessary. Fetch it only once and use the smp_processor_id() variant, so we also get preemption checks along with it when DEBUG_PREEMPT is set. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30bpf: minor cleanups on fd maps and helpersDaniel Borkmann
Some minor cleanups: i) Remove the unlikely() from fd array map lookups and let the CPU branch predictor do its job, scenarios where there is not always a map entry are very well valid. ii) Move the attribute type check in the bpf_perf_event_read() helper a bit earlier so it's consistent wrt checks with bpf_perf_event_output() helper as well. iii) remove some comments that are self-documenting in kprobe_prog_is_valid_access() and therefore make it consistent to tp_prog_is_valid_access() as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of overlapping changes, except the packet scheduler conflicts which deal with the addition of the free list parameter to qdisc_enqueue(). Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Documentation updates. Just some simple changes, no design-level additions. - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-29Merge branch 'stable-4.7' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fixes from Paul Moore: "Two small patches to fix audit problems in 4.7-rcX: the first fixes a potential kref leak, the second removes some header file noise. The first is an important bug fix that really should go in before 4.7 is released, the second is not critical, but falls into the very-nice- to-have category so I'm including in the pull request. Both patches are straightforward, self-contained, and pass our testsuite without problem" * 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit: audit: move audit_get_tty to reduce scope and kabi changes audit: move calcs after alloc and check when logging set loginuid
2016-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "I've been traveling so this accumulates more than week or so of bug fixing. It perhaps looks a little worse than it really is. 1) Fix deadlock in ath10k driver, from Ben Greear. 2) Increase scan timeout in iwlwifi, from Luca Coelho. 3) Unbreak STP by properly reinjecting STP packets back into the stack. Regression fix from Ido Schimmel. 4) Mediatek driver fixes (missing malloc failure checks, leaking of scratch memory, wrong indexing when mapping TX buffers, etc.) from John Crispin. 5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic Sowa. 6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su. 7) Fix netlink notifications in ovs for tunnels, delete link messages are never emitted because of how the device registry state is handled. From Nicolas Dichtel. 8) Conntrack module leaks kmemcache on unload, from Florian Westphal. 9) Prevent endless jump loops in nft rules, from Liping Zhang and Pablo Neira Ayuso. 10) Not early enough spinlock initialization in mlx4, from Eric Dumazet. 11) Bind refcount leak in act_ipt, from Cong WANG. 12) Missing RCU locking in HTB scheduler, from Florian Westphal. 13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU barrier, using heap for SG and IV, and erroneous use of async flag when allocating AEAD conext.) 14) RCU handling fix in TIPC, from Ying Xue. 15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in SIT driver, from Simon Horman. 16) Socket timer deadlock fix in TIPC from Jon Paul Maloy. 17) Fix potential deadlock in team enslave, from Ido Schimmel. 18) Memory leak in KCM procfs handling, from Jiri Slaby. 19) ESN generation fix in ipv4 ESP, from Herbert Xu. 20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong WANG. 21) Use after free in netem, from Eric Dumazet. 22) Uninitialized last assert time in multicast router code, from Tom Goff. 23) Skip raw sockets in sock_diag destruction broadcast, from Willem de Bruijn. 24) Fix link status reporting in thunderx, from Sunil Goutham. 25) Limit resegmentation of retransmit queue so that we do not retransmit too large GSO frames. From Eric Dumazet. 26) Delay bpf program release after grace period, from Daniel Borkmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits) openvswitch: fix conntrack netlink event delivery qed: Protect the doorbell BAR with the write barriers. neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit() e1000e: keep VLAN interfaces functional after rxvlan off cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag() bpf, perf: delay release of BPF prog after grace period net: bridge: fix vlan stats continue counter tcp: do not send too big packets at retransmit time ibmvnic: fix to use list_for_each_safe() when delete items net: thunderx: Fix TL4 configuration for secondary Qsets net: thunderx: Fix link status reporting net/mlx5e: Reorganize ethtool statistics net/mlx5e: Fix number of PFC counters reported to ethtool net/mlx5e: Prevent adding the same vxlan port net/mlx5e: Check for BlueFlame capability before allocating SQ uar net/mlx5e: Change enum to better reflect usage net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices net/mlx5: Update command strings net: marvell: Add separate config ANEG function for Marvell 88E1111 ...
2016-06-29Merge branch 'for-4.7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three fix patches. Two are for cgroup / css init failure path. The last one makes css_set_lock irq-safe as the deadline scheduler ends up calling put_css_set() from irq context" * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Disable IRQs while holding css_set_lock cgroup: set css->id to -1 during init cgroup: remove redundant cleanup in css_create
2016-06-29bpf, perf: delay release of BPF prog after grace periodDaniel Borkmann
Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved destruction of BPF program from free_event_rcu() callback to __free_event(), which is problematic if used with tail calls: if prog A is attached as trace event directly, but at the same time present in a tail call map used by another trace event program elsewhere, then we need to delay destruction via RCU grace period since it can still be in use by the program doing the tail call (the prog first needs to be dropped from the tail call map, then trace event with prog A attached destroyed, so we get immediate destruction). Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Jann Horn <jann@thejh.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-28audit: move audit_get_tty to reduce scope and kabi changesRichard Guy Briggs
The only users of audit_get_tty and audit_put_tty are internal to audit, so move it out of include/linux/audit.h to kernel.h and create a proper function rather than inlining it. This also reduces kABI changes. Suggested-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: line wrapped description] Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-28audit: move calcs after alloc and check when logging set loginuidRichard Guy Briggs
Move the calculations of values after the allocation in case the allocation fails. This avoids wasting effort in the rare case that it fails, but more importantly saves us extra logic to release the tty ref. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-28PM / sleep: make PM notifiers called symmetricallyLianwei Wang
This makes pm notifier PREPARE/POST symmetrical: if PREPARE fails, we will only undo what ever happened on PREPARE. It fixes the unbalanced CPU hotplug enable in CPU PM notifier. Signed-off-by: Lianwei Wang <lianwei.wang@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-06-27tracing/function_graph: Fix filters for function_graph thresholdJoel Fernandes
Function graph tracer currently ignores filters if tracing_thresh is set. For example, even if set_ftrace_pid is set, then its ignored if tracing_thresh set, resulting in all processes being traced. To fix this, we reuse the same entry function as when tracing_thresh is not set and do everything as in the regular case except for writing the function entry to the ring buffer. Link: http://lkml.kernel.org/r/1466228694-2677-1-git-send-email-agnel.joel@gmail.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Joel Fernandes <agnel.joel@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-06-27audit: add fields to exclude filter by reusing user filterRichard Guy Briggs
RFE: add additional fields for use in audit filter exclude rules https://github.com/linux-audit/audit-kernel/issues/5 Re-factor and combine audit_filter_type() with audit_filter_user() to use audit_filter_user_rules() to enable the exclude filter to additionally filter on PID, UID, GID, AUID, LOGINUID_SET, SUBJ_*. The process of combining the similar audit_filter_user() and audit_filter_type() functions, required inverting the meaning and including the ALWAYS action of the latter. Include audit_filter_user_rules() into audit_filter(), removing unneeded logic in the process. Keep the check to quit early if the list is empty. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: checkpatch.pl fixes - whitespace damage, wrapped description] Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-27sched/fair: Rework throttle_count syncPeter Zijlstra
Since we already take rq->lock when creating a cgroup, use it to also sync the throttle_count and avoid the extra state and enqueue path branch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: linux-kernel@vger.kernel.org [ Fixed build warning. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/core: Fix sched_getaffinity() return value kerneldoc commentZev Weiss
Previous version was probably written referencing the man page for glibc's wrapper, but the wrapper's behavior differs from that of the syscall itself in this case. Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Reorder cgroup creation codePeter Zijlstra
A future patch needs rq->lock held _after_ we link the task_group into the hierarchy. In order to avoid taking every rq->lock twice, reorder things a little and create online_fair_sched_group() to be called after we link the task_group. All this code is still ran from css_alloc() so css_online() isn't in fact used for this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Apply more PELT fixesPeter Zijlstra
One additional 'rule' for using update_cfs_rq_load_avg() is that one should call update_tg_load_avg() if it returns true. Add a bunch of comments to hopefully clarify some of the rules: o You need to update cfs_rq _before_ any entity attach/detach, this is important, because while for mathmatical consisency this isn't strictly needed, it is required for the physical interpretation of the model, you attach/detach _now_. o When you modify the cfs_rq avg, you have to then call update_tg_load_avg() in order to propagate changes upwards. o (Fair) entities are always attached, switched_{to,from}_fair() deal with !fair. This directly follows from the definition of the cfs_rq averages, namely that they are a direct sum of all (runnable or blocked) entities on that rq. It is the second rule that this patch enforces, but it adds comments pertaining to all of them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix PELT integrity for new tasksPeter Zijlstra
Vincent and Yuyang found another few scenarios in which entity tracking goes wobbly. The scenarios are basically due to the fact that new tasks are not immediately attached and thereby differ from the normal situation -- a task is always attached to a cfs_rq load average (such that it includes its blocked contribution) and are explicitly detached/attached on migration to another cfs_rq. Scenario 1: switch to fair class p->sched_class = fair_class; if (queued) enqueue_task(p); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() check_class_changed() switched_from() (!fair) switched_to() (fair) switched_to_fair() attach_entity_load_avg() If @p is a new task that hasn't been fair before, it will have !last_update_time and, per the above, end up in attach_entity_load_avg() _twice_. Scenario 2: change between cgroups sched_move_group(p) if (queued) dequeue_task() task_move_group_fair() detach_task_cfs_rq() detach_entity_load_avg() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() if (queued) enqueue_task(); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() Similar as with scenario 1, if @p is a new task, it will have !load_update_time and we'll end up in attach_entity_load_avg() _twice_. Furthermore, notice how we do a detach_entity_load_avg() on something that wasn't attached to begin with. As stated above; the problem is that the new task isn't yet attached to the load tracking and thereby violates the invariant assumption. This patch remedies this by ensuring a new task is indeed properly attached to the load tracking on creation, through post_init_entity_util_avg(). Of course, this isn't entirely as straightforward as one might think, since the task is hashed before we call wake_up_new_task() and thus can be poked at. We avoid this by adding TASK_NEW and teaching cpu_cgroup_can_attach() to refuse such tasks. Reported-by: Yuyang Du <yuyang.du@intel.com> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/cgroup: Fix cpu_cgroup_fork() handlingVincent Guittot
A new fair task is detached and attached from/to task_group with: cgroup_post_fork() ss->fork(child) := cpu_cgroup_fork() sched_move_task() task_move_group_fair() Which is wrong, because at this point in fork() the task isn't fully initialized and it cannot 'move' to another group, because its not attached to any group as yet. In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we can just call this small part directly instead sched_move_task(). And the task doesn't really migrate because it is not yet attached so we need the following sequence: do_fork() sched_fork() __set_task_cpu() cgroup_post_fork() set_task_rq() # set task group and runqueue wake_up_new_task() select_task_rq() can select a new cpu __set_task_cpu post_init_entity_util_avg attach_task_cfs_rq() activate_task enqueue_task This patch makes that happen. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added TASK_SET_GROUP to set depth properly. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix PELT integrity for new groupsPeter Zijlstra
Vincent reported that when a new task is moved into a new cgroup it gets attached twice to the load tracking: sched_move_task() task_move_group_fair() detach_task_cfs_rq() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0 enqueue_entity() enqueue_entity_load_avg() update_cfs_rq_load_avg() now = clock() __update_load_avg(&cfs_rq->avg) cfs_rq->avg.last_load_update = now // ages load/util for: now - 0, load/util -> 0 if (migrated) attach_entity_load_avg() se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0 The problem is that we don't update cfs_rq load_avg before all entity attach/detach operations. Only enqueue_task() and migrate_task() do this. By fixing this, the above will not happen, because the sched_move_task() attach will have updated cfs_rq's last_load_update time before attach, and in turn the attach will have set the entity's last_load_update stamp. Note that there is a further problem with sched_move_task() calling detach on a task that hasn't yet been attached; this will be taken care of in a subsequent patch. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix and optimize the fork() pathPeter Zijlstra
The task_fork_fair() callback already calls __set_task_cpu() and takes rq->lock. If we move the sched_class::task_fork callback in sched_fork() under the existing p->pi_lock, right after its set_task_cpu() call, we can avoid doing two such calls and omit the IRQ disabling on the rq->lock. Change to __set_task_cpu() to skip the migration bits, this is a new task, not a migration. Similarly, make wake_up_new_task() use __set_task_cpu() for the same reason, the task hasn't actually migrated as it hasn't ever ran. This cures the problem of calling migrate_task_rq_fair(), which does remove_entity_from_load_avg() on tasks that have never been added to the load avg to begin with. This bug would result in transiently messed up load_avg values, averaged out after a few dozen milliseconds. This is probably the reason why this bug was not found for such a long time. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()Pan Xinhui
queued_spin_lock_slowpath() should not worry about another queued_spin_lock_slowpath() running in interrupt context and changing node->count by accident, because node->count keeps the same value every time we enter/leave queued_spin_lock_slowpath(). On some architectures this_cpu_dec() will save/restore irq flags, which has high overhead. Use the much cheaper __this_cpu_dec() instead. Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hpe.com Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com [ Rewrote changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27Merge tag 'v4.7-rc5' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusionPeter Zijlstra
Commit: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") did something non-obvious but also did it buggy yet latent. The problem was exposed for real by a later commit in the v4.7 merge window: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels") ... after which tg->load_avg and cfs_rq->load.weight had different units (10 bit fixed point and 20 bit fixed point resp.). Add a comment to explain the use of cfs_rq->load.weight over the 'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct for the difference in unit. Since this is (now, as per a previous commit) the only user of calc_tg_weight(), collapse it. The effects of this bug should be randomly inconsistent SMP-balancing of cgroups workloads. Reported-by: Jirka Hladky <jhladky@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels") Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix effective_load() to consistently use smoothed loadPeter Zijlstra
Starting with the following commit: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") calc_tg_weight() doesn't compute the right value as expected by effective_load(). The difference is in the 'correction' term. In order to ensure \Sum rw_j >= rw_i we cannot use tg->load_avg directly, since that might be lagging a correction on the current cfs_rq->avg.load_avg value. Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib + cfs_rq->avg.load_avg. Now, per the referenced commit, calc_tg_weight() doesn't use cfs_rq->avg.load_avg, as is later used in @w, but uses cfs_rq->load.weight instead. So stop using calc_tg_weight() and do it explicitly. The effects of this bug are wake_affine() making randomly poor choices in cgroup-intense workloads. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v4.3+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-26x86/KASLR, x86/power: Remove x86 hibernation restrictionsKees Cook
With the following fix: 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel") ... there is no longer a problem with hibernation resuming a KASLR-booted kernel image, so remove the restriction. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux PM list <linux-pm@vger.kernel.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20160613221002.GA29719@www.outflux.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "A couple of scheduler fixes: - force watchdog reset while processing sysrq-w - fix a deadlock when enabling trace events in the scheduler - fixes to the throttled next buddy logic - fixes for the average accounting (missing serialization and underflow handling) - allow kernel threads for fallback to online but not active cpus" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Allow kthreads to fall back to online && !active cpus sched/fair: Do not announce throttled next buddy in dequeue_task_fair() sched/fair: Initialize throttle_count for new task-groups lazily sched/fair: Fix cfs_rq avg tracking underflow kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w sched/debug: Fix deadlock when enabling sched events sched/fair: Fix post_init_entity_util_avg() serialization
2016-06-25Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "A single fix to address a race in the static key logic" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/static_key: Fix concurrent static_key_slow_inc()
2016-06-25Fix build break in fork.c when THREAD_SIZE < PAGE_SIZEMichael Ellerman
Commit b235beea9e99 ("Clarify naming of thread info/stack allocators") breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE: kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack' kernel/fork.c:355:8: error: assignment from incompatible pointer type stack = alloc_thread_stack_node(tsk, node); ^ Fix it by renaming free_stack() to free_thread_stack(), and updating the return type of alloc_thread_stack_node(). Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "Two weeks worth of fixes here" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64 autofs: don't get stuck in a loop if vfs_write() returns an error mm/page_owner: avoid null pointer dereference tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences" fs/nilfs2: fix potential underflow in call to crc32_le oom, suspend: fix oom_reaper vs. oom_killer_disable race ocfs2: disable BUG assertions in reading blocks mm, compaction: abort free scanner if split fails mm: prevent KASAN false positives in kmemleak mm/hugetlb: clear compound_mapcount when freeing gigantic pages mm/swap.c: flush lru pvecs on compound page arrival memcg: css_alloc should return an ERR_PTR value on error memcg: mem_cgroup_migrate() may be called with irq disabled hugetlb: fix nr_pmds accounting with shared page tables Revert "mm: disable fault around on emulated access bit architecture" Revert "mm: make faultaround produce old ptes" mailmap: add Boris Brezillon's email mailmap: add Antoine Tenart's email mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask mm: mempool: kasan: don't poot mempool objects in quarantine ...
2016-06-24oom, suspend: fix oom_reaper vs. oom_killer_disable raceMichal Hocko
Tetsuo has reported the following potential oom_killer_disable vs. oom_reaper race: (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 not yet finished (8) P1 perform IO/interfere with the freezer. This situation is unfortunate. We cannot move oom_killer_disable after all the freezable kernel threads are frozen because the oom victim might depend on some of those kthreads to make a forward progress to exit so we could deadlock. It is also far from trivial to teach the oom_reaper to not call exit_oom_victim() because then we would lose a guarantee of the OOM killer and oom_killer_disable forward progress because exit_mm->mmput might block and never call exit_oom_victim. It seems the easiest way forward is to workaround this race by calling try_to_freeze_tasks again after oom_killer_disable. This will make sure that all the tasks are frozen or it bails out. Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24Clarify naming of thread info/stack allocatorsLinus Torvalds
We've had the thread info allocated together with the thread stack for most architectures for a long time (since the thread_info was split off from the task struct), but that is about to change. But the patches that move the thread info to be off-stack (and a part of the task struct instead) made it clear how confused the allocator and freeing functions are. Because the common case was that we share an allocation with the thread stack and the thread_info, the two pointers were identical. That identity then meant that we would have things like ti = alloc_thread_info_node(tsk, node); ... tsk->stack = ti; which certainly _worked_ (since stack and thread_info have the same value), but is rather confusing: why are we assigning a thread_info to the stack? And if we move the thread_info away, the "confusing" code just gets to be entirely bogus. So remove all this confusion, and make it clear that we are doing the stack allocation by renaming and clarifying the function names to be about the stack. The fact that the thread_info then shares the allocation is an implementation detail, and not really about the allocation itself. This is a pure renaming and type fix: we pass in the same pointer, it's just that we clarify what the pointer means. The ia64 code that actually only has one single allocation (for all of task_struct, thread_info and kernel thread stack) now looks a bit odd, but since "tsk->stack" is actually not even used there, that oddity doesn't matter. It would be a separate thing to clean that up, I intentionally left the ia64 changes as a pure brute-force renaming and type change. Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24libnvdimm, pmem: allow nfit_test to override pmem_direct_access()Dan Williams
Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to override it and indicate that nfit_test-pmem is not device-mapped. Now, we want to enable nfit_test to operate without DMA_CMA and the pmem it provides will no longer be physically contiguous, i.e. won't be capable of supporting direct_access requests larger than a page. Make pmem_direct_access() a weak symbol so that it can be replaced by the tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static inline now that it no longer needs to be overridden. Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-24fs: Limit file caps to the user namespace of the super blockSeth Forshee
Capability sets attached to files must be ignored except in the user namespaces where the mounter is privileged, i.e. s_user_ns and its descendants. Otherwise a vector exists for gaining privileges in namespaces where a user is not already privileged. Add a new helper function, current_in_user_ns(), to test whether a user namespace is the same as or a descendant of another namespace. Use this helper to determine whether a file's capability set should be applied to the caps constructed during exec. --EWB Replaced in_userns with the simpler current_in_userns. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>