summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-01-29perf/bpf: Convert perf_event_array to use struct fileAlexei Starovoitov
Robustify refcounting. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-29perf: Fix NULL derefPeter Zijlstra
Dan reported: 1229 if (ctx->task == TASK_TOMBSTONE || 1230 !atomic_inc_not_zero(&ctx->refcount)) { 1231 raw_spin_unlock(&ctx->lock); 1232 ctx = NULL; ^^^^^^^^^^ ctx is NULL. 1233 } 1234 1235 WARN_ON_ONCE(ctx->task != task); ^^^^^^^^^^^^^^^^^ The patch adds a NULL dereference. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 63b6da39bb38 ("perf: Fix perf_event_exit_task() race") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-28Merge tag 'trace-v4.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull minor tracing fixes from Steven Rostedt: "This includes three minor fixes, mostly due to cut-and-paste issues. The first is a cut and paste issue that changed the amount of stack to skip when tracing a stack dump from 0 to 6, which basically made the stack disappear for small stack traces. The second fix is just removing an unused field in a struct that is no longer used, and currently just wastes space. The third is another cut-and-paste fix that had a tracepoint recording the wrong field (it was recording the previous field a second time)" * tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/dma-buf/fence: Fix timeline str value on fence_annotate_wait_on ftrace: Remove unused nr_trampolines var tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
2016-01-28perf: Fix race in perf_event_exit_task_context()Peter Zijlstra
There is a race between perf_event_exit_task_context() and orphans_remove_work() which results in a use-after-free. We mark ctx->task with TASK_TOMBSTONE to indicate a context is 'dead', under ctx->lock. After which point event_function_call() on any event of that context will NOP A concurrent orphans_remove_work() will only hold ctx->mutex for the list iteration and not serialize against this. Therefore its possible that orphans_remove_work()'s perf_remove_from_context() call will fail, but we'll continue to free the event, with the result of free'd memory still being on lists and everything. Once perf_event_exit_task_context() gets around to acquiring ctx->mutex it too will iterate the event list, encounter the already free'd event and proceed to free it _again_. This fails with the WARN in free_event(). Plug the race by having perf_event_exit_task_context() hold ctx::mutex over the whole tear-down, thereby 'naturally' serializing against all other sites, including the orphan work. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: alexander.shishkin@linux.intel.com Cc: dsahern@gmail.com Cc: namhyung@kernel.org Link: http://lkml.kernel.org/r/20160125130954.GY6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-28perf: Fix orphan holePeter Zijlstra
We should set event->owner before we install the event, otherwise there is a hole where the target task can fork() and we'll not inherit the event because it thinks the event is orphaned. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-28Merge tag 'seccomp-4.5-rc2' of ↵James Morris
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux into for-linus
2016-01-27PM: APM_EMULATION does not depend on PMArnd Bergmann
The APM emulation code does multiple things, and some of them depend on PM_SLEEP, while the battery management does not. However, selecting the symbol like SHARPSL_PM does causes a Kconfig warning: warning: (SHARPSL_PM && PMAC_APM_EMU) selects APM_EMULATION which has unmet direct dependencies (PM && SYS_SUPPORTS_APM_EMULATION) From all I can tell, this is completely harmless, and we can simply allow APM_EMULATION to be enabled here, even if PM is not. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-27seccomp: always propagate NO_NEW_PRIVS on tsyncJann Horn
Before this patch, a process with some permissive seccomp filter that was applied by root without NO_NEW_PRIVS was able to add more filters to itself without setting NO_NEW_PRIVS by setting the new filter from a throwaway thread with NO_NEW_PRIVS. Signed-off-by: Jann Horn <jann@thejh.net> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2016-01-27tick/nohz: Set the correct expiry when switching to nohz/lowres modeWanpeng Li
commit 0ff53d096422 sets the next tick interrupt to the last jiffies update, i.e. in the past, because the forward operation is invoked before the set operation. There is no resulting damage (yet), but we get an extra pointless tick interrupt. Revert the order so we get the next tick interrupt in the future. Fixes: commit 0ff53d096422 "tick: sched: Force tick interrupt and get rid of softirq magic" Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-26tick/sched: Hide unused oneshot timer codeArnd Bergmann
A couple of functions in kernel/time/tick-sched.c are only relevant for oneshot timer mode, i.e. when hires-timers or nohz mode are enabled. If both are disabled, we get gcc warnings about them: kernel/time/tick-sched.c:98:16: warning: 'tick_init_jiffy_update' defined but not used [-Wunused-function] static ktime_t tick_init_jiffy_update(void) ^ kernel/time/tick-sched.c:112:13: warning: 'tick_sched_do_timer' defined but not used [-Wunused-function] static void tick_sched_do_timer(ktime_t now) ^ kernel/time/tick-sched.c:134:13: warning: 'tick_sched_handle' defined but not used [-Wunused-function] static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) ^ This encloses the whole set of functions in an appropriate ifdef to avoid the warning and to make it clearer when they are used. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1453736525-1959191-1-git-send-email-arnd@arndb.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-26irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED tokenMarc Zyngier
Let's take the (outlandish) example of an interrupt controller capable of handling both wired interrupts and PCI MSIs. With the current code, the PCI MSI domain is going to be tagged with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY. Things get hairy when we start looking up the domain for a wired interrupt (typically when creating it based on some firmware information - DT or ACPI). In irq_create_fwspec_mapping(), we perform the lookup using DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives us one chance out of two to end up with the wrong domain, and we try to configure a wired interrupt with the MSI domain. Everything grinds to a halt pretty quickly. What we really need to do is to start looking for a domain that would uniquely identify a wired interrupt domain, and only use DOMAIN_BUS_ANY as a fallback. In order to solve this, let's introduce a new DOMAIN_BUS_WIRED token, which is going to be used exactly as described above. Of course, this depends on the irqchip to setup the domain bus_token, and nobody had to implement this so far. Only so far. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-26rtmutex: Make wait_lock irq safeThomas Gleixner
Sasha reported a lockdep splat about a potential deadlock between RCU boosting rtmutex and the posix timer it_lock. CPU0 CPU1 rtmutex_lock(&rcu->rt_mutex) spin_lock(&rcu->rt_mutex.wait_lock) local_irq_disable() spin_lock(&timer->it_lock) spin_lock(&rcu->mutex.wait_lock) --> Interrupt spin_lock(&timer->it_lock) This is caused by the following code sequence on CPU1 rcu_read_lock() x = lookup(); if (x) spin_lock_irqsave(&x->it_lock); rcu_read_unlock(); return x; We could fix that in the posix timer code by keeping rcu read locked across the spinlocked and irq disabled section, but the above sequence is common and there is no reason not to support it. Taking rt_mutex.wait_lock irq safe prevents the deadlock. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
2016-01-22wrappers for ->i_mutex accessAl Viro
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Embarrassing braino fix + pipe page accounting + fixing an eyesore in find_filesystem() (checking that s1 is equal to prefix of s2 of given length can be done in many ways, but "compare strlen(s1) with length and then do strncmp()" is not a good one...)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [regression] fix braino in fs/dlm/user.c pipe: limit the per-user amount of pages allocated in pipes find_filesystem(): simplify comparison
2016-01-22cgroup: make sure a parent css isn't freed before its childrenTejun Heo
There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. The previous patch updated ordering for css_offline() which fixes the cpu controller issue. While there currently isn't a known bug caused by misordering of css_free() invocations, let's fix it too for consistency. css_free() ordering can be trivially fixed by moving putting of the parent css below css_free() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
2016-01-22cgroup: make sure a parent css isn't offlined before its childrenTejun Heo
There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com> Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org
2016-01-22cpuset: make mm migration asynchronousTejun Heo
If "cpuset.memory_migrate" is set, when a process is moved from one cpuset to another with a different memory node mask, pages in used by the process are migrated to the new set of nodes. This was performed synchronously in the ->attach() callback, which is synchronized against process management. Recently, the synchronization was changed from per-process rwsem to global percpu rwsem for simplicity and optimization. Combined with the synchronous mm migration, this led to deadlocks because mm migration could schedule a work item which may in turn try to create a new worker blocking on the process management lock held from cgroup process migration path. This heavy an operation shouldn't be performed synchronously from that deep inside cgroup migration in the first place. This patch punts the actual migration to an ordered workqueue and updates cgroup process migration and cpuset config update paths to flush the workqueue after all locks are released. This way, the operations still seem synchronous to userland without entangling mm migration with process management synchronization. CPU hotplug can also invoke mm migration but there's no reason for it to wait for mm migrations and thus doesn't synchronize against their completions. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: stable@vger.kernel.org # v4.4+
2016-01-22sched/numa: Fix use-after-free bug in the task_numa_compareGavin Guo
The following message can be observed on the Ubuntu v3.13.0-65 with KASan backported: ================================================================== BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8 Read of size 8 by task qemu-system-x86/3998900 ============================================================================= BUG kmalloc-128 (Tainted: G B ): kasan: bad access detected ----------------------------------------------------------------------------- INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890 __slab_alloc+0x4f8/0x560 __kmalloc+0x1eb/0x280 task_numa_fault+0xc1b/0xed0 do_numa_page+0x192/0x200 handle_mm_fault+0x808/0x1160 __do_page_fault+0x218/0x750 do_page_fault+0x1a/0x70 page_fault+0x28/0x30 SyS_poll+0x66/0x1a0 system_call_fastpath+0x1a/0x1f INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0 __slab_free+0x2ab/0x3f0 kfree+0x161/0x170 task_numa_free+0x1d2/0x200 finish_task_switch+0x1d2/0x210 __schedule+0x5d4/0xc60 schedule_preempt_disabled+0x40/0xc0 cpu_startup_entry+0x2da/0x340 start_secondary+0x28f/0x360 Call Trace: [<ffffffff81a6ce35>] dump_stack+0x45/0x56 [<ffffffff81244aed>] print_trailer+0xfd/0x170 [<ffffffff8124ac36>] object_err+0x36/0x40 [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0 [<ffffffff8124d260>] kasan_report+0x40/0x50 [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890 [<ffffffff8124bee9>] __asan_load8+0x69/0xa0 [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120 [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890 [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0 [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0 [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0 [<ffffffff8120ef02>] do_numa_page+0x192/0x200 [<ffffffff81211038>] handle_mm_fault+0x808/0x1160 [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160 [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0 [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750 [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160 [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0 [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70 [<ffffffff81a772e8>] page_fault+0x28/0x30 [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0 [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0 [<ffffffff810233c9>] ? sched_clock+0x9/0x10 [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0 [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130 [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170 [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20 [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90 [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0 [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0 [<ffffffff81022c89>] ? read_tsc+0x9/0x20 [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170 [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0 [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0 [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f As commit 1effd9f19324 ("sched/numa: Fix unsafe get_task_struct() in task_numa_assign()") points out, the rcu_read_lock() cannot protect the task_struct from being freed in the finish_task_switch(). And the bug happens in the process of calculation of imp which requires the access of p->numa_faults being freed in the following path: do_exit() current->flags |= PF_EXITING; release_task() ~~delayed_put_task_struct()~~ schedule() ... ... rq->curr = next; context_switch() finish_task_switch() put_task_struct() __put_task_struct() task_numa_free() The fix here to get_task_struct() early before end of dst_rq->lock to protect the calculation process and also put_task_struct() in the corresponding point if finally the dst_rq->curr somehow cannot be assigned. Additional credit to Liang Chen who helped fix the error logic and add the put_task_struct() to the place it missed. Signed-off-by: Gavin Guo <gavin.guo@canonical.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jay.vosburgh@canonical.com Cc: liang.chen@canonical.com Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-22ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANOJohn Stultz
Recently, in commit 37cf4dc3370f I forgot to check if the timeval being passed was actually a timespec (as is signaled with ADJ_NANO). This resulted in that patch breaking ADJ_SETOFFSET users who set ADJ_NANO, by rejecting valid timespecs that were compared with valid timeval ranges. This patch addresses this by checking for the ADJ_NANO flag and using the timepsec check instead in that case. Reported-by: Harald Hoyer <harald@redhat.com> Reported-by: Kay Sievers <kay@vrfy.org> Fixes: 37cf4dc3370f "time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow" Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: David Herrmann <dh.herrmann@gmail.com> Link: http://lkml.kernel.org/r/1453417415-19110-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-22cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freezeSudeep Holla
Commit 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()" made find_deepest_state() return non-negative value and check all the states with index > 0. Also as a result, find_deepest_state() returns 0 even when enter_freeze callbacks are not implemented and enter_freeze_proper() is called which ends up crashing the kernel. This patch updates the check for index > 0 in cpuidle_enter_freeze and cpuidle_idle_call(when idle_should_freeze is true) to restore the suspend-to-idle functionality in absence of enter_freeze callback. Fixes: 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()" Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-21Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge third patch-bomb from Andrew Morton: "I'm pretty much done for -rc1 now: - the rest of MM, basically - lib/ updates - checkpatch, epoll, hfs, fatfs, ptrace, coredump, exit - cpu_mask simplifications - kexec, rapidio, MAINTAINERS etc, etc. - more dma-mapping cleanups/simplifications from hch" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits) MAINTAINERS: add/fix git URLs for various subsystems mm: memcontrol: add "sock" to cgroup2 memory.stat mm: memcontrol: basic memory statistics in cgroup2 memory controller mm: memcontrol: do not uncharge old page in page cache replacement Documentation: cgroup: add memory.swap.{current,max} description mm: free swap cache aggressively if memcg swap is full mm: vmscan: do not scan anon pages if memcg swap limit is hit swap.h: move memcg related stuff to the end of the file mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online mm: vmscan: pass memcg to get_scan_count() mm: memcontrol: charge swap to cgroup2 mm: memcontrol: clean up alloc, online, offline, free functions mm: memcontrol: flatten struct cg_proto mm: memcontrol: rein in the CONFIG space madness net: drop tcp_memcontrol.c mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM mm: memcontrol: allow to disable kmem accounting for cgroup2 mm: memcontrol: account "kmem" consumers in cgroup2 memory controller mm: memcontrol: move kmem accounting code to CONFIG_MEMCG mm: memcontrol: separate kmem code from legacy tcp accounting code ...
2016-01-21Merge tag 'pci-v4.5-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "PCI changes for the v4.5 merge window: Enumeration: - Simplify config space size computation (Bjorn Helgaas) - Avoid iterating through ROM outside the resource window (Edward O'Callaghan) - Support PCIe devices with short cfg_size (Jason S. McMullan) - Add Netronome vendor and device IDs (Jason S. McMullan) - Limit config space size for Netronome NFP6000 family (Jason S. McMullan) - Add Netronome NFP4000 PF device ID (Simon Horman) - Limit config space size for Netronome NFP4000 (Simon Horman) - Print warnings for all invalid expansion ROM headers (Vladis Dronov) Resource management: - Fix minimum allocation address overwrite (Christoph Biedl) PCI device hotplug: - acpiphp_ibm: Fix null dereferences on null ibm_slot (Colin Ian King) - pciehp: Always protect pciehp_disable_slot() with hotplug mutex (Guenter Roeck) - shpchp: Constify hpc_ops structure (Julia Lawall) - ibmphp: Remove unneeded NULL test (Julia Lawall) Power management: - Make ASPM sysfs link_state_store() consistent with link_state_show() (Andy Lutomirski) Virtualization - Add function 1 DMA alias quirk for Lite-On/Plextor M6e/Marvell 88SS9183 (Tim Sander) MSI: - Remove empty pci_msi_init_pci_dev() (Bjorn Helgaas) - Mark PCIe/PCI (MSI) IRQ cascade handlers as IRQF_NO_THREAD (Grygorii Strashko) - Initialize MSI capability for all architectures (Guilherme G. Piccoli) - Relax msi_domain_alloc() to support parentless MSI irqdomains (Liu Jiang) ARM Versatile host bridge driver: - Remove unused pci_sys_data structures (Lorenzo Pieralisi) Broadcom iProc host bridge driver: - Hide CONFIG_PCIE_IPROC (Arnd Bergmann) - Do not use 0x in front of %pap (Dmitry V. Krivenok) - Update iProc PCIe device tree binding (Ray Jui) - Add PAXC interface support (Ray Jui) - Add iProc PCIe MSI device tree binding (Ray Jui) - Add iProc PCIe MSI support (Ray Jui) Freescale i.MX6 host bridge driver: - Use gpio_set_value_cansleep() (Fabio Estevam) - Add support for active-low reset GPIO (Petr Štetiar) HiSilicon host bridge driver: - Add support for HiSilicon Hip06 PCIe host controllers (Gabriele Paoloni) Intel VMD host bridge driver: - Export irq_domain_set_info() for module use (Keith Busch) - x86/PCI: Allow DMA ops specific to a PCI domain (Keith Busch) - Use 32 bit PCI domain numbers (Keith Busch) - Add driver for Intel Volume Management Device (VMD) (Keith Busch) Qualcomm host bridge driver: - Document PCIe devicetree bindings (Stanimir Varbanov) - Add Qualcomm PCIe controller driver (Stanimir Varbanov) - dts: apq8064: add PCIe devicetree node (Stanimir Varbanov) - dts: ifc6410: enable PCIe DT node for this board (Stanimir Varbanov) Renesas R-Car host bridge driver: - Add support for R-Car H3 to pcie-rcar (Harunobu Kurokawa) - Allow DT to override default window settings (Phil Edworthy) - Convert to DT resource parsing API (Phil Edworthy) - Revert "PCI: rcar: Build pcie-rcar.c only on ARM" (Phil Edworthy) - Remove unused pci_sys_data struct from pcie-rcar (Phil Edworthy) - Add runtime PM support to pcie-rcar (Phil Edworthy) - Add Gen2 PHY setup to pcie-rcar (Phil Edworthy) - Add gen2 fallback compatibility string for pci-rcar-gen2 (Simon Horman) - Add gen2 fallback compatibility string for pcie-rcar (Simon Horman) Synopsys DesignWare host bridge driver: - Simplify control flow (Bjorn Helgaas) - Make config accessor override checking symmetric (Bjorn Helgaas) - Ensure ATU is enabled before IO/conf space accesses (Stanimir Varbanov) Miscellaneous: - Add of_pci_get_host_bridge_resources() stub (Arnd Bergmann) - Check for PCI_HEADER_TYPE_BRIDGE equality, not bitmask (Bjorn Helgaas) - Fix all whitespace issues (Bogicevic Sasa) - x86/PCI: Simplify pci_bios_{read,write} (Geliang Tang) - Use to_pci_dev() instead of open-coding it (Geliang Tang) - Use kobj_to_dev() instead of open-coding it (Geliang Tang) - Use list_for_each_entry() to simplify code (Geliang Tang) - Fix typos in <linux/msi.h> (Thomas Petazzoni) - x86/PCI: Clarify AMD Fam10h config access restrictions comment (Tomasz Nowicki)" * tag 'pci-v4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (58 commits) PCI: Add function 1 DMA alias quirk for Lite-On/Plextor M6e/Marvell 88SS9183 PCI: Limit config space size for Netronome NFP4000 PCI: Add Netronome NFP4000 PF device ID x86/PCI: Add driver for Intel Volume Management Device (VMD) PCI/AER: Use 32 bit PCI domain numbers x86/PCI: Allow DMA ops specific to a PCI domain irqdomain: Export irq_domain_set_info() for module use PCI: host: Add of_pci_get_host_bridge_resources() stub genirq/MSI: Relax msi_domain_alloc() to support parentless MSI irqdomains PCI: rcar: Add Gen2 PHY setup to pcie-rcar PCI: rcar: Add runtime PM support to pcie-rcar PCI: designware: Make config accessor override checking symmetric PCI: ibmphp: Remove unneeded NULL test ARM: dts: ifc6410: enable PCIe DT node for this board ARM: dts: apq8064: add PCIe devicetree node PCI: hotplug: Use list_for_each_entry() to simplify code PCI: rcar: Remove unused pci_sys_data struct from pcie-rcar PCI: hisi: Add support for HiSilicon Hip06 PCIe host controllers PCI: Avoid iterating through memory outside the resource window PCI: acpiphp_ibm: Fix null dereferences on null ibm_slot ...
2016-01-21perf: Synchronously free aux pages in case of allocation failureAlexander Shishkin
We are currently using asynchronous deallocation in the error path in AUX mmap code, which is unnecessary and also presents a problem for users that wish to probe for the biggest possible buffer size they can get: they'll get -EINVAL on all subsequent attemts to allocate a smaller buffer before the asynchronous deallocation callback frees up the pages from the previous unsuccessful attempt. Currently, gdb does that for allocating AUX buffers for Intel PT traces. More specifically, overwrite mode of AUX pmus that don't support hardware sg (some implementations of Intel PT, for instance) is limited to only one contiguous high order allocation for its buffer and there is no way of knowing its size without trying. This patch changes error path freeing to be synchronous as there won't be any contenders for the AUX pages at that point. Reported-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1453216469-9509-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Fix perf_event_exit_task() racePeter Zijlstra
There is a race against perf_event_exit_task() vs event_function_call(),find_get_context(),perf_install_in_context() (iow, everyone). Since there is no permanent marker on a context that its dead, it is quite possible that we access (and even modify) a context after its passed through perf_event_exit_task(). For instance, find_get_context() might find the context still installed, but by the time we get to perf_install_in_context() it might already have passed through perf_event_exit_task() and be considered dead, we will however still add the event to it. Solve this by marking a ctx dead by setting its ctx->task value to -1, it must be !0 so we still know its a (former) task context. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Add more assertionsPeter Zijlstra
Try to trigger warnings before races do damage. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Collapse and fix event_function_call() usersPeter Zijlstra
There is one common bug left in all the event_function_call() users, between loading ctx->task and getting to the remote_function(), ctx->task can already have been changed. Therefore we need to double check and retry if ctx->task != current. Insert another trampoline specific to event_function_call() that checks for this and further validates state. This also allows getting rid of the active/inactive functions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Specialize perf_event_exit_task()Peter Zijlstra
The perf_remove_from_context() usage in __perf_event_exit_task() is different from the other usages in that this site has already detached and scheduled out the task context. This will stand in the way of stronger assertions checking the (task) context scheduling invariants. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Fix task context schedulingPeter Zijlstra
There is a very nasty problem wrt disabling the perf task scheduling hooks. Currently we {set,clear} ctx->is_active on every __perf_event_task_sched_{in,out}, _however_ this means that if we disable these calls we'll have task contexts with ->is_active set that are not active and 'active' task contexts without ->is_active set. This can result in event_function_call() looping on the ctx->is_active condition basically indefinitely. Resolve this by changing things such that contexts without events do not set ->is_active like we used to. From this invariant it trivially follows that if there are no (task) events, every task ctx is inactive and disabling the context switch hooks is harmless. This leaves two places that need attention (and already had accumulated weird and wonderful hacks to work around, without recognising this actual problem). Namely: - perf_install_in_context() will need to deal with installing events in an inactive context, meaning it cannot rely on ctx-is_active for its IPIs. - perf_remove_from_context() will have to mark a context as inactive when it removes the last event. For specific detail, see the patch/comments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Make ctx->is_active and cpuctx->task_ctx consistentPeter Zijlstra
For no apparent reason and to great confusion the rules for ctx->is_active and cpuctx->task_ctx are different. This means that its not always possible to find all active (task) contexts. Fix this such that if ctx->is_active gets set, we also set (or verify) cpuctx->task_ctx. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Optimize perf_sched_events() usagePeter Zijlstra
It doesn't make sense to take up-to _4_ references on perf_sched_events() per event, avoid doing this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Simplify/fix perf_event_enable() event schedulingPeter Zijlstra
Like perf_enable_on_exec(), perf_event_enable() event scheduling has problems respecting the context hierarchy when trying to schedule events (for example, it will try and add a pinned event without first removing existing flexible events). So simplify it by using the new ctx_resched() call which will DTRT. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Use task_ctx_sched_out()Peter Zijlstra
We have a function that does exactly what we want here, use it. This reduces the amount of cpuctx->task_ctx muckery. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Fix perf_enable_on_exec() event schedulingPeter Zijlstra
There are two problems with the current perf_enable_on_exec() event scheduling: - the newly enabled events will be immediately scheduled irrespective of their ctx event list order. - there's a hole in the ctx->lock between scheduling the events out and putting them back on. Esp. the latter issue is a real problem because a hole in event scheduling leaves the thing in an observable inconsistent state, confusing things. Fix both issues by first doing the enable iteration and at the end, when there are newly enabled events, reschedule the ctx in one go. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Remove stale commentPeter Zijlstra
The comment here is horribly out of date, remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Fix cgroup scheduling in perf_enable_on_exec()Peter Zijlstra
There is a comment that states that perf_event_context_sched_in() will also switch in the cgroup events, I cannot find it does so. Therefore all the resulting logic goes out the window too. Clean that up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Fix cgroup event schedulingPeter Zijlstra
There appears to be a problem in __perf_event_task_sched_in() wrt cgroup event scheduling. The normal event scheduling order is: CPU pinned Task pinned CPU flexible Task flexible And since perf_cgroup_sched*() only schedules the cpu context, we must call this _before_ adding the task events. Note: double check what happens on the ctx switch optimization where the task ctx isn't scheduled. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-21perf: Add lockdep assertionsPeter Zijlstra
Make various bugs easier to see. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-20Merge tag 'pm+acpi-4.5-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management and ACPI updates from Rafael Wysocki: "This includes fixes on top of the previous batch of PM+ACPI updates and some new material as well. From the new material perspective the most significant are the driver core changes that should allow USB devices to stay suspended over system suspend/resume cycles if they have been runtime-suspended already beforehand. Apart from that, ACPICA is updated to upstream revision 20160108 (cosmetic mostly, but including one fixup on top of the previous ACPICA update) and there are some devfreq updates the didn't make it before (due to timing). A few recent regressions are fixed, most importantly in the cpuidle menu governor and in the ACPI backlight driver and some x86 platform drivers depending on it. Some more bugs are fixed and cleanups are made on top of that. Specifics: - Modify the driver core and the USB subsystem to allow USB devices to stay suspended over system suspend/resume cycles if they have been runtime-suspended already beforehand and fix some bugs on top of these changes (Tomeu Vizoso, Rafael Wysocki). - Update ACPICA to upstream revision 20160108, including updates of the ACPICA's copyright notices, a code fixup resulting from a regression fix that was necessary in the upstream code only (the regression fixed by it has never been present in Linux) and a compiler warning fix (Bob Moore, Lv Zheng). - Fix a recent regression in the cpuidle menu governor that broke it on practically all architectures other than x86 and make a couple of optimizations on top of that fix (Rafael Wysocki). - Clean up the selection of cpuidle governors depending on whether or not the kernel is configured for tickless systems (Jean Delvare). - Revert a recent commit that introduced a regression in the ACPI backlight driver, address the problem it attempted to fix in a different way and revert one more cosmetic change depending on the problematic commit (Hans de Goede). - Add two more ACPI backlight quirks (Hans de Goede). - Fix a few minor problems in the core devfreq code, clean it up a bit and update the MAINTAINERS information related to it (Chanwoo Choi, MyungJoo Ham). - Improve an error message in the ACPI fan driver (Andy Lutomirski). - Fix a recent build regression in the cpupower tool (Shreyas Prabhu)" * tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits) cpuidle: menu: Avoid pointless checks in menu_select() sched / idle: Drop default_idle_call() fallback from call_cpuidle() cpupower: Fix build error in cpufreq-info cpuidle: Don't enable all governors by default cpuidle: Default to ladder governor on ticking systems time: nohz: Expose tick_nohz_enabled ACPICA: Update version to 20160108 ACPICA: Silence a -Wbad-function-cast warning when acpi_uintptr_t is 'uintptr_t' ACPICA: Additional 2016 copyright changes ACPICA: Reduce regression fix divergence from upstream ACPICA ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830 ACPI / video: Revert "thinkpad_acpi: Use acpi_video_handles_brightness_key_presses()" ACPI / video: Document acpi_video_handles_brightness_key_presses() a bit ACPI / video: Fix using an uninitialized mutex / list_head in acpi_video_handles_brightness_key_presses() ACPI / video: Revert "ACPI / video: driver must be registered before checking for keypresses" ACPI / fan: Improve acpi_device_update_power error message ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700 cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0 MAINTAINERS: Add devfreq-event entry MAINTAINERS: Add missing git repository and directory for devfreq ...
2016-01-20prctl: take mmap sem for writing to protect against othersMateusz Guzik
An unprivileged user can trigger an oops on a kernel with CONFIG_CHECKPOINT_RESTORE. proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env start/end values. These get sanity checked as follows: BUG_ON(arg_start > arg_end); BUG_ON(env_start > env_end); These can be changed by prctl_set_mm. Turns out also takes the semaphore for reading, effectively rendering it useless. This results in: kernel BUG at fs/proc/base.c:240! invalid opcode: 0000 [#1] SMP Modules linked in: virtio_net CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000 RIP: proc_pid_cmdline_read+0x520/0x530 RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206 RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246 RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050 R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600 FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0 Call Trace: __vfs_read+0x37/0x100 vfs_read+0x82/0x130 SyS_read+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RIP proc_pid_cmdline_read+0x520/0x530 ---[ end trace 97882617ae9c6818 ]--- Turns out there are instances where the code just reads aformentioned values without locking whatsoever - namely environ_read and get_cmdline. Interestingly these functions look quite resilient against bogus values, but I don't believe this should be relied upon. The first patch gets rid of the oops bug by grabbing mmap_sem for writing. The second patch is optional and puts locking around aformentioned consumers for safety. Consumers of other fields don't seem to benefit from similar treatment and are left untouched. This patch (of 2): The code was taking the semaphore for reading, which does not protect against readers nor concurrent modifications. The problem could cause a sanity checks to fail in procfs's cmdline reader, resulting in an OOPS. Note that some functions perform an unlocked read of various mm fields, but they seem to be fine despite possible modificaton. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jarod Wilson <jarod@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anshuman Khandual <anshuman.linux@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kernel: printk: specify alignment for struct printk_logAndrey Ryabinin
On architectures that have support for efficient unaligned access struct printk_log has 4-byte alignment. Specify alignment attribute in type declaration. The whole point of this patch is to fix deadlock which happening when UBSAN detects unaligned access in printk() thus UBSAN recursively calls printk() with logbuf_lock held by top printk() call. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Marek <mmarek@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yury Gribov <y.gribov@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kostya Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20sysctl: enable strict writesKees Cook
SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for strict write position handling"), and released in v3.16 in August of 2014. Since then I can find only 1 instance of non-zero offset writing[1], and it was fixed immediately in CRIU[2]. As such, it appears safe to flip this to the strict state now. [1] https://www.google.com/search?q="when%20file%20position%20was%20not%200" [2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kexec: move some memembers and definitions within the scope of CONFIG_KEXEC_FILEXunlei Pang
Move the stuff currently only used by the kexec file code within CONFIG_KEXEC_FILE (and CONFIG_KEXEC_VERIFY_SIG). Also move internal "struct kexec_sha_region" and "struct kexec_buf" into "kexec_internal.h". Signed-off-by: Xunlei Pang <xlpang@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Young <dyoung@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kernel/kexec_core.c: use list_for_each_entry_safe in kimage_free_page_listGeliang Tang
Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kexec: set KEXEC_TYPE_CRASH before sanity_check_segment_list()Xunlei Pang
sanity_check_segment_list() checks KEXEC_TYPE_CRASH flag to ensure all the segments of the loaded crash kernel are within the kernel crash resource limits, so set the flag beforehand. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kernel/cpu.c: make set_cpu_* static inlinesRasmus Villemoes
Almost all callers of the set_cpu_* functions pass an explicit true or false. Making them static inline thus replaces the function calls with a simple set_bit/clear_bit, saving some .text. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kernel/cpu.c: eliminate cpu_*_maskRasmus Villemoes
Replace the variables cpu_possible_mask, cpu_online_mask, cpu_present_mask and cpu_active_mask with macros expanding to expressions of the same type and value, eliminating some indirection. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kernel/cpu.c: export __cpu_*_maskRasmus Villemoes
Exporting the cpumasks __cpu_possible_mask and friends will allow us to remove the extra indirection through the cpu_*_mask variables. It will also allow the set_cpu_* functions to become static inlines, which will give a .text reduction. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20kernel/cpu.c: change type of cpu_possible_bits and friendsRasmus Villemoes
Change cpu_possible_bits and friends (online, present, active) from being bitmaps that happen to have the right size to actually being struct cpumasks. Also rename them to __cpu_xyz_mask. This is mostly a small cleanup in preparation for exporting them and, eventually, eliminating the extra indirection through the cpu_xyz_mask variables. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20exit: remove unneeded declaration of exit_mm()Dmitry Safonov
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn
By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>