summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-09-22sched/debug: Hide printk() by defaultPeter Zijlstra
Dietmar accidentally added an unconditional sched domain printk. Hide it behind the normal sched_debug flag. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: cd92bfd3b8cb ("sched/core: Store maximum per-CPU capacity in root domain") [ Fixed !SCHED_DEBUG build failure. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasksSrivatsa Vaddagiri
SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a per-CPU hrtimer resource and hence arming that hrtimer should be based on total SCHED_FAIR tasks a CPU has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same CPU. Fix this by arming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonwoop@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22sched/core: Avoid _cond_resched() for PREEMPT=yPeter Zijlstra
On fully preemptible kernels _cond_resched() is pointless, so avoid emitting any code for it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22sched/core: Optimize __schedule()Peter Zijlstra
Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD context switch, we can avoid the TASK_DEAD special case currently in __schedule() because that avoids the extra preempt_disable() from schedule(). In order to facilitate this, create a do_task_dead() helper which we place in the scheduler code, such that it can access __schedule(). Also add some __noreturn annotations to the functions, there's no coming back from do_exit(). Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Cheng Chao <cs.os.kernel@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: chris@chris-wilson.co.uk Cc: tj@kernel.org Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22stop_machine: Avoid a sleep and wakeup in stop_one_cpu()Cheng Chao
In case @cpu == smp_proccessor_id(), we can avoid a sleep+wakeup cycle by doing a preemption. Callers such as sched_exec() can benefit from this change. Signed-off-by: Cheng Chao <cs.os.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: chris@chris-wilson.co.uk Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1473818510-6779-1-git-send-email-cs.os.kernel@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22sched/core: Remove unnecessary initialization in sched_init()Cheng Chao
init_idle() is called immediately after: current->sched_class = &fair_sched_class; init_idle() sets: current->sched_class = &idle_sched_class; First assignment is superfluous. Signed-off-by: Cheng Chao <cs.os.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1473819536-7398-1-git-send-email-cs.os.kernel@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22sched/core: Do not use smp_processor_id() with preempt enabled in ↵Con Kolivas
smpboot_thread_fn() We should not be using smp_processor_id() with preempt enabled. Bug identified and fix provided by Alfred Chen. Reported-by: Alfred Chen <cchalpha@gmail.com> Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: Alfred Chen <cchalpha@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2042051.3vvUWIM0vs@hex Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-19cgroup: duplicate cgroup reference when cloning socketsJohannes Weiner
When a socket is cloned, the associated sock_cgroup_data is duplicated but not its reference on the cgroup. As a result, the cgroup reference count will underflow when both sockets are destroyed later on. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-13Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "A try_to_wake_up() memory ordering race fix causing a busy-loop in ttwu()" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix a race between try_to_wake_up() and a woken up task
2016-09-13Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This contains: - a set of fixes found by directed-random perf fuzzing efforts by Vince Weaver, Alexander Shishkin and Peter Zijlstra - a cqm driver crash fix - an AMD uncore driver use after free fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix PEBSv3 record drain perf/x86/intel/bts: Kill a silly warning perf/x86/intel/bts: Fix BTS PMI detection perf/x86/intel/bts: Fix confused ordering of PMU callbacks perf/core: Fix aux_mmap_count vs aux_refcount order perf/core: Fix a race between mmap_close() and set_output() of AUX events perf/x86/amd/uncore: Prevent use after free perf/x86/intel/cqm: Check cqm/mbm enabled state in event init perf/core: Remove WARN from perf_event_read()
2016-09-10Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "nvdimm fixes for v4.8, two of them are tagged for -stable: - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise, DAX pmd mappings end up with an uncached pgprot, and unusable performance for the device-dax interface. The device-dax interface appeared in 4.7 so this is tagged for -stable. - Fix a couple VM_BUG_ON() checks in the show_smaps() path to understand DAX pmd entries. This fix is tagged for -stable. - Fix a mis-merge of the nfit machine-check handler to flip the polarity of an if() to match the final version of the patch that Vishal sent for 4.8-rc1. Without this the nfit machine check handler never detects / inserts new 'badblocks' entries which applications use to identify lost portions of files. - For test purposes, fix the nvdimm_clear_poison() path to operate on legacy / simulated nvdimm memory ranges. Without this fix a test can set badblocks, but never clear them on these ranges. - Fix the range checking done by dax_dev_pmd_fault(). This is not tagged for -stable since this problem is mitigated by specifying aligned resources at device-dax setup time. These patches have appeared in a next release over the past week. The recent rebase you can see in the timestamps was to drop an invalid fix as identified by the updated device-dax unit tests [1]. The -mm touches have an ack from Andrew" [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs" https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm: allow legacy (e820) pmem region to clear bad blocks nfit, mce: Fix SPA matching logic in MCE handler mm: fix cache mode of dax pmd mappings mm: fix show_smap() for zone_device-pmd ranges dax: fix mapping size check
2016-09-10Revert "sched/fair: Make update_min_vruntime() more readable"Peter Zijlstra
There's a bug in this commit: 97a7142f157a ("sched/fair: Make update_min_vruntime() more readable") ... when !rb_leftmost && curr we fail to advance min_vruntime. So revert it. Reported-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-10perf/core: Fix aux_mmap_count vs aux_refcount orderAlexander Shishkin
The order of accesses to ring buffer's aux_mmap_count and aux_refcount has to be preserved across the users, namely perf_mmap_close() and perf_aux_output_begin(), otherwise the inversion can result in the latter holding the last reference to the aux buffer and subsequently free'ing it in atomic context, triggering a warning. > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130 > CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596 > Call Trace: > [<ffffffff810f3e0b>] __warn+0xcb/0xf0 > [<ffffffff810f3f3d>] warn_slowpath_null+0x1d/0x20 > [<ffffffff8121182a>] __rb_free_aux+0x11a/0x130 > [<ffffffff812127a8>] rb_free_aux+0x18/0x20 > [<ffffffff81212913>] perf_aux_output_begin+0x163/0x1e0 > [<ffffffff8100c33a>] bts_event_start+0x3a/0xd0 > [<ffffffff8100c42d>] bts_event_add+0x5d/0x80 > [<ffffffff81203646>] event_sched_in.isra.104+0xf6/0x2f0 > [<ffffffff8120652e>] group_sched_in+0x6e/0x190 > [<ffffffff8120694e>] ctx_sched_in+0x2fe/0x5f0 > [<ffffffff81206ca0>] perf_event_sched_in+0x60/0x80 > [<ffffffff81206d1b>] ctx_resched+0x5b/0x90 > [<ffffffff81207281>] __perf_event_enable+0x1e1/0x240 > [<ffffffff81200639>] event_function+0xa9/0x180 > [<ffffffff81202000>] ? perf_cgroup_attach+0x70/0x70 > [<ffffffff8120203f>] remote_function+0x3f/0x50 > [<ffffffff811971f3>] flush_smp_call_function_queue+0x83/0x150 > [<ffffffff81197bd3>] generic_smp_call_function_single_interrupt+0x13/0x60 > [<ffffffff810a6477>] smp_call_function_single_interrupt+0x27/0x40 > [<ffffffff81a26ea9>] call_function_single_interrupt+0x89/0x90 > [<ffffffff81120056>] finish_task_switch+0xa6/0x210 > [<ffffffff81120017>] ? finish_task_switch+0x67/0x210 > [<ffffffff81a1e83d>] __schedule+0x3dd/0xb50 > [<ffffffff81a1efe5>] schedule+0x35/0x80 > [<ffffffff81128031>] sys_sched_yield+0x61/0x70 > [<ffffffff81a25be5>] entry_SYSCALL_64_fastpath+0x18/0xa8 > ---[ end trace 6235f556f5ea83a9 ]--- This patch puts the checks in perf_aux_output_begin() in the same order as that of perf_mmap_close(). Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-10perf/core: Fix a race between mmap_close() and set_output() of AUX eventsAlexander Shishkin
In the mmap_close() path we need to stop all the AUX events that are writing data to the AUX area that we are unmapping, before we can safely free the pages. To determine if an event needs to be stopped, we're comparing its ->rb against the one that's getting unmapped. However, a SET_OUTPUT ioctl may turn up inside an AUX transaction and swizzle event::rb to some other ring buffer, but the transaction will keep writing data to the old ring buffer until the event gets scheduled out. At this point, mmap_close() will skip over such an event and will proceed to free the AUX area, while it's still being used by this event, which will set off a warning in the mmap_close() path and cause a memory corruption. To avoid this, always stop an AUX event before its ->rb is updated; this will release the (potentially) last reference on the AUX area of the buffer. If the event gets restarted, its new ring buffer will be used. If another SET_OUTPUT comes and switches it back to the old ring buffer that's getting unmapped, it's also fine: this ring buffer's aux_mmap_count will be zero and AUX transactions won't start any more. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-09mm: fix cache mode of dax pmd mappingsDan Williams
track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as uncacheable rendering them impractical for application usage. DAX-pte mappings are cached and the goal of establishing DAX-pmd mappings is to attain more performance, not dramatically less (3 orders of magnitude). track_pfn_insert() relies on a previous call to reserve_memtype() to establish the expected page_cache_mode for the range. While memremap() arranges for reserve_memtype() to be called, devm_memremap_pages() does not. So, teach track_pfn_insert() and untrack_pfn() how to handle tracking without a vma, and arrange for devm_memremap_pages() to establish the write-back-cache reservation in the memtype tree. Cc: <stable@vger.kernel.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Reported-by: Kai Zhang <kai.ka.zhang@oracle.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-05PM / QoS: avoid calling cancel_delayed_work_sync() during early bootTejun Heo
of_clk_init() ends up calling into pm_qos_update_request() very early during boot where irq is expected to stay disabled. pm_qos_update_request() uses cancel_delayed_work_sync() which correctly assumes that irq is enabled on invocation and unconditionally disables and re-enables it. Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid enabling irq unexpectedly during early boot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Qiao Zhou <qiaozhou@asrmicro.com> Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-09-05sched/debug: Remove several CONFIG_SCHEDSTATS guardsJosh Poimboeuf
Clean up the sched code by removing several of the CONFIG_SCHEDSTATS guards, using schedstat_*() macros where needed. Code size: !CONFIG_SCHEDSTATS defconfig: text data bss dec hex filename 10209818 4368184 1105920 15683922 ef5152 vmlinux.before.nostats 10209818 4368184 1105920 15683922 ef5152 vmlinux.after.nostats CONFIG_SCHEDSTATS defconfig: text data bss dec hex filename 10214210 4370040 1105920 15690170 ef69ba vmlinux.before.stats 10214210 4370680 1105920 15690810 ef6c3a vmlinux.after.stats Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e51e0ebe5af95ac295de720dd252e7c0d2142e4a.1466184592.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/debug: Rename 'schedstat_val()' -> 'schedstat_val_or_zero()'Josh Poimboeuf
The schedstat_val() macro's behavior is kind of surprising: when schedstat is runtime disabled, it returns zero. Rename it to schedstat_val_or_zero(). There's also a need for a similar macro which doesn't have the 'if (schedstat_enable())' check, to avoid doing the check twice. Create a new 'schedstat_val()' macro for that. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/3bb1d2367d041fee333b0dde17171e709395b675.1466184592.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/debug: Clean up schedstat macrosJosh Poimboeuf
The schedstat_*() macros are inconsistent: most of them take a pointer and a field which the macro combines, whereas schedstat_set() takes the already combined ptr->field. The already combined ptr->field argument is actually more intuitive and easier to use, and there's no reason to require the user to split the variable up, so convert the macros to use the combined argument. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/debug: Rename and move enqueue_sleeper()Josh Poimboeuf
enqueue_sleeper() doesn't actually enqueue, it just handles some statistics and tracepoints. Rename it to update_stats_enqueue_sleeper() and call it from update_stats_enqueue(). Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/fb20b7159dc4d028c406c0e8d5f8c439b741615b.1466184592.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/deadline: Fix the intention to re-evalute tick dependency for offline CPUWanpeng Li
The dl task will be replenished after dl task timer fire and start a new period. It will be enqueued and to re-evaluate its dependency on the tick in order to restart it. However, if the CPU is hot-unplugged, irq_work_queue will splash since the target CPU is offline. As a result we get: WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0 Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 irq_work_queue_on+0xad/0xe0 tick_nohz_full_kick_cpu+0x44/0x50 tick_nohz_dep_set_cpu+0x74/0xb0 enqueue_task_dl+0x226/0x480 activate_task+0x5c/0xa0 dl_task_timer+0x19b/0x2c0 ? push_dl_task.part.31+0x190/0x190 This can be triggered by hot-unplugging the full dynticks CPU which dl task is running on. We enqueue the dl task on the offline CPU, because we need to do replenish for start_dl_timer(). So, as Juri pointed out, we would need to do is calling replenish_dl_entity() directly, instead of enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't be boosted if it was throttled. This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by first migrating (set_task_cpu()) and then doing 1 enqueue. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05schedcore: Remove duplicated init_task's preempt_notifiers initseokhoon.yoon
init_task's preempt_notifiers is initialized twice: 1) sched_init() -> INIT_HLIST_HEAD(&init_task.preempt_notifiers) 2) sched_init() -> init_idle(current,) <--- current task is init_task at this time -> __sched_fork(,current) -> INIT_HLIST_HEAD(&p->preempt_notifiers) I think the first one is unnecessary, so remove it. Signed-off-by: seokhoon.yoon <iamyooon@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1471339568-5790-1-git-send-email-iamyooon@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/fair: Fix load_above_capacity fixed point arithmetic widthDietmar Eggemann
Since commit: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels") we now have two different fixed point units for load. load_above_capacity has to have 10 bits fixed point unit like PELT, whereas NICE_0_LOAD has 20 bit fixed point unit on 64-bit kernels. Fix this by scaling down NICE_0_LOAD when multiplying load_above_capacity with it. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1470824847-5316-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear()Tommaso Cucinotta
These 2 exercise independent code paths and need different arguments. After this change, you call: cpudl_clear(cp, cpu); cpudl_set(cp, cpu, dl); instead of: cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */); cpudl_set(cp, cpu, dl, 1 /* is_valid */); Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@unitn.it> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-dl@retis.sssup.it Link: http://lkml.kernel.org/r/1471184828-12644-4-git-send-email-tommaso.cucinotta@sssup.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/deadline: Make CPU heap faster avoiding real swaps on heapifyTommaso Cucinotta
This change goes from heapify() ops done by swapping with parent/child so that the item to fix moves along, to heapify() ops done by just pulling the parent/child chain by 1 pos, then storing the item to fix just at the end. On a non-trivial heapify(), this performs roughly half stores wrt swaps. This has been measured to achieve up to 10% of speed-up for cpudl_set() calls, with a randomly generated workload of 1K,10K,100K random heap insertions and deletions (75% cpudl_set() calls with is_valid=1 and 25% with is_valid=0), and randomly generated cpu IDs, with up to 256 CPUs, as measured on an Intel Core2 Duo. Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@unitn.it> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-dl@retis.sssup.it Link: http://lkml.kernel.org/r/1471184828-12644-3-git-send-email-tommaso.cucinotta@sssup.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/deadline: Refactor CPU heap codeTommaso Cucinotta
1. heapify up factored out in new dedicated function heapify_up() (avoids repetition of same code) 2. call to cpudl_change_key() replaced with heapify_up() when cpudl_set actually inserts a new node in the heap 3. cpudl_change_key() replaced with heapify() that heapifies up or down as needed. Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@unitn.it> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-dl@retis.sssup.it Link: http://lkml.kernel.org/r/1471184828-12644-2-git-send-email-tommaso.cucinotta@sssup.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/fair: Make update_min_vruntime() more readableByungchul Park
The update_min_vruntime() control flow can be simplified. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: minchan.kim@lge.com Link: http://lkml.kernel.org/r/1436088829-25768-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05sched/core: Fix a race between try_to_wake_up() and a woken up taskBalbir Singh
The origin of the issue I've seen is related to a missing memory barrier between check for task->state and the check for task->on_rq. The task being woken up is already awake from a schedule() and is doing the following: do { schedule() set_current_state(TASK_(UN)INTERRUPTIBLE); } while (!cond); The waker, actually gets stuck doing the following in try_to_wake_up(): while (p->on_cpu) cpu_relax(); Analysis: The instance I've seen involves the following race: CPU1 CPU2 while () { if (cond) break; do { schedule(); set_current_state(TASK_UN..) } while (!cond); wakeup_routine() spin_lock_irqsave(wait_lock) raw_spin_lock_irqsave(wait_lock) wake_up_process() } try_to_wake_up() set_current_state(TASK_RUNNING); .. list_del(&waiter.list); CPU2 wakes up CPU1, but before it can get the wait_lock and set current state to TASK_RUNNING the following occurs: CPU3 wakeup_routine() raw_spin_lock_irqsave(wait_lock) if (!list_empty) wake_up_process() try_to_wake_up() raw_spin_lock_irqsave(p->pi_lock) .. if (p->on_rq && ttwu_wakeup()) .. while (p->on_cpu) cpu_relax() .. CPU3 tries to wake up the task on CPU1 again since it finds it on the wait_queue, CPU1 is spinning on wait_lock, but immediately after CPU2, CPU3 got it. CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and the task is spinning on the wait_lock. Interestingly since p->on_rq is checked under pi_lock, I've noticed that try_to_wake_up() finds p->on_rq to be 0. This was the most confusing bit of the analysis, but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq check is not reliable without this fix IMHO. The race is visible (based on the analysis) only when ttwu_queue() does a remote wakeup via ttwu_queue_remote. In which case the p->on_rq change is not done uder the pi_lock. The result is that after a while the entire system locks up on the raw_spin_irqlock_save(wait_lock) and the holder spins infintely Reproduction of the issue: The issue can be reproduced after a long run on my system with 80 threads and having to tweak available memory to very low and running memory stress-ng mmapfork test. It usually takes a long time to reproduce. I am trying to work on a test case that can reproduce the issue faster, but thats work in progress. I am still testing the changes on my still in a loop and the tests seem OK thus far. Big thanks to Benjamin and Nick for helping debug this as well. Ben helped catch the missing barrier, Nick caught every missing bit in my theory. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [ Updated comment to clarify matching barriers. Many architectures do not have a full barrier in switch_to() so that cannot be relied upon. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicholas Piggin <nicholas.piggin@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05perf/core: Remove WARN from perf_event_read()Peter Zijlstra
This effectively reverts commit: 71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI") ... and puts in a comment explaining why we ignore the return value. Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-04Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixlet from the timers departement: - A fix for scheduler stalls in the tick idle code affecting NOHZ_FULL kernels - A trivial compile fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/nohz: Fix softlockup on scheduler stalls in kvm guest clocksource/drivers/atmel-pit: Fix compilation error
2016-09-02tick/nohz: Fix softlockup on scheduler stalls in kvm guestWanpeng Li
tick_nohz_start_idle() is prevented to be called if the idle tick can't be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter"). As a result, after suspend/resume the host machine, full dynticks kvm guest will softlockup: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0] Call Trace: default_idle+0x31/0x1a0 arch_cpu_idle+0xf/0x20 default_idle_call+0x2a/0x50 cpu_startup_entry+0x39b/0x4d0 rest_init+0x138/0x140 ? rest_init+0x5/0x140 start_kernel+0x4c1/0x4ce ? set_init_arg+0x55/0x55 ? early_idt_handler_array+0x120/0x120 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0x142/0x14f In addition, cat /proc/stat | grep cpu in guest or host: cpu 398 16 5049 15754 5490 0 1 46 0 0 cpu0 206 5 450 0 0 0 1 14 0 0 cpu1 81 0 3937 3149 1514 0 0 9 0 0 cpu2 45 6 332 6052 2243 0 0 11 0 0 cpu3 65 2 328 6552 1732 0 0 11 0 0 The idle and iowait states are weird 0 for cpu0(housekeeping). The bug is present in both guest and host kernels, and they both have cpu0's idle and iowait states issue, however, host kernel's suspend/resume path etc will touch watchdog to avoid the softlockup. - The watchdog will not be touched in tick_nohz_stop_idle path (need be touched since the scheduler stall is expected) if idle_active flags are not detected. - The idle and iowait states will not be accounted when exit idle loop (resched or interrupt) if idle start time and idle_active flags are not set. This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop idle tick doesn't mean can't be idle. Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter") Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com> Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com> Cc: stable@vger.kernel.org Cc: kvm@vger.kernel.org Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-01Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: rapidio/tsi721: fix incorrect detection of address translation condition rapidio/documentation/mport_cdev: add missing parameter description kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd MAINTAINERS: Vladimir has moved mm, mempolicy: task->mempolicy must be NULL before dropping final reference printk/nmi: avoid direct printk()-s from __printk_nmi_flush() treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator lib/test_hash.c: fix warning in preprocessor symbol evaluation lib/test_hash.c: fix warning in two-dimensional array init kconfig: tinyconfig: provide whole choice blocks to avoid warnings kexec: fix double-free when failing to relocate the purgatory mm, oom: prevent premature OOM killer invocation for high order request
2016-09-01kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscdMichal Hocko
Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit") has caused a subtle regression in nscd which uses CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the shared databases, so that the clients are notified when nscd is restarted. Now, when nscd uses a non-persistent database, clients that have it mapped keep thinking the database is being updated by nscd, when in fact nscd has created a new (anonymous) one (for non-persistent databases it uses an unlinked file as backend). The original proposal for the CLONE_CHILD_CLEARTID change claimed (https://lkml.org/lkml/2006/10/25/233): : The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls : on behalf of pthread_create() library calls. This feature is used to : request that the kernel clear the thread-id in user space (at an address : provided in the syscall) when the thread disassociates itself from the : address space, which is done in mm_release(). : : Unfortunately, when a multi-threaded process incurs a core dump (such as : from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of : the other threads, which then proceed to clear their user-space tids : before synchronizing in exit_mm() with the start of core dumping. This : misrepresents the state of process's address space at the time of the : SIGSEGV and makes it more difficult for someone to debug NPTL and glibc : problems (misleading him/her to conclude that the threads had gone away : before the fault). : : The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a : core dump has been initiated. The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269) seems to have a larger scope than the original patch asked for. It seems that limitting the scope of the check to core dumping should work for SIGSEGV issue describe above. [Changelog partly based on Andreas' description] Fixes: fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit") Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Tested-by: William Preston <wpreston@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Andreas Schwab <schwab@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01mm, mempolicy: task->mempolicy must be NULL before dropping final referenceDavid Rientjes
KASAN allocates memory from the page allocator as part of kmem_cache_free(), and that can reference current->mempolicy through any number of allocation functions. It needs to be NULL'd out before the final reference is dropped to prevent a use-after-free bug: BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140 ... Call Trace: dump_stack kasan_object_err kasan_report_error __asan_report_load2_noabort alloc_pages_current <-- use after free depot_save_stack save_stack kasan_slab_free kmem_cache_free __mpol_put <-- free do_exit This patch sets current->mempolicy to NULL before dropping the final reference. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB") Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01printk/nmi: avoid direct printk()-s from __printk_nmi_flush()Sergey Senozhatsky
__printk_nmi_flush() can be called from nmi_panic(), therefore it has to test whether it's executed in NMI context and thus must route the messages through deferred printk() or via direct printk(). This is to avoid potential deadlocks, as described in commit cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic"). However there remain two places where __printk_nmi_flush() does unconditional direct printk() calls: - pr_err("printk_nmi_flush: internal error ...") - pr_cont("\n") Factor out print_nmi_seq_line() parts into a new printk_nmi_flush_line() function, which takes care of in_nmi(), and use it in __printk_nmi_flush() for printing and error-reporting. Link: http://lkml.kernel.org/r/20160830161354.581-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01kconfig: tinyconfig: provide whole choice blocks to avoid warningsArnd Bergmann
Using "make tinyconfig" produces a couple of annoying warnings that show up for build test machines all the time: .config:966:warning: override: NOHIGHMEM changes choice state .config:965:warning: override: SLOB changes choice state .config:963:warning: override: KERNEL_XZ changes choice state .config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state .config:933:warning: override: SLOB changes choice state .config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state .config:870:warning: override: SLOB changes choice state .config:868:warning: override: KERNEL_XZ changes choice state .config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state I've made a previous attempt at fixing them and we discussed a number of alternatives. I tried changing the Makefile to use "merge_config.sh -n $(fragment-list)" but couldn't get that to work properly. This is yet another approach, based on the observation that we do want to see a warning for conflicting 'choice' options, and that we can simply make them non-conflicting by listing all other options as disabled. This is a trivial patch that we can apply independent of plans for other changes. Link: http://lkml.kernel.org/r/20160829214952.1334674-2-arnd@arndb.de Link: https://storage.kernelci.org/mainline/v4.7-rc6/x86-tinyconfig/build.log https://patchwork.kernel.org/patch/9212749/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01kexec: fix double-free when failing to relocate the purgatoryThiago Jung Bauermann
If kexec_apply_relocations fails, kexec_load_purgatory frees pi->sechdrs and pi->purgatory_buf. This is redundant, because in case of error kimage_file_prepare_segments calls kimage_file_post_load_cleanup, which will also free those buffers. This causes two warnings like the following, one for pi->sechdrs and the other for pi->purgatory_buf: kexec-bzImage64: Loading purgatory failed ------------[ cut here ]------------ WARNING: CPU: 1 PID: 2119 at mm/vmalloc.c:1490 __vunmap+0xc1/0xd0 Trying to vfree() nonexistent vm area (ffffc90000e91000) Modules linked in: CPU: 1 PID: 2119 Comm: kexec Not tainted 4.8.0-rc3+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x4d/0x65 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 ? find_vmap_area+0x19/0x70 ? kimage_file_post_load_cleanup+0x47/0xb0 __vunmap+0xc1/0xd0 vfree+0x2e/0x70 kimage_file_post_load_cleanup+0x5e/0xb0 SyS_kexec_file_load+0x448/0x680 ? putname+0x54/0x60 ? do_sys_open+0x190/0x1f0 entry_SYSCALL_64_fastpath+0x13/0x8f ---[ end trace 158bb74f5950ca2b ]--- Fix by setting pi->sechdrs an pi->purgatory_buf to NULL, since vfree won't try to free a NULL pointer. Link: http://lkml.kernel.org/r/1472083546-23683-1-git-send-email-bauerman@linux.vnet.ibm.com Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fixes from Paul Moore: "Two small patches to fix some bugs with the audit-by-executable functionality we introduced back in v4.3 (both patches are marked for the stable folks)" * 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit: audit: fix exe_file access in audit_exe_compare mm: introduce get_task_exe_file
2016-08-31audit: fix exe_file access in audit_exe_compareMateusz Guzik
Prior to the change the function would blindly deference mm, exe_file and exe_file->f_inode, each of which could have been NULL or freed. Use get_task_exe_file to safely obtain stable exe_file. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Richard Guy Briggs <rgb@redhat.com> Cc: <stable@vger.kernel.org> # 4.3.x Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-31mm: introduce get_task_exe_fileMateusz Guzik
For more convenient access if one has a pointer to the task. As a minor nit take advantage of the fact that only task lock + rcu are needed to safely grab ->exe_file. This saves mm refcount dance. Use the helper in proc_exe_link. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Richard Guy Briggs <rgb@redhat.com> Cc: <stable@vger.kernel.org> # 4.3.x Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-30Merge tag 'seccomp-v4.8-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp fix from Kees Cook: "Fix fatal signal delivery after ptrace reordering" * tag 'seccomp-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: Fix tracer exit notifications during fatal signals
2016-08-30seccomp: Fix tracer exit notifications during fatal signalsKees Cook
This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: https://github.com/mozilla/rr/issues/1762#issuecomment-237396255. Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com>
2016-08-30Merge branch 'for-4.8-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two fixes for cgroup. - There still was a hole in enforcing cpuset rules, fixed by Li. - The recent switch to global percpu_rwseom for threadgroup locking revealed a couple issues in how percpu_rwsem is implemented and used by cgroup. Balbir found that the read locking section was too wide unnecessarily including operations which can often depend on IOs. With percpu_rwsem updates (coming through a different tree) and reduction of read locking section, all the reported locking latency issues, including the android one, are resolved. It looks like we can keep global percpu_rwsem locking for now. If there actually are cases which can't be resolved, we can go back to more complex per-signal_struct locking" * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork cpuset: make sure new tasks conform to the current config of the cpuset
2016-08-28Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A few fixes from the perf departement - prevent a imbalanced preemption disable in the events teardown code - prevent out of bound acces in perf userspace - make perf tools compile with UCLIBC again - a fix for the userspace unwinder utility" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Use this_cpu_ptr() when stopping AUX events perf evsel: Do not access outside hw cache name arrays tools lib: Reinstate strlcpy() header guard with __UCLIBC__ perf unwind: Use addr_location::addr instead of ip for entries
2016-08-28Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "This lot provides: - plug a hotplug race in the new affinity infrastructure - a fix for the trigger type of chained interrupts - plug a potential memory leak in the core code - a few fixes for ARM and MIPS GICs" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/mips-gic: Implement activate op for device domain irqchip/mips-gic: Cleanup chip and handler setup genirq/affinity: Use get/put_online_cpus around cpumask operations genirq: Fix potential memleak when failing to get irq pm irqchip/gicv3-its: Disable the ITS before initializing it irqchip/gicv3: Remove disabling redistributor and group1 non-secure interrupts irqchip/gic: Allow self-SGIs for SMP on UP configurations genirq: Correctly configure the trigger on chained interrupts
2016-08-28Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A few updates for timers & co: - prevent a livelock in the timekeeping code when debugging is enabled - prevent out of bounds access in the timekeeping debug code - various fixes in clocksource drivers - a new maintainers entry" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function drivers/clocksource/pistachio: Fix memory corruption in init clocksource/drivers/timer-atmel-pit: Enable mck clock clocksource/drivers/pxa: Fix include files for compilation MAINTAINERS: Add ARM ARCHITECTED TIMER entry timekeeping: Cap array access in timekeeping_debug timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING
2016-08-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "11 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: silently skip readahead for DAX inodes dax: fix device-dax region base fs/seq_file: fix out-of-bounds read mm: memcontrol: avoid unused function warning mm: clarify COMPACTION Kconfig text treewide: replace config_enabled() with IS_ENABLED() (2nd round) printk: fix parsing of "brl=" option soft_dirty: fix soft_dirty during THP split sysctl: handle error writing UINT_MAX to u32 fields get_maintainer: quiet noisy implicit -f vcs_file_exists checking byteswap: don't use __builtin_bswap*() with sparse
2016-08-26Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Here's a set of block fixes for the current 4.8-rc release. This contains: - a fix for a secure erase regression, from Adrian. - a fix for an mmc use-after-free bug regression, also from Adrian. - potential zero pointer deference in bdev freezing, from Andrey. - a race fix for blk_set_queue_dying() from Bart. - a set of xen blkfront fixes from Bob Liu. - three small fixes for bcache, from Eric and Kent. - a fix for a potential invalid NVMe state transition, from Gabriel. - blk-mq CPU offline fix, preventing us from issuing and completing a request on the wrong queue. From me. - revert two previous floppy changes, since they caused a user visibile regression. A better fix is in the works. - ensure that we don't send down bios that have more than 256 elements in them. Fixes a crash with bcache, for example. From Ming. - a fix for deferencing an error pointer with cgroup writeback. Fixes a regression. From Vegard" * 'for-linus' of git://git.kernel.dk/linux-block: mmc: fix use-after-free of struct request Revert "floppy: refactor open() flags handling" Revert "floppy: fix open(O_ACCMODE) for ioctl-only open" fs/block_dev: fix potential NULL ptr deref in freeze_bdev() blk-mq: improve warning for running a queue on the wrong CPU blk-mq: don't overwrite rq->mq_ctx block: make sure a big bio is split into at most 256 bvecs nvme: Fix nvme_get/set_features() with a NULL result pointer bdev: fix NULL pointer dereference xen-blkfront: free resources if xlvbd_alloc_gendisk fails xen-blkfront: introduce blkif_set_queue_limits() xen-blkfront: fix places not updated after introducing 64KB page granularity bcache: pr_err: more meaningful error message when nr_stripes is invalid bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two. bcache: register_bcache(): call blkdev_put() when cache_alloc() fails block: Fix race triggered by blk_set_queue_dying() block: Fix secure erase nvme: Prevent controller state invalid transition