summaryrefslogtreecommitdiff
path: root/kernel/sched/clock.c
AgeCommit message (Collapse)Author
2016-04-13sched/clock: Make local_clock()/cpu_clock() inlineDaniel Lezcano
The local_clock/cpu_clock functions were changed to prevent a double identical test with sched_clock_cpu() when HAVE_UNSTABLE_SCHED_CLOCK is set. That resulted in one line functions. As these functions are in all the cases one line functions and in the hot path, it is useful to specify them as static inline in order to give a strong hint to the compiler. After verification, it appears the compiler does not inline them without this hint. Change those functions to static inline. sched_clock_cpu() is called via the inlined local_clock()/cpu_clock() functions from sched.h. So any module code including sched.h will reference sched_clock_cpu(). Thus it must be exported with the EXPORT_SYMBOL_GPL macro. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460385514-14700-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/clock: Remove pointless test in cpu_clock/local_clockDaniel Lezcano
In case the HAVE_UNSTABLE_SCHED_CLOCK config is set, the cpu_clock() version checks if sched_clock_stable() is not set and calls sched_clock_cpu(), otherwise it calls sched_clock(). sched_clock_cpu() checks also if sched_clock_stable() is set and, if true, calls sched_clock(). sched_clock() will be called in sched_clock_cpu() if sched_clock_stable() is true. Remove the duplicate test by directly calling sched_clock_cpu() and let the static key act in this function instead. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460385514-14700-1-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-02sched-clock: Migrate to use new tick dependency mask modelFrederic Weisbecker
Instead of checking sched_clock_stable from the nohz subsystem to verify its tick dependency, migrate it to the new mask in order to include it to the all-in-one check. Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-01-11Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue update from Tejun Heo: "Workqueue changes for v4.5. One cleanup patch and three to improve the debuggability. Workqueue now has a stall detector which dumps workqueue state if any worker pool hasn't made forward progress over a certain amount of time (30s by default) and also triggers a warning if a workqueue which can be used in memory reclaim path tries to wait on something which can't be. These should make workqueue hangs a lot easier to debug." * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: simplify the apply_workqueue_attrs_locked() workqueue: implement lockup detector watchdog: introduce touch_softlockup_watchdog_sched() workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue
2015-12-08watchdog: introduce touch_softlockup_watchdog_sched()Tejun Heo
touch_softlockup_watchdog() is used to tell watchdog that scheduler stall is expected. One group of usage is from paths where the task may not be able to yield for a long time such as performing slow PIO to finicky device and coming out of suspend. The other is to account for scheduler and timer going idle. For scheduler softlockup detection, there's no reason to distinguish the two cases; however, workqueue lockup detector is planned and it can use the same signals from the former group while the latter would spuriously prevent detection. This patch introduces a new function touch_softlockup_watchdog_sched() and convert the latter group to call it instead. For now, it just calls touch_softlockup_watchdog() and there's no functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org>
2015-11-23treewide: Remove old email addressPeter Zijlstra
There were still a number of references to my old Red Hat email address in the kernel source. Remove these while keeping the Red Hat copyright notices intact. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-12kernel/sched/clock.c: add another clock for use with the soft lockup watchdogCyril Bur
When the hypervisor pauses a virtualised kernel the kernel will observe a jump in timebase, this can cause spurious messages from the softlockup detector. Whilst these messages are harmless, they are accompanied with a stack trace which causes undue concern and more problematically the stack trace in the guest has nothing to do with the observed problem and can only be misleading. Futhermore, on POWER8 this is completely avoidable with the introduction of the Virtual Time Base (VTB) register. This patch (of 2): This permits the use of arch specific clocks for which virtualised kernels can use their notion of 'running' time, not the elpased wall time which will include host execution time. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Jones <drjones@redhat.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: chai wen <chaiw.fnst@cn.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ben Zhang <benzh@chromium.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-26time: Replace __get_cpu_var usesChristoph Lameter
Convert uses of __get_cpu_var for creating a address from a percpu offset to this_cpu_ptr. The two cases where get_cpu_var is used to actually access a percpu variable are changed to use this_cpu_read/raw_cpu_read. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-07kernel: use macros from compiler.h instead of __attribute__((...))Gideon Israel Dsouza
To increase compiler portability there is <linux/compiler.h> which provides convenience macros for various gcc constructs. Eg: __weak for __attribute__((weak)). I've replaced all instances of gcc attributes with the right macro in the kernel subsystem. Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-11sched/clock: Prevent tracing recursion in sched_clock_cpu()Fernando Luis Vazquez Cao
Prevent tracing of preempt_disable/enable() in sched_clock_cpu(). When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are traced and this causes trace_clock() users (and probably others) to go into an infinite recursion. Systems with a stable sched_clock() are not affected. This problem is similar to that fixed by upstream commit 95ef1e52922 ("KVM guest: prevent tracing recursion with kvmclock"). Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexus Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-23sched/clock: Fixup early initializationPeter Zijlstra
The code would assume sched_clock_stable() and switch to !stable later, this switch brings a discontinuity in time. The discontinuity on switching from stable to unstable was always present, but previously we would set stable/unstable before initializing TSC and usually stick to the one we start out with. So the static_key bits brought an extra switch where there previously wasn't one. Things are further complicated by the fact that we cannot use static_key as early as we usually call set_sched_clock_stable(). Fix things by tracking the stable state in a regular variable and only set the static_key to the right state on sched_clock_init(), which is ran right after late_time_init->tsc_init(). Before this we would not be using the TSC anyway. Reported-and-Tested-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: dyoung@redhat.com Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable") Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: paulmck@linux.vnet.ibm.com Cc: John Stultz <john.stultz@linaro.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Cc: rjw@rjwysocki.net Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: rui.zhang@intel.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140122115918.GG3694@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/clock: Fix up clear_sched_clock_stable()Peter Zijlstra
The below tells us the static_key conversion has a problem; since the exact point of clearing that flag isn't too important, delay the flip and use a workqueue to process it. [ ] TSC synchronization [CPU#0 -> CPU#22]: [ ] Measured 8 cycles TSC warp between CPUs, turning off TSC clock. [ ] [ ] ====================================================== [ ] [ INFO: possible circular locking dependency detected ] [ ] 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 Not tainted [ ] ------------------------------------------------------- [ ] swapper/0/1 is trying to acquire lock: [ ] (jump_label_mutex){+.+...}, at: [<ffffffff8115a637>] jump_label_lock+0x17/0x20 [ ] [ ] but task is already holding lock: [ ] (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60 [ ] [ ] which lock already depends on the new lock. [ ] [ ] [ ] the existing dependency chain (in reverse order) is: [ ] [ ] -> #1 (cpu_hotplug.lock){+.+.+.}: [ ] [<ffffffff810def00>] lock_acquire+0x90/0x130 [ ] [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0 [ ] [<ffffffff81093fdc>] get_online_cpus+0x3c/0x60 [ ] [<ffffffff8104cc67>] arch_jump_label_transform+0x37/0x130 [ ] [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80 [ ] [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0 [ ] [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0 [ ] [<ffffffff810c0f65>] sched_feat_set+0xf5/0x100 [ ] [<ffffffff810c5bdc>] set_numabalancing_state+0x2c/0x30 [ ] [<ffffffff81d12f3d>] numa_policy_init+0x1af/0x1b7 [ ] [<ffffffff81cebdf4>] start_kernel+0x35d/0x41f [ ] [<ffffffff81ceb5a5>] x86_64_start_reservations+0x2a/0x2c [ ] [<ffffffff81ceb6a2>] x86_64_start_kernel+0xfb/0xfe [ ] [ ] -> #0 (jump_label_mutex){+.+...}: [ ] [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0 [ ] [<ffffffff810def00>] lock_acquire+0x90/0x130 [ ] [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0 [ ] [<ffffffff8115a637>] jump_label_lock+0x17/0x20 [ ] [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0 [ ] [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20 [ ] [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70 [ ] [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150 [ ] [<ffffffff81076612>] native_cpu_up+0x3a2/0x890 [ ] [<ffffffff810941cb>] _cpu_up+0xdb/0x160 [ ] [<ffffffff810942c9>] cpu_up+0x79/0x90 [ ] [<ffffffff81d0af6b>] smp_init+0x60/0x8c [ ] [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197 [ ] [<ffffffff8164e32e>] kernel_init+0xe/0x130 [ ] [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0 [ ] [ ] other info that might help us debug this: [ ] [ ] Possible unsafe locking scenario: [ ] [ ] CPU0 CPU1 [ ] ---- ---- [ ] lock(cpu_hotplug.lock); [ ] lock(jump_label_mutex); [ ] lock(cpu_hotplug.lock); [ ] lock(jump_label_mutex); [ ] [ ] *** DEADLOCK *** [ ] [ ] 2 locks held by swapper/0/1: [ ] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81094037>] cpu_maps_update_begin+0x17/0x20 [ ] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60 [ ] [ ] stack backtrace: [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 [ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010 [ ] ffffffff82c9c270 ffff880236843bb8 ffffffff8165c5f5 ffffffff82c9c270 [ ] ffff880236843bf8 ffffffff81658c02 ffff880236843c80 ffff8802368586a0 [ ] ffff880236858678 0000000000000001 0000000000000002 ffff880236858000 [ ] Call Trace: [ ] [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a [ ] [<ffffffff81658c02>] print_circular_bug+0x1f9/0x207 [ ] [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0 [ ] [<ffffffff816680ff>] ? __atomic_notifier_call_chain+0x8f/0xb0 [ ] [<ffffffff810def00>] lock_acquire+0x90/0x130 [ ] [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20 [ ] [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20 [ ] [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0 [ ] [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20 [ ] [<ffffffff8115a637>] jump_label_lock+0x17/0x20 [ ] [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0 [ ] [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20 [ ] [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70 [ ] [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150 [ ] [<ffffffff81076612>] native_cpu_up+0x3a2/0x890 [ ] [<ffffffff810941cb>] _cpu_up+0xdb/0x160 [ ] [<ffffffff810942c9>] cpu_up+0x79/0x90 [ ] [<ffffffff81d0af6b>] smp_init+0x60/0x8c [ ] [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] [<ffffffff8164e32e>] kernel_init+0xe/0x130 [ ] [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] ------------[ cut here ]------------ [ ] WARNING: CPU: 0 PID: 1 at /usr/src/linux-2.6/kernel/smp.c:374 smp_call_function_many+0xad/0x300() [ ] Modules linked in: [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 [ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010 [ ] 0000000000000009 ffff880236843be0 ffffffff8165c5f5 0000000000000000 [ ] ffff880236843c18 ffffffff81093d8c 0000000000000000 0000000000000000 [ ] ffffffff81ccd1a0 ffffffff810ca951 0000000000000000 ffff880236843c28 [ ] Call Trace: [ ] [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a [ ] [<ffffffff81093d8c>] warn_slowpath_common+0x8c/0xc0 [ ] [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0 [ ] [<ffffffff81093dda>] warn_slowpath_null+0x1a/0x20 [ ] [<ffffffff8110b72d>] smp_call_function_many+0xad/0x300 [ ] [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30 [ ] [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30 [ ] [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0 [ ] [<ffffffff8110ba96>] smp_call_function+0x46/0x80 [ ] [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30 [ ] [<ffffffff8110bb3c>] on_each_cpu+0x3c/0xa0 [ ] [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20 [ ] [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0 [ ] [<ffffffff8104f964>] text_poke_bp+0x64/0xd0 [ ] [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20 [ ] [<ffffffff8104ccde>] arch_jump_label_transform+0xae/0x130 [ ] [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80 [ ] [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0 [ ] [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0 [ ] [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20 [ ] [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70 [ ] [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150 [ ] [<ffffffff81076612>] native_cpu_up+0x3a2/0x890 [ ] [<ffffffff810941cb>] _cpu_up+0xdb/0x160 [ ] [<ffffffff810942c9>] cpu_up+0x79/0x90 [ ] [<ffffffff81d0af6b>] smp_init+0x60/0x8c [ ] [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] [<ffffffff8164e32e>] kernel_init+0xe/0x130 [ ] [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] ---[ end trace 6ff1df5620c49d26 ]--- [ ] tsc: Marking TSC unstable due to check_tsc_sync_source failed Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-v55fgqj3nnyqnngmvuu8ep6h@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/clock, x86: Use a static_key for sched_clock_stablePeter Zijlstra
In order to avoid the runtime condition and variable load turn sched_clock_stable into a static_key. Also provide a shorter implementation of local_clock() and cpu_clock(int) when sched_clock_stable==1. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 221876 215295 (cold) local_clock: 301773 234692 220773 (warm) sched_clock: 38375 25602 25659 (warm) local_clock: 100371 33265 27242 (warm) rdtsc: 27340 24214 24208 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 235941 237019 (cold) local_clock: 396890 297017 294819 (warm) sched_clock: 38194 25233 25609 (warm) local_clock: 143452 71234 71232 (warm) rdtsc: 27345 24245 24243 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/clock: Remove local_irq_disable() from the clocksPeter Zijlstra
Now that x86 no longer requires IRQs disabled for sched_clock() and ia64 never had this requirement (it doesn't seem to do cpufreq at all), we can remove the requirement of disabling IRQs. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 257223 221876 (cold) local_clock: 301773 309889 234692 (warm) sched_clock: 38375 25280 25602 (warm) local_clock: 100371 85268 33265 (warm) rdtsc: 27340 24247 24214 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 301224 235941 (cold) local_clock: 396890 399870 297017 (warm) sched_clock: 38194 25630 25233 (warm) local_clock: 143452 129629 71234 (warm) rdtsc: 27345 24307 24245 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-36e5kohiasnr106d077mgubp@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08sched_clock: Prevent 64bit inatomicity on 32bit systemsThomas Gleixner
The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2011-11-17sched: Move all scheduler bits into kernel/sched/Peter Zijlstra
There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>