Age | Commit message (Collapse) | Author |
|
This reverts commit 5258f386ea4e8454bc801fb443e8a4217da1947c,
because the underlying autogroups bug got fixed upstream in
a better way, via:
fd8ef11730f1 Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
Cc: Mike Galbraith <efault@gmx.de>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull more cputime cleanups from Frederic Weisbecker:
* Get rid of underscores polluting the vtime namespace
* Consolidate context switch and tick handling
* Improve debuggability by detecting irq unsafe callers
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull cputime cleanups from Frederic Weisbecker:
* Improve naming and code location
* Consolidate adjustment code
* Comment the adjustement code
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pick up the autogroups fix and other fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module signing fixes from Rusty Russell:
"David gave me these a month ago, during my git workflow churn :("
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
ASN.1: Fix an indefinite length skip error
MODSIGN: Don't use enum-type bitfields in module signature info block
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchdog fix from Thomas Gleixner:
"Trivial CPU hotplug regression fix for the watchdog code"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
watchdog: Fix CPU hotplug regression
|
|
Don't use enum-type bitfields in the module signature info block as we can't be
certain how the compiler will handle them. As I understand it, it is arch
dependent, and it is possible for the compiler to rearrange them based on
endianness and to insert a byte of padding to pad the three enums out to four
bytes.
Instead use u8 fields for these, which the compiler should emit in the right
order without padding.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Norbert reported:
"3.7-rc6 booted with nmi_watchdog=0 fails to suspend to RAM or
offline CPUs. It's reproducable with a KVM guest and physical
system."
The reason is that commit bcd951cf(watchdog: Use hotplug thread
infrastructure) missed to take this into account. So the cpu offline
code gets stuck in the teardown function because it accesses non
initialized data structures.
Add a check for watchdog_enabled into that path to cure the issue.
Reported-and-tested-by: Norbert Warmuth <nwarmuth@t-online.de>
Tested-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1211231033230.2701@ionos
Link: http://bugs.launchpad.net/bugs/1079534
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fixes from Rusty Russell:
"Module signing build fixes for blackfin and metag"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
modsign: add symbol prefix to certificate list
linux/kernel.h: define SYMBOL_PREFIX
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"So, safe fixes my ass.
Commit 8852aac25e79 ("workqueue: mod_delayed_work_on() shouldn't queue
timer on 0 delay") had the side-effect of performing delayed_work
sanity checks even when @delay is 0, which should be fine for any sane
use cases.
Unfortunately, megaraid was being overly ingenious. It seemingly
wanted to use cancel_delayed_work_sync() before cancel_work_sync() was
introduced, but didn't want to waste the space for full delayed_work
as it was only going to use 0 @delay. So, it only allocated space for
struct work_struct and then cast it to struct delayed_work and passed
it into delayed_work functions - truly awesome engineering tradeoff to
save some bytes.
Xiaotian fixed it by making megraid allocate full delayed_work for
now. It should be converted to use work_struct and cancel_work_sync()
but I think we better do that after 3.7.
I added another commit to change BUG_ON()s in __queue_delayed_work()
to WARN_ON_ONCE()s so that the kernel doesn't crash even if there are
more such abuses."
* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s
megaraid: fix BUG_ON() from incorrect use of delayed work
|
|
8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on
0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in
megaraid - it allocated work_struct, casted it to delayed_work and
then pass that into queue_delayed_work().
Previously, this was okay because 0 @delay short-circuited to
queue_work() before doing anything with delayed_work. 8852aac25e
moved 0 @delay test into __queue_delayed_work() after sanity check on
delayed_work making megaraid trigger BUG_ON().
Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix
BUG_ON() from incorrect use of delayed work"), this patch converts
BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such
abusers, if there are more, trigger warning but don't crash the
machine.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Xiaotian Feng <xtfeng@gmail.com>
|
|
This reverts commit 800d4d30c8f20bd728e5741a3b77c4859a613f7c.
Between commits 8323f26ce342 ("sched: Fix race in task_group()") and
800d4d30c8f2 ("sched, autogroup: Stop going ahead if autogroup is
disabled"), autogroup is a wreck.
With both applied, all you have to do to crash a box is disable
autogroup during boot up, then reboot.. boom, NULL pointer dereference
due to commit 800d4d30c8f2 not allowing autogroup to move things, and
commit 8323f26ce342 making that the only way to switch runqueues:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90
Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502
RIP: effective_load.isra.43+0x50/0x90
Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0)
Call Trace:
select_task_rq_fair+0x255/0x780
try_to_wake_up+0x156/0x2c0
wake_up_state+0xb/0x10
signal_wake_up+0x28/0x40
complete_signal+0x1d6/0x250
__send_signal+0x170/0x310
send_signal+0x40/0x80
do_send_sig_info+0x47/0x90
group_send_sig_info+0x4a/0x70
kill_pid_info+0x3a/0x60
sys_kill+0x97/0x1a0
? vfs_read+0x120/0x160
? sys_read+0x45/0x90
system_call_fastpath+0x16/0x1b
Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 <48> 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c
RIP [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90
RSP <ffff880221ddfbd8>
CR2: 0000000000000000
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add the arch symbol prefix (if applicable) to the asm definition of
modsign_certificate_list and modsign_certificate_list_end. This uses the
recently defined SYMBOL_PREFIX which is derived from
CONFIG_SYMBOL_PREFIX.
This fixes the build of module signing on the blackfin and metag
architectures.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
implemented mod_delayed_work[_on]() using the improved
try_to_grab_pending(). The function is later used, among others, to
replace [__]candel_delayed_work() + queue_delayed_work() combinations.
Unfortunately, a delayed_work item w/ zero @delay is handled slightly
differently by mod_delayed_work_on() compared to
queue_delayed_work_on(). The latter skips timer altogether and
directly queues it using queue_work_on() while the former schedules
timer which will expire on the closest tick. This means, when @delay
is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
makes the target item immediately executable while
mod_delayed_work_on() may induce delay of upto a full tick.
This somewhat subtle difference breaks some of the converted users.
e.g. block queue plugging uses delayed_work for deferred processing
and uses mod_delayed_work_on() when the queue needs to be immediately
unplugged. The above problem manifested as noticeably higher number
of context switches under certain circumstances.
The difference in behavior was caused by missing special case handling
for 0 delay in mod_delayed_work_on() compared to
queue_delayed_work_on(). Joonsoo Kim posted a patch to add it -
("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
The patch was queued for 3.8 but it was described as optimization and
I missed that it was a correctness issue.
As both queue_delayed_work_on() and mod_delayed_work_on() use
__queue_delayed_work() for queueing, it seems that the better approach
is to move the 0 delay special handling to the function instead of
duplicating it in mod_delayed_work_on().
Fix the problem by moving 0 delay special case handling from
queue_delayed_work_on() to __queue_delayed_work(). This replaces
Joonsoo's patch.
[1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU>
Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
LKML-Reference: <50A78AA9.5040904@iskon.hr>
Cc: Joonsoo Kim <js1304@gmail.com>
|
|
A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
off, never to be seen again. In the case where this occurred, an exiting
thread hit reiserfs homebrew conditional resched while holding a mutex,
bringing the box to its knees.
PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush"
#0 [ffff8808157e7670] schedule at ffffffff8143f489
#1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
#2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
#3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
#4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
#5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
#6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
#7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
#8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
#9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
[exception RIP: kernel_thread_helper]
RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"This is mostly about unbreaking architectures that took the UAPI
changes in the v3.7 cycle, plus misc fixes."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf kvm: Fix building perf kvm on non x86 arches
perf kvm: Rename perf_kvm to perf_kvm_stat
perf: Make perf build for x86 with UAPI disintegration applied
perf powerpc: Use uapi/unistd.h to fix build error
tools: Pass the target in descend
tools: Honour the O= flag when tool build called from a higher Makefile
tools: Define a Makefile function to do subdir processing
x86: Export asm/{svm.h,vmx.h,perf_regs.h}
perf tools: Fix strbuf_addf() when the buffer needs to grow
perf header: Fix numa topology printing
perf, powerpc: Fix hw breakpoints returning -ENOSPC
|
|
The reason for the scaling and monotonicity correction performed
by cputime_adjust() may not be immediately clear to the reviewer.
Add some comments to explain what happens there.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
task_cputime_adjusted() and thread_group_cputime_adjusted()
essentially share the same code. They just don't use the same
source:
* The first function uses the cputime in the task struct and the
previous adjusted snapshot that ensures monotonicity.
* The second adds the cputime of all tasks in the group and the
previous adjusted snapshot of the whole group from the signal
structure.
Just consolidate the common code that does the adjustment. These
functions just need to fetch the values from the appropriate
source.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
We have thread_group_cputime() and thread_group_times(). The naming
doesn't provide enough information about the difference between
these two APIs.
To lower the confusion, rename thread_group_times() to
thread_group_cputime_adjusted(). This name better suggests that
it's a version of thread_group_cputime() that does some stabilization
on the raw cputime values. ie here: scale on top of CFS runtime
stats and bound lower value for monotonicity.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
thread_group_cputime() is a general cputime API that is not only
used by posix cpu timer. Let's move this helper to sched code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
Dave Jones reported a bug with futex_lock_pi() that his trinity test
exposed. Sometime between queue_me() and taking the q.lock_ptr, the
lock_ptr became NULL, resulting in a crash.
While futex_wake() is careful to not call wake_futex() on futex_q's with
a pi_state or an rt_waiter (which are either waiting for a
futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
futex_requeue() do not perform the same test.
Update futex_wake_op() and futex_requeue() to test for q.pi_state and
q.rt_waiter and abort with -EINVAL if detected. To ensure any future
breakage is caught, add a WARN() to wake_futex() if the same condition
is true.
This fix has seen 3 hours of testing with "trinity -c futex" on an
x86_64 VM with 4 CPUS.
[akpm@linux-foundation.org: tidy up the WARN()]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Reported-by: Dave Jones <davej@redat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In get_sample_period(), unsigned long is not enough:
watchdog_thresh * 2 * (NSEC_PER_SEC / 5)
case1:
watchdog_thresh is 10 by default, the sample value will be: 0xEE6B2800
case2:
set watchdog_thresh is 20, the sample value will be: 0x1 DCD6 5000
In case2, we need use u64 to express the sample period. Otherwise,
changing the threshold thru proc often can not be successful.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
vtime_account() is only called from irq entry. irqs
are always disabled at this point so we can safely
remove the irq disabling guards on that function.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
On ia64 and powerpc, vtime context switch only consists
in flushing system and user pending time, plus a few
arch housekeeping.
Consolidate that into a generic implementation. s390 is
a special case because pending user and system time accounting
there is hard to dissociate. So it's keeping its own implementation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Prepending irq-unsafe vtime APIs with underscores was actually
a bad idea as the result is a big mess in the API namespace that
is even waiting to be further extended. Also these helpers
are always called from irq safe callers except kvm. Just
provide a vtime_account_system_irqsafe() for this specific
case so that we can remove the underscore prefix on other
vtime functions.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Merge in fixes before we queue up dependent bits, to avoid conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Merge Linux 3.7-rc5, to pick up fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fix from Thomas Gleixner:
"Single fix for a long standing futex race when taking over a futex
whose owner died. You can end up with two owners, which violates
quite some rules."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Handle futex_pi OWNER_DIED take over correctly
|
|
Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076
The detailed problem analysis shows:
Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.
T1 lock_futex_pi(F);
T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.
T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.
T3 lock_futex_pi(F);
--> Succeeds due to the check for F's userspace TID field == 0
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes T2 via
rt_mutex_unlock(&pi_state->mutex)
T2 --> acquires pi_state->mutex and gains real ownership of the
pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
T3 --> observes inconsistent state
This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.
So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:
if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.
AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.
Now the proposed change
- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ if (unlikely(ownerdied ||
+ !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.
What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!
Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.
Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.
lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.
So the above scenario changes to:
T1 lock_futex_pi(F);
T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.
T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.
T3 lock_futex_pi(F);
--> Finds pi_state and blocks on pi_state->rt_mutex
T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes it via
rt_mutex_unlock(&pi_state->mutex)
T2 --> acquires pi_state->mutex and gains ownership of the pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.
Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.
As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)
Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Masaki found and patched a kallsyms issue: the last symbol in a
module's symtab wasn't transferred. This is because we manually copy
the zero'th entry (which is always empty) then copy the rest in a loop
starting at 1, though from src[0]. His fix was minimal, I prefer to
rewrite the loops in more standard form.
There are two loops: one to get the size, and one to copy. Make these
identical: always count entry 0 and any defined symbol in an allocated
non-init section.
This bug exists since the following commit was introduced.
module: reduce symbol table for loaded modules (v2)
commit: 4a4962263f07d14660849ec134ee42b63e95ea9a
LKML: http://lkml.org/lkml/2012/10/24/27
Reported-by: Masaki Kimura <masaki.kimura.kz@hitachi.com>
Cc: stable@kernel.org
|
|
Due to these two commits:
8323f26ce342 sched: Fix race in task_group()
800d4d30c8f2 sched, autogroup: Stop going ahead if autogroup is disabled
... autogroup scheduling's dynamic knobs are wrecked.
With both patches applied, all you have to do to crash a box is
disable autogroup during boot up, then reboot.. boom, NULL pointer
dereference due to 800d4d30 not allowing autogroup to move things,
and 8323f26ce making that the only way to switch runqueues.
Remove most of the (dysfunctional) knobs and turn the remaining
sched_autogroup_enabled knob readonly.
If the user fiddles with cgroups hereafter, once tasks
are moved, autogroup won't mess with them again unless
they call setsid().
No knobs, no glitz, nada, just a cute little thing folks can
turn on if they don't want to muck about with cgroups and/or
systemd.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Xiaotian Feng <xtfeng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> # v3.6
Link: http://lkml.kernel.org/r/1351451963.4999.8.camel@maggy.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
I've been trying to get hardware breakpoints with perf to work
on POWER7 but I'm getting the following:
% perf record -e mem:0x10000000 true
Error: sys_perf_event_open() syscall returned with 28 (No space left on device). /bin/dmesg may provide additional information.
Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
true: Terminated
(FWIW adding -a and it works fine)
Debugging it seems that __reserve_bp_slot() is returning ENOSPC
because it thinks there are no free breakpoint slots on this
CPU.
I have a 2 CPUs, so perf userspace is doing two perf_event_open
syscalls to add a counter to each CPU [1]. The first syscall
succeeds but the second is failing.
On this second syscall, fetch_bp_busy_slots() sets slots.pinned
to be 1, despite there being no breakpoint on this CPU. This is
because the call the task_bp_pinned, checks all CPUs, rather
than just the current CPU. POWER7 only has one hardware
breakpoint per CPU (ie. HBP_NUM=1), so we return ENOSPC.
The following patch fixes this by checking the associated CPU
for each breakpoint in task_bp_pinned. I'm not familiar with
this code, so it's provided as a reference to the above issue.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Jovi Zhang <bookjovi@gmail.com>
Cc: K Prasad <prasad@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1351268936-2956-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
vtime_account() doesn't have the same role in
CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.
In the first case it handles time accounting in any context. In
the second case it only handles irq time accounting.
So when vtime_account() is called from outside vtime_account_irq_*()
this call is pointless to CONFIG_IRQ_TIME_ACCOUNTING.
To fix the confusion, change vtime_account() to irqtime_account_irq()
in CONFIG_IRQ_TIME_ACCOUNTING. This way we ensure future account_vtime()
calls won't waste useless cycles in the irqtime APIs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
is called in irq entry/exit, we perform a check on the
context: if we are interrupting the idle task we
account the pending cputime to idle, otherwise account
to system time or its sub-areas: tsk->stime, hardirq time,
softirq time, ...
However this check for idle only concerns the hardirq entry
and softirq entry:
* Hardirq may directly interrupt the idle task, in which case
we need to flush the pending CPU time to idle.
* The idle task may be directly interrupted by a softirq if
it calls local_bh_enable(). There is probably no such call
in any idle task but we need to cover every case. Ksoftirqd
is not concerned because the idle time is flushed on context
switch and softirq in the end of hardirq have the idle time
already flushed from the hardirq entry.
In the other cases we always account to system/irq time:
* On hardirq exit we account the time to hardirq time.
* On softirq exit we account the time to softirq time.
To optimize this and avoid the indirect call to vtime_account()
and the checks it performs, specialize the vtime irq APIs and
only perform the check on irq entry. Irq exit can directly call
vtime_account_system().
CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
maps to its own vtime_account() implementation. One may want
to take benefits from the new APIs to optimize irq time accounting
as well in the future.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
vtime_account_system() currently has only one caller with
vtime_account() which is irq safe.
Now we are going to call it from other places like kvm where
irqs are not always disabled by the time we account the cputime.
So let's make it irqsafe. The arch implementation part is now
prefixed with "__".
vtime_account_idle() arch implementation is prefixed accordingly
to stay consistent.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Merge misc fixes from Andrew Morton:
"18 total. 15 fixes and some updates to a device_cgroup patchset which
bring it up to date with the version which I should have merged in the
first place."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (18 patches)
fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
gen_init_cpio: avoid stack overflow when expanding
drivers/rtc/rtc-imxdi.c: add missing spin lock initialization
mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
pidns: limit the nesting depth of pid namespaces
drivers/dma/dw_dmac: make driver's endianness configurable
mm/mmu_notifier: allocate mmu_notifier in advance
tools/testing/selftests/epoll/test_epoll.c: fix build
UAPI: fix tools/vm/page-types.c
mm/page_alloc.c:alloc_contig_range(): return early for err path
rbtree: include linux/compiler.h for definition of __always_inline
genalloc: stop crashing the system when destroying a pool
backlight: ili9320: add missing SPI dependency
device_cgroup: add proper checking when changing default behavior
device_cgroup: stop using simple_strtoul()
device_cgroup: rename deny_all to behavior
cgroup: fix invalid rcu dereference
mm: fix XFS oops due to dirty pages without buffers on s390
|
|
If one includes documentation for an external tool, it should be
correct. This is not:
1. Overriding the input to rngd should typically be neither
necessary nor desired. This is especially so since newer
versions of rngd support a number of different *types* of sources.
2. The default kernel-exported device is called /dev/hwrng not
/dev/hwrandom nor /dev/hw_random (both of which were used in the
past; however, kernel and udev seem to have converged on
/dev/hwrng.)
Overall it is better if the documentation for rngd is kept with rngd
rather than in a kernel Makefile.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
'struct pid' is a "variable sized struct" - a header with an array of
upids at the end.
The size of the array depends on a level (depth) of pid namespaces. Now a
level of pidns is not limited, so 'struct pid' can be more than one page.
Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
not calculated from PAGE_SIZE, because in this case it depends on
architectures, config options and it will be reduced, if someone adds a
new fields in struct pid or struct upid.
I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
"struct pid" and it's more than enough for all known for me use-cases.
When someone finds a reasonable use case, we can add a config option or a
sysctl parameter.
In addition it will reduce the effect of another problem, when we have
many nested namespaces and the oldest one starts dying.
zap_pid_ns_processe will be called for each namespace and find_vpid will
be called for each process in a namespace. find_vpid will be called
minimum max_level^2 / 2 times. The reason of that is that when we found a
bit in pidmap, we can't determine this pidns is top for this process or it
isn't.
vpid is a heavy operation, so a fork bomb, which create many nested
namespace, can make a system inaccessible for a long time. For example my
system becomes inaccessible for a few minutes with 4000 processes.
[akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"This pull request contains three fixes.
Two are reverts of task_lock() removal in cgroup fork path. The
optimizations incorrectly assumed that threadgroup_lock can protect
process forks (as opposed to thread creations) too. Further cleanup
of cgroup fork path is scheduled.
The third fixes cgroup emptiness notification loss."
* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: Remove task_lock() from cgroup_post_fork()"
Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"
cgroup: notify_on_release may not be triggered in some cases
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"This pull request contains one patch from Dan Magenheimer to fix
cancel_delayed_work() regression introduced by its reimplementation
using try_to_grab_pending(). The reimplementation made it incorrectly
return %true when the work item is idle.
There aren't too many consumers of the return value but it broke at
least ramster."
* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: cancel_delayed_work() should return %false if work item is idle
|
|
57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
try_to_grab_pending()") made cancel_delayed_work() always return %true
unless someone else is also trying to cancel the work item, which is
broken - if the target work item is idle, the return value should be
%false.
try_to_grab_pending() indicates that the target work item was idle by
zero return value. Use it for return. Note that this brings
cancel_delayed_work() in line with __cancel_work_timer() in return
value handling.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>
|
|
Add some scribbles on how and why the load-balancer works..
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g. runnable based load-balance (in progress), governors,
power-management, etc.
These facilities are not yet consumers of this data. This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.422162369@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average. In this function we charge the accumulated run-time since last
update and handle appropriate decay. In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.
Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.
[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.
C program to generate the magic constants in the arrays:
#include <math.h>
#include <stdio.h>
#define N 32
#define WMULT_SHIFT 32
const long WMULT_CONST = ((1UL << N) - 1);
double y;
long runnable_avg_yN_inv[N];
void calc_mult_inv() {
int i;
double yn = 0;
printf("inverses\n");
for (i = 0; i < N; i++) {
yn = (double)WMULT_CONST * pow(y, i);
runnable_avg_yN_inv[i] = yn;
printf("%2d: 0x%8lx\n", i, runnable_avg_yN_inv[i]);
}
printf("\n");
}
long mult_inv(long c, int n) {
return (c * runnable_avg_yN_inv[n]) >> WMULT_SHIFT;
}
void calc_yn_sum(int n)
{
int i;
double sum = 0, sum_fl = 0, diff = 0;
/*
* We take the floored sum to ensure the sum of partial sums is never
* larger than the actual sum.
*/
printf("sum y^n\n");
printf(" %8s %8s %8s\n", "exact", "floor", "error");
for (i = 1; i <= n; i++) {
sum = (y * sum + y * 1024);
sum_fl = floor(y * sum_fl+ y * 1024);
printf("%2d: %8.0f %8.0f %8.0f\n", i, sum, sum_fl,
sum_fl - sum);
}
printf("\n");
}
void calc_conv(long n) {
long old_n;
int i = -1;
printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
do {
old_n = n;
n = mult_inv(n, 1) + 1024;
i++;
} while (n != old_n);
printf("%d> %ld\n", i - 1, n);
printf("\n");
}
void main() {
y = pow(0.5, 1/(double)N);
calc_mult_inv();
calc_conv(1024);
calc_yn_sum(N);
}
[ Compile with -lm ]
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.277808946@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow. This is a large cost
saving for frequently switching tasks.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.200772172@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities. This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.133999170@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits. Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.
What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.
Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.998912151@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.
To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq. This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.930124292@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.
Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.855074415@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|