summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-06-18ptrace: do_notify_parent_cldstop: fix the wrong ->nsproxy usageOleg Nesterov
If the non-traced sub-thread calls do_notify_parent_cldstop(), we send the notification to group_leader->real_parent and we report group_leader's pid. But, if group_leader is traced we use the wrong ->parent->nsproxy->pid_ns, the tracer and parent can live in different namespaces. Change the code to use "parent" instead of tsk->parent. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace: wait_task_zombie: s/->parent/->real_parent/Oleg Nesterov
Change wait_task_zombie() to use ->real_parent instead of ->parent. We could even use current afaics, but ->real_parent is more clean. We know that the child is not ptrace_reparented() and thus they are equal. But we should avoid using task_struct->parent, we are going to remove it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace_get_task_struct: s/tasklist/rcu/, make it staticOleg Nesterov
- Use rcu_read_lock() instead of tasklist_lock to find/get the task in ptrace_get_task_struct(). - Make it static, it has no callers outside of ptrace.c. - The comment doesn't match the reality, this helper does not do any checks. Beacuse it is really trivial and static I removed the whole comment. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace: do not use task_lock() for attachOleg Nesterov
Remove the "Nasty, nasty" lock dance in ptrace_attach()/ptrace_traceme() - from now task_lock() has nothing to do with ptrace at all. With the recent changes nobody uses task_lock() to serialize with ptrace, but in fact it was never needed and it was never used consistently. However ptrace_attach() calls __ptrace_may_access() and needs task_lock() to pin task->mm for get_dumpable(). But we can call __ptrace_may_access() before we take tasklist_lock, ->cred_exec_mutex protects us against do_execve() path which can change creds and MMF_DUMP* flags. (ugly, but we can't use ptrace_may_access() because it hides the error code, so we have to take task_lock() and use __ptrace_may_access()). NOTE: this change assumes that LSM hooks, security_ptrace_may_access() and security_ptrace_traceme(), can be called without task_lock() held. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace: cleanup check/set of PT_PTRACED during attachOleg Nesterov
ptrace_attach() and ptrace_traceme() are the last functions which look as if the untraced task can have task->ptrace != 0, this must not be possible. Change the code to just check ->ptrace != 0 and s/|=/=/ to set PT_PTRACED. Also, a couple of trivial whitespace cleanups in ptrace_attach(). And move ptrace_traceme() up near ptrace_attach() to keep them close to each other. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace: ptrace_attach: check PF_KTHREAD + exit_state instead of ->mmOleg Nesterov
- Add PF_KTHREAD check to prevent attaching to the kernel thread with a borrowed ->mm. With or without this change we can race with daemonize() which can set PF_KTHREAD or clear ->mm after ptrace_attach() does the check, but this doesn't matter because reparent_to_kthreadd() does ptrace_unlink(). - Kill "!task->mm" check. We don't really care about ->mm != NULL, and the task can call exit_mm() right after we drop task_lock(). What we need is to make sure we can't attach after exit_notify(), check task->exit_state != 0 instead. Also, move the "already traced" check down for cosmetic reasons. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace: do not use task->ptrace directly in core kernelOleg Nesterov
No functional changes. - Nobody except ptrace.c & co should use ptrace flags directly, we have task_ptrace() for that. - No need to specially check PT_PTRACED, we must not have other PT_ bits set without PT_PTRACED. And no need to know this flag exists. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18ptrace: mm_need_new_owner: use ->real_parent to search in the siblingsOleg Nesterov
"Search in the siblings" should use ->real_parent, not ->parent. If the task is traced then ->parent == tracer, while the task's parent is always ->real_parent. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18allow_signal: kill the bogus ->mm check, add a note about CLONE_SIGHANDOleg Nesterov
allow_signal() checks ->mm == NULL. Not sure why. Perhaps to make sure current is the kernel thread. But this helper must not be used unless we are the kernel thread, kill this check. Also, document the fact that the CLONE_SIGHAND kthread must not use allow_signal(), unless the caller really wants to change the parent's ->sighand->action as well. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18memcg: add interface to reset limitsDaisuke Nishimura
We don't have an interface to reset mem.limit or memsw.limit now. This patch allows to reset mem.limit or memsw.limit when they are being set to -1. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18cgroups: forbid noprefix if mounting more than just cpuset subsystemLi Zefan
The 'noprefix' option was introduced for backwards-compatibility of cpuset, but actually it can be used when mounting other subsystems. This results in possibility of name collision, and now the collision can really happen, because we have 'stat' file in both memory and cpuacct subsystem: # mount -t cgroup -o noprefix,memory,cpuacct xxx /mnt Cgroup will happily mount the 2 subsystems, but only 'stat' file of memory subsys can be seen. We don't want users to use nopreifx, and also want to avoid name collision, so we change to allow noprefix only if mounting just the cpuset subsystem. [akpm@linux-foundation.org: fix shift for cpuset_subsys_id >= 32] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18softirq: introduce statistics for softirqKeika Kobayashi
Statistics for softirq doesn't exist. It will be helpful like statistics for interrupts. This patch introduces counting the number of softirq, which will be exported in /proc/softirqs. When softirq handler consumes much CPU time, /proc/stat is like the following. $ while :; do cat /proc/stat | head -n1 ; sleep 10 ; done cpu 88 0 408 739665 583 28 2 0 0 cpu 450 0 1090 740970 594 28 1294 0 0 ^^^^ softirq In such a situation, /proc/softirqs shows us which softirq handler is invoked. We can see the increase rate of softirqs. <before> $ cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI 0 0 0 0 TIMER 462850 462805 462782 462718 NET_TX 0 0 0 365 NET_RX 2472 2 2 40 BLOCK 0 0 381 1164 TASKLET 0 0 0 224 SCHED 462654 462689 462698 462427 RCU 3046 2423 3367 3173 <after> $ cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI 0 0 0 0 TIMER 463361 465077 465056 464991 NET_TX 53 0 1 365 NET_RX 3757 2 2 40 BLOCK 0 0 398 1170 TASKLET 0 0 0 224 SCHED 463074 464318 464612 463330 RCU 3505 2948 3947 3673 When CPU TIME of softirq is high, the rates of increase is the following. TIMER : 220/sec : CPU1-3 NET_TX : 5/sec : CPU0 NET_RX : 120/sec : CPU0 SCHED : 40-200/sec : all CPU RCU : 45-58/sec : all CPU The rates of increase in an idle mode is the following. TIMER : 250/sec SCHED : 250/sec RCU : 2/sec It seems many softirqs for receiving packets and rcu are invoked. This gives us help for checking system. Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp> Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18perf_counter: Add event overlow handlingPeter Zijlstra
Alternative method of mmap() data output handling that provides better overflow management and a more reliable data stream. Unlike the previous method, that didn't have any user->kernel feedback and relied on userspace keeping up, this method relies on userspace writing its last read position into the control page. It will ensure new output doesn't overwrite not-yet read events, new events for which there is no space left are lost and the overflow counter is incremented, providing exact event loss numbers. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-17ring-buffer: have benchmark test print to trace bufferSteven Rostedt
Currently the output of the ring buffer benchmark/test prints to the console. This test runs for ten seconds every ten seconds and ouputs the result after every iteration. This needlessly fills up the logs. This patch makes the ring buffer benchmark/test print to the ftrace buffer using trace_printk. To view the test results, you must examine the debug/tracing/trace file. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-17ring-buffer: do not grab locks in nmiSteven Rostedt
If ftrace_dump_on_oops is set, and an NMI detects a lockup, then it will need to read from the ring buffer. But the read side of the ring buffer still takes locks. This patch adds a check on the read side that if it is in an NMI, then it will disable the ring buffer and not take any locks. Reads can still happen on a disabled ring buffer. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-17ring-buffer: add locks around rb_per_cpu_emptySteven Rostedt
The checking of whether the buffer is empty or not needs to be serialized among the readers. Add the reader spin lock around it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-17ring-buffer: check for less than two in size allocationSteven Rostedt
The ring buffer must have at least two pages allocated for the reader page swap to work. The page count check will miss the case of a zero size passed in. Even though a zero size ring buffer would probably fail an allocation, making the min size check for less than two instead of equal to one makes the code a bit more robust. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-17ring-buffer: remove useless compile check for buffer_page sizeSteven Rostedt
The original version of the ring buffer had a hack to map the page struct that held the pages of the buffer to also be the structure that the ring buffer would keep the pages in a link list. This overlap of the page struct was very dangerous and that hack was removed a while ago. But there was a check to make sure the buffer_page never became bigger than the page struct, and would fail the compile if it did. The check was only meaningful when we had the hack. Now that we have separate allocated descriptors for the buffer pages, we can remove this check. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-17Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6Linus Torvalds
* 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: start using hrtimers hrtimer: export ktime_add_safe UBIFS: do not forget to register BDI device UBIFS: allow sync option in rootflags UBIFS: remove dead code UBIFS: use anonymous device UBIFS: return proper error code if the compr is not present UBIFS: return error if link and unlink race UBIFS: reset no_space flag after inode deletion
2009-06-17sched: Fix out of scope variable access in sched_slice()Christian Engelmayer
Access to local variable lw is aliased by usage of pointer load. Access to pointer load in calc_delta_mine() happens when lw is already out of scope. [ Reported by static code analysis. ] Signed-off-by: Christian Engelmayer <christian.engelmayer@frequentis.com> LKML-Reference: <20090616103512.0c846e51@frequentis.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-17sched: Hide runqueues from direct refer at source code levelHitoshi Mitake
There are some points which refer the per-cpu value "runqueues" directly. sched.c provides nice abstraction, such as cpu_rq() and this_rq(), so we should use these macros when looking runqueues. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> LKML-Reference: <20090617.222055.374768827975756908.mitake@dcl.info.waseda.ac.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-17sched: Remove unneeded __ref tagLi Zefan
Those two functions no longer call alloc_bootmmem_cpumask_var(), so no need to tag them with __init_refok. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <4A35DD5B.9050106@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-17Merge branch 'linus' into perfcounters/coreIngo Molnar
Conflicts: arch/x86/include/asm/kmap_types.h include/linux/mm.h include/asm-generic/kmap_types.h Merge reason: We crossed changes with kmap_types.h cleanups in mainline. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-16Merge branch 'akpm'Linus Torvalds
* akpm: (182 commits) fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset fbdev: *bfin*: fix __dev{init,exit} markings fbdev: *bfin*: drop unnecessary calls to memset fbdev: bfin-t350mcqb-fb: drop unused local variables fbdev: blackfin has __raw I/O accessors, so use them in fb.h fbdev: s1d13xxxfb: add accelerated bitblt functions tcx: use standard fields for framebuffer physical address and length fbdev: add support for handoff from firmware to hw framebuffers intelfb: fix a bug when changing video timing fbdev: use framebuffer_release() for freeing fb_info structures radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb? s3c-fb: CPUFREQ frequency scaling support s3c-fb: fix resource releasing on error during probing carminefb: fix possible access beyond end of carmine_modedb[] acornfb: remove fb_mmap function mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF mb862xxfb: restrict compliation of platform driver to PPC Samsung SoC Framebuffer driver: add Alpha Channel support atmel-lcdc: fix pixclock upper bound detection offb: use framebuffer_alloc() to allocate fb_info struct ... Manually fix up conflicts due to kmemcheck in mm/slab.c
2009-06-16slow-work: use round_jiffies() for thread pool's cull and OOM timersChris Peterson
Round the slow work queue's cull and OOM timeouts to whole second boundary with round_jiffies(). The slow work queue uses a pair of timers to cull idle threads and, after OOM, to delay new thread creation. This patch also extracts the mod_timer() logic for the cull timer into a separate helper function. By rounding non-time-critical timers such as these to whole seconds, they will be batched up to fire at the same time rather than being spread out. This allows the CPU wake up less, which saves power. Signed-off-by: Chris Peterson <cpeterso@cpeterso.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16groups: move code to kernel/groups.cAlexey Dobriyan
Move supplementary groups implementation to kernel/groups.c . kernel/sys.c already accumulated quite a few random stuff. Do strictly copy/paste + add required headers to compile. Compile-tested on many configs and archs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16kernel/kfifo.c: replace conditional test with is_power_of_2()Robert P. J. Day
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: remove CONFIG_UNEVICTABLE_LRU config optionKOSAKI Motohiro
Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm, PM/Freezer: Disable OOM killer when tasks are frozenRafael J. Wysocki
Currently, the following scenario appears to be possible in theory: * Tasks are frozen for hibernation or suspend. * Free pages are almost exhausted. * Certain piece of code in the suspend code path attempts to allocate some memory using GFP_KERNEL and allocation order less than or equal to PAGE_ALLOC_COSTLY_ORDER. * __alloc_pages_internal() cannot find a free page so it invokes the OOM killer. * The OOM killer attempts to kill a task, but the task is frozen, so it doesn't die immediately. * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries to find a free page and invokes the OOM killer. * No progress can be made. Although it is now hard to trigger during hibernation due to the memory shrinking carried out by the hibernation code, it is theoretically possible to trigger during suspend after the memory shrinking has been removed from that code path. Moreover, since memory allocations are going to be used for the hibernation memory shrinking, it will be even more likely to happen during hibernation. To prevent it from happening, introduce the oom_killer_disabled switch that will cause __alloc_pages_internal() to fail in the situations in which the OOM killer would have been called and make the freezer set this switch after tasks have been successfully frozen. [akpm@linux-foundation.org: be nicer to the namespace] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not check NUMA node ID when the caller knows the node is ↵Mel Gorman
valid Callers of alloc_pages_node() can optionally specify -1 as a node to mean "allocate from the current node". However, a number of the callers in fast paths know for a fact their node is valid. To avoid a comparison and branch, this patch adds alloc_pages_exact_node() that only checks the nid with VM_BUG_ON(). Callers that know their node is valid are then converted. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16cpuset,mm: update tasks' mems_allowed in timeMiao Xie
Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16cpusets: update tasks' page/slab spread flags in timeMiao Xie
Fix the bug that the kernel didn't spread page cache/slab object evenly over all the allowed nodes when spread flags were set by updating tasks' page/slab spread flags in time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16cpusets: restructure the function cpuset_update_task_memory_state()Miao Xie
The kernel still allocates the page caches on old node after modifying its cpuset's mems when 'memory_spread_page' was set, or it didn't spread the page cache evenly over all the nodes that faulting task is allowed to usr after memory_spread_page was set. it is caused by the old mem_allowed and flags of the task, the current kernel doesn't updates them unless some function invokes cpuset_update_task_memory_state(), it is too late sometimes.We must update the mem_allowed and the flags of the tasks in time. Slab has the same problem. The following patches fix this bug by updating tasks' mem_allowed and spread flag after its cpuset's mems or spread flag is changed. This patch: Extract a function from cpuset_update_task_memory_state(). It will be used later for update tasks' page/slab spread flags after its cpuset's flag is set Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16ring-buffer: remove useless warn on checkSteven Rostedt
A check if "write > BUF_PAGE_SIZE" is done right after a if (write > BUF_PAGE_SIZE) return ...; Thus the check is actually testing the compiler and not the kernel. This is useless, remove it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-16ring-buffer: use BUF_PAGE_HDR_SIZE in calculating indexSteven Rostedt
The index of the event is found by masking PAGE_MASK to it and subtracting the header size. Currently the header size is calculate by PAGE_SIZE - BUF_PAGE_SIZE, when we already have a macro BUF_PAGE_HDR_SIZE to define it. If we want to change BUF_PAGE_SIZE to something less than filling the rest of the page (this is done for debugging), then we break the algorithm to find the index. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-16tracing/filters: fix race between filter setting and module unloadLi Zefan
Module unload is protected by event_mutex, while setting filter is protected by filter_mutex. This leads to the race: echo 'bar == 0 || bar == 10' \ | > sample/filter | | insmod sample.ko add_pred("bar == 0") | -> n_preds == 1 | add_pred("bar == 100") | -> n_preds == 2 | | rmmod sample.ko | insmod sample.ko add_pred("&&") | -> n_preds == 1 (should be 3) | Now event->filter->preds is corrupted. An then when filter_match_preds() is called, the WARN_ON() in it will be triggered. To avoid the race, we remove filter_mutex, and replace it with event_mutex. [ Impact: prevent corruption of filters by module removing and loading ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A375A4D.6000205@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-16tracing/filters: free filter_string in destroy_preds()Li Zefan
filter->filter_string is not freed when unloading a module: # insmod trace-events-sample.ko # echo "bar < 100" > /mnt/tracing/events/sample/foo_bar/filter # rmmod trace-events-sample.ko [ Impact: fix memory leak when unloading module ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A375A30.9060802@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-16ring-buffer: use commit counters for commit pointer accountingSteven Rostedt
The ring buffer is made up of three sets of pointers. The head page pointer, which points to the next page for the reader to get. The commit pointer and commit index, which points to the page and index of the last committed write respectively. The tail pointer and tail index, which points to the page and the index of the last reserved data respectively (non committed). The commit pointer is only moved forward by the outer most writer. If a nested writer comes in, it will not move the pointer forward. The current implementation has a flaw. It assumes that the outer most writer successfully reserved data. There's a small race window where the outer most writer could find the tail pointer, but a nested writer could come in (via interrupt) and move the tail forward, and even the commit forward. The outer writer would not realized the commit moved forward and the accounting will break. This patch changes the design to use counters in the per cpu buffers to keep track of commits. The counters are incremented at the start of the commit, and decremented at the end. If the end commit counter is 1, then it moves the commit pointers. A loop is made to check for races between checking and moving the commit pointers. Only the outer commit should move the pointers anyway. The test of knowing if a reserve is equal to the last commit update is still needed to know for time keeping. The time code is much less racey than the commit updates. This change not only solves the mentioned race, but also makes the code simpler. [ Impact: fix commit race and simplify code ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-16ring-buffer: remove unused variableSteven Rostedt
Fix the compiler error: kernel/trace/ring_buffer.c: In function 'rb_move_tail': kernel/trace/ring_buffer.c:1236: warning: unused variable 'event' Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-16Merge branch 'for-linus2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits) signal: fix __send_signal() false positive kmemcheck warning fs: fix do_mount_root() false positive kmemcheck warning fs: introduce __getname_gfp() trace: annotate bitfields in struct ring_buffer_event net: annotate struct sock bitfield c2port: annotate bitfield for kmemcheck net: annotate inet_timewait_sock bitfields ieee1394/csr1212: fix false positive kmemcheck report ieee1394: annotate bitfield net: annotate bitfields in struct inet_sock net: use kmemcheck bitfields API for skbuff kmemcheck: introduce bitfield API kmemcheck: add opcode self-testing at boot x86: unify pte_hidden x86: make _PAGE_HIDDEN conditional kmemcheck: make kconfig accessible for other architectures kmemcheck: enable in the x86 Kconfig kmemcheck: add hooks for the page allocator kmemcheck: add hooks for page- and sg-dma-mappings kmemcheck: don't track page tables ...
2009-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits) debugfs: use specified mode to possibly mark files read/write only debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. xen: remove driver_data direct access of struct device from more drivers usb: gadget: at91_udc: remove driver_data direct access of struct device uml: remove driver_data direct access of struct device block/ps3: remove driver_data direct access of struct device s390: remove driver_data direct access of struct device parport: remove driver_data direct access of struct device parisc: remove driver_data direct access of struct device of_serial: remove driver_data direct access of struct device mips: remove driver_data direct access of struct device ipmi: remove driver_data direct access of struct device infiniband: ehca: remove driver_data direct access of struct device ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device hvcs: remove driver_data direct access of struct device xen block: remove driver_data direct access of struct device thermal: remove driver_data direct access of struct device scsi: remove driver_data direct access of struct device pcmcia: remove driver_data direct access of struct device PCIE: remove driver_data direct access of struct device ... Manually fix up trivial conflicts due to different direct driver_data direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
2009-06-16printk: add KERN_DEFAULT loglevel to print_modules()Linus Torvalds
Several WARN_ON() messages omit the '\n' at the end of the string, which is a simple (and understandable) error. The next line printed after that warning line is usually the current module list, and that printk does not have a log-level marker - resulting in one long mixed-up line. Adding this loglevel marker will now avoid this unreadable mess. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16printk: Add KERN_DEFAULT printk log-levelLinus Torvalds
This adds a KERN_DEFAULT loglevel marker, for when you cannot decide which loglevel you want, and just want to keep an existing printk with the default loglevel. The difference between having KERN_DEFAULT and having no log-level marker at all is two-fold: - having the log-level marker will now force a new-line if the previous printout had not added one (perhaps because it forgot, but perhaps because it expected a continuation) - having a log-level marker is required if you are printing out a message that otherwise itself could perhaps otherwise be mistaken for a log-level. Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16printk: clean up handling of log-levels and newlinesLinus Torvalds
It used to be that we would only look at the log-level in a printk() after explicit newlines, which can cause annoying problems when the previous printk() did not end with a '\n'. In that case, the log-level marker would be just printed out in the middle of the line, and be seen as just noise rather than change the logging level. This changes things to always look at the log-level in the first bytes of the printout. If a log level marker is found, it is always used as the log-level. Additionally, if no newline existed, one is added (unless the log-level is the explicit KERN_CONT marker, to explicitly show that it's a continuation of a previous line). Acked-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16ring-buffer: have benchmark test handle discarded eventsSteven Rostedt
With the addition of commit: c7b0930857e2278f2e7714db6294e94c57f623b0 ring-buffer: prevent adding write in discarded area The ring buffer may now add discarded events when a write passes the end of a buffer page. Before, a discarded event was only added when the tracer deliberately created one. The ring buffer benchmark test does not handle discarded events when it reads the buffer and fails when it encounters one. Also fix the increment for large data entries (luckily, the test did not add any yet). [ Impact: fix false failure of ring buffer self test ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.GeunSik Lim
Many developers use "/debug/" or "/debugfs/" or "/sys/kernel/debug/" directory name to mount debugfs filesystem for ftrace according to ./Documentation/tracers/ftrace.txt file. And, three directory names(ex:/debug/, /debugfs/, /sys/kernel/debug/) is existed in kernel source like ftrace, DRM, Wireless, Documentation, Network[sky2]files to mount debugfs filesystem. debugfs means debug filesystem for debugging easy to use by greg kroah hartman. "/sys/kernel/debug/" name is suitable as directory name of debugfs filesystem. - debugfs related reference: http://lwn.net/Articles/334546/ Fix inconsistency of directory name to mount debugfs filesystem. * From Steven Rostedt - find_debugfs() and tracing_files() in this patch. Signed-off-by: GeunSik Lim <geunsik.lim@samsung.com> Acked-by : Inaky Perez-Gonzalez <inaky@linux.intel.com> Reviewed-by : Steven Rostedt <rostedt@goodmis.org> Reviewed-by : James Smart <james.smart@emulex.com> CC: Jiri Kosina <trivial@kernel.org> CC: David Airlie <airlied@linux.ie> CC: Peter Osterlund <petero2@telia.com> CC: Ananth N Mavinakayanahalli <ananth@in.ibm.com> CC: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> CC: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15sched: delayed cleanup of user_structKay Sievers
During bootup performance tracing we see repeated occurrences of /sys/kernel/uid/* events for the same uid, leading to a, in this case, rather pointless userspace processing for the same uid over and over. This is usually caused by tools which change their uid to "nobody", to run without privileges to read data supplied by untrusted users. This change delays the execution of the (already existing) scheduled work, to cleanup the uid after one second, so the allocated and announced uid can possibly be re-used by another process. This is the current behavior, where almost every invocation of a binary, which changes the uid, creates two events: $ read START < /sys/kernel/uevent_seqnum; \ for i in `seq 100`; do su --shell=/bin/true bin; done; \ read END < /sys/kernel/uevent_seqnum; \ echo $(($END - $START)) 178 With the delayed cleanup, we get only two events, and userspace finishes a bit faster too: $ read START < /sys/kernel/uevent_seqnum; \ for i in `seq 100`; do su --shell=/bin/true bin; done; \ read END < /sys/kernel/uevent_seqnum; \ echo $(($END - $START)) 1 Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15Merge branch 'timers-for-linus-migration' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: Logic to move non pinned timers timers: /proc/sys sysctl hook to enable timer migration timers: Identifying the existing pinned timers timers: Framework for identifying pinned timers timers: allow deferrable timers for intervals tv2-tv5 to be deferred Fix up conflicts in kernel/sched.c and kernel/timer.c manually
2009-06-15Merge branch 'timers-for-linus-clockevents' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-clockevents' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevent: export register_device and delta2ns clockevents: tick_broadcast_device can become static
2009-06-15Merge branch 'timers-for-linus-clocksource' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: prevent selection of low resolution clocksourse also for nohz=on clocksource: sanity check sysfs clocksource changes