summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-09-22ksm: fix deadlock with munlock in exit_mmapAndrea Arcangeli
Rawhide users have reported hang at startup when cryptsetup is run: the same problem can be simply reproduced by running a program int main() { mlockall(MCL_CURRENT | MCL_FUTURE); return 0; } The problem is that exit_mmap() applies munlock_vma_pages_all() to clean up VM_LOCKED areas, and its current implementation (stupidly) tries to fault in absent pages, for example where PROT_NONE prevented them being faulted in when mlocking. Whereas the "ksm: fix oom deadlock" patch, knowing there's a race by which KSM might try to fault in pages after exit_mmap() had finally zapped the range, backs out of such faults doing nothing when its ksm_test_exit() notices mm_users 0. So revert that part of "ksm: fix oom deadlock" which moved the ksm_exit() call from before exit_mmap() to the middle of exit_mmap(); and remove those ksm_test_exit() checks from the page fault paths, so allowing the munlocking to proceed without interference. ksm_exit, if there are rmap_items still chained on this mm slot, takes mmap_sem write side: so preventing KSM from working on an mm while exit_mmap runs. And KSM will bail out as soon as it notices that mm_users is already zero, thanks to its internal ksm_test_exit checks. So that when a task is killed by OOM killer or the user, KSM will not indefinitely prevent it from running exit_mmap to release its memory. This does break a part of what "ksm: fix oom deadlock" was trying to achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even when ksmd itself has to cancel a KSM page, it is possible that the first OOM-kill victim would be the KSM process being faulted: then its memory won't be freed until a second victim has been selected (freeing memory for the unmerging fault to complete). But the OOM killer is already liable to kill a second victim once the intended victim's p->mm goes to NULL: so there's not much point in rejecting this KSM patch before fixing that OOM behaviour. It is very much more important to allow KSM users to boot up, than to haggle over an unlikely and poorly supported OOM case. We also intend to fix munlocking to not fault pages: at which point this patch _could_ be reverted; though that would be controversial, so we hope to find a better solution. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Justin M. Forbes <jforbes@redhat.com> Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: fix oom deadlockHugh Dickins
There's a now-obvious deadlock in KSM's out-of-memory handling: imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex, trying to allocate a page to break KSM in an mm which becomes the OOM victim (quite likely in the unmerge case): it's killed and goes to exit, and hangs there waiting to acquire ksm_thread_mutex. Clearly we must not require ksm_thread_mutex in __ksm_exit, simple though that made everything else: perhaps use mmap_sem somehow? And part of the answer lies in the comments on unmerge_ksm_pages: __ksm_exit should also leave all the rmap_item removal to ksmd. But there's a fundamental problem, that KSM relies upon mmap_sem to guarantee the consistency of the mm it's dealing with, yet exit_mmap tears down an mm without taking mmap_sem. And bumping mm_users won't help at all, that just ensures that the pages the OOM killer assumes are on their way to being freed will not be freed. The best answer seems to be, to move the ksm_exit callout from just before exit_mmap, to the middle of exit_mmap: after the mm's pages have been freed (if the mmu_gather is flushed), but before its page tables and vma structures have been freed; and down_write,up_write mmap_sem there to serialize with KSM's own reliance on mmap_sem. But KSM then needs to be careful, whenever it downs mmap_sem, to check that the mm is not already exiting: there's a danger of using find_vma on a layout that's being torn apart, or writing into page tables which have been freed for reuse; and even do_anonymous_page and __do_fault need to check they're not being called by break_ksm to reinstate a pte after zap_pte_range has zapped that page table. Though it might be clearer to add an exiting flag, set while holding mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating a zapped pte. All we need is to check whether mm_users is 0 - but must remember that ksmd may detect that before __ksm_exit is reached. So, ksm_test_exit(mm) added to comment such checks on mm->mm_users. __ksm_exit now has to leave clearing up the rmap_items to ksmd, that needs ksm_thread_mutex; but shift the exiting mm just after the ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise mm_count to hold the mm_struct, ksmd's exit processing (exactly like its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it, similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd). But also give __ksm_exit a fast path: when there's no complication (no rmap_items attached to mm and it's not at the ksm_scan cursor), it can safely do all the exiting work itself. This is not just an optimization: when ksmd is not running, the raised mm_count would otherwise leak mm_structs. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: the mm interface to ksmHugh Dickins
This patch presents the mm interface to a dummy version of ksm.c, for better scrutiny of that interface: the real ksm.c follows later. When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and MADV_UNMERGEABLE with EINVAL, since that seems more helpful than pretending that they can be serviced. But when CONFIG_KSM=y, accept them even if KSM is not currently running, and even on areas which KSM will not touch (e.g. hugetlb or shared file or special driver mappings). Like other madvices, report ENOMEM despite success if any area in the range is unmapped, and use EAGAIN to report out of memory. Define vma flag VM_MERGEABLE to identify an area on which KSM may try merging pages: leave it to ksm_madvise() to decide whether to set it. Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain VM_MERGEABLE areas, to minimize callouts when forking or exiting. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log outputKOSAKI Motohiro
The amount of memory allocated to kernel stacks can become significant and cause OOM conditions. However, we do not display the amount of memory consumed by stacks. Add code to display the amount of memory used for stacks in /proc/meminfo. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22const: mark remaining inode_operations as constAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22const: mark remaining super_operations constAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-21Merge branch 'perfcounters-rename-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Tidy up after the big rename perf: Do the big rename: Performance Counters -> Performance Events perf_counter: Rename 'event' to event_id/hw_event perf_counter: Rename list_entry -> group_entry, counter_list -> group_list Manually resolved some fairly trivial conflicts with the tracing tree in include/trace/ftrace.h and kernel/trace/trace_syscalls.c.
2009-09-21Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Fix whitespace inconsistencies rcu: Fix thinko, actually initialize full tree rcu: Apply results of code inspection of kernel/rcutree_plugin.h rcu: Add WARN_ON_ONCE() consistency checks covering state transitions rcu: Fix synchronize_rcu() for TREE_PREEMPT_RCU rcu: Simplify rcu_read_unlock_special() quiescent-state accounting rcu: Add debug checks to TREE_PREEMPT_RCU for premature grace periods rcu: Kconfig help needs to say that TREE_PREEMPT_RCU scales down rcutorture: Occasionally delay readers enough to make RCU force_quiescent_state rcu: Initialize multi-level RCU grace periods holding locks rcu: Need to update rnp->gpnum if preemptable RCU is to be reliable
2009-09-21Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Simplify sys_sched_rr_get_interval() system call sched: Fix potential NULL derference of doms_cur sched: Fix raciness in runqueue_is_locked() sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair() sched: Remove unneeded indentation in sched_fair.c::place_entity()
2009-09-21Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file tracing: Export trace_profile_buf symbols tracing/events: use list_for_entry_continue tracing: remove max_tracer_type_len function-graph: use ftrace_graph_funcs directly tracing: Remove markers tracing: Allocate the ftrace event profile buffer dynamically tracing: Factorize the events profile accounting
2009-09-21perf: Tidy up after the big renameIngo Molnar
- provide compatibility Kconfig entry for existing PERF_COUNTERS .config's - provide courtesy copy of old perf_counter.h, for user-space projects - small indentation fixups - fix up MAINTAINERS - fix small x86 printout fallout - fix up small PowerPC comment fallout (use 'counter' as in register) Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21perf: Do the big rename: Performance Counters -> Performance EventsIngo Molnar
Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N | sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21perf_counter: Rename 'event' to event_id/hw_eventIngo Molnar
In preparation to the renames, to avoid a namespace clash. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21perf_counter: Rename list_entry -> group_entry, counter_list -> group_listIngo Molnar
This is in preparation of the big rename, but also makes sense in a standalone way: 'list_entry' is a bad name as we already have a list_entry() in list.h. Also, the 'counter list' is too vague, it doesnt tell us the purpose of that list. Clarify these names to show that it's all about the group hiearchy. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21sched: Simplify sys_sched_rr_get_interval() system callPeter Williams
By removing the need for it to know details of scheduling classes. This allows PlugSched to define orthogonal scheduling classes. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/jaswinder/linux-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/jaswinder/linux-2.6: includecheck fix: x86, cpu/common.c includecheck fix: kernel/trace, ring_buffer.c includecheck fix: include/linux, ftrace.h includecheck fix: include/linux, page_cgroup.h includecheck fix: include/linux, aio.h includecheck fix: include/drm, drm_memory.h includecheck fix: include/acpi, acpi_bus.h includecheck fix: drivers/xen, evtchn.c includecheck fix: drivers/video, vgacon.c includecheck fix: drivers/scsi, ibmvscsi.c includecheck fix: drivers/scsi, libfcoe.c includecheck fix: x86, shadow.c includecheck fix: x86, traps.c includecheck fix: um, helper.c includecheck fix: s390, sys_s390.c
2009-09-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (79 commits) USB serial: update the console driver usb-serial: straighten out serial_open usb-serial: add missing tests and debug lines usb-serial: rename subroutines usb-serial: fix termios initialization logic usb-serial: acquire references when a new tty is installed usb-serial: change logic of serial lookups usb-serial: put subroutines in logical order usb-serial: change referencing of port and serial structures tty: Char: mxser, use THRE for ASPP_OQUEUE ioctl tty: Char: mxser, add support for CP112UL uartlite: support shared interrupt lines tty: USB: serial/mct_u232, fix tty refcnt tty: riscom8, fix tty refcnt tty: riscom8, fix shutdown declaration TTY: fix typos tty: Power: fix suspend vt regression tty: vt: use printk_once tty: handle VT specific compat ioctls in vt driver n_tty: move echoctl check and clean up logic ...
2009-09-20Merge branch 'perfcounters-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (58 commits) perf_counter: Fix perf_copy_attr() pointer arithmetic perf utils: Use a define for the maximum length of a trace event perf: Add timechart help text and add timechart to "perf help" tracing, x86, cpuidle: Move the end point of a C state in the power tracer perf utils: Be consistent about minimum text size in the svghelper perf timechart: Add "perf timechart record" perf: Add the timechart tool perf: Add a SVG helper library file tracing, perf: Convert the power tracer into an event tracer perf: Add a sample_event type to the event_union perf: Allow perf utilities to have "callback" options without arguments perf: Store trace event name/id pairs in perf.data perf: Add a timestamp to fork events sched_clock: Make it NMI safe perf_counter: Fix up swcounter throttling x86, perf_counter, bts: Optimize BTS overflow handling perf sched: Add --input=file option to builtin-sched.c perf trace: Sample timestamp and cpu when using record flag perf tools: Increase MAX_EVENT_LENGTH perf tools: Fix memory leak in read_ftrace_printk() ...
2009-09-20sched: Fix potential NULL derference of doms_curYong Zhang
If CONFIG_CPUMASK_OFFSTACK is enabled but doms_cur alloc failed in arch_init_sched_domains(), doms_cur will move back to fallback_doms. But this time, fallback_doms has not been initialized yet. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: a.p.zijlstra@chello.nl LKML-Reference: <1252930816-7672-1-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-20kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_fileAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-20sched: Fix raciness in runqueue_is_locked()Andrew Morton
runqueue_is_locked() is unavoidably racy due to a poor interface design. It does cpu = get_cpu() ret = some_perpcu_thing(cpu); put_cpu(cpu); return ret; Its return value is unreliable. Fix. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <200909191855.n8JItiko022148@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-20tracing: Export trace_profile_buf symbolsPeter Zijlstra
ERROR: "trace_profile_buf_nmi" [fs/jbd2/jbd2.ko] undefined! ERROR: "trace_profile_buf" [fs/jbd2/jbd2.ko] undefined! ERROR: "trace_profile_buf_nmi" [fs/ext4/ext4.ko] undefined! ERROR: "trace_profile_buf" [fs/ext4/ext4.ko] undefined! ERROR: "trace_profile_buf_nmi" [arch/x86/kvm/kvm.ko] undefined! ERROR: "trace_profile_buf" [arch/x86/kvm/kvm.ko] undefined! Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1253442878.7542.3.camel@laptop> [ fixed whitespace noise and checkpatch complaint ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-20includecheck fix: kernel/trace, ring_buffer.cJaswinder Singh Rajput
fix the following 'make includecheck' warning: kernel/trace/ring_buffer.c: trace.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Sam Ravnborg <sam@ravnborg.org> LKML-Reference: <1247068617.4382.107.camel@ht.satnam>
2009-09-19vt: remove power stuff from kernel/powerAlan Cox
In the past someone gratuitiously borrowed chunks of kernel internal vt code and dumped them in kernel/power. They have all sorts of deep relations with the vt code so put them in the vt tree instead Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-19kfifo: Use "const" definitionsAlan Cox
Currently kfifo cannot be used by parts of the kernel that use "const" properly as kfifo itself does not use const for passed data blocks which are indeed const. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-19perf_counter: Fix perf_copy_attr() pointer arithmeticIan Schram
There is still some weird code in per_copy_attr(). Which supposedly checks that all bytes trailing a struct are zero. It doesn't seem to get pointer arithmetic right. Since it increments an iterating pointer by sizeof(unsigned long) rather than 1. Signed-off-by: Ian Schram <ischram@telenet.be> [ v2: clean up the messy PTR_ALIGN logic as well. ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # for v2.6.31.x LKML-Reference: <4AB3DEE2.3030600@telenet.be> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19Merge branch 'tip/tracing/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent
2009-09-19tracing/events: use list_for_entry_continueLi Zefan
Simplify s_next() and t_next(). Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4AB32389.1030005@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-09-19tracing: remove max_tracer_type_lenLi Zefan
Limit the length of a tracer's name within 100 chars, and then we don't have to play with max_tracer_type_len. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4AB32377.9020601@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-09-19function-graph: use ftrace_graph_funcs directlyLi Zefan
No need to store ftrace_graph_funcs in file->private. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4AB32364.7020602@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-09-19sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair()Mike Galbraith
While doing some testing, I pinned mplayer, only to find it following X around like a puppy. Looking at commit c88d591, I found a cpu_allowed check that went AWOL. I plugged it back in where it looks like it needs to go, and now when I say "sit, stay!", mplayer obeys again. 'c88d591 sched: Merge select_task_rq_fair() and sched_balance_self()' accidentally dropped the check, causing wake_affine() to pull pinned tasks - put it back. [ v2: use a cheaper version from Peter ] Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19Merge branch 'tracing/core-v3' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/urgent
2009-09-19tracing, perf: Convert the power tracer into an event tracerArjan van de Ven
This patch converts the existing power tracer into an event tracer, so that power events (C states and frequency changes) can be tracked via "perf". This also removes the perl script that was used to demo the tracer; its functionality is being replaced entirely with timechart. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20090912130542.6d314860@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19perf: Add a timestamp to fork eventsArjan van de Ven
perf timechart needs to know when a process forked, in order to be able to visualize properly when tasks start. This patch adds a time field to the event structure, and fills it in appropriately. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20090912130341.51ad2de2@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19Merge branch 'linus' into perfcounters/coreIngo Molnar
Merge reason: Bring in tracing changes we depend on. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19rcu: Fix whitespace inconsistenciesPaul E. McKenney
Fix a number of whitespace ^Ierrors in the include/linux/rcu* and the kernel/rcu* files. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu LKML-Reference: <20090918172819.GA24405@linux.vnet.ibm.com> [ did more checkpatch fixlets ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19rcu: Fix thinko, actually initialize full treePaul E. McKenney
Commit de078d8 ("rcu: Need to update rnp->gpnum if preemptable RCU is to be reliable") repeatedly and incorrectly initializes the root rcu_node structure's ->gpnum field rather than initializing the ->gpnum field of each node in the tree. Fix this. Also add an additional consistency check to catch this in the future. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu LKML-Reference: <125329262011-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19rcu: Apply results of code inspection of kernel/rcutree_plugin.hPaul E. McKenney
o Drop the calls to cpu_quiet() from the online/offline code. These are unnecessary, since force_quiescent_state() will clean up, and removing them simplifies the code a bit. o Add a warning to check that we don't enqueue the same blocked task twice onto the ->blocked_tasks[] lists. o Rework the phase computation in rcu_preempt_note_context_switch() to be more readable, as suggested by Josh Triplett. o Disable irqs to close a race between the scheduling clock interrupt and rcu_preempt_note_context_switch() WRT the ->rcu_read_unlock_special field. o Add comments to rnp->lock acquisition and release within rcu_read_unlock_special() noting that irqs are already disabled. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu LKML-Reference: <12532926201851-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-19rcu: Add WARN_ON_ONCE() consistency checks covering state transitionsPaul E. McKenney
o Verify that qsmask bits stay clear through GP initialization. o Verify that cpu_quiet_msk_finish() is never invoked unless there actually is an RCU grace period in progress. o Verify that all internal-node rcu_node structures have empty blocked_tasks[] lists. o Verify that child rcu_node structure's bits remain clear after acquiring parent's lock. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu LKML-Reference: <12532926191947-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-18tracing: Remove markersChristoph Hellwig
Now that the last users of markers have migrated to the event tracer we can kill off the (now orphan) support code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20090917173527.GA1699@lst.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-18sched_clock: Make it NMI safePeter Zijlstra
Arjan complained about the suckyness of TSC on modern machines, and asked if we could do something about that for PERF_SAMPLE_TIME. Make cpu_clock() NMI safe by removing the spinlock and using cmpxchg. This also makes it smaller and more robust. Affects architectures that use HAVE_UNSTABLE_SCHED_CLOCK, i.e. IA64 and x86. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-18perf_counter: Fix up swcounter throttlingPeter Zijlstra
/me dons the brown paper bag. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-18x86, perf_counter, bts: Optimize BTS overflow handlingMarkus Metzger
Draining the BTS buffer on a buffer overflow interrupt takes too long resulting in a kernel lockup when tracing the kernel. Restructure perf_counter sampling into sample creation and sample output. Prepare a single reference sample for BTS sampling and update the from and to address fields when draining the BTS buffer. Drain the entire BTS buffer between a single perf_output_begin() / perf_output_end() pair. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090915130023.A16204@sedona.ch.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-18headers: taskstats_kern.h trimAlexey Dobriyan
Remove net/genetlink.h inclusion, now sched.c won't be recompiled because of some networking changes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-18Merge branch 'timers-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (34 commits) time: Prevent 32 bit overflow with set_normalized_timespec() clocksource: Delay clocksource down rating to late boot clocksource: clocksource_select must be called with mutex locked clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash timers: Drop a function prototype clocksource: Resolve cpu hotplug dead lock with TSC unstable timer.c: Fix S/390 comments timekeeping: Fix invalid getboottime() value timekeeping: Fix up read_persistent_clock() breakage on sh timekeeping: Increase granularity of read_persistent_clock(), build fix time: Introduce CLOCK_REALTIME_COARSE x86: Do not unregister PIT clocksource on PIT oneshot setup/shutdown clocksource: Avoid clocksource watchdog circular locking dependency clocksource: Protect the watchdog rating changes with clocksource_mutex clocksource: Call clocksource_change_rating() outside of watchdog_lock timekeeping: Introduce read_boot_clock timekeeping: Increase granularity of read_persistent_clock() timekeeping: Update clocksource with stop_machine timekeeping: Add timekeeper read_clock helper functions timekeeping: Move NTP adjusted clock multiplier to struct timekeeper ... Fix trivial conflict due to MIPS lemote -> loongson renaming.
2009-09-18sched: Remove unneeded indentation in sched_fair.c::place_entity()Mike Galbraith
Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1253258365.22787.33.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-18tracing: Allocate the ftrace event profile buffer dynamicallyFrederic Weisbecker
Currently the trace event profile buffer is allocated in the stack. But this may be too much for the stack, as the events can have large statically defined field size and can also grow with dynamic arrays. Allocate two per cpu buffer for all profiled events. The first cpu buffer is used to host every non-nmi context traces. It is protected by disabling the interrupts while writing and committing the trace. The second buffer is reserved for nmi. So that there is no race between them and the first buffer. The whole write/commit section is rcu protected because we release these buffers while deactivating the last profiling trace event. v2: Move the buffers from trace_event to be global, as pointed by Steven Rostedt. v3: Fix the syscall events to handle the profiling buffer races by disabling interrupts, now that the buffers are globals. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu>
2009-09-18tracing: Factorize the events profile accountingFrederic Weisbecker
Factorize the events enabling accounting in a common tracing core helper. This reduces the size of the profile_enable() and profile_disable() callbacks for each trace events. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu>
2009-09-17Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (37 commits) sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINE sched: Stop buddies from hogging the system sched: Add new wakeup preemption mode: WAKEUP_RUNNING sched: Fix TASK_WAKING & loadaverage breakage sched: Disable wakeup balancing sched: Rename flags to wake_flags sched: Clean up the load_idx selection in select_task_rq_fair sched: Optimize cgroup vs wakeup a bit sched: x86: Name old_perf in a unique way sched: Implement a gentler fair-sleepers feature sched: Add SD_PREFER_LOCAL sched: Add a few SYNC hint knobs to play with sched: Fix sync wakeups again sched: Add WF_FORK sched: Rename sync arguments sched: Rename select_task_rq() argument sched: Feature to disable APERF/MPERF cpu_power x86: sched: Provide arch implementations using aperf/mperf x86: Add generic aperf/mperf code x86: Move APERF/MPERF into a X86_FEATURE ... Fix up trivial conflict in arch/x86/include/asm/processor.h due to nearby addition of amd_get_nb_id() declaration from the EDAC merge.
2009-09-18rcu: Fix synchronize_rcu() for TREE_PREEMPT_RCUPaul E. McKenney
The redirection of synchronize_sched() to synchronize_rcu() was appropriate for TREE_RCU, but not for TREE_PREEMPT_RCU. Fix this by creating an underlying synchronize_sched(). TREE_RCU then redirects synchronize_rcu() to synchronize_sched(), while TREE_PREEMPT_RCU has its own version of synchronize_rcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu LKML-Reference: <12528585111916-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>