summaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)Author
2015-06-26Merge tag 'trace-v4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "This patch series contains several clean ups and even a new trace clock "monitonic raw". Also some enhancements to make the ring buffer even faster. But the biggest and most noticeable change is the renaming of the ftrace* files, structures and variables that have to deal with trace events. Over the years I've had several developers tell me about their confusion with what ftrace is compared to events. Technically, "ftrace" is the infrastructure to do the function hooks, which include tracing and also helps with live kernel patching. But the trace events are a separate entity altogether, and the files that affect the trace events should not be named "ftrace". These include: include/trace/ftrace.h -> include/trace/trace_events.h include/linux/ftrace_event.h -> include/linux/trace_events.h Also, functions that are specific for trace events have also been renamed: ftrace_print_*() -> trace_print_*() (un)register_ftrace_event() -> (un)register_trace_event() ftrace_event_name() -> trace_event_name() ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled() ftrace_define_fields_##call() -> trace_define_fields_##call() ftrace_get_offsets_##call() -> trace_get_offsets_##call() Structures have been renamed: ftrace_event_file -> trace_event_file ftrace_event_{call,class} -> trace_event_{call,class} ftrace_event_buffer -> trace_event_buffer ftrace_subsystem_dir -> trace_subsystem_dir ftrace_event_raw_##call -> trace_event_raw_##call ftrace_event_data_offset_##call-> trace_event_data_offset_##call ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call And a few various variables and flags have also been updated. This has been sitting in linux-next for some time, and I have not heard a single complaint about this rename breaking anything. Mostly because these functions, variables and structures are mostly internal to the tracing system and are seldom (if ever) used by anything external to that" * tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits) ring_buffer: Allow to exit the ring buffer benchmark immediately ring-buffer-benchmark: Fix the wrong type ring-buffer-benchmark: Fix the wrong param in module_param ring-buffer: Add enum names for the context levels ring-buffer: Remove useless unused tracing_off_permanent() ring-buffer: Give NMIs a chance to lock the reader_lock ring-buffer: Add trace_recursive checks to ring_buffer_write() ring-buffer: Allways do the trace_recursive checks ring-buffer: Move recursive check to per_cpu descriptor ring-buffer: Add unlikelys to make fast path the default tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call() tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call() tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled() tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_* tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir tracing: Rename ftrace_event_name() to trace_event_name() tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX ...
2015-06-26Merge tag 'trace-fixes-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This isn't my 4.2 pull request (yet). I found a few more bugs that I would have sent to fix 4.1, but since 4.1 is already out, I'm sending this before sending my 4.2 request (which is ready to go). After fixing the previous filter issue reported by Vince Weaver, I could not come up with a situation where the operand counter (cnt) could go below zero, so I added a WARN_ON_ONCE(cnt < 0). Vince was able to trigger that warn on with his fuzzer test, but didn't have a filter input that caused it. Later, Sasha Levin was able to trigger that same warning, and was able to give me the filter string that triggered it. It was simply a single operation ">". I wrapped the filtering code in a userspace program such that I could single step through the logic. With a single operator the operand counter can legitimately go below zero, and should be reported to the user as an error, but should not produce a kernel warning. The WARN_ON_ONCE(cnt < 0) should be just a "if (cnt < 0) break;" and the code following it will produce the error message for the user. While debugging this, I found that there was another bug that let the pointer to the filter string go beyond the filter string. This too was fixed. Finally, there was a typo in a stub function that only gets compiled if trace events is disabled but tracing is enabled (I'm not even sure that's possible)" * tag 'trace-fixes-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix typo from "static inlin" to "static inline" tracing/filter: Do not allow infix to exceed end of string tracing/filter: Do not WARN on operand count going below zero
2015-06-25kernel/trace/blktrace.c: use strreplace() in do_blk_trace_setup()Rasmus Villemoes
Part of the disassembly of do_blk_trace_setup: 231b: e8 00 00 00 00 callq 2320 <do_blk_trace_setup+0x50> 231c: R_X86_64_PC32 strlen+0xfffffffffffffffc 2320: eb 0a jmp 232c <do_blk_trace_setup+0x5c> 2322: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 2328: 48 83 c3 01 add $0x1,%rbx 232c: 48 39 d8 cmp %rbx,%rax 232f: 76 47 jbe 2378 <do_blk_trace_setup+0xa8> 2331: 41 80 3c 1c 2f cmpb $0x2f,(%r12,%rbx,1) 2336: 75 f0 jne 2328 <do_blk_trace_setup+0x58> 2338: 41 c6 04 1c 5f movb $0x5f,(%r12,%rbx,1) 233d: 4c 89 e7 mov %r12,%rdi 2340: e8 00 00 00 00 callq 2345 <do_blk_trace_setup+0x75> 2341: R_X86_64_PC32 strlen+0xfffffffffffffffc 2345: eb e1 jmp 2328 <do_blk_trace_setup+0x58> Yep, that's right: gcc isn't smart enough to realize that replacing '/' by '_' cannot change the strlen(), so we call it again and again (at least when a '/' is found). Even if gcc were that smart, this construction would still loop over the string twice, once for the initial strlen() call and then the open-coded loop. Let's simply use strreplace() instead. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Steven Rostedt <rostedt@goodmis.org> Liked-by: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25kernel/trace/trace_events_filter.c: use strreplace()Rasmus Villemoes
There's no point in starting over every time we see a ','... Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25tracing: Fix typo from "static inlin" to "static inline"Steven Rostedt (Red Hat)
The trace.h header when called without CONFIG_EVENT_TRACING enabled (seldom done), will not compile because of a typo in the protocol of trace_event_enum_update(). Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-25tracing/filter: Do not allow infix to exceed end of stringSteven Rostedt (Red Hat)
While debugging a WARN_ON() for filtering, I found that it is possible for the filter string to be referenced after its end. With the filter: # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter The filter_parse() function can call infix_get_op() which calls infix_advance() that updates the infix filter pointers for the cnt and tail without checking if the filter is already at the end, which will put the cnt to zero and the tail beyond the end. The loop then calls infix_next() that has ps->infix.cnt--; return ps->infix.string[ps->infix.tail++]; The cnt will now be below zero, and the tail that is returned is already passed the end of the filter string. So far the allocation of the filter string usually has some buffer that is zeroed out, but if the filter string is of the exact size of the allocated buffer there's no guarantee that the charater after the nul terminating character will be zero. Luckily, only root can write to the filter. Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-25tracing/filter: Do not WARN on operand count going below zeroSteven Rostedt (Red Hat)
When testing the fix for the trace filter, I could not come up with a scenario where the operand count goes below zero, so I added a WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case that it can happen (although the filter would be wrong). # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter That is, a single operation without any operands will hit the path where the WARN_ON_ONCE() can trigger. Although this is harmless, and the filter is reported as a error. But instead of spitting out a warning to the kernel dmesg, just fail nicely and report it via the proper channels. Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com Reported-by: Vince Weaver <vincent.weaver@maine.edu> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Add TX fast path in mac80211, from Johannes Berg. 2) Add TSO/GRO support to ibmveth, from Thomas Falcon 3) Move away from cached routes in ipv6, just like ipv4, from Martin KaFai Lau. 4) Lots of new rhashtable tests, from Thomas Graf. 5) Run ingress qdisc lockless, from Alexei Starovoitov. 6) Allow servers to fetch TCP packet headers for SYN packets of new connections, for fingerprinting. From Eric Dumazet. 7) Add mode parameter to pktgen, for testing receive. From Alexei Starovoitov. 8) Cache access optimizations via simplifications of build_skb(), from Alexander Duyck. 9) Move page frag allocator under mm/, also from Alexander. 10) Add xmit_more support to hv_netvsc, from KY Srinivasan. 11) Add a counter guard in case we try to perform endless reclassify loops in the packet scheduler. 12) Extern flow dissector to be programmable and use it in new "Flower" classifier. From Jiri Pirko. 13) AF_PACKET fanout rollover fixes, performance improvements, and new statistics. From Willem de Bruijn. 14) Add netdev driver for GENEVE tunnels, from John W Linville. 15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso. 16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet. 17) Add an ECN retry fallback for the initial TCP handshake, from Daniel Borkmann. 18) Add tail call support to BPF, from Alexei Starovoitov. 19) Add several pktgen helper scripts, from Jesper Dangaard Brouer. 20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa. 21) Favor even port numbers for allocation to connect() requests, and odd port numbers for bind(0), in an effort to help avoid ip_local_port_range exhaustion. From Eric Dumazet. 22) Add Cavium ThunderX driver, from Sunil Goutham. 23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata, from Alexei Starovoitov. 24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai. 25) Double TCP Small Queues default to 256K to accomodate situations like the XEN driver and wireless aggregation. From Wei Liu. 26) Add more entropy inputs to flow dissector, from Tom Herbert. 27) Add CDG congestion control algorithm to TCP, from Kenneth Klette Jonassen. 28) Convert ipset over to RCU locking, from Jozsef Kadlecsik. 29) Track and act upon link status of ipv4 route nexthops, from Andy Gospodarek. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits) bridge: vlan: flush the dynamically learned entries on port vlan delete bridge: multicast: add a comment to br_port_state_selection about blocking state net: inet_diag: export IPV6_V6ONLY sockopt stmmac: troubleshoot unexpected bits in des0 & des1 net: ipv4 sysctl option to ignore routes when nexthop link is down net: track link-status of ipv4 nexthops net: switchdev: ignore unsupported bridge flags net: Cavium: Fix MAC address setting in shutdown state drivers: net: xgene: fix for ACPI support without ACPI ip: report the original address of ICMP messages net/mlx5e: Prefetch skb data on RX net/mlx5e: Pop cq outside mlx5e_get_cqe net/mlx5e: Remove mlx5e_cq.sqrq back-pointer net/mlx5e: Remove extra spaces net/mlx5e: Avoid TX CQE generation if more xmit packets expected net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq() net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device ...
2015-06-17tracing: Have filter check for balanced opsSteven Rostedt
When the following filter is used it causes a warning to trigger: # cd /sys/kernel/debug/tracing # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter -bash: echo: write error: Invalid argument # cat events/ext4/ext4_truncate_exit/filter ((dev==1)blocks==2) ^ parse_error: No error ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990() Modules linked in: bnep lockd grace bluetooth ... CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0 0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea Call Trace: [<ffffffff816ed4f9>] dump_stack+0x4c/0x6e [<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0 [<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80 [<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81159065>] replace_preds+0x3c5/0x990 [<ffffffff811596b2>] create_filter+0x82/0xb0 [<ffffffff81159944>] apply_event_filter+0xd4/0x180 [<ffffffff81152bbf>] event_filter_write+0x8f/0x120 [<ffffffff811db2a8>] __vfs_write+0x28/0xe0 [<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0 [<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0 [<ffffffff811dc408>] vfs_write+0xb8/0x1b0 [<ffffffff811dc72f>] SyS_write+0x4f/0xb0 [<ffffffff816f5217>] system_call_fastpath+0x12/0x6a ---[ end trace e11028bd95818dcd ]--- Worse yet, reading the error message (the filter again) it says that there was no error, when there clearly was. The issue is that the code that checks the input does not check for balanced ops. That is, having an op between a closed parenthesis and the next token. This would only cause a warning, and fail out before doing any real harm, but it should still not caues a warning, and the error reported should work: # cd /sys/kernel/debug/tracing # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter -bash: echo: write error: Invalid argument # cat events/ext4/ext4_truncate_exit/filter ((dev==1)blocks==2) ^ parse_error: Meaningless filter expression And give no kernel warning. Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: stable@vger.kernel.org # 2.6.31+ Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-15bpf: let kprobe programs use bpf_get_smp_processor_id() helperAlexei Starovoitov
It's useful to do per-cpu histograms. Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-15bpf: allow networking programs to use bpf_trace_printk() for debuggingAlexei Starovoitov
bpf_trace_printk() is a helper function used to debug eBPF programs. Let socket and TC programs use it as well. Note, it's DEBUG ONLY helper. If it's used in the program, the kernel will print warning banner to make sure users don't use it in production. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-15bpf: introduce current->pid, tgid, uid, gid, comm accessorsAlexei Starovoitov
eBPF programs attached to kprobes need to filter based on current->pid, uid and other fields, so introduce helper functions: u64 bpf_get_current_pid_tgid(void) Return: current->tgid << 32 | current->pid u64 bpf_get_current_uid_gid(void) Return: current_gid << 32 | current_uid bpf_get_current_comm(char *buf, int size_of_buf) stores current->comm into buf They can be used from the programs attached to TC as well to classify packets based on current task fields. Update tracex2 example to print histogram of write syscalls for each process instead of aggregated for all. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-15ring_buffer: Allow to exit the ring buffer benchmark immediatelyPetr Mladek
It takes a while until the ring_buffer_benchmark module is removed when the ring buffer hammer is running. It is because it takes few seconds and kthread_should_stop() is not being checked. This patch adds the check for kthread termination into the producer. It uses the existing @kill_test flag to finish the kthreads as cleanly as possible. It disables printing the "ERROR" message when the kthread is going. It makes sure that producer does not go into the 10sec sleep when it is being killed. Finally, it does not call wait_to_die() when kthread_should_stop() already returns true. Link: http://lkml.kernel.org/r/20150615155428.GD3135@pathway.suse.cz Signed-off-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2015-06-11ring-buffer-benchmark: Fix the wrong sched_priority of producerWang Long
The producer should be used producer_fifo as its sched_priority, so correct it. Link: http://lkml.kernel.org/r/1433923957-67842-1-git-send-email-long.wanglong@huawei.com Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-10ring-buffer-benchmark: Fix the wrong typeWang Long
The macro 'module_param' shows that the type of the variable disable_reader and write_iteration is unsigned integer. so, we change their type form int to unsigned int. Link: http://lkml.kernel.org/r/1433923927-67782-1-git-send-email-long.wanglong@huawei.com Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-10ring-buffer-benchmark: Fix the wrong param in module_paramWang Long
The {producer|consumer}_{nice|fifo} parameters are integer type, we should use 'int' as the second param in module_param. For example(consumer_fifo): the default value of consumer_fifo is -1. Without this patch: # cat /sys/module/ring_buffer_benchmark/parameters/consumer_fifo 4294967295 With this patch: # cat /sys/module/ring_buffer_benchmark/parameters/consumer_fifo -1 Link: http://lkml.kernel.org/r/1433923873-67712-1-git-send-email-long.wanglong@huawei.com Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-31ebpf: allow bpf_ktime_get_ns_proto also for networkingDaniel Borkmann
As this is already exported from tracing side via commit d9847d310ab4 ("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might as well want to move it to the core, so also networking users can make use of it, e.g. to measure diffs for certain flows from ingress/egress. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-29ring-buffer: Add enum names for the context levelsSteven Rostedt (Red Hat)
Instead of having hard coded numbers for the context levels, use enums to describe them more. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-28ring-buffer: Remove useless unused tracing_off_permanent()Steven Rostedt (Red Hat)
The tracing_off_permanent() call is a way to disable all ring_buffers. Nothing uses it and nothing should use it, as tracing_off() and friends are better, as they disable the ring buffers related to tracing. The tracing_off_permanent() even disabled non tracing ring buffers. This is a bit drastic, and was added to handle NMIs doing outputs that could corrupt the ring buffer when only tracing used them. It is now obsolete and adds a little overhead, it should be removed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-28ring-buffer: Give NMIs a chance to lock the reader_lockSteven Rostedt (Red Hat)
Currently, if an NMI does a dump of a ring buffer, it disables all ring buffers from ever doing any writes again. This is because it wont take the locks for the cpu_buffer and this can cause corruption if it preempted a read, or a read happens on another CPU for the current cpu buffer. This is a bit overkill. First, it should at least try to take the lock, and if it fails then disable it. Also, there's no need to disable all ring buffers, even those that are unrelated to what is being read. Only disable the per cpu ring buffer that is being read if it can not get the lock for it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-27ring-buffer: Add trace_recursive checks to ring_buffer_write()Steven Rostedt (Red Hat)
The ring_buffer_write() function isn't protected by the trace recursive writes. Luckily, this function is not used as much and is unlikely to ever recurse. But it should still have the protection, because even a call to ring_buffer_lock_reserve() could cause ring buffer corruption if called when ring_buffer_write() is being used. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-27ring-buffer: Allways do the trace_recursive checksSteven Rostedt (Red Hat)
Currently the trace_recursive checks are only done if CONFIG_TRACING is enabled. That was because there use to be a dependency with tracing for the recursive checks (it used the task_struct trace recursive variable). But now it uses its own variable and there is no dependency. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-27ring-buffer: Move recursive check to per_cpu descriptorSteven Rostedt (Red Hat)
Instead of using a global per_cpu variable to perform the recursive checks into the ring buffer, use the already existing per_cpu descriptor that is part of the ring buffer itself. Not only does this simplify the code, it also allows for one ring buffer to be used within the guts of the use of another ring buffer. For example trace_printk() can now be used within the ring buffer to record changes done by an instance into the main ring buffer. The recursion checks will prevent the trace_printk() itself from causing recursive issues with the main ring buffer (it is just ignored), but the recursive checks wont prevent the trace_printk() from recording other ring buffers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-21ring-buffer: Add unlikelys to make fast path the defaultSteven Rostedt (Red Hat)
I was running the trace_event benchmark and noticed that the times to record a trace_event was all over the place. I looked at the assembly of the ring_buffer_lock_reserver() and saw this: <ring_buffer_lock_reserve>: 31 c0 xor %eax,%eax 48 83 3d 76 47 bd 00 cmpq $0x1,0xbd4776(%rip) # ffffffff81d10d60 <ring_buffer_flags> 01 55 push %rbp 48 89 e5 mov %rsp,%rbp 75 1d jne ffffffff8113c60d <ring_buffer_lock_reserve+0x2d> 65 ff 05 69 e3 ec 7e incl %gs:0x7eece369(%rip) # a960 <__preempt_count> 8b 47 08 mov 0x8(%rdi),%eax 85 c0 test %eax,%eax +---- 74 12 je ffffffff8113c610 <ring_buffer_lock_reserve+0x30> | 65 ff 0d 5b e3 ec 7e decl %gs:0x7eece35b(%rip) # a960 <__preempt_count> | 0f 84 85 00 00 00 je ffffffff8113c690 <ring_buffer_lock_reserve+0xb0> | 31 c0 xor %eax,%eax | 5d pop %rbp | c3 retq | 90 nop +---> 65 44 8b 05 48 e3 ec mov %gs:0x7eece348(%rip),%r8d # a960 <__preempt_count> 7e 41 81 e0 ff ff ff 7f and $0x7fffffff,%r8d b0 08 mov $0x8,%al 65 8b 0d 58 36 ed 7e mov %gs:0x7eed3658(%rip),%ecx # fc80 <current_context> 41 f7 c0 00 ff 1f 00 test $0x1fff00,%r8d 74 1e je ffffffff8113c64f <ring_buffer_lock_reserve+0x6f> 41 f7 c0 00 00 10 00 test $0x100000,%r8d b0 01 mov $0x1,%al 75 13 jne ffffffff8113c64f <ring_buffer_lock_reserve+0x6f> 41 81 e0 00 00 0f 00 and $0xf0000,%r8d 49 83 f8 01 cmp $0x1,%r8 19 c0 sbb %eax,%eax 83 e0 02 and $0x2,%eax 83 c0 02 add $0x2,%eax 85 c8 test %ecx,%eax 75 ab jne ffffffff8113c5fe <ring_buffer_lock_reserve+0x1e> 09 c8 or %ecx,%eax 65 89 05 24 36 ed 7e mov %eax,%gs:0x7eed3624(%rip) # fc80 <current_context> The arrow is the fast path. After adding the unlikely's, the fast path looks a bit better: <ring_buffer_lock_reserve>: 31 c0 xor %eax,%eax 48 83 3d 76 47 bd 00 cmpq $0x1,0xbd4776(%rip) # ffffffff81d10d60 <ring_buffer_flags> 01 55 push %rbp 48 89 e5 mov %rsp,%rbp 75 7b jne ffffffff8113c66b <ring_buffer_lock_reserve+0x8b> 65 ff 05 69 e3 ec 7e incl %gs:0x7eece369(%rip) # a960 <__preempt_count> 8b 47 08 mov 0x8(%rdi),%eax 85 c0 test %eax,%eax 0f 85 9f 00 00 00 jne ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1> 65 8b 0d 57 e3 ec 7e mov %gs:0x7eece357(%rip),%ecx # a960 <__preempt_count> 81 e1 ff ff ff 7f and $0x7fffffff,%ecx b0 08 mov $0x8,%al 65 8b 15 68 36 ed 7e mov %gs:0x7eed3668(%rip),%edx # fc80 <current_context> f7 c1 00 ff 1f 00 test $0x1fff00,%ecx 75 50 jne ffffffff8113c670 <ring_buffer_lock_reserve+0x90> 85 d0 test %edx,%eax 75 7d jne ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1> 09 d0 or %edx,%eax 65 89 05 53 36 ed 7e mov %eax,%gs:0x7eed3653(%rip) # fc80 <current_context> 65 8b 05 fc da ec 7e mov %gs:0x7eecdafc(%rip),%eax # a130 <cpu_number> 89 c2 mov %eax,%edx Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-21bpf: allow bpf programs to tail-call other bpf programsAlexei Starovoitov
introduce bpf_tail_call(ctx, &jmp_table, index) helper function which can be used from BPF programs like: int bpf_prog(struct pt_regs *ctx) { ... bpf_tail_call(ctx, &jmp_table, index); ... } that is roughly equivalent to: int bpf_prog(struct pt_regs *ctx) { ... if (jmp_table[index]) return (*jmp_table[index])(ctx); ... } The important detail that it's not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame. It's trivially done in interpreter and a bit trickier in JITs. In case of x64 JIT the bigger part of generated assembler prologue is common for all programs, so it is simply skipped while jumping. Other JITs can do similar prologue-skipping optimization or do stack unwind before jumping into the next program. bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table Since all BPF programs are idenitified by file descriptor, user space need to populate the jmp_table with FDs of other BPF programs. If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere and program execution continues as normal. New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can populate this jmp_table array with FDs of other bpf programs. Programs can share the same jmp_table array or use multiple jmp_tables. The chain of tail calls can form unpredictable dynamic loops therefore tail_call_cnt is used to limit the number of calls and currently is set to 32. Use cases: Acked-by: Daniel Borkmann <daniel@iogearbox.net> ========== - simplify complex programs by splitting them into a sequence of small programs - dispatch routine For tracing and future seccomp the program may be triggered on all system calls, but processing of syscall arguments will be different. It's more efficient to implement them as: int syscall_entry(struct seccomp_data *ctx) { bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */); ... default: process unknown syscall ... } int sys_write_event(struct seccomp_data *ctx) {...} int sys_read_event(struct seccomp_data *ctx) {...} syscall_jmp_table[__NR_write] = sys_write_event; syscall_jmp_table[__NR_read] = sys_read_event; For networking the program may call into different parsers depending on packet format, like: int packet_parser(struct __sk_buff *skb) { ... parse L2, L3 here ... __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol)); bpf_tail_call(skb, &ipproto_jmp_table, ipproto); ... default: process unknown protocol ... } int parse_tcp(struct __sk_buff *skb) {...} int parse_udp(struct __sk_buff *skb) {...} ipproto_jmp_table[IPPROTO_TCP] = parse_tcp; ipproto_jmp_table[IPPROTO_UDP] = parse_udp; - for TC use case, bpf_tail_call() allows to implement reclassify-like logic - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table are atomic, so user space can build chains of BPF programs on the fly Implementation details: ======================= - high performance of bpf_tail_call() is the goal. It could have been implemented without JIT changes as a wrapper on top of BPF_PROG_RUN() macro, but with two downsides: . all programs would have to pay performance penalty for this feature and tail call itself would be slower, since mandatory stack unwind, return, stack allocate would be done for every tailcall. . tailcall would be limited to programs running preempt_disabled, since generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would need to be either global per_cpu variable accessed by helper and by wrapper or global variable protected by locks. In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue. - bpf_prog_array_compatible() ensures that prog_type of callee and caller are the same and JITed/non-JITed flag is the same, since calling JITed program from non-JITed is invalid, since stack frames are different. Similarly calling kprobe type program from socket type program is invalid. - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map' abstraction, its user space API and all of verifier logic. It's in the existing arraymap.c file, since several functions are shared with regular array map. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##callSteven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The ftrace_raw_##call structures are built by macros for trace events. They have nothing to do with function tracing. Rename them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The ftrace_trigger_soft_disabled() tests if a trace_event is soft disabled (called but not traced), and returns true if it is. It has nothing to do with function tracing and should be renamed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The FTRACE_EVENT_FL_* flags are flags to do with the trace_event files in the tracefs directory. They are not related to function tracing. Rename them to a more descriptive name. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dirSteven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The structure ftrace_subsystem_dir holds the information about trace event subsystems. It should not be named ftrace, rename it to trace_subsystem_dir. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_event_name() to trace_event_name()Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. ftrace_event_name() returns the name of an event tracepoint, has nothing to do with function tracing. Rename it to trace_event_name(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAXSteven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. Rename the max trace_event type size to something more descriptive and appropriate. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_output functions to trace_outputSteven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The ftrace_output_*() and ftrace_raw_output_*() functions represent the trace_event code. Rename them to just trace_output or trace_raw_output. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_event_buffer to trace_event_buffer.Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The ftrace_event_buffer functions and data structures are for trace_events and not for function hooks. Rename them to trace_event_buffer*. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_event_{call,class} to trace_event_{call,class}Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The structures ftrace_event_call and ftrace_event_class have nothing to do with the function hooks, and are really trace_event structures. Rename ftrace_event_* to trace_event_*. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_event_file to trace_event_fileSteven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The structure ftrace_event_file is really about trace events and not "ftrace". Rename it to trace_event_file. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename (un)register_ftrace_event() to (un)register_trace_event()Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The functions (un)register_ftrace_event() is really about trace_events, and the name should be register_trace_event() instead. Also renamed ftrace_event_reg() to trace_event_reg() for the same reason. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_print_*() functions ta trace_print_*()Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The functions ftrace_print_*() are not part of the function infrastructure, and the names can be confusing. Rename them to be trace_print_*(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_event.h to trace_events.hSteven Rostedt (Red Hat)
The term "ftrace" is really the infrastructure of the function hooks, and not the trace events. Rename ftrace_event.h to trace_events.h to represent the trace_event infrastructure and decouple the term ftrace from it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-12ftrace: Provide trace clock monotonic rawDrew Richardson
Expose the NMI safe accessor to the monotonic raw clock to the tracer. The mono clock was added with commit 1b3e5c0936046e7e023149ddc8946d21c2ea20eb. The advantage of the monotonic raw clock is that it will advance more constantly than the monotonic clock. Imagine someone is trying to optimize a particular program to reduce instructions executed for a given workload while minimizing the effect on runtime. Also suppose that NTP is running and potentially making larger adjustments to the monotonic clock. If NTP is adjusting the monotonic clock to advance more rapidly, the program will appear to use fewer instructions per second but run longer than if the monotonic raw clock had been used. The total number of instructions observed would be the same regardless of the clock source used, but how it's attributed to time would be affected. Conversely if NTP is adjusting the monotonic clock to advance more slowly, the program will appear to use more instructions per second but run more quickly. Of course there are many sources that can cause jitter in performance measurements on modern processors, but let's remove NTP from the list. The monotonic raw clock can also be useful for tracing early boot, e.g. when debugging issues with NTP. Link: http://lkml.kernel.org/r/20150508143037.GB1276@dreric01-Precision-T1650 Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Drew Richardson <drew.richardson@arm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-12tracing: Export tracing clock functionsJerry Snitselaar
Critical tracepoint hooks should never call anything that takes a lock, so they are unable to call getrawmonotonic() or ktime_get(). Export the rest of the tracing clock functions so can be used in tracepoint hooks. Background: We have a customer that adds their own module and registers a tracepoint hook to sched_wakeup. They were using ktime_get() for a time source, but it grabs a seq lock and caused a deadlock to occur. Link: http://lkml.kernel.org/r/1430406624-22609-1-git-send-email-jsnitsel@redhat.com Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-06tracing: Make ftrace_print_array_seq compute buf_lenAlex Bennée
The only caller to this function (__print_array) was getting it wrong by passing the array length instead of buffer length. As the element size was already being passed for other reasons it seems reasonable to push the calculation of buffer length into the function. Link: http://lkml.kernel.org/r/1430320727-14582-1-git-send-email-alex.bennee@linaro.org Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-22Merge tag 'trace-v4.1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This adds three fixes for the tracing code. The first is a bug when ftrace_dump_on_oops is triggered in atomic context and function graph tracer is the tracer that is being reported. The second fix is bad parsing of the trace_events from the kernel command line, where it would ignore specific events if the system name is used with defining the event(it enables all events within the system). The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was missing to see if the ptr was incremented to the end of the string, but the loop increments it again and can miss the nul delimiter to stop processing" * tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix possible out of bounds memory access when parsing enums tracing: Fix incorrect enabling of trace events by boot cmdline tracing: Handle ftrace_dump() atomic context in graph_trace_open()
2015-04-17tracing: Fix possible out of bounds memory access when parsing enumsSteven Rostedt (Red Hat)
The code that replaces the enum names with the enum values in the tracepoints' format files could possible miss the end of string nul character. This was caused by processing things like backslashes, quotes and other tokens. After processing the tokens, a check for the nul character needed to be done before continuing the loop, because the loop incremented the pointer before doing the check, which could bypass the nul character. Link: http://lkml.kernel.org/r/552E661D.5060502@oracle.com Reported-by: Sasha Levin <sasha.levin@oracle.com> # via KASan Tested-by: Andrey Ryabinin <a.ryabinin@samsung.com> Fixes: 0c564a538aa9 "tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-16tracing: Fix incorrect enabling of trace events by boot cmdlineJoonsoo Kim
There is a problem that trace events are not properly enabled with boot cmdline. The problem is that if we pass "trace_event=kmem:mm_page_alloc" to the boot cmdline, it enables all kmem trace events, and not just the page_alloc event. This is caused by the parsing mechanism. When we parse the cmdline, the buffer contents is modified due to tokenization. And, if we use this buffer again, we will get the wrong result. Unfortunately, this buffer is be accessed three times to set trace events properly at boot time. So, we need to handle this situation. There is already code handling ",", but we need another for ":". This patch adds it. Link: http://lkml.kernel.org/r/1429159484-22977-1-git-send-email-iamjoonsoo.kim@lge.com Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> [ added missing return ret; ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-16tracing: Handle ftrace_dump() atomic context in graph_trace_open()Rabin Vincent
graph_trace_open() can be called in atomic context from ftrace_dump(). Use GFP_ATOMIC for the memory allocations when that's the case, in order to avoid the following splat. BUG: sleeping function called from invalid context at mm/slab.c:2849 in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/0 Backtrace: .. [<8004dc94>] (__might_sleep) from [<801371f4>] (kmem_cache_alloc_trace+0x160/0x238) r7:87800040 r6:000080d0 r5:810d16e8 r4:000080d0 [<80137094>] (kmem_cache_alloc_trace) from [<800cbd60>] (graph_trace_open+0x30/0xd0) r10:00000100 r9:809171a8 r8:00008e28 r7:810d16f0 r6:00000001 r5:810d16e8 r4:810d16f0 [<800cbd30>] (graph_trace_open) from [<800c79c4>] (trace_init_global_iter+0x50/0x9c) r8:00008e28 r7:808c853c r6:00000001 r5:810d16e8 r4:810d16f0 r3:800cbd30 [<800c7974>] (trace_init_global_iter) from [<800c7aa0>] (ftrace_dump+0x90/0x2ec) r4:810d2580 r3:00000000 [<800c7a10>] (ftrace_dump) from [<80414b2c>] (sysrq_ftrace_dump+0x1c/0x20) r10:00000100 r9:809171a8 r8:808f6e7c r7:00000001 r6:00000007 r5:0000007a r4:808d5394 [<80414b10>] (sysrq_ftrace_dump) from [<800169b8>] (return_to_handler+0x0/0x18) [<80415498>] (__handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18) r8:808c8100 r7:808c8444 r6:00000101 r5:00000010 r4:84eb3210 [<80415668>] (handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18) [<8042a760>] (pl011_int) from [<800169b8>] (return_to_handler+0x0/0x18) r10:809171bc r9:809171a8 r8:00000001 r7:00000026 r6:808c6000 r5:84f01e60 r4:8454fe00 [<8007782c>] (handle_irq_event_percpu) from [<80077b44>] (handle_irq_event+0x4c/0x6c) r10:808c7ef0 r9:87283e00 r8:00000001 r7:00000000 r6:8454fe00 r5:84f01e60 r4:84f01e00 [<80077af8>] (handle_irq_event) from [<8007aa28>] (handle_fasteoi_irq+0xf0/0x1ac) r6:808f52a4 r5:84f01e60 r4:84f01e00 r3:00000000 [<8007a938>] (handle_fasteoi_irq) from [<80076dc0>] (generic_handle_irq+0x3c/0x4c) r6:00000026 r5:00000000 r4:00000026 r3:8007a938 [<80076d84>] (generic_handle_irq) from [<80077128>] (__handle_domain_irq+0x8c/0xfc) r4:808c1e38 r3:0000002e [<8007709c>] (__handle_domain_irq) from [<800087b8>] (gic_handle_irq+0x34/0x6c) r10:80917748 r9:00000001 r8:88802100 r7:808c7ef0 r6:808c8fb0 r5:00000015 r4:8880210c r3:808c7ef0 [<80008784>] (gic_handle_irq) from [<80014044>] (__irq_svc+0x44/0x7c) Link: http://lkml.kernel.org/r/1428953721-31349-1-git-send-email-rabin@rab.in Link: http://lkml.kernel.org/r/1428957012-2319-1-git-send-email-rabin@rab.in Cc: stable@vger.kernel.org # 3.13+ Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-15tracing: remove use of seq_printf return valueJoe Perches
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Miscellanea: o Remove unused return value from trace_lookup_stack Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15VFS: kernel/: d_inode() annotationsDavid Howells
relayfs and tracefs are dealing with inodes of their own; those two act as filesystem drivers Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-14Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf changes from Ingo Molnar: "Core kernel changes: - One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively. (Right now it's limited to root-only, but in the future we might allow unprivileged use as well.) (Alexei Starovoitov) - Another non-trivial feature is per event clockid support: this allows, amongst other things, the selection of different clock sources for event timestamps traced via perf. This feature is sought by people who'd like to merge perf generated events with external events that were measured with different clocks: - cluster wide profiling - for system wide tracing with user-space events, - JIT profiling events etc. Matching perf tooling support is added as well, available via the -k, --clockid <clockid> parameter to perf record et al. (Peter Zijlstra) Hardware enablement kernel changes: - x86 Intel Processor Trace (PT) support: which is a hardware tracer on steroids, available on Broadwell CPUs. The hardware trace stream is directly output into the user-space ring-buffer, using the 'AUX' data format extension that was added to the perf core to support hardware constraints such as the necessity to have the tracing buffer physically contiguous. This patch-set was developed for two years and this is the result. A simple way to make use of this is to use BTS tracing, the PT driver emulates BTS output - available via the 'intel_bts' PMU. More explicit PT specific tooling support is in the works as well - will probably be ready by 4.2. (Alexander Shishkin, Peter Zijlstra) - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware feature of Intel Xeon CPUs that allows the measurement and allocation/partitioning of caches to individual workloads. These kernel changes expose the measurement side as a new PMU driver, which exposes various QoS related PMU events. (The partitioning change is work in progress and is planned to be merged as a cgroup extension.) (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P Waskiewicz Jr) - x86 Intel Haswell LBR call stack support: this is a new Haswell feature that allows the hardware recording of call chains, plus tooling support. To activate this feature you have to enable it via the new 'lbr' call-graph recording option: perf record --call-graph lbr perf report or: perf top --call-graph lbr This hardware feature is a lot faster than stack walk or dwarf based unwinding, but has some limitations: - It reuses the current LBR facility, so LBR call stack and branch record can not be enabled at the same time. - It is only available for user-space callchains. (Yan, Zheng) - x86 Intel Broadwell CPU support and various event constraints and event table fixes for earlier models. (Andi Kleen) - x86 Intel HT CPUs event scheduling workarounds. This is a complex CPU bug affecting the SNB,IVB,HSW families that results in counter value corruption. The mitigation code is automatically enabled and is transparent. (Maria Dimakopoulou, Stephane Eranian) The perf tooling side had a ton of changes in this cycle as well, so I'm only able to list the user visible changes here, in addition to the tooling changes outlined above: User visible changes affecting all tools: - Improve support of compressed kernel modules (Jiri Olsa) - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo) - Bash completion for subcommands (Yunlong Song) - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa) - Support missing -f to override perf.data file ownership. (Yunlong Song) - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo) User visible changes in individual tools: 'perf data': New tool for converting perf.data to other formats, initially for the CTF (Common Trace Format) from LTTng (Jiri Olsa, Sebastian Siewior) 'perf diff': Add --kallsyms option (David Ahern) 'perf list': Allow listing events with 'tracepoint' prefix (Yunlong Song) Sort the output of the command (Yunlong Song) 'perf kmem': Respect -i option (Jiri Olsa) Print big numbers using thousands' group (Namhyung Kim) Allow -v option (Namhyung Kim) Fix alignment of slab result table (Namhyung Kim) 'perf probe': Support multiple probes on different binaries on the same command line (Masami Hiramatsu) Support unnamed union/structure members data collection. (Masami Hiramatsu) Check kprobes blacklist when adding new events. (Masami Hiramatsu) 'perf record': Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra) Support recording running/enabled time (Andi Kleen) 'perf sched': Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song) 'perf report' and 'perf top': Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo) Indicate which callchain entries are annotated in the TUI hists browser (Arnaldo Carvalho de Melo) Add pid/tid filtering to 'report' and 'script' commands (David Ahern) Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT events (Arnaldo Carvalho de Melo) 'perf stat': Report unsupported events properly (Suzuki K. Poulose) Output running time and run/enabled ratio in CSV mode (Andi Kleen) 'perf trace': Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo) Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo) Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo) Dump stack on segfaults (Arnaldo Carvalho de Melo) No need to explicitely enable evsels for workload started from perf, let it be enabled via perf_event_attr.enable_on_exec, removing some events that take place in the 'perf trace' before a workload is really started by it. (Arnaldo Carvalho de Melo) Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo) There's also been a ton of infrastructure work done, such as the split-out of perf's build system into tools/build/ and other changes - see the shortlog and changelog for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits) perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init() perf evlist: Fix type for references to data_head/tail perf probe: Check the orphaned -x option perf probe: Support multiple probes on different binaries perf buildid-list: Fix segfault when show DSOs with hits perf tools: Fix cross-endian analysis perf tools: Fix error path to do closedir() when synthesizing threads perf tools: Fix synthesizing fork_event.ppid for non-main thread perf tools: Add 'I' event modifier for exclude_idle bit perf report: Don't call map__kmap if map is NULL. perf tests: Fix attr tests perf probe: Fix ARM 32 building error perf tools: Merge all perf_event_attr print functions perf record: Add clockid parameter perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10 perf sched replay: Support using -f to override perf.data file ownership perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task perf sched replay: Fix the segmentation fault problem caused by pr_err in threads perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations ...