summaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)Author
2020-04-14tracing: Fix the race between registering 'snapshot' event trigger and ↵Xiao Yang
triggering 'snapshot' operation Traced event can trigger 'snapshot' operation(i.e. calls snapshot_trigger() or snapshot_count_trigger()) when register_snapshot_trigger() has completed registration but doesn't allocate buffer for 'snapshot' event trigger. In the rare case, 'snapshot' operation always detects the lack of allocated buffer so make register_snapshot_trigger() allocate buffer first. trigger-snapshot.tc in kselftest reproduces the issue on slow vm: ----------------------------------------------------------- cat trace ... ftracetest-3028 [002] .... 236.784290: sched_process_fork: comm=ftracetest pid=3028 child_comm=ftracetest child_pid=3036 <...>-2875 [003] .... 240.460335: tracing_snapshot_instance_cond: *** SNAPSHOT NOT ALLOCATED *** <...>-2875 [003] .... 240.460338: tracing_snapshot_instance_cond: *** stopping trace here! *** ----------------------------------------------------------- Link: http://lkml.kernel.org/r/20200414015145.66236-1-yangx.jy@cn.fujitsu.com Cc: stable@vger.kernel.org Fixes: 93e31ffbf417a ("tracing: Add 'snapshot' event trigger command") Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-05Merge tag 'trace-v5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "New tracing features: - The ring buffer is no longer disabled when reading the trace file. The trace_pipe file was made to be used for live tracing and reading as it acted like the normal producer/consumer. As the trace file would not consume the data, the easy way of handling it was to just disable writes to the ring buffer. This came to a surprise to the BPF folks who complained about lost events due to reading. This is no longer an issue. If someone wants to keep the old disabling there's a new option "pause-on-trace" that can be set. - New set_ftrace_notrace_pid file. PIDs in this file will not be traced by the function tracer. Similar to set_ftrace_pid, which makes the function tracer only trace those tasks with PIDs in the file, the set_ftrace_notrace_pid does the reverse. - New set_event_notrace_pid file. PIDs in this file will cause events not to be traced if triggered by a task with a matching PID. Similar to the set_event_pid file but will not be traced. Note, sched_waking and sched_switch events may still be traced if one of the tasks referenced by those events contains a PID that is allowed to be traced. Tracing related features: - New bootconfig option, that is attached to the initrd file. If bootconfig is on the command line, then the initrd file is searched looking for a bootconfig appended at the end. - New GPU tracepoint infrastructure to help the gfx drivers to get off debugfs (acked by Greg Kroah-Hartman) And other minor updates and fixes" * tag 'trace-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits) tracing: Do not allocate buffer in trace_find_next_entry() in atomic tracing: Add documentation on set_ftrace_notrace_pid and set_event_notrace_pid selftests/ftrace: Add test to test new set_event_notrace_pid file selftests/ftrace: Add test to test new set_ftrace_notrace_pid file tracing: Create set_event_notrace_pid to not trace tasks ftrace: Create set_ftrace_notrace_pid to not trace tasks ftrace: Make function trace pid filtering a bit more exact ftrace/kprobe: Show the maxactive number on kprobe_events tracing: Have the document reflect that the trace file keeps tracing enabled ring-buffer/tracing: Have iterator acknowledge dropped events tracing: Do not disable tracing when reading the trace file ring-buffer: Do not disable recording when there is an iterator ring-buffer: Make resize disable per cpu buffer instead of total buffer ring-buffer: Optimize rb_iter_head_event() ring-buffer: Do not die if rb_iter_peek() fails more than thrice ring-buffer: Have rb_iter_head_event() handle concurrent writer ring-buffer: Add page_stamp to iterator for synchronization ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance() ring-buffer: Have ring_buffer_empty() not depend on tracing stopped tracing: Save off entry when peeking at next entry ...
2020-04-03tracing: Do not allocate buffer in trace_find_next_entry() in atomicSteven Rostedt (VMware)
When dumping out the trace data in latency format, a check is made to peek at the next event to compare its timestamp to the current one, and if the delta is of a greater size, it will add a marker showing so. But to do this, it needs to save the current event otherwise peeking at the next event will remove the current event. To save the event, a temp buffer is used, and if the event is bigger than the temp buffer, the temp buffer is freed and a bigger buffer is allocated. This allocation is a problem when called in atomic context. The only way this gets called via atomic context is via ftrace_dump(). Thus, use a static buffer of 128 bytes (which covers most events), and if the event is bigger than that, simply return NULL. The callers of trace_find_next_entry() need to handle a NULL case, as that's what would happen if the allocation failed. Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30bpf: Introduce BPF_PROG_TYPE_LSMKP Singh
Introduce types and configs for bpf programs that can be attached to LSM hooks. The programs can be enabled by the config option CONFIG_BPF_LSM. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: Thomas Garnier <thgarnie@google.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
2020-03-27tracing: Create set_event_notrace_pid to not trace tasksSteven Rostedt (VMware)
There's currently a way to select a task that should only have its events traced, but there's no way to select a task not to have itsevents traced. Add a set_event_notrace_pid file that acts the same as set_event_pid (and is also affected by event-fork), but the task pids in this file will not be traced even if they are listed in the set_event_pid file. This makes it easy for tools like trace-cmd to "hide" itself from beint traced by events when it is recording other tasks. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27ftrace: Create set_ftrace_notrace_pid to not trace tasksSteven Rostedt (VMware)
There's currently a way to select a task that should only be traced by functions, but there's no way to select a task not to be traced by the function tracer. Add a set_ftrace_notrace_pid file that acts the same as set_ftrace_pid (and is also affected by function-fork), but the task pids in this file will not be traced even if they are listed in the set_ftrace_pid file. This makes it easy for tools like trace-cmd to "hide" itself from the function tracer when it is recording other tasks. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27ftrace: Make function trace pid filtering a bit more exactSteven Rostedt (VMware)
The set_ftrace_pid file is used to filter function tracing to only trace tasks that are listed in that file. Instead of testing the pids listed in that file (it's a bitmask) at each function trace event, the logic is done via a sched_switch hook. A flag is set when the next task to run is in the list of pids in the set_ftrace_pid file. But the sched_switch hook is not at the exact location of when the task switches, and the flag gets set before the task to be traced actually runs. This leaves a residue of traced functions that do not belong to the pid that should be filtered on. By changing the logic slightly, where instead of having a boolean flag to test, record the pid that should be traced, with special values for not to trace and always trace. Then at each function call, a check will be made to see if the function should be ignored, or if the current pid matches the function that should be traced, and only trace if it matches (or if it has the special value to always trace). Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27ftrace/kprobe: Show the maxactive number on kprobe_eventsMasami Hiramatsu
Show maxactive parameter on kprobe_events. This allows user to save the current configuration and restore it without losing maxactive parameter. Link: http://lkml.kernel.org/r/4762764a-6df7-bc93-ed60-e336146dce1f@gmail.com Link: http://lkml.kernel.org/r/158503528846.22706.5549974121212526020.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 696ced4fb1d76 ("tracing/kprobes: expose maxactive for kretprobe in kprobe_events") Reported-by: Taeung Song <treeze.taeung@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27ring-buffer/tracing: Have iterator acknowledge dropped eventsSteven Rostedt (VMware)
Have the ring_buffer_iterator set a flag if events were dropped as it were to go and peek at the next event. Have the trace file display this fact if it happened with a "LOST EVENTS" message. Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27tracing: Do not disable tracing when reading the trace fileSteven Rostedt (VMware)
When opening the "trace" file, it is no longer necessary to disable tracing. Note, a new option is created called "pause-on-trace", when set, will cause the trace file to emulate its original behavior. Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27ring-buffer: Do not disable recording when there is an iteratorSteven Rostedt (VMware)
Now that the iterator can handle a concurrent writer, do not disable writing to the ring buffer when there is an iterator present. Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27ring-buffer: Make resize disable per cpu buffer instead of total bufferSteven Rostedt (VMware)
When the ring buffer becomes writable for even when the trace file is read, it must still not be resized. But since tracers can be activated while the trace file is being read, the irqsoff tracer can modify the per CPU buffers, and this can cause the reader of the trace file to update the wrong buffer's resize disable bit, as the irqsoff tracer swaps out cpu buffers. By making the resize disable per cpu_buffer, it makes the update follow the per cpu_buffer even if it's swapped out with the snapshot buffer and keeps the release of the trace file modifying the same data as the open did. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19ring-buffer: Optimize rb_iter_head_event()Steven Rostedt (VMware)
As it is fine to perform several "peeks" of event data in the ring buffer via the iterator before moving it forward, do not re-read the event, just return what was read before. Otherwise, it can cause inconsistent results, especially when testing multiple CPU buffers to interleave them. Link: http://lkml.kernel.org/r/20200317213416.592032170@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19ring-buffer: Do not die if rb_iter_peek() fails more than thriceSteven Rostedt (VMware)
As the iterator will be reading a live buffer, and if the event being read is on a page that a writer crosses, it will fail and try again, the condition in rb_iter_peek() that only allows a retry to happen three times is no longer valid. Allow rb_iter_peek() to retry more than three times without killing the ring buffer, but only if rb_iter_head_event() had failed at least once. Link: http://lkml.kernel.org/r/20200317213416.452888193@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19ring-buffer: Have rb_iter_head_event() handle concurrent writerSteven Rostedt (VMware)
Have the ring_buffer_iter structure have a place to store an event, such that it can not be overwritten by a writer, and load it in such a way via rb_iter_head_event() that it will return NULL and reset the iter to the start of the current page if a writer updated the page. Link: http://lkml.kernel.org/r/20200317213416.306959216@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19ring-buffer: Add page_stamp to iterator for synchronizationSteven Rostedt (VMware)
Have the ring_buffer_iter structure contain a page_stamp, such that it can be used to see if the writer entered the page the iterator is on. When going to a new page, the iterator will record the time stamp of that page. When reading events, it can copy the event to an internal buffer on the iterator (to be implemented later), then check the page's time stamp with its own to see if the writer entered the page. If so, it will need to try to read the event again. Link: http://lkml.kernel.org/r/20200317213416.163549674@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()Steven Rostedt (VMware)
When the ring buffer was first created, the iterator followed the normal producer/consumer operations where it had both a peek() operation, that just returned the event at the current location, and a read(), that would return the event at the current location and also increment the iterator such that the next peek() or read() will return the next event. The only use of the ring_buffer_read() is currently to move the iterator to the next location and nothing now actually reads the event it returns. Rename this function to its actual use case to ring_buffer_iter_advance(), which also adds the "iter" part to the name, which is more meaningful. As the timestamp returned by ring_buffer_read() was never used, there's no reason that this new version should bother having returning it. It will also become a void function. Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19ring-buffer: Have ring_buffer_empty() not depend on tracing stoppedSteven Rostedt (VMware)
It was complained about that when the trace file is read, that the tracing is disabled, as the iterator expects writing to the buffer it reads is not updated. Several steps are needed to make the iterator handle a writer, by testing if things have changed as it reads. This step is to make ring_buffer_empty() expect the buffer to be changing. Note if the current location of the iterator is overwritten, then it will return false as new data is being added. Note, that this means that data will be skipped. Link: http://lkml.kernel.org/r/20200317213415.870741809@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19tracing: Save off entry when peeking at next entrySteven Rostedt (VMware)
In order to have the iterator read the buffer even when it's still updating, it requires that the ring buffer iterator saves each event in a separate location outside the ring buffer such that its use is immutable. There's one use case that saves off the event returned from the ring buffer interator and calls it again to look at the next event, before going back to use the first event. As the ring buffer iterator will only have a single copy, this use case will no longer be supported. Instead, have the one use case create its own buffer to store the first event when looking at the next event. This way, when looking at the first event again, it wont be corrupted by the second read. Link: http://lkml.kernel.org/r/20200317213415.722539921@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-19tracing: Use address-of operator on section symbolsNathan Chancellor
Clang warns: ../kernel/trace/trace.c:9335:33: warning: array comparison always evaluates to true [-Wtautological-compare] if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt) ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the runtime result of the check (tested with some print statements compiled in with clang + ld.lld and gcc + ld.bfd in QEMU). Link: http://lkml.kernel.org/r/20200220051011.26113-1-natechancellor@gmail.com Link: https://github.com/ClangBuiltLinux/linux/issues/893 Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-03-12 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 8 day(s) which contain a total of 12 files changed, 161 insertions(+), 15 deletions(-). The main changes are: 1) Andrii fixed two bugs in cgroup-bpf. 2) John fixed sockmap. 3) Luke fixed x32 jit. 4) Martin fixed two issues in struct_ops. 5) Yonghong fixed bpf_send_signal. 6) Yoshiki fixed BTF enum. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12bpf: Add bpf_xdp_output() helperEelco Chaudron
Introduce new helper that reuses existing xdp perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct xdp_buff *' as a tracepoint argument. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
2020-03-12bpf: Added new helper bpf_get_ns_current_pid_tgidCarlos Neira
New bpf helper bpf_get_ns_current_pid_tgid, This helper will return pid and tgid from current task which namespace matches dev_t and inode number provided, this will allows us to instrument a process inside a container. Signed-off-by: Carlos Neira <cneirabustos@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com
2020-03-11Merge tag 'trace-v5.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fix from Steven Rostedt: "Have ftrace lookup_rec() return a consistent record otherwise it can break live patching" * tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Return the first found result in lookup_rec()
2020-03-11ftrace: Return the first found result in lookup_rec()Artem Savkov
It appears that ip ranges can overlap so. In that case lookup_rec() returns whatever results it got last even if it found nothing in last searched page. This breaks an obscure livepatch late module patching usecase: - load livepatch - load the patched module - unload livepatch - try to load livepatch again To fix this return from lookup_rec() as soon as it found the record containing searched-for ip. This used to be this way prior lookup_rec() introduction. Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com Cc: stable@vger.kernel.org Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()") Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-07Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Here are a few fixes that should go into this release. This contains: - Revert of a bad bcache patch from this merge window - Removed unused function (Daniel) - Fixup for the blktrace fix from Jan from this release (Cengiz) - Fix of deeper level bfqq overwrite in BFQ (Carlo)" * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block: block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() blktrace: fix dereference after null check Revert "bcache: ignore pending signals when creating gc and allocator thread" block: Remove used kblockd_schedule_work_on()
2020-03-05bpf: Fix bpf_prog_test_run_tracing for !CONFIG_NETKP Singh
test_run.o is not built when CONFIG_NET is not set and bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the linker error: ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to `bpf_prog_test_run_tracing' Add a __weak function in bpf_trace.c to handle this. Fixes: da00d2f117a0 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING") Signed-off-by: KP Singh <kpsingh@google.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org
2020-03-05bpf: Fix deadlock with rq_lock in bpf_send_signal()Yonghong Song
When experimenting with bpf_send_signal() helper in our production environment (5.2 based), we experienced a deadlock in NMI mode: #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24 #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012 #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55 #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602 #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227 #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140 #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09 #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47 #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d #17 [ffffc9002219fc90] schedule at ffffffff81a3e219 #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9 #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529 #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602 #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068 The above call stack is actually very similar to an issue reported by Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()") by Song Liu. The only difference is bpf_send_signal() helper instead of bpf_get_stack() helper. The above deadlock is triggered with a perf_sw_event. Similar to Commit eac9153f2b58, the below almost identical reproducer used tracepoint point sched/sched_switch so the issue can be easily caught. /* stress_test.c */ #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define THREAD_COUNT 1000 char *filename; void *worker(void *p) { void *ptr; int fd; char *pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char *argv[]) { void *ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } and the following command: 1. run `stress_test /bin/ls` in one windown 2. hack bcc trace.py with the following change: --- a/tools/trace.py +++ b/tools/trace.py @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s); __data.tgid = __tgid; __data.pid = __pid; bpf_get_current_comm(&__data.comm, sizeof(__data.comm)); + bpf_send_signal(10); %s %s %s.perf_submit(%s, &__data, sizeof(__data)); 3. in a different window run ./trace.py -p $(pidof stress_test) t:sched:sched_switch The deadlock can be reproduced in our production system. Similar to Song's fix, the fix is to delay sending signal if irqs is disabled to avoid deadlocks involving with rq_lock. With this change, my above stress-test in our production system won't cause deadlock any more. I also implemented a scale-down version of reproducer in the selftest (a subsequent commit). With latest bpf-next, it complains for the following potential deadlock. [ 32.832450] -> #1 (&p->pi_lock){-.-.}: [ 32.833100] _raw_spin_lock_irqsave+0x44/0x80 [ 32.833696] task_rq_lock+0x2c/0xa0 [ 32.834182] task_sched_runtime+0x59/0xd0 [ 32.834721] thread_group_cputime+0x250/0x270 [ 32.835304] thread_group_cputime_adjusted+0x2e/0x70 [ 32.835959] do_task_stat+0x8a7/0xb80 [ 32.836461] proc_single_show+0x51/0xb0 ... [ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}: [ 32.840275] __lock_acquire+0x1358/0x1a20 [ 32.840826] lock_acquire+0xc7/0x1d0 [ 32.841309] _raw_spin_lock_irqsave+0x44/0x80 [ 32.841916] __lock_task_sighand+0x79/0x160 [ 32.842465] do_send_sig_info+0x35/0x90 [ 32.842977] bpf_send_signal+0xa/0x10 [ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000 [ 32.844301] trace_call_bpf+0x115/0x270 [ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0 [ 32.845411] perf_trace_sched_switch+0x10f/0x180 [ 32.846014] __schedule+0x45d/0x880 [ 32.846483] schedule+0x5f/0xd0 ... [ 32.853148] Chain exists of: [ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock [ 32.853148] [ 32.854451] Possible unsafe locking scenario: [ 32.854451] [ 32.855173] CPU0 CPU1 [ 32.855745] ---- ---- [ 32.856278] lock(&rq->lock); [ 32.856671] lock(&p->pi_lock); [ 32.857332] lock(&rq->lock); [ 32.857999] lock(&(&sighand->siglock)->rlock); Deadlock happens on CPU0 when it tries to acquire &sighand->siglock but it has been held by CPU1 and CPU1 tries to grab &rq->lock and cannot get it. This is not exactly the callstack in our production environment, but sympotom is similar and both locks are using spin_lock_irqsave() to acquire the lock, and both involves rq_lock. The fix to delay sending signal when irq is disabled also fixed this issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
2020-03-05blktrace: fix dereference after null checkCengiz Can
There was a recent change in blktrace.c that added a RCU protection to `q->blk_trace` in order to fix a use-after-free issue during access. However the change missed an edge case that can lead to dereferencing of `bt` pointer even when it's NULL: Coverity static analyzer marked this as a FORWARD_NULL issue with CID 1460458. ``` /kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store() 1898 ret = 0; 1899 if (bt == NULL) 1900 ret = blk_trace_setup_queue(q, bdev); 1901 1902 if (ret == 0) { 1903 if (attr == &dev_attr_act_mask) >>> CID 1460458: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "bt". 1904 bt->act_mask = value; 1905 else if (attr == &dev_attr_pid) 1906 bt->pid = value; 1907 else if (attr == &dev_attr_start_lba) 1908 bt->start_lba = value; 1909 else if (attr == &dev_attr_end_lba) ``` Added a reassignment with RCU annotation to fix the issue. Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Cengiz Can <cengiz@kernel.wtf> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04bpf: Add test ops for BPF_PROG_TYPE_TRACINGKP Singh
The current fexit and fentry tests rely on a different program to exercise the functions they attach to. Instead of doing this, implement the test operations for tracing which will also be used for BPF_MODIFY_RETURN in a subsequent patch. Also, clean up the fexit test to use the generated skeleton. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org
2020-03-03tracing: Remove unused TRACE_BUFFER bitsSteven Rostedt (VMware)
Commit 567cd4da54ff ("ring-buffer: User context bit recursion checking") added the TRACE_BUFFER bits to be used in the current task's trace_recursion field. But the final submission of the logic removed the use of those bits, but never removed the bits themselves (they were never used in upstream Linux). These can be safely removed. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-03tracing: Have hwlat ts be first instance and record count of instancesSteven Rostedt (VMware)
The hwlat tracer runs a loop of width time during a given window. It then reports the max latency over a given threshold and records a timestamp. But this timestamp is the time after the width has finished, and not the time it actually triggered. Record the actual time when the latency was greater than the threshold as well as the number of times it was greater in a given width per window. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-02-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 41 non-merge commits during the last 7 day(s) which contain a total of 49 files changed, 1383 insertions(+), 499 deletions(-). The main changes are: 1) BPF and Real-Time nicely co-exist. 2) bpftool feature improvements. 3) retrieve bpf_sk_storage via INET_DIAG. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Passthrough insertion fix (Ming) - Kill off some unused arguments (John) - blktrace RCU fix (Jan) - Dead fields removal for null_blk (Dongli) - NVMe polled IO fix (Bijan) * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block: nvme-pci: Hold cq_poll_lock while completing CQEs blk-mq: Remove some unused function arguments null_blk: remove unused fields in 'nullb_cmd' blktrace: Protect q->blk_trace with RCU blk-mq: insert passthrough request into hctx->dispatch directly
2020-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The mptcp conflict was overlapping additions. The SMC conflict was an additional and removal happening at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge tag 'trace-v5.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64
2020-02-25bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issueMasami Hiramatsu
Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") also changed the CONFIG_BOOTTIME_TRACING to select CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu, it introduced wrong dependencies with BLK_DEV_INITRD as below. WARNING: unmet direct dependencies detected for BOOT_CONFIG Depends on [n]: BLK_DEV_INITRD [=n] Selected by [y]: - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y] This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so that both boot-time tracing and boot configuration off but those appear on the menu list. Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2 Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") Reported-by: Randy Dunlap <rdunlap@infradead.org> Compiled-tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-25blktrace: Protect q->blk_trace with RCUJan Kara
KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Tristan Madani <tristmd@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24bpf/trace: Remove redundant preempt_disable from trace_call_bpf()Thomas Gleixner
Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de
2020-02-24bpf: disable preemption for bpf progs attached to uprobeAlexei Starovoitov
trace_call_bpf() no longer disables preemption on its own. All callers of this function has to do it explicitly. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2020-02-24bpf/trace: Remove EXPORT from trace_call_bpf()Thomas Gleixner
All callers are built in. No point to export this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-02-24bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()Thomas Gleixner
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation. This is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de
2020-02-20bootconfig: Set CONFIG_BOOT_CONFIG=n by defaultMasami Hiramatsu
Set CONFIG_BOOT_CONFIG=n by default. This also warns user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given in the kernel command line. Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Clear trace_state when starting traceMasami Hiramatsu
Clear trace_state data structure when starting trace in __synth_event_trace_start() internal function. Currently trace_state is initialized only in the synth_event_trace_start() API, but the trace_state in synth_event_trace() and synth_event_trace_array() are on the stack without initialization. This means those APIs will see wrong parameters and wil skip closing process in __synth_event_trace_end() because trace_state->disabled may be !0. Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2 Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Disable trace_printk() on post poned testsSteven Rostedt (VMware)
The tracing seftests checks various aspects of the tracing infrastructure, and one is filtering. If trace_printk() is active during a self test, it can cause the filtering to fail, which will disable that part of the trace. To keep the selftests from failing because of trace_printk() calls, trace_printk() checks the variable tracing_selftest_running, and if set, it does not write to the tracing buffer. As some tracers were registered earlier in boot, the selftest they triggered would fail because not all the infrastructure was set up for the full selftest. Thus, some of the tests were post poned to when their infrastructure was ready (namely file system code). The postpone code did not set the tracing_seftest_running variable, and could fail if a trace_printk() was added and executed during their run. Cc: stable@vger.kernel.org Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Have synthetic event test use raw_smp_processor_id()Steven Rostedt (VMware)
The test code that tests synthetic event creation pushes in as one of its test fields the current CPU using "smp_processor_id()". As this is just something to see if the value is correctly passed in, and the actual CPU used does not matter, use raw_smp_processor_id(), otherwise with debug preemption enabled, a warning happens as the smp_processor_id() is called without preemption enabled. Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>