diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/bootconfig.rst | 30 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 13 | ||||
-rw-r--r-- | Documentation/trace/boottime-trace.rst | 6 | ||||
-rw-r--r-- | Documentation/trace/hwlat_detector.rst | 13 | ||||
-rw-r--r-- | Documentation/trace/index.rst | 2 | ||||
-rw-r--r-- | Documentation/trace/osnoise-tracer.rst | 152 | ||||
-rw-r--r-- | Documentation/trace/timerlat-tracer.rst | 181 |
7 files changed, 389 insertions, 8 deletions
diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index 452b7dcd7f6b..6a79f2e59396 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -89,13 +89,35 @@ you can use ``+=`` operator. For example:: In this case, the key ``foo`` has ``bar``, ``baz`` and ``qux``. -However, a sub-key and a value can not co-exist under a parent key. -For example, following config is NOT allowed.:: +Moreover, sub-keys and a value can coexist under a parent key. +For example, following config is allowed.:: foo = value1 - foo.bar = value2 # !ERROR! subkey "bar" and value "value1" can NOT co-exist - foo.bar := value2 # !ERROR! even with the override operator, this is NOT allowed. + foo.bar = value2 + foo := value3 # This will update foo's value. + +Note, since there is no syntax to put a raw value directly under a +structured key, you have to define it outside of the brace. For example:: + + foo { + bar = value1 + bar { + baz = value2 + qux = value3 + } + } + +Also, the order of the value node under a key is fixed. If there +are a value and subkeys, the value is always the first child node +of the key. Thus if user specifies subkeys first, e.g.:: + + foo.bar = value1 + foo = value2 + +In the program (and /proc/bootconfig), it will be shown as below:: + foo = value2 + foo.bar = value1 Comments -------- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0081d83457e8..d6717b747694 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5672,12 +5672,25 @@ Note, echoing 1 into this file without the tracepoint_printk kernel cmdline option has no effect. + The tp_printk_stop_on_boot (see below) can also be used + to stop the printing of events to console at + late_initcall_sync. + ** CAUTION ** Having tracepoints sent to printk() and activating high frequency tracepoints such as irq or sched, can cause the system to live lock. + tp_printk_stop_on_boot[FTRACE] + When tp_printk (above) is set, it can cause a lot of noise + on the console. It may be useful to only include the + printing of events during boot up, as user space may + make the system inoperable. + + This command line option will stop the printing of events + to console at the late_initcall_sync() time frame. + traceoff_on_warning [FTRACE] enable this option to disable tracing when a warning is hit. This turns off "tracing_on". Tracing can diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst index 89b64334929b..8053898cfeb4 100644 --- a/Documentation/trace/boottime-trace.rst +++ b/Documentation/trace/boottime-trace.rst @@ -99,6 +99,12 @@ These options are setting per-event options. ftrace.[instance.INSTANCE.]event.GROUP.EVENT.enable Enable GROUP:EVENT tracing. +ftrace.[instance.INSTANCE.]event.GROUP.enable + Enable all event tracing within GROUP. + +ftrace.[instance.INSTANCE.]event.enable + Enable all event tracing. + ftrace.[instance.INSTANCE.]event.GROUP.EVENT.filter = FILTER Set FILTER rule to the GROUP:EVENT. diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst index 5739349649c8..de94b499b0bc 100644 --- a/Documentation/trace/hwlat_detector.rst +++ b/Documentation/trace/hwlat_detector.rst @@ -76,8 +76,13 @@ in /sys/kernel/tracing: - tracing_cpumask - the CPUs to move the hwlat thread across - hwlat_detector/width - specified amount of time to spin within window (usecs) - hwlat_detector/window - amount of time between (width) runs (usecs) + - hwlat_detector/mode - the thread mode -The hwlat detector's kernel thread will migrate across each CPU specified in -tracing_cpumask between each window. To limit the migration, either modify -tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU -affinity directly, and the migration will stop. +By default, one hwlat detector's kernel thread will migrate across each CPU +specified in cpumask at the beginning of a new window, in a round-robin +fashion. This behavior can be changed by changing the thread mode, +the available options are: + + - none: do not force migration + - round-robin: migrate across each CPU specified in cpumask [default] + - per-cpu: create one thread for each cpu in tracing_cpumask diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index f634b36fd3aa..3769b9b7aed8 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -23,6 +23,8 @@ Linux Tracing Technologies histogram-design boottime-trace hwlat_detector + osnoise-tracer + timerlat-tracer intel_th ring-buffer-design stm diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst new file mode 100644 index 000000000000..b648cb9bf1f0 --- /dev/null +++ b/Documentation/trace/osnoise-tracer.rst @@ -0,0 +1,152 @@ +============== +OSNOISE Tracer +============== + +In the context of high-performance computing (HPC), the Operating System +Noise (*osnoise*) refers to the interference experienced by an application +due to activities inside the operating system. In the context of Linux, +NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the +system. Moreover, hardware-related jobs can also cause noise, for example, +via SMIs. + +hwlat_detector is one of the tools used to identify the most complex +source of noise: *hardware noise*. + +In a nutshell, the hwlat_detector creates a thread that runs +periodically for a given period. At the beginning of a period, the thread +disables interrupt and starts sampling. While running, the hwlatd +thread reads the time in a loop. As interrupts are disabled, threads, +IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the +cause of any gap between two different reads of the time roots either on +NMI or in the hardware itself. At the end of the period, hwlatd enables +interrupts and reports the max observed gap between the reads. It also +prints a NMI occurrence counter. If the output does not report NMI +executions, the user can conclude that the hardware is the culprit for +the latency. The hwlat detects the NMI execution by observing +the entry and exit of a NMI. + +The osnoise tracer leverages the hwlat_detector by running a +similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing +all the sources of *osnoise* during its execution. Using the same approach +of hwlat, osnoise takes note of the entry and exit point of any +source of interferences, increasing a per-cpu interference counter. The +osnoise tracer also saves an interference counter for each source of +interference. The interference counter for NMI, IRQs, SoftIRQs, and +threads is increased anytime the tool observes these interferences' entry +events. When a noise happens without any interference from the operating +system level, the hardware noise counter increases, pointing to a +hardware-related noise. In this way, osnoise can account for any +source of interference. At the end of the period, the osnoise tracer +prints the sum of all noise, the max single noise, the percentage of CPU +available for the thread, and the counters for the noise sources. + +Usage +----- + +Write the ASCII text "osnoise" into the current_tracer file of the +tracing system (generally mounted at /sys/kernel/tracing). + +For example:: + + [root@f32 ~]# cd /sys/kernel/tracing/ + [root@f32 tracing]# echo osnoise > current_tracer + +It is possible to follow the trace by reading the trace trace file:: + + [root@f32 tracing]# cat trace + # tracer: osnoise + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth MAX + # || / SINGLE Interference counters: + # |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+ + # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD + # | | | |||| | | | | | | | | | | + <...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1 + <...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3 + <...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21 + <...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0 + <...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41 + <...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2 + <...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1 + <...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19 + +In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the +tracer prints a message at the end of each period for each CPU that is +running an osnoise/ thread. The osnoise specific fields report: + + - The RUNTIME IN US reports the amount of time in microseconds that + the osnoise thread kept looping reading the time. + - The NOISE IN US reports the sum of noise in microseconds observed + by the osnoise tracer during the associated runtime. + - The % OF CPU AVAILABLE reports the percentage of CPU available for + the osnoise thread during the runtime window. + - The MAX SINGLE NOISE IN US reports the maximum single noise observed + during the runtime window. + - The Interference counters display how many each of the respective + interference happened during the runtime window. + +Note that the example above shows a high number of HW noise samples. +The reason being is that this sample was taken on a virtual machine, +and the host interference is detected as a hardware interference. + +Tracer options +--------------------- + +The tracer has a set of options inside the osnoise directory, they are: + + - osnoise/cpus: CPUs at which a osnoise thread will execute. + - osnoise/period_us: the period of the osnoise thread. + - osnoise/runtime_us: how long an osnoise thread will look for noise. + - osnoise/stop_tracing_us: stop the system tracing if a single noise + higher than the configured value happens. Writing 0 disables this + option. + - osnoise/stop_tracing_total_us: stop the system tracing if total noise + higher than the configured value happens. Writing 0 disables this + option. + - tracing_threshold: the minimum delta between two time() reads to be + considered as noise, in us. When set to 0, the default value will + will be used, which is currently 5 us. + +Additional Tracing +------------------ + +In addition to the tracer, a set of tracepoints were added to +facilitate the identification of the osnoise source. + + - osnoise:sample_threshold: printed anytime a noise is higher than + the configurable tolerance_ns. + - osnoise:nmi_noise: noise from NMI, including the duration. + - osnoise:irq_noise: noise from an IRQ, including the duration. + - osnoise:softirq_noise: noise from a SoftIRQ, including the + duration. + - osnoise:thread_noise: noise from a thread, including the duration. + +Note that all the values are *net values*. For example, if while osnoise +is running, another thread preempts the osnoise thread, it will start a +thread_noise duration at the start. Then, an IRQ takes place, preempting +the thread_noise, starting a irq_noise. When the IRQ ends its execution, +it will compute its duration, and this duration will be subtracted from +the thread_noise, in such a way as to avoid the double accounting of the +IRQ execution. This logic is valid for all sources of noise. + +Here is one example of the usage of these tracepoints:: + + osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns + osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns + migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns + osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8812 ns interferences 2 + +In this example, a noise sample of 8 microseconds was reported in the last +line, pointing to two interferences. Looking backward in the trace, the +two previous entries were about the migration thread running after a +timer IRQ execution. The first event is not part of the noise because +it took place one millisecond before. + +It is worth noticing that the sum of the duration reported in the +tracepoints is smaller than eight us reported in the sample_threshold. +The reason roots in the overhead of the entry and exit code that happens +before and after any interference execution. This justifies the dual +approach: measuring thread and tracing. diff --git a/Documentation/trace/timerlat-tracer.rst b/Documentation/trace/timerlat-tracer.rst new file mode 100644 index 000000000000..c7cbb557aee7 --- /dev/null +++ b/Documentation/trace/timerlat-tracer.rst @@ -0,0 +1,181 @@ +############### +Timerlat tracer +############### + +The timerlat tracer aims to help the preemptive kernel developers to +find souces of wakeup latencies of real-time threads. Like cyclictest, +the tracer sets a periodic timer that wakes up a thread. The thread then +computes a *wakeup latency* value as the difference between the *current +time* and the *absolute time* that the timer was set to expire. The main +goal of timerlat is tracing in such a way to help kernel developers. + +Usage +----- + +Write the ASCII text "timerlat" into the current_tracer file of the +tracing system (generally mounted at /sys/kernel/tracing). + +For example:: + + [root@f32 ~]# cd /sys/kernel/tracing/ + [root@f32 tracing]# echo timerlat > current_tracer + +It is possible to follow the trace by reading the trace trace file:: + + [root@f32 tracing]# cat trace + # tracer: timerlat + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # || / + # |||| ACTIVATION + # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY + # | | | |||| | | | | + <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns + <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns + <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns + <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns + <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns + <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns + <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns + <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns + + +The tracer creates a per-cpu kernel thread with real-time priority that +prints two lines at every activation. The first is the *timer latency* +observed at the *hardirq* context before the activation of the thread. +The second is the *timer latency* observed by the thread. The ACTIVATION +ID field serves to relate the *irq* execution to its respective *thread* +execution. + +The *irq*/*thread* splitting is important to clarify at which context +the unexpected high value is coming from. The *irq* context can be +delayed by hardware related actions, such as SMIs, NMIs, IRQs +or by a thread masking interrupts. Once the timer happens, the delay +can also be influenced by blocking caused by threads. For example, by +postponing the scheduler execution via preempt_disable(), by the +scheduler execution, or by masking interrupts. Threads can +also be delayed by the interference from other threads and IRQs. + +Tracer options +--------------------- + +The timerlat tracer is built on top of osnoise tracer. +So its configuration is also done in the osnoise/ config +directory. The timerlat configs are: + + - cpus: CPUs at which a timerlat thread will execute. + - timerlat_period_us: the period of the timerlat thread. + - osnoise/stop_tracing_us: stop the system tracing if a + timer latency at the *irq* context higher than the configured + value happens. Writing 0 disables this option. + - stop_tracing_total_us: stop the system tracing if a + timer latency at the *thread* context higher than the configured + value happens. Writing 0 disables this option. + - print_stack: save the stack of the IRQ ocurrence, and print + it afte the *thread context* event". + +timerlat and osnoise +---------------------------- + +The timerlat can also take advantage of the osnoise: traceevents. +For example:: + + [root@f32 ~]# cd /sys/kernel/tracing/ + [root@f32 tracing]# echo timerlat > current_tracer + [root@f32 tracing]# echo 1 > events/osnoise/enable + [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us + [root@f32 tracing]# tail -10 trace + cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 13585 ns + cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns + cc1-87882 [005] dNLh2.. 548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns + cc1-87882 [005] d...3.. 548.771102: thread_noise: cc1:87882 start 548.771078243 duration 9909 ns + timerlat/5-1035 [005] ....... 548.771104: #402268 context thread timer_latency 39960 ns + +In this case, the root cause of the timer latency does not point to a +single cause, but to multiple ones. Firstly, the timer IRQ was delayed +for 13 us, which may point to a long IRQ disabled section (see IRQ +stacktrace section). Then the timer interrupt that wakes up the timerlat +thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally, +the cc1 thread noise took 9909 ns of time before the context switch. +Such pieces of evidence are useful for the developer to use other +tracing methods to figure out how to debug and optimize the system. + +It is worth mentioning that the *duration* values reported +by the osnoise: events are *net* values. For example, the +thread_noise does not include the duration of the overhead caused +by the IRQ execution (which indeed accounted for 12736 ns). But +the values reported by the timerlat tracer (timerlat_latency) +are *gross* values. + +The art below illustrates a CPU timeline and how the timerlat tracer +observes it at the top and the osnoise: events at the bottom. Each "-" +in the timelines means circa 1 us, and the time moves ==>:: + + External timer irq thread + clock latency latency + event 13585 ns 39960 ns + | ^ ^ + v | | + |-------------| | + |-------------+-------------------------| + ^ ^ + ======================================================================== + [tmr irq] [dev irq] + [another thread...^ v..^ v.......][timerlat/ thread] <-- CPU timeline + ========================================================================= + |-------| |-------| + |--^ v-------| + | | | + | | + thread_noise: 9909 ns + | +-> irq_noise: 6139 ns + +-> irq_noise: 7597 ns + +IRQ stacktrace +--------------------------- + +The osnoise/print_stack option is helpful for the cases in which a thread +noise causes the major factor for the timer latency, because of preempt or +irq disabled. For example:: + + [root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us + [root@f32 tracing]# echo 500 > osnoise/print_stack + [root@f32 tracing]# echo timerlat > current_tracer + [root@f32 tracing]# tail -21 per_cpu/cpu7/trace + insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns + insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns + insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns + insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns + insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns + timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns + timerlat/7-1001 [007] ....1.. 200.203446: <stack trace> + => timerlat_irq + => __hrtimer_run_queues + => hrtimer_interrupt + => __sysvec_apic_timer_interrupt + => asm_call_irq_on_stack + => sysvec_apic_timer_interrupt + => asm_sysvec_apic_timer_interrupt + => delay_tsc + => dummy_load_1ms_pd_init + => do_one_initcall + => do_init_module + => __do_sys_finit_module + => do_syscall_64 + => entry_SYSCALL_64_after_hwframe + +In this case, it is possible to see that the thread added the highest +contribution to the *timer latency* and the stack trace, saved during +the timerlat IRQ handler, points to a function named +dummy_load_1ms_pd_init, which had the following code (on purpose):: + + static int __init dummy_load_1ms_pd_init(void) + { + preempt_disable(); + mdelay(1); + preempt_enable(); + return 0; + + } |