Age | Commit message (Collapse) | Author |
|
The patch 'tracing: Fix histogram code when expression has same var as
value' added code to return an existing variable reference when
creating a new variable reference, which resulted in var_ref_vals
slots being reused instead of being duplicated.
The implementation of the trace action assumes that the end of the
var_ref_vals array starting at action_data.var_ref_idx corresponds to
the values that will be assigned to the trace params. The patch
mentioned above invalidates that assumption, which means that each
param needs to explicitly specify its index into var_ref_vals.
This fix changes action_data.var_ref_idx to an array of var ref
indexes to account for that.
Link: https://lore.kernel.org/r/1580335695.6220.8.camel@kernel.org
Fixes: 8bcebc77e85f ("tracing: Fix histogram code when expression has same var as value")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Have trace_boot_add_synth_event() use the synth_event interface.
Also, rename synth_event_run_cmd() to synth_event_run_command() now
that trace_boot's version is gone.
Link: http://lkml.kernel.org/r/94f1fa0e31846d0bddca916b8663404b20559e34.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Replace open coded single-linked list iteration loop with for_each_console()
helper in use.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
The point bp is assigned a value that is never read, it is being
re-assigned later to bp = &kdb_breakpoints[lowbp] in a for-loop.
Remove the redundant assignment.
Addresses-Coverity ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20191128130753.181246-1-colin.king@canonical.com
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
If you switch to a sleeping task with the "pid" command and then type
"rd", kdb tells you this:
No current kdb registers. You may need to select another task
diag: -17: Invalid register name
The first message makes sense, but not the second. Fix it by just
returning 0 after commands accessing the current registers finish if
we've already printed the "No current kdb registers" error.
While fixing kdb_rd(), change the function to use "if" rather than
"ifdef". It cleans the function up a bit and any modern compiler will
have no trouble handling still producing good code.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111624.5.I121f4c6f0c19266200bf6ef003de78841e5bfc3d@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
Some (but not all?) of the kdb backtrace paths would cause the
kdb_current_task and kdb_current_regs to remain changed. As discussed
in a review of a previous patch [1], this doesn't seem intuitive, so
let's fix that.
...but, it turns out that there's actually no longer any reason to set
the current task / current regs while backtracing anymore anyway. As
of commit 2277b492582d ("kdb: Fix stack crawling on 'running' CPUs
that aren't the master") if we're backtracing on a task running on a
CPU we ask that CPU to do the backtrace itself. Linux can do that
without anything fancy. If we're doing backtrace on a sleeping task
we can also do that fine without updating globals. So this patch
mostly just turns into deleting a bunch of code.
[1] https://lore.kernel.org/r/20191010150735.dhrj3pbjgmjrdpwr@holly.lan
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111624.4.Ibc3d982bbeb9e46872d43973ba808cd4c79537c7@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
The kdb_current_task variable has been declared in
"kernel/debug/kdb/kdb_private.h" since 2010 when kdb was added to the
mainline kernel. This is not a public header. There should be no
reason that kdb_current_task should be exported and there are no
in-kernel users that need it. Remove the export.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111623.3.I14b22b5eb15ca8f3812ab33e96621231304dc1f7@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
As of the patch ("MIPS: kdb: Remove old workaround for backtracing on
other CPUs") there is no reason for kdb_current_regs to be in the
public "kdb.h". Let's move it next to kdb_current_task.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111623.2.Iadbfb484e90b557cc4b5ac9890bfca732cd99d77@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
5153faac18d2 ("cgroup: remove cgroup_enable_task_cg_lists()
optimization") removed lazy initialization of css_sets so that new
tasks are always lniked to its css_set. In the process, it incorrectly
ended up adding init_tasks to root css_set. They show up as PID 0's in
root's cgroup.procs triggering warnings in systemd and generally
confusing people.
Fix it by skip css_set linking for init_tasks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: https://github.com/joanbm
Link: https://github.com/systemd/systemd/issues/14682
Fixes: 5153faac18d2 ("cgroup: remove cgroup_enable_task_cg_lists() optimization")
Cc: stable@vger.kernel.org # v5.5+
|
|
Move all the tracing selftest configs to the bottom of the tracing menu.
There's no reason for them to be interspersed throughout.
Also, move the bootconfig menu to the top.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Move the config that enables the mmiotracer with the other tracers such that
all the tracers are together.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The MMIO test module was by itself, move it to the other test modules. Also,
add the text "Test module" to PREEMPTIRQ_DELAY_TEST as that create a test
module as well.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The features that depend on the function tracer were spread out through the
tracing menu, pull them together as it is easier to manage.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add a test module that checks the basic functionality of the in-kernel
kprobe event command generation API by creating kprobe events from a
module.
Link: http://lkml.kernel.org/r/97e502b204f9dba948e3fa3a4315448298218787.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Have trace_boot_add_kprobe_event() use the kprobe_event interface.
Also, rename kprobe_event_run_cmd() to kprobe_event_run_command() now
that trace_boot's version is gone.
Link: http://lkml.kernel.org/r/af5429d11291ab1e9a85a0ff944af3b2bcf193c7.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add functions used to generate kprobe event commands, built on top of
the dynevent_cmd interface.
kprobe_event_gen_cmd_start() is used to create a kprobe event command
using a variable arg list, and kretprobe_event_gen_cmd_start() does
the same for kretprobe event commands. kprobe_event_add_fields() can
be used to add single fields one by one or as a group. Once all
desired fields are added, kprobe_event_gen_cmd_end() or
kretprobe_event_gen_cmd_end() respectively are used to actually
execute the command and create the event.
Link: http://lkml.kernel.org/r/95cc4696502bb6017f9126f306a45ad19b4cc14f.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add a test module that checks the basic functionality of the in-kernel
synthetic event generation API by generating and tracing synthetic
events from a module.
Link: http://lkml.kernel.org/r/fcb4dd9eb9eefb70ab20538d3529d51642389664.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add an exported function named synth_event_trace(), allowing modules
or other kernel code to trace synthetic events.
Also added are several functions that allow the same functionality to
be broken out in a piecewise fashion, which are useful in situations
where tracing an event from a full array of values would be
cumbersome. Those functions are synth_event_trace_start/end() and
synth_event_add_(next)_val().
Link: http://lkml.kernel.org/r/7a84de5f1854acf4144b57efe835ca645afa764f.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add functions used to generate synthetic event commands, built on top
of the dynevent_cmd interface.
synth_event_gen_cmd_start() is used to create a synthetic event
command using a variable arg list and
synth_event_gen_cmd_array_start() does the same thing but using an
array of field descriptors. synth_event_add_field(),
synth_event_add_field_str() and synth_event_add_fields() can be used
to add single fields one by one or as a group. Once all desired
fields are added, synth_event_gen_cmd_end() is used to actually
execute the command and create the event.
synth_event_create() does everything, including creating the event, in
a single call.
Link: http://lkml.kernel.org/r/38fef702fad5ef208009f459552f34a94befd860.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add an interface used to build up dynamic event creation commands,
such as synthetic and kprobe events. Interfaces specific to those
particular types of events and others can be built on top of this
interface.
Command creation is started by first using the dynevent_cmd_init()
function to initialize the dynevent_cmd object. Following that, args
are appended and optionally checked by the dynevent_arg_add() and
dynevent_arg_pair_add() functions, which use objects representing
arguments and pairs of arguments, initialized respectively by
dynevent_arg_init() and dynevent_arg_pair_init(). Finally, once all
args have been successfully added, the command is finalized and
actually created using dynevent_create().
The code here for actually printing into the dyn_event->cmd buffer
using snprintf() etc was adapted from v4 of Masami's 'tracing/boot:
Add synthetic event support' patch.
Link: http://lkml.kernel.org/r/1f65fa44390b6f238f6036777c3784ced1dcc6a0.1580323897.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
create_or_delete_synth_event() contains code to delete a synthetic
event, which would be useful on its own - specifically, it would be
useful to allow event-creating modules to call it separately.
Separate out the delete code from that function and create an exported
function named synth_event_delete().
Link: http://lkml.kernel.org/r/050db3b06df7f0a4b8a2922da602d1d879c7c1c2.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add a function to get an event file and prevent it from going away on
module or instance removal.
trace_get_event_file() will find an event file in a given instance (if
instance is NULL, it assumes the top trace array) and return it,
pinning the instance's trace array as well as the event's module, if
applicable, so they won't go away while in use.
trace_put_event_file() does the matching release.
Link: http://lkml.kernel.org/r/bb31ac4bdda168d5ed3c4b5f5a4c8f633e8d9118.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
[ Moved trace_array_put() to end of trace_put_event_file() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add a new trace_array_find() function that can be used to find a trace
array given the instance name, and replace existing code that does the
same thing with it. Also add trace_array_find_get() which does the
same but returns the trace array after upping its refcount.
Also make both available for use outside of trace.c.
Link: http://lkml.kernel.org/r/cb68528c975eba95bee4561ac67dd1499423b2e5.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
6+1 records in
6+1 records out
206 bytes copied, 0.00027916 s, 738 kB/s
Notice the printing of "# Available triggers:..." after the line.
With the patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
2+1 records in
2+1 records out
88 bytes copied, 0.000526867 s, 167 kB/s
It only prints the end of the file, and does not restart.
Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.com
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Link: http://lkml.kernel.org/r/7ad85b22-1866-977c-db17-88ac438bc764@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
[ This is not a bug fix, it just makes it "technically correct"
which is why I applied it. NULL is only returned on an anomaly
which triggers a WARN_ON ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
no pid
2+1 records in
2+1 records out
10 bytes copied, 0.000213285 s, 46.9 kB/s
Notice the "id" followed by "no pid".
With the patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
0+1 records in
0+1 records out
3 bytes copied, 0.000202112 s, 14.8 kB/s
Notice that it only prints "id" and not the "no pid" afterward.
Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.com
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Reading the sched_cmdline_ref and sched_tgid_ref initial state within
tracing_start_sched_switch without holding the sched_register_mutex is
racy against concurrent updates, which can lead to tracepoint probes
being registered more than once (and thus trigger warnings within
tracepoint.c).
[ May be the fix for this bug ]
Link: https://lore.kernel.org/r/000000000000ab6f84056c786b93@google.com
Link: http://lkml.kernel.org/r/20190817141208.15226-1-mathieu.desnoyers@efficios.com
Cc: stable@vger.kernel.org
CC: Steven Rostedt (VMware) <rostedt@goodmis.org>
CC: Joel Fernandes (Google) <joel@joelfernandes.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul E. McKenney <paulmck@linux.ibm.com>
Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com
Fixes: d914ba37d7145 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull mmu_notifier updates from Jason Gunthorpe:
"This small series revises the names in mmu_notifier to make the code
clearer and more readable"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier
mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier
mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread management updates from Christian Brauner:
"Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
syscall.
This syscall allows for the retrieval of file descriptors of a process
based on its pidfd. A task needs to have ptrace_may_access()
permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
Andy) on the target.
One of the main use-cases is in combination with seccomp's user
notification feature. As a reminder, seccomp's user notification
feature was made available in v5.0. It allows a task to retrieve a
file descriptor for its seccomp filter. The file descriptor is usually
handed of to a more privileged supervising process. The supervisor can
then listen for syscall events caught by the seccomp filter of the
supervisee and perform actions in lieu of the supervisee, usually
emulating syscalls. pidfd_getfd() is needed to expand its uses.
There are currently two major users that wait on pidfd_getfd() and one
future user:
- Netflix, Sargun said, is working on a service mesh where users
should be able to connect to a dns-based VIP. When a user connects
to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
redirected to an envoy process. This service mesh uses seccomp user
notifications and pidfd to intercept all connect calls and instead
of connecting them to 1.2.3.4:80 connects them to e.g.
127.0.0.1:8080.
- LXD uses the seccomp notifier heavily to intercept and emulate
mknod() and mount() syscalls for unprivileged containers/processes.
With pidfd_getfd() more uses-cases e.g. bridging socket connections
will be possible.
- The patchset has also seen some interest from the browser corner.
Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
broker process. In the future glibc will start blocking all signals
during dlopen() rendering this type of sandbox impossible. Hence,
in the future Firefox will switch to a seccomp-user-nofication
based sandbox which also makes use of file descriptor retrieval.
The thread for this can be found at
https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html
With pidfd_getfd() it is e.g. possible to bridge socket connections
for the supervisee (binding to a privileged port) and taking actions
on file descriptors on behalf of the supervisee in general.
Sargun's first version was using an ioctl on pidfds but various people
pushed for it to be a proper syscall which he duely implemented as
well over various review cycles. Selftests are of course included.
I've also added instructions how to deal with merge conflicts below.
There's also a small fix coming from the kernel mentee project to
correctly annotate struct sighand_struct with __rcu to fix various
sparse warnings. We've received a few more such fixes and even though
they are mostly trivial I've decided to postpone them until after -rc1
since they came in rather late and I don't want to risk introducing
build warnings.
Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
needed to avoid allocation recursions triggerable by storage drivers
that have userspace parts that run in the IO path (e.g. dm-multipath,
iscsi, etc). These allocation recursions deadlock the device.
The new prctl() allows such privileged userspace components to avoid
allocation recursions by setting the PF_MEMALLOC_NOIO and
PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
relevant maintainers and is routed here as part of prctl()
thread-management."
* tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
sched.h: Annotate sighand_struct with __rcu
test: Add test for pidfd getfd
arch: wire up pidfd_getfd syscall
pid: Implement pidfd_getfd syscall
vfs, fdtable: Add fget_task helper
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest kunit updates from Shuah Khan:
"This kunit update consists of:
- Support for building kunit as a module from Alan Maguire
- AppArmor KUnit tests for policy unpack from Mike Salvatore"
* tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: building kunit as a module breaks allmodconfig
kunit: update documentation to describe module-based build
kunit: allow kunit to be loaded as a module
kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
kunit: allow kunit tests to be loaded as a module
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: move string-stream.h to lib/kunit
apparmor: add AppArmor KUnit tests for policy unpack
|
|
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 updates from Arnd Bergmann:
"Core, driver and file system changes
These are updates to device drivers and file systems that for some
reason or another were not included in the kernel in the previous
y2038 series.
I've gone through all users of time_t again to make sure the kernel is
in a long-term maintainable state, replacing all remaining references
to time_t with safe alternatives.
Some related parts of the series were picked up into the nfsd, xfs,
alsa and v4l2 trees. A final set of patches in linux-mm removes the
now unused time_t/timeval/timespec types and helper functions after
all five branches are merged for linux-5.6, ensuring that no new users
get merged.
As a result, linux-5.6, or my backport of the patches to 5.4 [1],
should be the first release that can serve as a base for a 32-bit
system designed to run beyond year 2038, with a few remaining caveats:
- All user space must be compiled with a 64-bit time_t, which will be
supported in the coming musl-1.2 and glibc-2.32 releases, along
with installed kernel headers from linux-5.6 or higher.
- Applications that use the system call interfaces directly need to
be ported to use the time64 syscalls added in linux-5.1 in place of
the existing system calls. This impacts most users of futex() and
seccomp() as well as programming languages that have their own
runtime environment not based on libc.
- Applications that use a private copy of kernel uapi header files or
their contents may need to update to the linux-5.6 version, in
particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
linux/elfcore.h, linux/sockios.h, linux/timex.h and
linux/can/bcm.h.
- A few remaining interfaces cannot be changed to pass a 64-bit
time_t in a compatible way, so they must be configured to use
CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
timestamps. Most importantly this impacts all users of 'struct
input_event'.
- All y2038 problems that are present on 64-bit machines also apply
to 32-bit machines. In particular this affects file systems with
on-disk timestamps using signed 32-bit seconds: ext4 with
ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"
[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
* tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
y2038: sh: remove timeval/timespec usage from headers
y2038: sparc: remove use of struct timex
y2038: rename itimerval to __kernel_old_itimerval
y2038: remove obsolete jiffies conversion functions
nfs: fscache: use timespec64 in inode auxdata
nfs: fix timstamp debug prints
nfs: use time64_t internally
sunrpc: convert to time64_t for expiry
drm/etnaviv: avoid deprecated timespec
drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
drm/msm: avoid using 'timespec'
hfs/hfsplus: use 64-bit inode timestamps
hostfs: pass 64-bit timestamps to/from user space
packet: clarify timestamp overflow
tsacct: add 64-bit btime field
acct: stop using get_seconds()
um: ubd: use 64-bit time_t where possible
xtensa: ISS: avoid struct timeval
dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk update from Petr Mladek:
"Prevent replaying log on all consoles"
* tag 'printk-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
printk: fix exclusive_console replaying
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull openat2 support from Al Viro:
"This is the openat2() series from Aleksa Sarai.
I'm afraid that the rest of namei stuff will have to wait - it got
zero review the last time I'd posted #work.namei, and there had been a
leak in the posted series I'd caught only last weekend. I was going to
repost it on Monday, but the window opened and the odds of getting any
review during that... Oh, well.
Anyway, openat2 part should be ready; that _did_ get sane amount of
review and public testing, so here it comes"
From Aleksa's description of the series:
"For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown
flags are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road
to being added to openat(2).
Furthermore, the need for some sort of control over VFS's path
resolution (to avoid malicious paths resulting in inadvertent
breakouts) has been a very long-standing desire of many userspace
applications.
This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
(which was a variant of David Drysdale's O_BENEATH patchset[4] which
was a spin-off of the Capsicum project[5]) with a few additions and
changes made based on the previous discussion within [6] as well as
others I felt were useful.
In line with the conclusions of the original discussion of
AT_NO_JUMPS, the flag has been split up into separate flags. However,
instead of being an openat(2) flag it is provided through a new
syscall openat2(2) which provides several other improvements to the
openat(2) interface (see the patch description for more details). The
following new LOOKUP_* flags are added:
LOOKUP_NO_XDEV:
Blocks all mountpoint crossings (upwards, downwards, or through
absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is
also blocked (though magic-link jumps on the same vfsmount are
permitted).
LOOKUP_NO_MAGICLINKS:
Blocks resolution through /proc/$pid/fd-style links. This is done
by blocking the usage of nd_jump_link() during resolution in a
filesystem. The term "magic-links" is used to match with the only
reference to these links in Documentation/, but I'm happy to change
the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
LOOKUP_BENEATH:
Disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed.
Conceptually this flag is to ensure you "stay below" a certain
point in the filesystem tree -- but this requires some additional
to protect against various races that would allow escape using
"..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done
as in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
LOOKUP_NO_SYMLINKS:
Does what it says on the tin. No symlink resolution is allowed at
all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
can still be used with NOFOLLOW to open an fd for the symlink as
long as no parent path had a symlink component.
LOOKUP_IN_ROOT:
This is an extension of LOOKUP_BENEATH that, rather than blocking
attempts to move past the root, forces all such movements to be
scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that
chroot(2) is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to
cross magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container.
There is a long list of CVEs that could have bene mitigated by
having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution.
It features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like
RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
programs to be sure they don't hit DoSes though stale NFS handles)"
* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Documentation: path-lookup: include new LOOKUP flags
selftests: add openat2(2) selftests
open: introduce openat2(2) syscall
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: allow set_root() to produce errors
namei: allow nd_jump_link() to produce errors
nsfs: clean-up ns_get_path() signature to return int
namei: only return -ECHILD from follow_dotdot_rcu()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU warning removal from Paul McKenney:
"A single commit that fixes an embarrassing bug discussed here:
https://lore.kernel.org/lkml/20200125131425.GB16136@zn.tnic/
which apparently also affects smaller systems"
[ This was sent to Ingo, but since I see the issue on the laptop I use for
testing during the merge window, I'm doing the pull directly - Linus ]
* 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Forgive slow expedited grace periods at boot time
|
|
Instead of using a locally defined "struct bpf_verifier_log log = {}",
btf_struct_ops_init() should reuse the "log" from its calling
function "btf_parse_vmlinux()". It should also resolve the
frame-size too large compiler warning in some ARCH.
Fixes: 27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200127175145.1154438-1-kafai@fb.com
|
|
Move external function declarations into kernel/trace/trace.h
from trace_boot.c for tracing subsystem internal use.
Link: http://lkml.kernel.org/r/158029060405.12381.11944554430359702545.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Include some required (but currently indirectly included)
headers and sort it alphabetically.
Link: http://lkml.kernel.org/r/158029059514.12381.6597832266860248781.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The 'hist:' prefix gets stripped from the command text during command
processing, but should be added back when displaying the command
during error processing.
Not only because it's what should be displayed but also because not
having it means the test cases fail because the caret is miscalculated
by the length of the prefix string.
Link: http://lkml.kernel.org/r/449df721f560042e22382f67574bcc5b4d830d3d.1561743018.git.zanussi@kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add error codes and messages for all the error paths leading to sort
specification parsing errors.
Link: http://lkml.kernel.org/r/237830dc05e583fbb53664d817a784297bf961be.1561743018.git.zanussi@kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
In the process of adding better error messages for sorting, I realized
that strsep was being used incorrectly and some of the error paths I
was expecting to be hit weren't and just fell through to the common
invalid key error case.
It also became obvious that for keyword assignments, it wasn't
necessary to save the full assignment and reparse it later, and having
a common empty-assignment check would also make more sense in terms of
error processing.
Change the code to fix these problems and simplify it for new error
message changes in a subsequent patch.
Link: http://lkml.kernel.org/r/1c3ef0b6655deaf345f6faee2584a0298ac2d743.1561743018.git.zanussi@kernel.org
Fixes: e62347d24534 ("tracing: Add hist trigger support for user-defined sorting ('sort=' param)")
Fixes: 7ef224d1d0e3 ("tracing: Add 'hist' event trigger command")
Fixes: a4072fe85ba3 ("tracing: Add a clock attribute for hist triggers")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Anton Ivanov:
"I am sending this on behalf of Richard who is traveling.
This contains the following changes for UML:
- Fix for time travel mode
- Disable CONFIG_CONSTRUCTORS again
- A new command line option to have an non-raw serial line
- Preparations to remove obsolete UML network drivers"
* tag 'for-linus-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Fix time-travel=inf-cpu with xor/raid6
Revert "um: Enable CONFIG_CONSTRUCTORS"
um: Mark non-vector net transports as obsolete
um: Add an option to make serial driver non-raw
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Kprobe events added 'ustring' to distinguish reading strings from
kernel space or user space.
But the creating of the event format file only checks for 'string' to
display string formats. 'ustring' must also be handled"
* tag 'trace-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Have uname use __get_str() in print_fmt
|
|
Pull networking updates from David Miller:
1) Add WireGuard
2) Add HE and TWT support to ath11k driver, from John Crispin.
3) Add ESP in TCP encapsulation support, from Sabrina Dubroca.
4) Add variable window congestion control to TIPC, from Jon Maloy.
5) Add BCM84881 PHY driver, from Russell King.
6) Start adding netlink support for ethtool operations, from Michal
Kubecek.
7) Add XDP drop and TX action support to ena driver, from Sameeh
Jubran.
8) Add new ipv4 route notifications so that mlxsw driver does not have
to handle identical routes itself. From Ido Schimmel.
9) Add BPF dynamic program extensions, from Alexei Starovoitov.
10) Support RX and TX timestamping in igc, from Vinicius Costa Gomes.
11) Add support for macsec HW offloading, from Antoine Tenart.
12) Add initial support for MPTCP protocol, from Christoph Paasch,
Matthieu Baerts, Florian Westphal, Peter Krystad, and many others.
13) Add Octeontx2 PF support, from Sunil Goutham, Geetha sowjanya, Linu
Cherian, and others.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1469 commits)
net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC
udp: segment looped gso packets correctly
netem: change mailing list
qed: FW 8.42.2.0 debug features
qed: rt init valid initialization changed
qed: Debug feature: ilt and mdump
qed: FW 8.42.2.0 Add fw overlay feature
qed: FW 8.42.2.0 HSI changes
qed: FW 8.42.2.0 iscsi/fcoe changes
qed: Add abstraction for different hsi values per chip
qed: FW 8.42.2.0 Additional ll2 type
qed: Use dmae to write to widebus registers in fw_funcs
qed: FW 8.42.2.0 Parser offsets modified
qed: FW 8.42.2.0 Queue Manager changes
qed: FW 8.42.2.0 Expose new registers and change windows
qed: FW 8.42.2.0 Internal ram offsets modifications
MAINTAINERS: Add entry for Marvell OcteonTX2 Physical Function driver
Documentation: net: octeontx2: Add RVU HW and drivers overview
octeontx2-pf: ethtool RSS config support
octeontx2-pf: Add basic ethtool support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Removed CRYPTO_TFM_RES flags
- Extended spawn grabbing to all algorithm types
- Moved hash descsize verification into API code
Algorithms:
- Fixed recursive pcrypt dead-lock
- Added new 32 and 64-bit generic versions of poly1305
- Added cryptogams implementation of x86/poly1305
Drivers:
- Added support for i.MX8M Mini in caam
- Added support for i.MX8M Nano in caam
- Added support for i.MX8M Plus in caam
- Added support for A33 variant of SS in sun4i-ss
- Added TEE support for Raven Ridge in ccp
- Added in-kernel API to submit TEE commands in ccp
- Added AMD-TEE driver
- Added support for BCM2711 in iproc-rng200
- Added support for AES256-GCM based ciphers for chtls
- Added aead support on SEC2 in hisilicon"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (244 commits)
crypto: arm/chacha - fix build failured when kernel mode NEON is disabled
crypto: caam - add support for i.MX8M Plus
crypto: x86/poly1305 - emit does base conversion itself
crypto: hisilicon - fix spelling mistake "disgest" -> "digest"
crypto: chacha20poly1305 - add back missing test vectors and test chunking
crypto: x86/poly1305 - fix .gitignore typo
tee: fix memory allocation failure checks on drv_data and amdtee
crypto: ccree - erase unneeded inline funcs
crypto: ccree - make cc_pm_put_suspend() void
crypto: ccree - split overloaded usage of irq field
crypto: ccree - fix PM race condition
crypto: ccree - fix FDE descriptor sequence
crypto: ccree - cc_do_send_request() is void func
crypto: ccree - fix pm wrongful error reporting
crypto: ccree - turn errors to debug msgs
crypto: ccree - fix AEAD decrypt auth fail
crypto: ccree - fix typo in comment
crypto: ccree - fix typos in error msgs
crypto: atmel-{aes,sha,tdes} - Retire crypto_platform_data
crypto: x86/sha - Eliminate casts on asm implementations
...
|
|
When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.
We have the following sequence:
sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.
Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
|
|
On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the
same socket are in nohz_full mode. We can observe huge time burn in the
loop for seaching nearest busy housekeeper cpu by ftrace.
2) | get_nohz_timer_target() {
2) 0.240 us | housekeeping_test_cpu();
2) 0.458 us | housekeeping_test_cpu();
...
2) 0.292 us | housekeeping_test_cpu();
2) 0.240 us | housekeeping_test_cpu();
2) 0.227 us | housekeeping_any_cpu();
2) + 43.460 us | }
This patch optimizes the searching logic by finding a nearest housekeeper
CPU in the housekeeping cpumask, it can minimize the worst searching time
from ~44us to < 10us in my testing. In addition, the last iterated busy
housekeeper can become a random candidate while current CPU is a better
fallback if it is a housekeeper.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com
|
|
The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison
7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }
# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96
Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().
# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of range
Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
|
|
The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.
This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.
The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.
2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
baseline lbnuma-v3
Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.
Ops NUMA base-page range updates 19796.00 292.00
Ops NUMA PTE updates 19796.00 292.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 16113.00 143.00
Ops NUMA hint local faults % 8407.00 142.00
Ops NUMA hint local percent 52.18 99.30
Ops NUMA pages migrated 4244.00 1.00
Without the patch, only 52.18% of sampled accesses are local. In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates". The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries. A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.
A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.
Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.
Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.
An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.
In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small. For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
|
|
The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.
Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.
[ swood: Added changelog and removed recheck of stopped tick. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
|
|
This will be used in the next patch to get a loadavg update from
nohz cpus. The delta check is skipped because idle_sched_class
doesn't update se.exec_start.
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
|