Age | Commit message (Collapse) | Author |
|
Currently the verifier does not track imm across alu operations when
the source register is of unknown type. This adds additional pattern
matching to catch this and track imm. We've seen LLVM generating this
pattern while working on cilium.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, bpf_trace_printk does not support common formatting
symbol '%i' however vsprintf does and is what eventually gets
called by bpf helper. If users are used to '%i' and currently
make use of it, then bpf_trace_printk will just return with
error without dumping anything to the trace pipe, so just add
support for '%i' to the helper.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We do export through fdinfo already whether a prog is JITed or not,
given a program load can fail in case of either prog or tail call map
has JITed property, but neither both are JITed or not JITed, we can
facilitate error reporting in loaders like iproute2 through exporting
owner_jited of tail call map. We already do export owner_prog_type
through this facility, so parser can pick up both for comparison.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This work tries to make the semantics and code around the
narrower ctx access a bit easier to follow. Right now
everything is done inside the .is_valid_access(). Offset
matching is done differently for read/write types, meaning
writes don't support narrower access and thus matching only
on offsetof(struct foo, bar) is enough whereas for read
case that supports narrower access we must check for
offsetof(struct foo, bar) + offsetof(struct foo, bar) +
sizeof(<bar>) - 1 for each of the cases. For read cases of
individual members that don't support narrower access (like
packet pointers or skb->cb[] case which has its own narrow
access logic), we check as usual only offsetof(struct foo,
bar) like in write case. Then, for the case where narrower
access is allowed, we also need to set the aux info for the
access. Meaning, ctx_field_size and converted_op_size have
to be set. First is the original field size e.g. sizeof(<bar>)
as in above example from the user facing ctx, and latter
one is the target size after actual rewrite happened, thus
for the kernel facing ctx. Also here we need the range match
and we need to keep track changing convert_ctx_access() and
converted_op_size from is_valid_access() as both are not at
the same location.
We can simplify the code a bit: check_ctx_access() becomes
simpler in that we only store ctx_field_size as a meta data
and later in convert_ctx_accesses() we fetch the target_size
right from the location where we do convert. Should the verifier
be misconfigured we do reject for BPF_WRITE cases or target_size
that are not provided. For the subsystems, we always work on
ranges in is_valid_access() and add small helpers for ranges
and narrow access, convert_ctx_accesses() sets target_size
for the relevant instruction.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.
Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.
Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code. For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.
The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.
This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.
This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.
Two new corresponding structs (one for the kernel one for the user/BPF
program):
/* kernel version */
struct bpf_sock_ops_kern {
struct sock *sk;
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
};
/* user version
* Some fields are in network byte order reflecting the sock struct
* Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
* convert them to host byte order.
*/
struct bpf_sock_ops {
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
__u32 family;
__u32 remote_ip4; /* In network byte order */
__u32 local_ip4; /* In network byte order */
__u32 remote_ip6[4]; /* In network byte order */
__u32 local_ip6[4]; /* In network byte order */
__u32 remote_port; /* In network byte order */
__u32 local_port; /* In host byte horder */
};
Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.
The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking fixes from David Miller:
1) Need to access netdev->num_rx_queues behind an accessor in netvsc
driver otherwise the build breaks with some configs, from Arnd
Bergmann.
2) Add dummy xfrm_dev_event() so that build doesn't fail when
CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu.
3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan
Carpenter.
4) Fix MCDI command size for filter operations in sfc driver, from
Martin Habets.
5) Fix UFO segmenting so that we don't calculate incorrect checksums,
from Michal Kubecek.
6) When ipv6 datagram connects fail, reset destination address and
port. From Wei Wang.
7) TCP disconnect must reset the cached receive DST, from WANG Cong.
8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric
Dumazet.
9) fman driver has to depend on HAS_DMA, from Madalin Bucur.
10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann.
11) Fix negative page counts with GFO, from Michal Kubecek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
sfc: fix attempt to translate invalid filter ID
net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
bpf: prevent leaking pointer via xadd on unpriviledged
arcnet: com20020-pci: add missing pdev setup in netdev structure
arcnet: com20020-pci: fix dev_id calculation
arcnet: com20020: remove needless base_addr assignment
Trivial fix to spelling mistake in arc_printk message
arcnet: change irq handler to lock irqsave
rocker: move dereference before free
mlxsw: spectrum_router: Fix NULL pointer dereference
net: sched: Fix one possible panic when no destroy callback
virtio-net: serialize tx routine during reset
net: usb: asix88179_178a: Add support for the Belkin B2B128
fsl/fman: add dependency on HAS_DMA
net: prevent sign extension in dev_get_stats()
tcp: reset sk_rx_dst in tcp_disconnect()
net: ipv6: reset daddr and dport in sk if connect() fails
bnx2x: Don't log mc removal needlessly
bnxt_en: Fix netpoll handling.
bnxt_en: Add missing logic to handle TPA end error conditions.
...
|
|
Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:
0: (b7) r0 = 0
1: (18) r2 = 0xffff897e82304400
3: (7b) *(u64 *)(r1 +48) = r2
R2 leaks addr into ctx
Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:
0: (b7) r0 = 0
1: (7b) *(u64 *)(r1 +48) = r0
2: (18) r2 = 0xffff897e82304400 ; map
4: (db) lock *(u64 *)(r1 +48) += r2
5: (95) exit
We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:
0: (bf) r6 = r1
1: (7a) *(u64 *)(r10 -8) = 0
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x0
6: (85) call bpf_map_lookup_elem#1
7: (15) if r0 == 0x0 goto pc+3
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
8: (b7) r3 = 0
9: (7b) *(u64 *)(r0 +0) = r3
10: (db) lock *(u64 *)(r0 +0) += r6
11: (b7) r0 = 0
12: (95) exit
from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
11: (b7) r0 = 0
12: (95) exit
Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.
Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The index is off-by-one when fp->aux->stack_depth
has already been rounded up to 32. In particular,
if stack_depth is 512, the index will be 16.
The fix is to round_up and then takes -1 instead of round_down.
[ 22.318680] ==================================================================
[ 22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
[ 22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
[ 22.321646]
[ 22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G W 4.12.0-rc6-01680-g2ee87db3a287 #22
[ 22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[ 22.324260] Call Trace:
[ 22.324612] dump_stack+0x67/0x99
[ 22.325081] print_address_description+0x1e8/0x290
[ 22.325734] ? bpf_prog_select_runtime+0x48a/0x670
[ 22.326360] kasan_report+0x265/0x350
[ 22.326860] __asan_report_load8_noabort+0x19/0x20
[ 22.327484] bpf_prog_select_runtime+0x48a/0x670
[ 22.328109] bpf_prog_load+0x626/0xd40
[ 22.328637] ? __bpf_prog_charge+0xc0/0xc0
[ 22.329222] ? check_nnp_nosuid.isra.61+0x100/0x100
[ 22.329890] ? __might_fault+0xf6/0x1b0
[ 22.330446] ? lock_acquire+0x360/0x360
[ 22.331013] SyS_bpf+0x67c/0x24d0
[ 22.331491] ? trace_hardirqs_on+0xd/0x10
[ 22.332049] ? __getnstimeofday64+0xaf/0x1c0
[ 22.332635] ? bpf_prog_get+0x20/0x20
[ 22.333135] ? __audit_syscall_entry+0x300/0x600
[ 22.333770] ? syscall_trace_enter+0x540/0xdd0
[ 22.334339] ? exit_to_usermode_loop+0xe0/0xe0
[ 22.334950] ? do_syscall_64+0x48/0x410
[ 22.335446] ? bpf_prog_get+0x20/0x20
[ 22.335954] do_syscall_64+0x181/0x410
[ 22.336454] entry_SYSCALL64_slow_path+0x25/0x25
[ 22.337121] RIP: 0033:0x7f263fe81f19
[ 22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[ 22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
[ 22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
[ 22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
[ 22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
[ 22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
[ 22.343369]
[ 22.343593] The buggy address belongs to the variable:
[ 22.344241] interpreters+0x80/0x980
[ 22.344708]
[ 22.344908] Memory state around the buggy address:
[ 22.345556] ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
[ 22.346449] ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
[ 22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
[ 22.348301] ^
[ 22.349142] ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
[ 22.350058] ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
[ 22.350984] ==================================================================
Fixes: b870aa901f4b ("bpf: use different interpreter depending on required stack size")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS.
The lookup returns a prog-id or map-id to the userspace.
The userspace can then use the BPF_PROG_GET_FD_BY_ID
or BPF_MAP_GET_FD_BY_ID to get a fd.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A few fixes for timekeeping and timers:
- Plug a subtle race due to a missing READ_ONCE() in the timekeeping
code where reloading of a pointer results in an inconsistent
callback argument being supplied to the clocksource->read function.
- Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the
time keeping core code, to prevent a possible discontuity.
- Apply a similar fix to the arm64 vdso clock_gettime()
implementation
- Add missing includes to clocksource drivers, which relied on
indirect includes which fails in certain configs.
- Use the proper iomem pointer for read/iounmap in a probe function"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW
time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
time: Fix clock->read(clock) race around clocksource changes
clocksource: Explicitly include linux/clocksource.h when needed
clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Three fixlets for perf:
- Return the proper error code if aux buffers for a event are not
supported.
- Calculate the probe offset for inlined functions correctly
- Update the Skylake DTLB load/store miss event so it can count 1G
TLB entries as well"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Fix probe definition for inlined functions
perf/x86/intel: Add 1G DTLB load/store miss support for SKL
perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull timer fix from Eric Biederman:
"This fixes an issue of confusing injected signals with the signals
from posix timers that has existed since posix timers have been in the
kernel.
This patch is slightly simpler than my earlier version of this patch
as I discovered in testing that I had misspelled "#ifdef
CONFIG_POSIX_TIMERS". So I deleted that unnecessary test and made
setting of resched_timer uncondtional.
I have tested this and verified that without this patch there is a
nasty hang that is easy to trigger, and with this patch everything
works properly"
Thomas Gleixner dixit:
"It fixes the problem at hand and covers the ptrace case as well, which
I missed.
Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Only reschedule timers on signals timers have sent
|
|
Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.
For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.
To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.
Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
"Fix the way how livepatches are being stacked with respect to RCU,
from Petr Mladek"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Fix stacking of patches with respect to RCU
|
|
If the event for which an AUX area is about to be allocated, does
not support setting up an AUX area, rb_alloc_aux() return -ENOTSUPP.
This error condition is being returned unfiltered to the user space,
and, for example, the perf tools fails with:
failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0x3fff497a1c8, 512)=22)
This error can be easily seen with "perf record -m 128,256 -e cpu-clock".
The 524 error code maps to -ENOTSUPP (in rb_alloc_aux()). The -ENOTSUPP
error code shall be only used within the kernel. So the correct error
code would then be -EOPNOTSUPP.
With this commit, the perf tool then reports:
failed to mmap with 95 (Operation not supported)
which is more clear.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pu Hou <bjhoupu@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas-Mich Richter <tmricht@linux.vnet.ibm.com>
Cc: acme@kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1497954399-6355-1-git-send-email-brueckner@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
access and manipulation of the list of patches that modify the same function.
In particular, it is the variable func_stack that is accessible from the ftrace
handler via struct ftrace_ops and klp_ops.
Of course, it synchronizes also some states of the patch on the top of the
stack, e.g. func->transition in klp_ftrace_handler.
At the same time, this mechanism guards also the manipulation of
task->patch_state. It is modified according to the state of the transition and
the state of the process.
Now, all this works well as long as RCU works well. Sadly livepatching might
get into some corner cases when this is not true. For example, RCU is not
watching when rcu_read_lock() is taken in idle threads. It is because they
might sleep and prevent reaching the grace period for too long.
There are ways how to make RCU watching even in idle threads, see
rcu_irq_enter(). But there is a small location inside RCU infrastructure when
even this does not work.
This small problematic location can be detected either before calling
rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
Sadly, there is no safe way how to handle it. Once we detect that RCU was not
watching, we might see inconsistent state of the function stack and the related
variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
incompatible implementation of the function and break the consistency of the
system. We could warn but we could not avoid the damage.
Fortunately, ftrace has similar problems and they seem to be solved well there.
It uses a heavy weight implementation of some RCU operations. In particular, it
replaces:
+ rcu_read_lock() with preempt_disable_notrace()
+ rcu_read_unlock() with preempt_enable_notrace()
+ synchronize_rcu() with schedule_on_each_cpu(sync_work)
My understanding is that this is RCU implementation from a stone age. It meets
the core RCU requirements but it is rather ineffective. Especially, it does not
allow to batch or speed up the synchronize calls.
On the other hand, it is very trivial. It allows to safely trace and/or
livepatch even the RCU core infrastructure. And the effectiveness is a not a
big issue because using ftrace or livepatches on productive systems is a rare
operation. The safety is much more important than a negligible extra load.
Note that the alternative implementation follows the RCU principles. Therefore,
we could and actually must use list_*_rcu() variants when manipulating the
func_stack. These functions allow to access the pointers in the right
order and with the right barriers. But they do not use any other
information that would be set only by rcu_read_lock().
Also note that there are actually two problems solved in ftrace:
First, it cares about the consistency of RCU read sections. It is being solved
the way as described and used in this patch.
Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
when it is being freed. For this, it also calls synchronize_rcu_tasks() in
preemptive kernel in ftrace_shutdown().
Livepatch has similar problem but it is solved by ftrace for free.
klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
registered with FTRACE_OPS_FL_DYNAMIC. It causes that
unregister_ftrace_function() calls:
* schedule_on_each_cpu(ftrace_sync) - always
* synchronize_rcu_tasks() - in preemptive kernel
The effect is that nobody is neither inside the dynamic trampoline nor inside
the ftrace handler after unregister_ftrace_function() returns.
[jkosina@suse.cz: reformat changelog, fix comment]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Due to how the MONOTONIC_RAW accumulation logic was handled,
there is the potential for a 1ns discontinuity when we do
accumulations. This small discontinuity has for the most part
gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
in their vDSO clock_gettime implementation, we've seen failures
with the inconsistency-check test in kselftest.
This patch addresses the issue by using the same sub-ns
accumulation handling that CLOCK_MONOTONIC uses, which avoids
the issue for in-kernel users.
Since the ARM64 vDSO implementation has its own clock_gettime
calculation logic, this patch reduces the frequency of errors,
but failures are still seen. The ARM64 vDSO will need to be
updated to include the sub-nanosecond xtime_nsec values in its
calculation for this issue to be completely fixed.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Daniel Mentz <danielmentz@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "stable #4 . 8+" <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
In tests, which excercise switching of clocksources, a NULL
pointer dereference can be observed on AMR64 platforms in the
clocksource read() function:
u64 clocksource_mmio_readl_down(struct clocksource *c)
{
return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
}
This is called from the core timekeeping code via:
cycle_now = tkr->read(tkr->clock);
tkr->read is the cached tkr->clock->read() function pointer.
When the clocksource is changed then tkr->clock and tkr->read
are updated sequentially. The code above results in a sequential
load operation of tkr->read and tkr->clock as well.
If the store to tkr->clock hits between the loads of tkr->read
and tkr->clock, then the old read() function is called with the
new clock pointer. As a consequence the read() function
dereferences a different data structure and the resulting 'reg'
pointer can point anywhere including NULL.
This problem was introduced when the timekeeping code was
switched over to use struct tk_read_base. Before that, it was
theoretically possible as well when the compiler decided to
reload clock in the code sequence:
now = tk->clock->read(tk->clock);
Add a helper function which avoids the issue by reading
tk_read_base->clock once into a local variable clk and then issue
the read function via clk->read(clk). This guarantees that the
read() function always gets the proper clocksource pointer handed
in.
Since there is now no use for the tkr.read pointer, this patch
also removes it, and to address stopping the fast timekeeper
during suspend/resume, it introduces a dummy clocksource to use
rather then just a dummy read function.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: stable <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Daniel Mentz <danielmentz@google.com>
Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Three fixlets for timers:
- Two hot-fixes for the alarmtimer based posix timers, which prevent
a nasty DOS by self rescheduling timers. The proper cleanup of that
mess is queued for 4.13
- Make a function static"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Make tick_broadcast_setup_oneshot() static
alarmtimer: Rate limit periodic intervals
alarmtimer: Prevent overflow of relative timers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"Two small fixes for the schedulre core:
- Use the proper switch_mm() variant in idle_task_exit() because that
code is not called with interrupts disabled.
- Fix a confusing typo in a printk"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
sched/fair: Fix typo in printk message
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"Add a missing resource release to an error path"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Release resources in __setup_irq() error path
|
|
Thomas Gleixner wrote:
> The CRIU support added a 'feature' which allows a user space task to send
> arbitrary (kernel) signals to itself. The changelog says:
>
> The kernel prevents sending of siginfo with positive si_code, because
> these codes are reserved for kernel. I think we can allow a task to
> send such a siginfo to itself. This operation should not be dangerous.
>
> Quite contrary to that claim, it turns out that it is outright dangerous
> for signals with info->si_code == SI_TIMER. The following code sequence in
> a user space task allows to crash the kernel:
>
> id = timer_create(CLOCK_XXX, ..... signo = SIGX);
> timer_set(id, ....);
> info->si_signo = SIGX;
> info->si_code = SI_TIMER:
> info->_sifields._timer._tid = id;
> info->_sifields._timer._sys_private = 2;
> rt_[tg]sigqueueinfo(..., SIGX, info);
> sigemptyset(&sigset);
> sigaddset(&sigset, SIGX);
> rt_sigtimedwait(sigset, info);
>
> For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
> results in a kernel crash because sigwait() dequeues the signal and the
> dequeue code observes:
>
> info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
>
> which triggers the following callchain:
>
> do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
>
> arm_timer() executes a list_add() on the timer, which is already armed via
> the timer_set() syscall. That's a double list add which corrupts the posix
> cpu timer list. As a consequence the kernel crashes on the next operation
> touching the posix cpu timer list.
>
> Posix clocks which are internally implemented based on hrtimers are not
> affected by this because hrtimer_start() can handle already armed timers
> nicely, but it's a reliable way to trigger the WARN_ON() in
> hrtimer_forward(), which complains about calling that function on an
> already armed timer.
This problem has existed since the posix timer code was merged into
2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
inject not just a signal (which linux has supported since 1.0) but the
full siginfo of a signal.
The core problem is that the code will reschedule in response to
signals getting dequeued not just for signals the timers sent but
for other signals that happen to a si_code of SI_TIMER.
Avoid this confusion by testing to see if the queued signal was
preallocated as all timer signals are preallocated, and so far
only the timer code preallocates signals.
Move the check for if a timer needs to be rescheduled up into
collect_signal where the preallocation check must be performed,
and pass the result back to dequeue_signal where the code reschedules
timers. This makes it clear why the code cares about preallocated
timers.
Cc: stable@vger.kernel.org
Reported-by: Thomas Gleixner <tglx@linutronix.de>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reference: 66dd34ad31e5 ("signal: allow to send any siginfo to itself")
Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* pm-cpufreq:
cpufreq: conservative: Allow down_threshold to take values from 1 to 10
Revert "cpufreq: schedutil: Reduce frequencies slower"
* pm-cpuidle:
cpuidle: dt: Add missing 'of_node_put()'
* pm-devfreq:
PM / devfreq: exynos-ppmu: Staticize event list
PM / devfreq: exynos-ppmu: Handle return value of clk_prepare_enable
PM / devfreq: exynos-nocp: Handle return value of clk_prepare_enable
|
|
Currently, verifier will reject a program if it contains an
narrower load from the bpf context structure. For example,
__u8 h = __sk_buff->hash, or
__u16 p = __sk_buff->protocol
__u32 sample_period = bpf_perf_event_data->sample_period
which are narrower loads of 4-byte or 8-byte field.
This patch solves the issue by:
. Introduce a new parameter ctx_field_size to carry the
field size of narrower load from prog type
specific *__is_valid_access validator back to verifier.
. The non-zero ctx_field_size for a memory access indicates
(1). underlying prog type specific convert_ctx_accesses
supporting non-whole-field access
(2). the current insn is a narrower or whole field access.
. In verifier, for such loads where load memory size is
less than ctx_field_size, verifier transforms it
to a full field load followed by proper masking.
. Currently, __sk_buff and bpf_perf_event_data->sample_period
are supporting narrowing loads.
. Narrower stores are still not allowed as typical ctx stores
are just normal stores.
Because of this change, some tests in verifier will fail and
these tests are removed. As a bonus, rename some out of bound
__sk_buff->cb access to proper field name and remove two
redundant "skb cb oob" tests.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In case __irq_set_trigger() fails the resources requested via
irq_request_resources() are not released.
Add the missing release call into the error handling path.
Fixes: c1bacbae8192 ("genirq: Provide irq_request/release_resources chip callbacks")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/655538f5-cb20-a892-ff15-fbd2dd1fa4ec@gmail.com
|
|
This function isn't used outside of tick-broadcast.c, so let's
mark it static.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170608063603.13276-1-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Revert commit 39b64aa1c007 (cpufreq: schedutil: Reduce frequencies
slower) that introduced unintentional changes in behavior leading
to adverse effects on some systems.
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().
This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes. Nonetheless, I think it
should be backported because it's trivial. There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Right now, we don't reset the id of spilled registers in case of
clear_all_pkt_pointers(). Given pkt_pointers are highly likely to
contain an id, do so by reusing __mark_reg_unknown_value().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Whenever we set the register to the type CONST_IMM, we currently don't
reset the id to 0. id member is not used in CONST_IMM case, so don't
let it become stale, where pruning won't be able to match later on.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
spilled_regs[] state is only used for stack slots of type STACK_SPILL,
never for STACK_MISC. Right now, in states_equal(), even if we have
old and current stack state of type STACK_MISC, we compare spilled_regs[]
for that particular offset. Just skip these like we do everywhere else.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
perf_sample_data consumes 386 bytes on stack, reduce excessive stack
usage and move it to per cpu buffer. It's allowed due to preemption
being disabled for tracing, xdp and tc programs, thus at all times
only one program can run on a specific CPU and programs cannot run
from interrupt. We similarly also handle bpf_pt_regs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
"An error handling corner case fix"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Drop the device lock on error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
"Fix an SRCU bug affecting KVM IRQ injection"
* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
srcu: Allow use of Classic SRCU from both process and interrupt context
srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"This is mostly tooling fixes, plus an instruction pointer filtering
fix.
It's more fixes than usual - Arnaldo got back from a longer vacation
and there was a backlog"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
perf symbols: Kill dso__build_id_is_kmod()
perf symbols: Keep DSO->symtab_type after decompress
perf tests: Decompress kernel module before objdump
perf tools: Consolidate error path in __open_dso()
perf tools: Decompress kernel module when reading DSO data
perf annotate: Use dso__decompress_kmodule_path()
perf tools: Introduce dso__decompress_kmodule_{fd,path}
perf tools: Fix a memory leak in __open_dso()
perf annotate: Fix symbolic link of build-id cache
perf/core: Drop kernel samples even though :u is specified
perf script python: Remove dups in documentation examples
perf script python: Updated trace_unhandled() signature
perf script python: Fix wrong code snippets in documentation
perf script: Fix documentation errors
perf script: Fix outdated comment for perf-trace-python
perf probe: Fix examples section of documentation
perf report: Ensure the perf DSO mapping matches what libdw sees
perf report: Include partial stacks unwound with libdw
perf annotate: Add missing powerpc triplet
perf test: Disable breakpoint signal tests for powerpc
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into rcu/urgent
Pull RCU fix from Paul E. McKenney:
" This series enables srcu_read_lock() and srcu_read_unlock() to be used from
interrupt handlers, which fixes a bug in KVM's use of SRCU in delivery
of interrupts to guest OSes. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These revert one problematic commit related to system sleep and fix
one recent intel_pstate regression.
Specifics:
- Revert a recent commit that attempted to avoid spurious wakeups
from suspend-to-idle via ACPI SCI, but introduced regressions on
some systems (Rafael Wysocki).
We will get back to the problem it tried to address in the next
cycle.
- Fix a possible division by 0 during intel_pstate initialization
due to a missing check (Rafael Wysocki)"
* tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
|
|
* intel_pstate:
cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
* pm-sleep:
Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk fix from Petr Mladek:
"This reverts a fix added into 4.12-rc1. It caused the kernel log to be
printed on another console when two consoles of the same type were
defined, e.g. console=ttyS0 console=ttyS1.
This configuration was never supported by kernel itself, but it
started to make sense with systemd. In other words, the commit broke
userspace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
Revert "printk: fix double printing with earlycon"
|
|
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device. This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq(). If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.
The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case. KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods. It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).
However, the docs are overly conservative. You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts. In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU. For those two implementations, only srcu_read_lock()
is unsafe.
When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller. Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.
Cc: stable@vger.kernel.org
Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device. This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq(). If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.
The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case. KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods. It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).
However, the docs are overly conservative. You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts. In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU. For those two implementations, only srcu_read_lock()
is unsafe.
When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller. Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.
Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
a single variable. Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
implementation already supports mixed-context use of srcu_read_lock()
and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
and srcu_read_unlock() in each handler are nested and paired properly.
In other words, it is still illegal to (say) invoke srcu_read_lock()
in an interrupt handler and to invoke the matching srcu_read_unlock()
in a softirq handler. Therefore, the only change required for Tiny SRCU
is to its comments.
Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This reverts commit cf39bf58afdaabc0b86f141630fb3fd18190294e.
The commit regression to users that define both console=ttyS1
and console=ttyS0 on the command line, see
https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain
The kernel log messages always appeared only on one serial port. It is
even documented in Documentation/admin-guide/serial-console.rst:
"Note that you can only define one console per device type (serial,
video)."
The above mentioned commit changed the order in which the command line
parameters are searched. As a result, the kernel log messages go to
the last mentioned ttyS* instead of the first one.
We long thought that using two console=ttyS* on the command line
did not make sense. But then we realized that console= parameters
were handled also by systemd, see
http://0pointer.de/blog/projects/serial-console.html
"By default systemd will instantiate one serial-getty@.service on
the main kernel console, if it is not a virtual terminal."
where
"[4] If multiple kernel consoles are used simultaneously, the main
console is the one listed first in /sys/class/tty/console/active,
which is the last one listed on the kernel command line."
This puts the original report into another light. The system is running
in qemu. The first serial port is used to store the messages into a file.
The second one is used to login to the system via a socket. It depends
on systemd and the historic kernel behavior.
By other words, systemd causes that it makes sense to define both
console=ttyS1 console=ttyS0 on the command line. The kernel fix
caused regression related to userspace (systemd) and need to be
reverted.
In addition, it went out that the fix helped only partially.
The messages still were duplicated when the boot console was
removed early by late_initcall(printk_late_init). Then the entire
log was replayed when the same console was registered as a normal one.
Link: 20170606160339.GC7604@pathway.suse.cz
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Robin Murphy <robin.murphy@arm.com>,
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
Cc: linux-serial@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
When doing sampling, for example:
perf record -e cycles:u ...
On workloads that do a lot of kernel entry/exits we see kernel
samples, even though :u is specified. This is due to skid existing.
This might be a security issue because it can leak kernel addresses even
though kernel sampling support is disabled.
The patch drops the kernel samples if exclude_kernel is specified.
For example, test on Haswell desktop:
perf record -e cycles:u <mgen>
perf report --stdio
Before patch applied:
99.77% mgen mgen [.] buf_read
0.20% mgen mgen [.] rand_buf_init
0.01% mgen [kernel.vmlinux] [k] apic_timer_interrupt
0.00% mgen mgen [.] last_free_elem
0.00% mgen libc-2.23.so [.] __random_r
0.00% mgen libc-2.23.so [.] _int_malloc
0.00% mgen mgen [.] rand_array_init
0.00% mgen [kernel.vmlinux] [k] page_fault
0.00% mgen libc-2.23.so [.] __random
0.00% mgen libc-2.23.so [.] __strcasestr
0.00% mgen ld-2.23.so [.] strcmp
0.00% mgen ld-2.23.so [.] _dl_start
0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4
0.00% mgen ld-2.23.so [.] _start
We can see kernel symbols apic_timer_interrupt and page_fault.
After patch applied:
99.79% mgen mgen [.] buf_read
0.19% mgen mgen [.] rand_buf_init
0.00% mgen libc-2.23.so [.] __random_r
0.00% mgen mgen [.] rand_array_init
0.00% mgen mgen [.] last_free_elem
0.00% mgen libc-2.23.so [.] vfprintf
0.00% mgen libc-2.23.so [.] rand
0.00% mgen libc-2.23.so [.] __random
0.00% mgen libc-2.23.so [.] _int_malloc
0.00% mgen libc-2.23.so [.] _IO_doallocbuf
0.00% mgen ld-2.23.so [.] do_lookup_x
0.00% mgen ld-2.23.so [.] open_verify.constprop.7
0.00% mgen ld-2.23.so [.] _dl_important_hwcaps
0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4
0.00% mgen ld-2.23.so [.] _start
There are only userspace symbols.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Cc: kan.liang@intel.com
Cc: mark.rutland@arm.com
Cc: will.deacon@arm.com
Cc: yao.jin@intel.com
Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Just some simple overlapping changes in marvell PHY driver
and the DSA core code.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Revert commit eed4d47efe95 (ACPI / sleep: Ignore spurious SCI wakeups
from suspend-to-idle) as it turned out to be premature and triggered
a number of different issues on various systems.
That includes, but is not limited to, premature suspend-to-RAM aborts
on Dell XPS 13 (9343) reported by Dominik.
The issue the commit in question attempted to address is real and
will need to be taken care of going forward, but evidently more work
is needed for this purpose.
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Commit fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to
access sk_buff") enabled programs of BPF_PROG_TYPE_CGROUP_SKB
type to use ld_abs/ind instructions. However, at this point,
we cannot use them, since offsets relative to SKF_LL_OFF will
end up pointing skb_mac_header(skb) out of bounds since in the
egress path it is not yet set at that point in time, but only
after __dev_queue_xmit() did a general reset on the mac header.
bpf_internal_load_pointer_neg_helper() will then end up reading
data from a wrong offset.
BPF_PROG_TYPE_CGROUP_SKB programs can use bpf_skb_load_bytes()
already to access packet data, which is also more flexible than
the insns carried over from cBPF.
Fixes: fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|