summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-06-30bpf: Fix an incorrect branch elimination by verifierYonghong Song
Wenbo reported an issue in [1] where a checking of null pointer is evaluated as always false. In this particular case, the program type is tp_btf and the pointer to compare is a PTR_TO_BTF_ID. The current verifier considers PTR_TO_BTF_ID always reprents a non-null pointer, hence all PTR_TO_BTF_ID compares to 0 will be evaluated as always not-equal, which resulted in the branch elimination. For example, struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { if (arg == 0) test7_result = 1; return 0; } int BPF_PROG(test8, struct bpf_fentry_test_t *arg) { if (arg->a == 0) test8_result = 1; return 0; } In above bpf programs, both branch arg == 0 and arg->a == 0 are removed. This may not be what developer expected. The bug is introduced by Commit cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ"), where PTR_TO_BTF_ID is considered to be non-null when evaluting pointer vs. scalar comparison. This may be added considering we have PTR_TO_BTF_ID_OR_NULL in the verifier as well. PTR_TO_BTF_ID_OR_NULL is added to explicitly requires a non-NULL testing in selective cases. The current generic pointer tracing framework in verifier always assigns PTR_TO_BTF_ID so users does not need to check NULL pointer at every pointer level like a->b->c->d. We may not want to assign every PTR_TO_BTF_ID as PTR_TO_BTF_ID_OR_NULL as this will require a null test before pointer dereference which may cause inconvenience for developers. But we could avoid branch elimination to preserve original code intention. This patch simply removed PTR_TO_BTD_ID from reg_type_not_null() in verifier, which prevented the above branches from being eliminated. [1]: https://lore.kernel.org/bpf/79dbb7c0-449d-83eb-5f4f-7af0cc269168@fb.com/T/ Fixes: cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ") Reported-by: Wenbo Zhang <ethercflow@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171240.2523722-1-yhs@fb.com
2020-06-30bpf, netns: Fix use-after-free in pernet pre_exit callbackJakub Sitnicki
Iterating over BPF links attached to network namespace in pre_exit hook is not safe, even if there is just one. Once link gets auto-detached, that is its back-pointer to net object is set to NULL, the link can be released and freed without waiting on netns_bpf_mutex, effectively causing the list element we are operating on to be freed. This leads to use-after-free when trying to access the next element on the list, as reported by KASAN. Bug can be triggered by destroying a network namespace, while also releasing a link attached to this network namespace. | ================================================================== | BUG: KASAN: use-after-free in netns_bpf_pernet_pre_exit+0xd9/0x130 | Read of size 8 at addr ffff888119e0d778 by task kworker/u8:2/177 | | CPU: 3 PID: 177 Comm: kworker/u8:2 Not tainted 5.8.0-rc1-00197-ga0c04c9d1008-dirty #776 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: netns cleanup_net | Call Trace: | dump_stack+0x9e/0xe0 | print_address_description.constprop.0+0x3a/0x60 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | kasan_report.cold+0x1f/0x40 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | netns_bpf_pernet_pre_exit+0xd9/0x130 | cleanup_net+0x30b/0x5b0 | ? unregister_pernet_device+0x50/0x50 | ? rcu_read_lock_bh_held+0xb0/0xb0 | ? _raw_spin_unlock_irq+0x24/0x50 | process_one_work+0x4d1/0xa10 | ? lock_release+0x3e0/0x3e0 | ? pwq_dec_nr_in_flight+0x110/0x110 | ? rwlock_bug.part.0+0x60/0x60 | worker_thread+0x7a/0x5c0 | ? process_one_work+0xa10/0xa10 | kthread+0x1e3/0x240 | ? kthread_create_on_node+0xd0/0xd0 | ret_from_fork+0x1f/0x30 | | Allocated by task 280: | save_stack+0x1b/0x40 | __kasan_kmalloc.constprop.0+0xc2/0xd0 | netns_bpf_link_create+0xfe/0x650 | __do_sys_bpf+0x153a/0x2a50 | do_syscall_64+0x59/0x300 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 | | Freed by task 198: | save_stack+0x1b/0x40 | __kasan_slab_free+0x12f/0x180 | kfree+0xed/0x350 | process_one_work+0x4d1/0xa10 | worker_thread+0x7a/0x5c0 | kthread+0x1e3/0x240 | ret_from_fork+0x1f/0x30 | | The buggy address belongs to the object at ffff888119e0d700 | which belongs to the cache kmalloc-192 of size 192 | The buggy address is located 120 bytes inside of | 192-byte region [ffff888119e0d700, ffff888119e0d7c0) | The buggy address belongs to the page: | page:ffffea0004678340 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 | flags: 0x2fffe0000000200(slab) | raw: 02fffe0000000200 ffffea00045ba8c0 0000000600000006 ffff88811a80ea80 | raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 | page dumped because: kasan: bad access detected | | Memory state around the buggy address: | ffff888119e0d600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ffff888119e0d680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | >ffff888119e0d700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ^ | ffff888119e0d780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | ffff888119e0d800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ================================================================== Remove the "fast-path" for releasing a link that got auto-detached by a dying network namespace to fix it. This way as long as link is on the list and netns_bpf mutex is held, we have a guarantee that link memory can be accessed. An alternative way to fix this issue would be to safely iterate over the list of links and ensure there is no access to link object after detaching it. But, at the moment, optimizing synchronization overhead on link release without a workload in mind seems like an overkill. Fixes: ab53cad90eb1 ("bpf, netns: Keep a list of attached bpf_link's") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200630164541.1329993-1-jakub@cloudflare.com
2020-06-30Merge branch 'fix-sockmap-flow_dissector-uapi'Alexei Starovoitov
Lorenz Bauer says: ==================== Both sockmap and flow_dissector ingnore various arguments passed to BPF_PROG_ATTACH and BPF_PROG_DETACH. We can fix the attach case by checking that the unused arguments are zero. I considered requiring target_fd to be -1 instead of 0, but this leads to a lot of churn in selftests. There is also precedent in that bpf_iter already expects 0 for a similar field. I think that we can come up with a work around for fd 0 should we need to in the future. The detach case is more problematic: both cgroups and lirc2 verify that attach_bpf_fd matches the currently attached program. This way you need access to the program fd to be able to remove it. Neither sockmap nor flow_dissector do this. flow_dissector even has a check for CAP_NET_ADMIN because of this. The patch set addresses this by implementing the desired behaviour. There is a possibility for user space breakage: any callers that don't provide the correct fd will fail with ENOENT. For sockmap the risk is low: even the selftests assume that sockmap works the way I described. For flow_dissector the story is less straightforward, and the selftests use a variety of arguments. I've includes fixes tags for the oldest commits that allow an easy backport, however the behaviour dates back to when sockmap and flow_dissector were introduced. What is the best way to handle these? This set is based on top of Jakub's work "bpf, netns: Prepare for multi-prog attachment" available at https://lore.kernel.org/bpf/87k0zwmhtb.fsf@cloudflare.com/T/ Since v1: - Adjust selftests - Implement detach behaviour ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-30selftests: bpf: Pass program to bpf_prog_detach in flow_dissectorLorenz Bauer
Calling bpf_prog_detach is incorrect, since it takes target_fd as its argument. The intention here is to pass it as attach_bpf_fd, so use bpf_prog_detach2 and pass zero for target_fd. Fixes: 06716e04a043 ("selftests/bpf: Extend test_flow_dissector to cover link creation") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-7-lmb@cloudflare.com
2020-06-30selftests: bpf: Pass program and target_fd in flow_dissector_reattachLorenz Bauer
Pass 0 as target_fd when attaching and detaching flow dissector. Additionally, pass the expected program when detaching. Fixes: 1f043f87bb59 ("selftests/bpf: Add tests for attaching bpf_link to netns") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-6-lmb@cloudflare.com
2020-06-30bpf: sockmap: Require attach_bpf_fd when detaching a programLorenz Bauer
The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
2020-06-30bpf: sockmap: Check value of unused args to BPF_PROG_ATTACHLorenz Bauer
Using BPF_PROG_ATTACH on a sockmap program currently understands no flags or replace_bpf_fd, but accepts any value. Return EINVAL instead. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-4-lmb@cloudflare.com
2020-06-30bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACHLorenz Bauer
Using BPF_PROG_DETACH on a flow dissector program supports neither attach_flags nor attach_bpf_fd. Yet no value is enforced for them. Enforce that attach_flags are zero, and require the current program to be passed via attach_bpf_fd. This allows us to remove the check for CAP_SYS_ADMIN, since userspace can now no longer remove arbitrary flow dissector programs. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-3-lmb@cloudflare.com
2020-06-30bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACHLorenz Bauer
Using BPF_PROG_ATTACH on a flow dissector program supports neither target_fd, attach_flags or replace_bpf_fd but accepts any value. Enforce that all of them are zero. This is fine for replace_bpf_fd since its presence is indicated by BPF_F_REPLACE. It's more problematic for target_fd, since zero is a valid fd. Should we want to use the flag later on we'd have to add an exception for fd 0. The alternative is to force a value like -1. This requires more changes to tests. There is also precedent for using 0, since bpf_iter uses this for target_fd as well. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-2-lmb@cloudflare.com
2020-06-30Merge branch 'bpf-multi-prog-prep'Alexei Starovoitov
Jakub Sitnicki says: ==================== This patch set prepares ground for link-based multi-prog attachment for future netns attach types, with BPF_SK_LOOKUP attach type in mind [0]. Two changes are needed in order to attach and run a series of BPF programs: 1) an bpf_prog_array of programs to run (patch #2), and 2) a list of attached links to keep track of attachments (patch #3). Nothing changes for BPF flow_dissector. Just as before only one program can be attached to netns. In v3 I've simplified patch #2 that introduces bpf_prog_array to take advantage of the fact that it will hold at most one program for now. In particular, I'm no longer using bpf_prog_array_copy. It turned out to be less suitable for link operations than I thought as it fails to append the same BPF program. bpf_prog_array_replace_item is also gone, because we know we always want to replace the first element in prog_array. Naturally the code that handles bpf_prog_array will need change once more when there is a program type that allows multi-prog attachment. But I feel it will be better to do it gradually and present it together with tests that actually exercise multi-prog code paths. [0] https://lore.kernel.org/bpf/20200511185218.1422406-1-jakub@cloudflare.com/ v2 -> v3: - Don't check if run_array is null in link update callback. (Martin) - Allow updating the link with the same BPF program. (Andrii) - Add patch #4 with a test for the above case. - Kill bpf_prog_array_replace_item. Access the run_array directly. - Switch from bpf_prog_array_copy() to bpf_prog_array_alloc(1, ...). - Replace rcu_deref_protected & RCU_INIT_POINTER with rcu_replace_pointer. - Drop Andrii's Ack from patch #2. Code changed. v1 -> v2: - Show with a (void) cast that bpf_prog_array_replace_item() return value is ignored on purpose. (Andrii) - Explain why bpf-cgroup cannot replace programs in bpf_prog_array based on bpf_prog pointer comparison in patch #2 description. (Andrii) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-30selftests/bpf: Test updating flow_dissector link with same programJakub Sitnicki
This case, while not particularly useful, is worth covering because we expect the operation to succeed as opposed when re-attaching the same program directly with PROG_ATTACH. While at it, update the tests summary that fell out of sync when tests extended to cover links. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-5-jakub@cloudflare.com
2020-06-30bpf, netns: Keep a list of attached bpf_link'sJakub Sitnicki
To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
2020-06-30bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki
Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki
Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-30bpf: Enforce BPF ringbuf size to be the power of 2Andrii Nakryiko
BPF ringbuf assumes the size to be a multiple of page size and the power of 2 value. The latter is important to avoid division while calculating position inside the ring buffer and using (N-1) mask instead. This patch fixes omission to enforce power-of-2 size rule. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com
2020-06-30xsk: Use dma_need_sync instead of reimplenting itChristoph Hellwig
Use the dma_need_sync helper instead of (not always entirely correctly) poking into the dma-mapping internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-5-hch@lst.de
2020-06-30xsk: Remove a double pool->dev assignment in xp_dma_mapChristoph Hellwig
->dev is already assigned at the top of the function, remove the duplicate one at the end. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-4-hch@lst.de
2020-06-30xsk: Replace the cheap_dma flag with a dma_need_sync flagChristoph Hellwig
Invert the polarity and better name the flag so that the use case is properly documented. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
2020-06-30dma-mapping: Add a new dma_need_sync APIChristoph Hellwig
Add a new API to check if calls to dma_sync_single_for_{device,cpu} are required for a given DMA streaming mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-2-hch@lst.de
2020-06-28Merge branch 'fix-sockmap'Alexei Starovoitov
John Fastabend says: ==================== Fix a splat introduced by recent changes to avoid skipping ingress policy when kTLS is enabled. The RCU splat was introduced because in the non-TLS case the caller is wrapped in an rcu_read_lock/unlock. But, in the TLS case we have a reference to the psock and the caller did not wrap its call in rcu_read_lock/unlock. To fix extend the RCU section to include the redirect case which was missed. From v1->v2 I changed the location a bit to simplify the code some. See patch 1. But, then Martin asked why it was not needed in the non-TLS case. The answer for patch 1 was, as stated above, because the caller has the rcu read lock. However, there was still a missing case where a BPF user could in-theory line up a set of parameters to hit a case where the code was entered from strparser side from a different context then the initial caller. To hit this user would need a parser program to return value greater than skb->len then an ENOMEM error could happen in the strparser codepath triggering strparser to retry from a workqueue and without rcu_read_lock original caller used. See patch 2 for details. Finally, we don't actually have any selftests for parser returning a value geater than skb->len so add one in patch 3. This is especially needed because at least I don't have any code that uses the parser to return value greater than skb->len. So I wouldn't have caught any errors here in my own testing. Thanks, John v1->v2: simplify code in patch 1 some and add patches 2 and 3. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-28bpf, sockmap: Add ingres skb tests that utilize merge skbsJohn Fastabend
Add a test to check strparser merging skbs is working. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159312681884.18340.4922800172600252370.stgit@john-XPS-13-9370
2020-06-28bpf, sockmap: RCU dereferenced psock may be used outside RCU blockJohn Fastabend
If an ingress verdict program specifies message sizes greater than skb->len and there is an ENOMEM error due to memory pressure we may call the rcv_msg handler outside the strp_data_ready() caller context. This is because on an ENOMEM error the strparser will retry from a workqueue. The caller currently protects the use of psock by calling the strp_data_ready() inside a rcu_read_lock/unlock block. But, in above workqueue error case the psock is accessed outside the read_lock/unlock block of the caller. So instead of using psock directly we must do a look up against the sk again to ensure the psock is available. There is an an ugly piece here where we must handle the case where we paused the strp and removed the psock. On psock removal we first pause the strparser and then remove the psock. If the strparser is paused while an skb is scheduled on the workqueue the skb will be dropped on the flow and kfree_skb() is called. If the workqueue manages to get called before we pause the strparser but runs the rcvmsg callback after the psock is removed we will hit the unlikely case where we run the sockmap rcvmsg handler but do not have a psock. For now we will follow strparser logic and drop the skb on the floor with skb_kfree(). This is ugly because the data is dropped. To date this has not caused problems in practice because either the application controlling the sockmap is coordinating with the datapath so that skbs are "flushed" before removal or we simply wait for the sock to be closed before removing it. This patch fixes the describe RCU bug and dropping the skb doesn't make things worse. Future patches will improve this by allowing the normal case where skbs are not merged to skip the strparser altogether. In practice many (most?) use cases have no need to merge skbs so its both a code complexity hit as seen above and a performance issue. For example, in the Cilium case we always set the strparser up to return sbks 1:1 without any merging and have avoided above issues. Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
2020-06-28bpf, sockmap: RCU splat with redirect and strparser error or TLSJohn Fastabend
There are two paths to generate the below RCU splat the first and most obvious is the result of the BPF verdict program issuing a redirect on a TLS socket (This is the splat shown below). Unlike the non-TLS case the caller of the *strp_read() hooks does not wrap the call in a rcu_read_lock/unlock. Then if the BPF program issues a redirect action we hit the RCU splat. However, in the non-TLS socket case the splat appears to be relatively rare, because the skmsg caller into the strp_data_ready() is wrapped in a rcu_read_lock/unlock. Shown here, static void sk_psock_strp_data_ready(struct sock *sk) { struct sk_psock *psock; rcu_read_lock(); psock = sk_psock(sk); if (likely(psock)) { if (tls_sw_has_ctx_rx(sk)) { psock->parser.saved_data_ready(sk); } else { write_lock_bh(&sk->sk_callback_lock); strp_data_ready(&psock->parser.strp); write_unlock_bh(&sk->sk_callback_lock); } } rcu_read_unlock(); } If the above was the only way to run the verdict program we would be safe. But, there is a case where the strparser may throw an ENOMEM error while parsing the skb. This is a result of a failed skb_clone, or alloc_skb_for_msg while building a new merged skb when the msg length needed spans multiple skbs. This will in turn put the skb on the strp_wrk workqueue in the strparser code. The skb will later be dequeued and verdict programs run, but now from a different context without the rcu_read_lock()/unlock() critical section in sk_psock_strp_data_ready() shown above. In practice I have not seen this yet, because as far as I know most users of the verdict programs are also only working on single skbs. In this case no merge happens which could trigger the above ENOMEM errors. In addition the system would need to be under memory pressure. For example, we can't hit the above case in selftests because we missed having tests to merge skbs. (Added in later patch) To fix the below splat extend the rcu_read_lock/unnlock block to include the call to sk_psock_tls_verdict_apply(). This will fix both TLS redirect case and non-TLS redirect+error case. Also remove psock from the sk_psock_tls_verdict_apply() function signature its not used there. [ 1095.937597] WARNING: suspicious RCU usage [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W [ 1095.944363] ----------------------------- [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage! [ 1095.950866] [ 1095.950866] other info that might help us debug this: [ 1095.950866] [ 1095.957146] [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1 [ 1095.961482] 1 lock held by test_sockmap/15970: [ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls] [ 1095.968568] [ 1095.968568] stack backtrace: [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1 [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 1095.980519] Call Trace: [ 1095.982191] dump_stack+0x8f/0xd0 [ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0 [ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250 [ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls] v2: Improve commit message to identify non-TLS redirect plus error case condition as well as more common TLS case. In the process I decided doing the rcu_read_unlock followed by the lock/unlock inside branches was unnecessarily complex. We can just extend the current rcu block and get the same effeective without the shuffling and branching. Thanks Martin! Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
2020-06-25libbpf: Adjust SEC short cut for expected attach type BPF_XDP_DEVMAPJesper Dangaard Brouer
Adjust the SEC("xdp_devmap/") prog type prefix to contain a slash "/" for expected attach type BPF_XDP_DEVMAP. This is consistent with other prog types like tracing. Fixes: 2778797037a6 ("libbpf: Add SEC name for xdp programs attached to device map") Suggested-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/159309521882.821855.6873145686353617509.stgit@firesoul
2020-06-25bpf: Do not allow btf_ctx_access with __int128 typesJohn Fastabend
To ensure btf_ctx_access() is safe the verifier checks that the BTF arg type is an int, enum, or pointer. When the function does the BTF arg lookup it uses the calculation 'arg = off / 8' using the fact that registers are 8B. This requires that the first arg is in the first reg, the second in the second, and so on. However, for __int128 the arg will consume two registers by default LLVM implementation. So this will cause the arg layout assumed by the 'arg = off / 8' calculation to be incorrect. Because __int128 is uncommon this patch applies the easiest fix and will force int types to be sizeof(u64) or smaller so that they will fit in a single register. v2: remove unneeded parens per Andrii's feedback Fixes: 9e15db66136a1 ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/159303723962.11287.13309537171132420717.stgit@john-Precision-5820-Tower
2020-06-23bpf: Fix formatting in documentation for BPF helpersQuentin Monnet
When producing the bpf-helpers.7 man page from the documentation from the BPF user space header file, rst2man complains: <stdin>:2636: (ERROR/3) Unexpected indentation. <stdin>:2640: (WARNING/2) Block quote ends without a blank line; unexpected unindent. Let's fix formatting for the relevant chunk (item list in bpf_ringbuf_query()'s description), and for a couple other functions. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200623153935.6215-1-quentin@isovalent.com
2020-06-23bpf: Restore behaviour of CAP_SYS_ADMIN allowing the loading of networking ↵Maciej Żenczykowski
bpf programs This is a fix for a regression in commit 2c78ee898d8f ("bpf: Implement CAP_BPF"). Before the above commit it was possible to load network bpf programs with just the CAP_SYS_ADMIN privilege. The Android bpfloader happens to run in such a configuration (it has SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN). Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: John Stultz <john.stultz@linaro.org> Link: https://lore.kernel.org/bpf/20200620212616.93894-1-zenczykowski@gmail.com
2020-06-23bpf: Set the number of exception entries properly for subprogramsYonghong Song
Currently, if a bpf program has more than one subprograms, each program will be jitted separately. For programs with bpf-to-bpf calls the prog->aux->num_exentries is not setup properly. For example, with bpf_iter_netlink.c modified to force one function to be not inlined and with CONFIG_BPF_JIT_ALWAYS_ON the following error is seen: $ ./test_progs -n 3/3 ... libbpf: failed to load program 'iter/netlink' libbpf: failed to load object 'bpf_iter_netlink' libbpf: failed to load BPF skeleton 'bpf_iter_netlink': -4007 test_netlink:FAIL:bpf_iter_netlink__open_and_load skeleton open_and_load failed #3/3 netlink:FAIL The dmesg shows the following errors: ex gen bug which is triggered by the following code in arch/x86/net/bpf_jit_comp.c: if (excnt >= bpf_prog->aux->num_exentries) { pr_err("ex gen bug\n"); return -EFAULT; } This patch fixes the issue by computing proper num_exentries for each subprogram before calling JIT. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-23libbpf: Fix CO-RE relocs against .text sectionAndrii Nakryiko
bpf_object__find_program_by_title(), used by CO-RE relocation code, doesn't return .text "BPF program", if it is a function storage for sub-programs. Because of that, any CO-RE relocation in helper non-inlined functions will fail. Fix this by searching for .text-corresponding BPF program manually. Adjust one of bpf_iter selftest to exhibit this pattern. Fixes: ddc7c3042614 ("libbpf: implement BPF CO-RE offset relocation algorithm") Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200619230423.691274-1-andriin@fb.com
2020-06-22libbpf: Forward-declare bpf_stats_type for systems with outdated UAPI headersAndrii Nakryiko
Systems that doesn't yet have the very latest linux/bpf.h header, enum bpf_stats_type will be undefined, causing compilation warnings. Prevents this by forward-declaring enum. Fixes: 0bee106716cf ("libbpf: Add support for command BPF_ENABLE_STATS") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200621031159.2279101-1-andriin@fb.com
2020-06-22MAINTAINERS: update email address for Felix FietkauFelix Fietkau
The old address has been bouncing for a while now Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20net: Add MODULE_DESCRIPTION entries to network modulesRob Gill
The user tool modinfo is used to get information on kernel modules, including a description where it is available. This patch adds a brief MODULE_DESCRIPTION to the following modules: 9p drop_monitor esp4_offload esp6_offload fou fou6 ila sch_fq sch_fq_codel sch_hhf Signed-off-by: Rob Gill <rrobgill@protonmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20rxrpc: Fix notification call on completion of discarded callsDavid Howells
When preallocated service calls are being discarded, they're passed to ->discard_new_call() to have the caller clean up any attached higher-layer preallocated pieces before being marked completed. However, the act of marking them completed now invokes the call's notification function - which causes a problem because that function might assume that the previously freed pieces of memory are still there. Fix this by setting a dummy notification function on the socket after calling ->discard_new_call(). This results in the following kasan message when the kafs module is removed. ================================================================== BUG: KASAN: use-after-free in afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707 Write of size 1 at addr ffff8880946c39e4 by task kworker/u4:1/21 CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.8.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707 rxrpc_notify_socket+0x1db/0x5d0 net/rxrpc/recvmsg.c:40 __rxrpc_set_call_completion.part.0+0x172/0x410 net/rxrpc/recvmsg.c:76 __rxrpc_call_completed net/rxrpc/recvmsg.c:112 [inline] rxrpc_call_completed+0xca/0xf0 net/rxrpc/recvmsg.c:111 rxrpc_discard_prealloc+0x781/0xab0 net/rxrpc/call_accept.c:233 rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245 afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110 afs_net_exit+0x1bc/0x310 fs/afs/main.c:155 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186 cleanup_net+0x511/0xa50 net/core/net_namespace.c:603 process_one_work+0x965/0x1690 kernel/workqueue.c:2269 worker_thread+0x96/0xe10 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:291 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 Allocated by task 6820: save_stack+0x1b/0x40 mm/kasan/common.c:48 set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc mm/kasan/common.c:494 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:467 kmem_cache_alloc_trace+0x153/0x7d0 mm/slab.c:3551 kmalloc include/linux/slab.h:555 [inline] kzalloc include/linux/slab.h:669 [inline] afs_alloc_call+0x55/0x630 fs/afs/rxrpc.c:141 afs_charge_preallocation+0xe9/0x2d0 fs/afs/rxrpc.c:757 afs_open_socket+0x292/0x360 fs/afs/rxrpc.c:92 afs_net_init+0xa6c/0xe30 fs/afs/main.c:125 ops_init+0xaf/0x420 net/core/net_namespace.c:151 setup_net+0x2de/0x860 net/core/net_namespace.c:341 copy_net_ns+0x293/0x590 net/core/net_namespace.c:482 create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:231 ksys_unshare+0x43d/0x8e0 kernel/fork.c:2983 __do_sys_unshare kernel/fork.c:3051 [inline] __se_sys_unshare kernel/fork.c:3049 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3049 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 21: save_stack+0x1b/0x40 mm/kasan/common.c:48 set_track mm/kasan/common.c:56 [inline] kasan_set_free_info mm/kasan/common.c:316 [inline] __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:455 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 afs_put_call+0x585/0xa40 fs/afs/rxrpc.c:190 rxrpc_discard_prealloc+0x764/0xab0 net/rxrpc/call_accept.c:230 rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245 afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110 afs_net_exit+0x1bc/0x310 fs/afs/main.c:155 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186 cleanup_net+0x511/0xa50 net/core/net_namespace.c:603 process_one_work+0x965/0x1690 kernel/workqueue.c:2269 worker_thread+0x96/0xe10 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:291 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 The buggy address belongs to the object at ffff8880946c3800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 484 bytes inside of 1024-byte region [ffff8880946c3800, ffff8880946c3c00) The buggy address belongs to the page: page:ffffea000251b0c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0xfffe0000000200(slab) raw: 00fffe0000000200 ffffea0002546508 ffffea00024fa248 ffff8880aa000c40 raw: 0000000000000000 ffff8880946c3000 0000000100000002 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880946c3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880946c3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880946c3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880946c3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880946c3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Reported-by: syzbot+d3eccef36ddbd02713e9@syzkaller.appspotmail.com Fixes: 5ac0d62226a0 ("rxrpc: Fix missing notification") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20Merge tag 'ieee802154-for-davem-2020-06-19' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2020-06-19 An update from ieee802154 for your *net* tree. Just two small maintenance fixes to update references to the new project homepage. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20tc-testing: update geneve options match in tunnel_key unit testsHangbin Liu
Since iproute2 commit f72c3ad00f3b ("tc: m_tunnel_key: add options support for vxlan"), the geneve opt output use key word "geneve_opts" instead of "geneve_opt". To make compatibility for both old and new iproute2, let's accept both "geneve_opt" and "geneve_opts". Suggested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Tested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20r8169: fix firmware not resetting tp->ocp_baseHeiner Kallweit
Typically the firmware takes care that tp->ocp_base is reset to its default value. That's not the case (at least) for RTL8117. As a result subsequent PHY access reads/writes the wrong page and the link is broken. Fix this be resetting tp->ocp_base explicitly. Fixes: 229c1e0dfd3d ("r8169: load firmware for RTL8168fp/RTL8117") Reported-by: Aaron Ma <mapengyu@gmail.com> Tested-by: Aaron Ma <mapengyu@gmail.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20ibmvnic: continue to init in CRQ reset returns H_CLOSEDDany Madden
Continue the reset path when partner adapter is not ready or H_CLOSED is returned from reset crq. This patch allows the CRQ init to proceed to establish a valid CRQ for traffic to flow after reset. Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20ionic: tame the watchdog timer on reconfigShannon Nelson
Even with moving netif_tx_disable() to an earlier point when taking down the queues for a reconfiguration, we still end up with the occasional netdev watchdog Tx Timeout complaint. The old method of using netif_trans_update() works fine for queue 0, but has no effect on the remaining queues. Using netif_device_detach() allows us to signal to the watchdog to ignore us for the moment. Fixes: beead698b173 ("ionic: Add the basic NDO callbacks for netdev support") Signed-off-by: Shannon Nelson <snelson@pensando.io> Acked-by: Jonathan Toppins <jtoppins@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19selftests/net: report etf errors correctlyWillem de Bruijn
The ETF qdisc can queue skbs that it could not pace on the errqueue. Address a few issues in the selftest - recv buffer size was too small, and incorrectly calculated - compared errno to ee_code instead of ee_errno - missed invalid request error type v2: - fix a few checkpatch --strict indentation warnings Fixes: ea6a547669b3 ("selftests/net: make so_txtime more robust to timer variance") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19ibmveth: Fix max MTU limitThomas Falcon
The max MTU limit defined for ibmveth is not accounting for virtual ethernet buffer overhead, which is twenty-two additional bytes set aside for the ethernet header and eight additional bytes of an opaque handle reserved for use by the hypervisor. Update the max MTU to reflect this overhead. Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more drivers") Fixes: 110447f8269a ("ethernet: fix min/max MTU typos") Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19Merge branch 'several-fixes-for-indirect-flow_blocks-offload'David S. Miller
wenxu says: ==================== several fixes for indirect flow_blocks offload v2: patch2: store the cb_priv of representor to the flow_block_cb->indr.cb_priv in the driver. And make the correct check with the statments this->indr.cb_priv == cb_priv patch4: del the driver list only in the indriect cleanup callbacks v3: add the cover letter and changlogs. v4: collapsed 1/4, 2/4, 4/4 in v3 to one fix Add the prepare patch 1 and 2 v5: patch1: place flow_indr_block_cb_alloc() right before flow_indr_dev_setup_offload() to avoid moving flow_block_indr_init() This series fixes commit 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") that revists the flow_block infrastructure. patch #1 #2: prepare for fix patch #3 add and use flow_indr_block_cb_alloc/remove function patch #3: fix flow_indr_dev_unregister path If the representor is removed, then identify the indirect flow_blocks that need to be removed by the release callback and the port representor structure. To identify the port representor structure, a new indr.cb_priv field needs to be introduced. The flow_block also needs to be removed from the driver list from the cleanup path patch#4 fix block->nooffloaddevcnt warning dmesg log. When a indr device add in offload success. The block->nooffloaddevcnt should be 0. After the representor go away. When the dir device go away the flow_block UNBIND operation with -EOPNOTSUPP which lead the warning demesg log. The block->nooffloaddevcnt should always count for indr block. even the indr block offload successful. The representor maybe gone away and the ingress qdisc can work in software mode. ==================== Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net/sched: cls_api: fix nooffloaddevcnt warning dmesg logwenxu
The block->nooffloaddevcnt should always count for indr block. even the indr block offload successful. The representor maybe gone away and the ingress qdisc can work in software mode. block->nooffloaddevcnt warning with following dmesg log: [ 760.667058] ##################################################### [ 760.668186] ## TEST test-ecmp-add-vxlan-encap-disable-sriov.sh ## [ 760.669179] ##################################################### [ 761.780655] :test: Fedora 30 (Thirty) [ 761.783794] :test: Linux reg-r-vrt-018-180 5.7.0+ [ 761.822890] :test: NIC ens1f0 FW 16.26.6000 PCI 0000:81:00.0 DEVICE 0x1019 ConnectX-5 Ex [ 761.860244] mlx5_core 0000:81:00.0 ens1f0: Link up [ 761.880693] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f0: link becomes ready [ 762.059732] mlx5_core 0000:81:00.1 ens1f1: Link up [ 762.234341] :test: unbind vfs of ens1f0 [ 762.257825] :test: Change ens1f0 eswitch (0000:81:00.0) mode to switchdev [ 762.291363] :test: unbind vfs of ens1f1 [ 762.306914] :test: Change ens1f1 eswitch (0000:81:00.1) mode to switchdev [ 762.309237] mlx5_core 0000:81:00.1: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3) [ 763.282598] mlx5_core 0000:81:00.1: E-Switch: Supported tc offload range - chains: 4294967294, prios: 4294967295 [ 763.362825] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 763.444465] mlx5_core 0000:81:00.1 ens1f1: renamed from eth0 [ 763.460088] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 763.502586] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 763.552429] ens1f1_0: renamed from eth0 [ 763.569569] mlx5_core 0000:81:00.1: E-Switch: Enable: mode(OFFLOADS), nvfs(2), active vports(3) [ 763.629694] ens1f1_1: renamed from eth1 [ 764.631552] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f1_0: link becomes ready [ 764.670841] :test: unbind vfs of ens1f0 [ 764.681966] :test: unbind vfs of ens1f1 [ 764.726762] mlx5_core 0000:81:00.0 ens1f0: Link up [ 764.766511] mlx5_core 0000:81:00.1 ens1f1: Link up [ 764.797325] :test: Add multipath vxlan encap rule and disable sriov [ 764.798544] :test: config multipath route [ 764.812732] mlx5_core 0000:81:00.0: lag map port 1:2 port 2:2 [ 764.874556] mlx5_core 0000:81:00.0: modify lag map port 1:1 port 2:2 [ 765.603681] :test: OK [ 765.659048] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f1_1: link becomes ready [ 765.675085] :test: verify rule in hw [ 765.694237] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f0: link becomes ready [ 765.711892] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f1: link becomes ready [ 766.979230] :test: OK [ 768.125419] :test: OK [ 768.127519] :test: - disable sriov ens1f1 [ 768.131160] pci 0000:81:02.2: Removing from iommu group 75 [ 768.132646] pci 0000:81:02.3: Removing from iommu group 76 [ 769.179749] mlx5_core 0000:81:00.1: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) [ 769.455627] mlx5_core 0000:81:00.0: modify lag map port 1:1 port 2:1 [ 769.703990] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 769.988637] mlx5_core 0000:81:00.1 ens1f1: renamed from eth0 [ 769.990022] :test: - disable sriov ens1f0 [ 769.994922] pci 0000:81:00.2: Removing from iommu group 73 [ 769.997048] pci 0000:81:00.3: Removing from iommu group 74 [ 771.035813] mlx5_core 0000:81:00.0: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) [ 771.339091] ------------[ cut here ]------------ [ 771.340812] WARNING: CPU: 6 PID: 3448 at net/sched/cls_api.c:749 tcf_block_offload_unbind.isra.0+0x5c/0x60 [ 771.341728] Modules linked in: act_mirred act_tunnel_key cls_flower dummy vxlan ip6_udp_tunnel udp_tunnel sch_ingress nfsv3 nfs_acl nfs lockd grace fscache tun bridge stp llc sunrpc rdma_ucm rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5_core intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mlxfw act_ct nf_flow_table kvm_intel nf_nat kvm nf_conntrack irqbypass crct10dif_pclmul igb crc32_pclmul nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 crc32c_intel ghash_clmulni_intel ptp ipmi_ssif intel_cstate pps_c ore ses intel_uncore mei_me iTCO_wdt joydev ipmi_si iTCO_vendor_support i2c_i801 enclosure mei ioatdma dca lpc_ich wmi ipmi_devintf pcspkr acpi_power_meter ipmi_msghandler acpi_pad ast i2c_algo_bit drm_vram_helper drm_kms_helper drm_ttm_helper ttm drm mpt3sas raid_class scsi_transport_sas [ 771.347818] CPU: 6 PID: 3448 Comm: test-ecmp-add-v Not tainted 5.7.0+ #1146 [ 771.348727] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 771.349646] RIP: 0010:tcf_block_offload_unbind.isra.0+0x5c/0x60 [ 771.350553] Code: 4a fd ff ff 83 f8 a1 74 0e 5b 4c 89 e7 5d 41 5c 41 5d e9 07 93 89 ff 8b 83 a0 00 00 00 8d 50 ff 89 93 a0 00 00 00 85 c0 75 df <0f> 0b eb db 0f 1f 44 00 00 41 57 41 56 41 55 41 89 cd 41 54 49 89 [ 771.352420] RSP: 0018:ffffb33144cd3b00 EFLAGS: 00010246 [ 771.353353] RAX: 0000000000000000 RBX: ffff8b37cf4b2800 RCX: 0000000000000000 [ 771.354294] RDX: 00000000ffffffff RSI: ffff8b3b9aad0000 RDI: ffffffff8d5c6e20 [ 771.355245] RBP: ffff8b37eb546948 R08: ffffffffc0b7a348 R09: ffff8b3b9aad0000 [ 771.356189] R10: 0000000000000001 R11: ffff8b3ba7a0a1c0 R12: ffff8b37cf4b2850 [ 771.357123] R13: ffff8b3b9aad0000 R14: ffff8b37cf4b2820 R15: ffff8b37cf4b2820 [ 771.358039] FS: 00007f8a19b6e740(0000) GS:ffff8b3befa00000(0000) knlGS:0000000000000000 [ 771.358965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 771.359885] CR2: 00007f3afb91c1a0 CR3: 000000045133c004 CR4: 00000000001606e0 [ 771.360825] Call Trace: [ 771.361764] __tcf_block_put+0x84/0x150 [ 771.362712] ingress_destroy+0x1b/0x20 [sch_ingress] [ 771.363658] qdisc_destroy+0x3e/0xc0 [ 771.364594] dev_shutdown+0x7a/0xa5 [ 771.365522] rollback_registered_many+0x20d/0x530 [ 771.366458] ? netdev_upper_dev_unlink+0x15d/0x1c0 [ 771.367387] unregister_netdevice_many.part.0+0xf/0x70 [ 771.368310] vxlan_netdevice_event+0xa4/0x110 [vxlan] [ 771.369454] notifier_call_chain+0x4c/0x70 [ 771.370579] rollback_registered_many+0x2f5/0x530 [ 771.371719] rollback_registered+0x56/0x90 [ 771.372843] unregister_netdevice_queue+0x73/0xb0 [ 771.373982] unregister_netdev+0x18/0x20 [ 771.375168] mlx5e_vport_rep_unload+0x56/0xc0 [mlx5_core] [ 771.376327] esw_offloads_disable+0x81/0x90 [mlx5_core] [ 771.377512] mlx5_eswitch_disable_locked.cold+0xcb/0x1af [mlx5_core] [ 771.378679] mlx5_eswitch_disable+0x44/0x60 [mlx5_core] [ 771.379822] mlx5_device_disable_sriov+0xad/0xb0 [mlx5_core] [ 771.380968] mlx5_core_sriov_configure+0xc1/0xe0 [mlx5_core] [ 771.382087] sriov_numvfs_store+0xfc/0x130 [ 771.383195] kernfs_fop_write+0xce/0x1b0 [ 771.384302] vfs_write+0xb6/0x1a0 [ 771.385410] ksys_write+0x5f/0xe0 [ 771.386500] do_syscall_64+0x5b/0x1d0 [ 771.387569] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 0fdcf78d5973 ("net: use flow_indr_dev_setup_offload()") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net: flow_offload: fix flow_indr_dev_unregister pathwenxu
If the representor is removed, then identify the indirect flow_blocks that need to be removed by the release callback and the port representor structure. To identify the port representor structure, a new indr.cb_priv field needs to be introduced. The flow_block also needs to be removed from the driver list from the cleanup path. Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19flow_offload: use flow_indr_block_cb_alloc/remove functionwenxu
Prepare fix the bug in the next patch. use flow_indr_block_cb_alloc/remove function and remove the __flow_block_indr_binding. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19flow_offload: add flow_indr_block_cb_alloc/remove functionwenxu
Add flow_indr_block_cb_alloc/remove function for next fix patch. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19geneve: allow changing DF behavior after creationSabrina Dubroca
Currently, trying to change the DF parameter of a geneve device does nothing: # ip -d link show geneve1 14: geneve1: <snip> link/ether <snip> geneve id 1 remote 10.0.0.1 ttl auto df set dstport 6081 <snip> # ip link set geneve1 type geneve id 1 df unset # ip -d link show geneve1 14: geneve1: <snip> link/ether <snip> geneve id 1 remote 10.0.0.1 ttl auto df set dstport 6081 <snip> We just need to update the value in geneve_changelink. Fixes: a025fb5f49ad ("geneve: Allow configuration of DF behaviour") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19enetc: Fix HW_VLAN_CTAG_TX|RX togglingClaudiu Manoil
VLAN tag insertion/extraction offload is correctly activated at probe time but deactivation of this feature (i.e. via ethtool) is broken. Toggling works only for Tx/Rx ring 0 of a PF, and is ignored for the other rings, including the VF rings. To fix this, the existing VLAN offload toggling code was extended to all the rings assigned to a netdevice, instead of the default ring 0 (likely a leftover from the early validation days of this feature). And the code was moved to the common set_features() function to fix toggling for the VF driver too. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net: macb: undo operations in case of failureClaudiu Beznea
Undo previously done operation in case macb_phylink_connect() fails. Since macb_reset_hw() is the 1st undo operation the napi_exit label was renamed to reset_hw. Fixes: 7897b071ac3b ("net: macb: convert to phylink") Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19Merge tag 'rxrpc-fixes-20200618' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Performance drop fix and other fixes Here are three fixes for rxrpc: (1) Fix a trace symbol mapping. It doesn't seem to let you map to "". (2) Fix the handling of the remote receive window size when it increases beyond the size we can support for our transmit window. (3) Fix a performance drop caused by retransmitted packets being accidentally marked as already ACK'd. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19Merge branch 'net-phy-MDIO-bus-scanning-fixes'David S. Miller
Florian Fainelli says: ==================== net: phy: MDIO bus scanning fixes This patch series fixes two problems with the current MDIO bus scanning logic which was identified while moving from 4.9 to 5.4 on devices that do rely on scanning the MDIO bus at runtime because they use pluggable cards. Changes in v2: - added comment explaining the special value of -ENODEV - added Andrew's Reviewed-by tag ==================== Signed-off-by: David S. Miller <davem@davemloft.net>