summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-11-20net: add annotation for sock_{lock,unlock}_fastPaolo Abeni
The static checker is fooled by the non-static locking scheme implemented by the mentioned helpers. Let's make its life easier adding some unconditional annotation so that the helpers are now interpreted as a plain spinlock from sparse. v1 -> v2: - add __releases() annotation to unlock_sock_fast() Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20net: openvswitch: Be liberal in tcp conntrack.Numan Siddique
There is no easy way to distinguish if a conntracked tcp packet is marked invalid because of tcp_in_window() check error or because it doesn't belong to an existing connection. With this patch, openvswitch sets liberal tcp flag for the established sessions so that out of window packets are not marked invalid. A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which sets this flag for both the directions of the nf_conn. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20201116130126.3065077-1-nusiddiq@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19IPv6: RTM_GETROUTE: Add RTA_ENCAP to resultOliver Herms
This patch adds an IPv6 routes encapsulation attribute to the result of netlink RTM_GETROUTE requests (i.e. ip route get 2001:db8::). Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20201118230651.GA8861@tws Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19mptcp: update rtx timeout only if required.Paolo Abeni
We must start the retransmission timer only there are pending data in the rtx queue. Otherwise we can hit a WARN_ON in mptcp_reset_timer(), as syzbot demonstrated. Reported-and-tested-by: syzbot+42aa53dafb66a07e5a24@syzkaller.appspotmail.com Fixes: d9ca1de8c0cd ("mptcp: move page frag allocation in mptcp_sendmsg()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Link: https://lore.kernel.org/r/1a72039f112cae048c44d398ffa14e0a1432db3d.1605737083.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19devlink: move flash end and begin to core devlinkJacob Keller
When performing a flash update via devlink, device drivers may inform user space of status updates via devlink_flash_update_(begin|end|timeout|status)_notify functions. It is expected that drivers do not send any status notifications unless they send a begin and end message. If a driver sends a status notification without sending the appropriate end notification upon finishing (regardless of success or failure), the current implementation of the devlink userspace program can get stuck endlessly waiting for the end notification that will never come. The current ice driver implementation may send such a status message without the appropriate end notification in rare cases. Fixing the ice driver is relatively simple: we just need to send the begin_notify at the start of the function and always send an end_notify no matter how the function exits. Rather than assuming driver authors will always get this right in the future, lets just fix the API so that it is not possible to get wrong. Make devlink_flash_update_begin_notify and devlink_flash_update_end_notify static, and call them in devlink.c core code. Always send the begin_notify just before calling the driver's flash_update routine. Always send the end_notify just after the routine returns regardless of success or failure. Doing this makes the status notification easier to use from the driver, as it no longer needs to worry about catching failures and cleaning up by calling devlink_flash_update_end_notify. It is now no longer possible to do the wrong thing in this regard. We also save a couple of lines of code in each driver. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19devlink: move request_firmware out of driverJacob Keller
All drivers which implement the devlink flash update support, with the exception of netdevsim, use either request_firmware or request_firmware_direct to locate the firmware file. Rather than having each driver do this separately as part of its .flash_update implementation, perform the request_firmware within net/core/devlink.c Replace the file_name parameter in the struct devlink_flash_update_params with a pointer to the fw object. Use request_firmware rather than request_firmware_direct. Although most Linux distributions today do not have the fallback mechanism implemented, only about half the drivers used the _direct request, as compared to the generic request_firmware. In the event that a distribution does support the fallback mechanism, the devlink flash update ought to be able to use it to provide the firmware contents. For distributions which do not support the fallback userspace mechanism, there should be essentially no difference between request_firmware and request_firmware_direct. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Shannon Nelson <snelson@pensando.io> Acked-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19Merge https://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Alexei Starovoitov says: ==================== 1) libbpf should not attempt to load unused subprogs, from Andrii. 2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel. 3) Relax return code check for subprograms in the BPF verifier, from Dmitrii. 4) Fix several sockmap issues, from John. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: fail_function: Remove a redundant mutex unlock selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL lib/strncpy_from_user.c: Mask out bytes after NUL terminator. libbpf: Fix VERSIONED_SYM_COUNT number parsing bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self bpf, sockmap: Use truesize with sk_rmem_schedule() bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made selftests/bpf: Fix error return code in run_getsockopt_test() bpf: Relax return code check for subprograms tools, bpftool: Add missing close before bpftool net attach exit MAINTAINERS/bpf: Update Andrii's entry. selftests/bpf: Fix unused attribute usage in subprogs_unused test bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare libbpf: Don't attempt to load unused subprog as an entry-point BPF program ==================== Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()Karsten Graul
Sparse complaints 3 times about: net/smc/smc_ib.c:203:52: warning: incorrect type in argument 1 (different address spaces) net/smc/smc_ib.c:203:52: expected struct net_device const *dev net/smc/smc_ib.c:203:52: got struct net_device [noderef] __rcu *const ndev Fix that by using the existing and validated ndev variable instead of accessing attr->ndev directly. Fixes: 5102eca9039b ("net/smc: Use rdma_read_gid_l2_fields to L2 fields") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19net/smc: fix matching of existing link groupsKarsten Graul
With the multi-subnet support of SMC-Dv2 the match for existing link groups should not include the vlanid of the network device. Set ini->smcd_version accordingly before the call to smc_conn_create() and use this value in smc_conn_create() to skip the vlanid check. Fixes: 5c21c4ccafe8 ("net/smc: determine accepted ISM devices") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 moduleGeorg Kohmann
IPV6=m NF_DEFRAG_IPV6=y ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function `nf_ct_frag6_gather': net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to `ipv6_frag_thdr_truncated' Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This dependency is forcing IPV6=y. Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This is the same solution as used with a similar issues: Referring to commit 70b095c843266 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module") Fixes: 9d9e937b1c8b ("ipv6/netfilter: Discard first fragment not including all headers") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Georg Kohmann <geokohma@cisco.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18net: bridge: replace struct br_vlan_stats with pcpu_sw_netstatsHeiner Kallweit
Struct br_vlan_stats duplicates pcpu_sw_netstats (apart from br_vlan_stats not defining an alignment requirement), therefore switch to using the latter one. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/04d25c3d-c5f6-3611-6d37-c2f40243dae2@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18atm: nicstar: Replace in_interrupt() usageSebastian Andrzej Siewior
push_scqe() uses in_interrupt() to figure out if it is allowed to sleep. The usage of in_interrupt() in drivers is phased out and Linus clearly requested that code which changes behaviour depending on context should either be separated or the context be conveyed in an argument passed by the caller, which usually knows the context. Aside of that in_interrupt() is not correct as it does not catch preempt disabled regions which neither can sleep. ns_send() (the only caller of push_scqe()) has the following callers: - vcc_sendmsg() used as proto_ops::sendmsg is expected to be invoked in preemtible context. -> vcc->dev->ops->send() (ns_send()) - atm_vcc::send via atmdev_ops::send either directly (pointer copied by atm_init_aal34() or atm_init_aal5()) or via atm_send_aal0(). This is invoked by drivers (like br2684, clip, pppoatm, ...) which are called from net_device_ops::ndo_start_xmit with BH disabled. Add atmdev_ops::send_bh which is used by callers from BH context (atm_send_aal*()) and if this callback missing then ::send is used instead. Implement this callback in nicstar and use it to replace in_interrupt(). Cc: Chas Williams <3chas3@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18net: Have netpoll bring-up DSA management interfaceFlorian Fainelli
DSA network devices rely on having their DSA management interface up and running otherwise their ndo_open() will return -ENETDOWN. Without doing this it would not be possible to use DSA devices as netconsole when configured on the command line. These devices also do not utilize the upper/lower linking so the check about the netpoll device having upper is not going to be a problem. The solution adopted here is identical to the one done for net/ipv4/ipconfig.c with 728c02089a0e ("net: ipv4: handle DSA enabled master network devices"), with the network namespace scope being restricted to that of the process configuring netpoll. Fixes: 04ff53f96a93 ("net: dsa: Add netconsole support") Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20201117035236.22658-1-f.fainelli@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18ah6: fix error return code in ah6_input()Zhang Changzhong
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17ipv4: use IS_ENABLED instead of ifdefFlorian Klink
Checking for ifdef CONFIG_x fails if CONFIG_x=m. Use IS_ENABLED instead, which is true for both built-ins and modules. Otherwise, a > ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1 fails with the message "Error: IPv6 support not enabled in kernel." if CONFIG_IPV6 is `m`. In the spirit of b8127113d01e53adba15b41aefd37b90ed83d631. Fixes: d15662682db2 ("ipv4: Allow ipv6 gateway with ipv4 routes") Cc: Kim Phillips <kim.phillips@arm.com> Signed-off-by: Florian Klink <flokli@flokli.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()Wang Hai
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message. Fixes: d545caca827b ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_listJohn Fastabend
When skb has a frag_list its possible for skb_to_sgvec() to fail. This happens when the scatterlist has fewer elements to store pages than would be needed for the initial skb plus any of its frags. This case appears rare, but is possible when running an RX parser/verdict programs exposed to the internet. Currently, when this happens we throw an error, break the pipe, and kfree the msg. This effectively breaks the application or forces it to do a retry. Lets catch this case and handle it by doing an skb_linearize() on any skb we receive with frags. At this point skb_to_sgvec should not fail because the failing conditions would require frags to be in place. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Handle memory acct if skb_verdict prog redirects to selfJohn Fastabend
If the skb_verdict_prog redirects an skb knowingly to itself, fix your BPF program this is not optimal and an abuse of the API please use SK_PASS. That said there may be cases, such as socket load balancing, where picking the socket is hashed based or otherwise picks the same socket it was received on in some rare cases. If this happens we don't want to confuse userspace giving them an EAGAIN error if we can avoid it. To avoid double accounting in these cases. At the moment even if the skb has already been charged against the sockets rcvbuf and forward alloc we check it again and do set_owner_r() causing it to be orphaned and recharged. For one this is useless work, but more importantly we can have a case where the skb could be put on the ingress queue, but because we are under memory pressure we return EAGAIN. The trouble here is the skb has already been accounted for so any rcvbuf checks include the memory associated with the packet already. This rolls up and can result in unnecessary EAGAIN errors in userspace read() calls. Fix by doing an unlikely check and skipping checks if skb->sk == sk. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to selfJohn Fastabend
If a socket redirects to itself and it is under memory pressure it is possible to get a socket stuck so that recv() returns EAGAIN and the socket can not advance for some time. This happens because when redirecting a skb to the same socket we received the skb on we first check if it is OK to enqueue the skb on the receiving socket by checking memory limits. But, if the skb is itself the object holding the memory needed to enqueue the skb we will keep retrying from kernel side and always fail with EAGAIN. Then userspace will get a recv() EAGAIN error if there are no skbs in the psock ingress queue. This will continue until either some skbs get kfree'd causing the memory pressure to reduce far enough that we can enqueue the pending packet or the socket is destroyed. In some cases its possible to get a socket stuck for a noticeable amount of time if the socket is only receiving skbs from sk_skb verdict programs. To reproduce I make the socket memory limits ridiculously low so sockets are always under memory pressure. More often though if under memory pressure it looks like a spurious EAGAIN error on user space side causing userspace to retry and typically enough has moved on the memory side that it works. To fix skip memory checks and skb_orphan if receiving on the same sock as already assigned. For SK_PASS cases this is easy, its always the same socket so we can just omit the orphan/set_owner pair. For backlog cases we need to check skb->sk and decide if the orphan and set_owner pair are needed. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Use truesize with sk_rmem_schedule()John Fastabend
We use skb->size with sk_rmem_scheduled() which is not correct. Instead use truesize to align with socket and tcp stack usage of sk_rmem_schedule. Suggested-by: Daniel Borkman <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirectJohn Fastabend
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This allows users to tune SO_RCVBUF and sockmap will honor them. We can refactor the if(charge) case out in later patches. But, keep this fix to the point. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Fix partial copy_page_to_iter so progress can still be madeJohn Fastabend
If copy_page_to_iter() fails or even partially completes, but with fewer bytes copied than expected we currently reset sg.start and return EFAULT. This proves problematic if we already copied data into the user buffer before we return an error. Because we leave the copied data in the user buffer and fail to unwind the scatterlist so kernel side believes data has been copied and user side believes data has _not_ been received. Expected behavior should be to return number of bytes copied and then on the next read we need to return the error assuming its still there. This can happen if we have a copy length spanning multiple scatterlist elements and one or more complete before the error is hit. The error is rare enough though that my normal testing with server side programs, such as nginx, httpd, envoy, etc., I have never seen this. The only reliable way to reproduce that I've found is to stream movies over my browser for a day or so and wait for it to hang. Not very scientific, but with a few extra WARN_ON()s in the code the bug was obvious. When we review the errors from copy_page_to_iter() it seems we are hitting a page fault from copy_page_to_iter_iovec() where the code checks fault_in_pages_writeable(buf, copy) where buf is the user buffer. It also seems typical server applications don't hit this case. The other way to try and reproduce this is run the sockmap selftest tool test_sockmap with data verification enabled, but it doesn't reproduce the fault. Perhaps we can trigger this case artificially somehow from the test tools. I haven't sorted out a way to do that yet though. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-17net/tls: Fix wrong record sn in async mode of device resyncTariq Toukan
In async_resync mode, we log the TCP seq of records until the async request is completed. Later, in case one of the logged seqs matches the resync request, we return it, together with its record serial number. Before this fix, we mistakenly returned the serial number of the current record instead. Fixes: ed9b7646b06a ("net/tls: Add asynchronous resync") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20201115131448.2702-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17net: datagram: fix some kernel-doc markupsMauro Carvalho Chehab
Some identifiers have different names between their prototypes and the kernel-doc markup. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17net: wan: Delete the DLCI / SDLA driversXie He
The DLCI driver (dlci.c) implements the Frame Relay protocol. However, we already have another newer and better implementation of Frame Relay provided by the HDLC_FR driver (hdlc_fr.c). The DLCI driver's implementation of Frame Relay is used by only one hardware driver in the kernel - the SDLA driver (sdla.c). The SDLA driver provides Frame Relay support for the Sangoma S50x devices. However, the vendor provides their own driver (along with their own multi-WAN-protocol implementations including Frame Relay), called WANPIPE. I believe most users of the hardware would use the vendor-provided WANPIPE driver instead. (The WANPIPE driver was even once in the kernel, but was deleted in commit 8db60bcf3021 ("[WAN]: Remove broken and unmaintained Sangoma drivers.") because the vendor no longer updated the in-kernel WANPIPE driver.) Cc: Mike McLagan <mike.mclagan@linux.org> Signed-off-by: Xie He <xie.he.0141@gmail.com> Link: https://lore.kernel.org/r/20201114150921.685594-1-xie.he.0141@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimateRyan Sharpelletti
During loss recovery, retransmitted packets are forced to use TCP timestamps to calculate the RTT samples, which have a millisecond granularity. BBR is designed using a microsecond granularity. As a result, multiple RTT samples could be truncated to the same RTT value during loss recovery. This is problematic, as BBR will not enter PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning that if there are persistent losses, PROBE_RTT will constantly be pushed off and potentially never re-entered. This patch makes sure that BBR enters PROBE_RTT by checking if RTT sample is < the current min_rtt sample, rather than <=. The Netflix transport/TCP team discovered this bug in the Linux TCP BBR code during lab tests. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Ryan Sharpelletti <sharpelletti@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17net: dsa: tag_dsa: Use a consistent comment styleTobias Waldekranz
Use a consistent style of one-line/multi-line comments throughout the file. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17net: dsa: tag_dsa: Unify regular and ethertype DSA taggersTobias Waldekranz
Ethertype DSA encodes exactly the same information in the DSA tag as the non-ethertype variety. So refactor out the common parts and reuse them for both protocols. This is ensures tag parsing and generation is always consistent across all mv88e6xxx chips. While we are at it, explicitly deal with all possible CPU codes on receive, making sure to set offload_fwd_mark as appropriate. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17net: dsa: tag_dsa: Allow forwarding of redirected IGMP trafficTobias Waldekranz
When receiving an IGMP/MLD frame with a TO_CPU tag, the switch has not performed any forwarding of it. This means that we should not set the offload_fwd_mark on the skb, in case a software bridge wants it forwarded. This is a port of: 1ed9ec9b08ad ("dsa: Allow forwarding of redirected IGMP traffic") Which corrected the issue for chips using EDSA tags, but not for those using regular DSA tags. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16net/tls: fix corrupted data in recvmsgVadim Fedorenko
If tcp socket has more data than Encrypted Handshake Message then tls_sw_recvmsg will try to decrypt next record instead of returning full control message to userspace as mentioned in comment. The next message - usually Application Data - gets corrupted because it uses zero copy for decryption that's why the data is not stored in skb for next iteration. Revert check to not decrypt next record if current is not Application Data. Fixes: 692d7b5d1f91 ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1605413760-21153-1-git-send-email-vfedorenko@novek.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16net: bridge: add missing counters to ndo_get_stats64 callbackHeiner Kallweit
In br_forward.c and br_input.c fields dev->stats.tx_dropped and dev->stats.multicast are populated, but they are ignored in ndo_get_stats64. Fixes: 28172739f0a2 ("net: fix 64 bit counters on 32 bit arches") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/58ea9963-77ad-a7cf-8dfd-fc95ab95f606@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: send explicit ack on delayed ack_seq incrPaolo Abeni
When the worker moves some bytes from the OoO queue into the receive queue, the msk->ask_seq is updated, the MPTCP-level ack carrying that value needs to wait the next ingress packet, possibly slowing down or hanging the peer Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: keep track of advertised windows right edgeFlorian Westphal
Before sending 'x' new bytes also check that the new snd_una would be within the permitted receive window. For every ACK that also contains a DSS ack, check whether its tcp-level receive window would advance the current mptcp window right edge and update it if so. Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: rework poll+nospace handlingFlorian Westphal
MPTCP maintains a status bit, MPTCP_SEND_SPACE, that is set when at least one subflow and the mptcp socket itself are writeable. mptcp_poll returns EPOLLOUT if the bit is set. mptcp_sendmsg makes sure MPTCP_SEND_SPACE gets cleared when last write has used up all subflows or the mptcp socket wmem. This reworks nospace handling as follows: MPTCP_SEND_SPACE is replaced with MPTCP_NOSPACE, i.e. inverted meaning. This bit is set when the mptcp socket is not writeable. The mptcp-level ack path schedule will then schedule the mptcp worker to allow it to free already-acked data (and reduce wmem usage). This will then wake userspace processes that wait for a POLLOUT event. sendmsg will set MPTCP_NOSPACE only when it has to wait for more wmem (blocking I/O case). poll path will set MPTCP_NOSPACE in case the mptcp socket is not writeable. Normal tcp-level notification (SOCK_NOSPACE) is only enabled in case the subflow socket has no available wmem. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: try to push pending data on snd una updatesPaolo Abeni
After the previous patch we may end-up with unsent data in the write buffer. If such buffer is full, the writer will block for unlimited time. We need to trigger the MPTCP xmit path even for the subflow rx path, on MPTCP snd_una updates. Keep things simple and just schedule the work queue if needed. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: move page frag allocation in mptcp_sendmsg()Paolo Abeni
mptcp_sendmsg() is refactored so that first it copies the data provided from user space into the send queue, and then tries to spool the send queue via sendmsg_frag. There a subtle change in the mptcp level collapsing on consecutive data fragment: we now allow that only on unsent data. The latter don't need to deal with msghdr data anymore and can be simplified in a relevant way. snd_nxt and write_seq are now tracked independently. Overall this allows some relevant cleanup and will allow sending pending mptcp data on msk una update in later patch. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: refactor shutdown and closePaolo Abeni
We must not close the subflows before all the MPTCP level data, comprising the DATA_FIN has been acked at the MPTCP level, otherwise we could be unable to retransmit as needed. __mptcp_wr_shutdown() shutdown is responsible to check for the correct status and close all subflows. Is called by the output path after spooling any data and at shutdown/close time. In a similar way, __mptcp_destroy_sock() is responsible to clean-up the MPTCP level status, and is called when the msk transition to TCP_CLOSE. The protocol level close() does not force anymore the TCP_CLOSE status, but orphan the msk socket and all the subflows. Orphaned msk sockets are forciby closed after a timeout or when all MPTCP-level data is acked. There is a caveat about keeping the orphaned subflows around: the TCP stack can asynchronusly call tcp_cleanup_ulp() on them via tcp_close(). To prevent accessing freed memory on later MPTCP level operations, the msk acquires a reference to each subflow socket and prevent subflow_ulp_release() from releasing the subflow context before __mptcp_destroy_sock(). The additional subflow references are released by __mptcp_done() and the async ULP release is detected checking ULP ops. If such field has been already cleared by the ULP release path, the dangling context is freed directly by __mptcp_done(). Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: introduce MPTCP snd_nxtPaolo Abeni
Track the next MPTCP sequence number used on xmit, currently always equal to write_next. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: add accounting for pending dataPaolo Abeni
Preparation patch to track the data pending in the msk write queue. No functional change introduced here Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: reduce the arguments of mptcp_sendmsg_fragPaolo Abeni
The current argument list is pretty long and quite unreadable, move many of them into a specific struct. Later patches will add more stuff to such struct. Additionally drop the 'timeo' argument, now unused. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: introduce mptcp_schedule_workPaolo Abeni
remove some of code duplications an allow preventing rescheduling on close. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16tcp: factor out __tcp_close() helperPaolo Abeni
unlocked version of protocol level close, will be used by MPTCP to allow decouple orphaning and subflow level close. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16mptcp: use tcp_build_frag()Paolo Abeni
mptcp_push_pending() is called even on orphaned msk (and orphaned subflows), if there is outstanding data at close() time. To cope with the above MPTCP needs to handle explicitly the allocation failure on xmit. The newly introduced do_tcp_sendfrag() allows that, just plug it. We can additionally drop a couple of sanity checks, duplicate in the TCP code. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16tcp: factor out tcp_build_frag()Paolo Abeni
Will be needed by the next patch, as MPTCP needs to handle directly the error/memory-allocation-needed path. No functional changes intended. Additionally let MPTCP code access the tcp_remove_empty_skb() helper. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16ipv6/netfilter: Discard first fragment not including all headersGeorg Kohmann
Packets are processed even though the first fragment don't include all headers through the upper layer header. This breaks TAHI IPv6 Core Conformance Test v6LC.1.3.6. Referring to RFC8200 SECTION 4.5: "If the first fragment does not include all headers through an Upper-Layer header, then that fragment should be discarded and an ICMP Parameter Problem, Code 3, message should be sent to the source of the fragment, with the Pointer field set to zero." The fragment needs to be validated the same way it is done in commit 2efdaaaf883a ("IPv6: reply ICMP error if the first fragment don't include all headers") for ipv6. Wrap the validation into a common function, ipv6_frag_thdr_truncated() to check for truncation in the upper layer header. This validation does not fullfill all aspects of RFC 8200, section 4.5, but is at the moment sufficient to pass mentioned TAHI test. In netfilter, utilize the fragment offset returned by find_prev_fhdr() to let ipv6_frag_thdr_truncated() start it's traverse from the fragment header. Return 0 to drop the fragment in the netfilter. This is the same behaviour as used on other protocol errors in this function, e.g. when nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an appropriate ICMP Parameter Problem message back to the source. References commit 2efdaaaf883a ("IPv6: reply ICMP error if the first fragment don't include all headers") Signed-off-by: Georg Kohmann <geokohma@cisco.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16treewide: rename nla_strlcpy to nla_strscpy.Francis Laniel
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new name of this function. Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16Modify return value of nla_strlcpy to match that of strscpy.Francis Laniel
nla_strlcpy now returns -E2BIG if src was truncated when written to dst. It also returns this error value if dstsize is 0 or higher than INT_MAX. For example, if src is "foo\0" and dst is 3 bytes long, the result will be: 1. "foG" after memcpy (G means garbage). 2. "fo\0" after memset. 3. -E2BIG is returned because src was not completely written into dst. The callers of nla_strlcpy were modified to take into account this modification. Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16Merge tag 'linux-can-fixes-for-5.10-20201115' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2020-11-15 Anant Thazhemadam contributed two patches for the AF_CAN that prevent potential access of uninitialized member in can_rcv() and canfd_rcv(). The next patch is by Alejandro Concepcion Rodriguez and changes can_restart() to use the correct function to push a skb into the networking stack from process context. Zhang Qilong's patch fixes a memory leak in the error path of the ti_hecc's probe function. A patch by me fixes mcba_usb_start_xmit() function in the mcba_usb driver, to first fill the skb and then pass it to can_put_echo_skb(). Colin Ian King's patch fixes a potential integer overflow on shift in the peak_usb driver. The next two patches target the flexcan driver, a patch by me adds the missing "req_bit" to the stop mode property comment (which was broken during net-next for v5.10). Zhang Qilong's patch fixes the failure handling of pm_runtime_get_sync(). The next seven patches target the m_can driver including the tcan4x5x spi driver glue code. Enric Balletbo i Serra's patch for the tcan4x5x Kconfig fix the REGMAP_SPI dependency handling. A patch by me for the tcan4x5x driver's probe() function adds missing error handling to for devm_regmap_init(), and in tcan4x5x_can_remove() the order of deregistration is fixed. Wu Bo's patch for the m_can driver fixes the state change handling in m_can_handle_state_change(). Two patches by Dan Murphy first introduce m_can_class_free_dev() and then make use of it to fix the freeing of the can device. A patch by Faiz Abbas add a missing shutdown of the CAN controller in the m_can_stop() function. * tag 'linux-can-fixes-for-5.10-20201115' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: m_can: m_can_stop(): set device to software init mode before closing can: m_can: Fix freeing of can device from peripherials can: m_can: m_can_class_free_dev(): introduce new function can: m_can: m_can_handle_state_change(): fix state change can: tcan4x5x: tcan4x5x_can_remove(): fix order of deregistration can: tcan4x5x: tcan4x5x_can_probe(): add missing error checking for devm_regmap_init() can: tcan4x5x: replace depends on REGMAP_SPI with depends on SPI can: flexcan: fix failure handling of pm_runtime_get_sync() can: flexcan: flexcan_setup_stop_mode(): add missing "req_bit" to stop mode property comment can: peak_usb: fix potential integer overflow on shift of a int can: mcba_usb: mcba_usb_start_xmit(): first fill skb, then pass to can_put_echo_skb() can: ti_hecc: Fix memleak in ti_hecc_probe can: dev: can_restart(): post buffer from the right context can: af_can: prevent potential access of uninitialized member in canfd_rcv() can: af_can: prevent potential access of uninitialized member in can_rcv() ==================== Link: https://lore.kernel.org/r/20201115174131.2089251-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-15can: af_can: prevent potential access of uninitialized member in canfd_rcv()Anant Thazhemadam
In canfd_rcv(), cfd->len is uninitialized when skb->len = 0, and this uninitialized cfd->len is accessed nonetheless by pr_warn_once(). Fix this uninitialized variable access by checking cfd->len's validity condition (cfd->len > CANFD_MAX_DLEN) separately after the skb->len's condition is checked, and appropriately modify the log messages that are generated as well. In case either of the required conditions fail, the skb is freed and NET_RX_DROP is returned, same as before. Fixes: d4689846881d ("can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once") Reported-by: syzbot+9bcb0c9409066696d3aa@syzkaller.appspotmail.com Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Link: https://lore.kernel.org/r/20201103213906.24219-3-anant.thazhemadam@gmail.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>