summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2021-04-01udp: Implement ->read_sock() for sockmapCong Wang
This is similar to tcp_read_sock(), except we do not need to worry about connections, we just need to retrieve skb from UDP receive queue. Note, the return value of ->read_sock() is unused in sk_psock_verdict_data_ready(), and UDP still does not support splice() due to lack of ->splice_read(), so users can not reach udp_read_sock() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
2021-04-01sock: Introduce sk->sk_prot->psock_update_sk_prot()Cong Wang
Currently sockmap calls into each protocol to update the struct proto and replace it. This certainly won't work when the protocol is implemented as a module, for example, AF_UNIX. Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each protocol can implement its own way to replace the struct proto. This also helps get rid of symbol dependencies on CONFIG_INET. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
2021-04-01skmsg: Introduce a spinlock to protect ingress_msgCong Wang
Currently we rely on lock_sock to protect ingress_msg, it is too big for this, we can actually just use a spinlock to protect this list like protecting other skb queues. __tcp_bpf_recvmsg() is still special because of peeking, it still has to use lock_sock. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
2021-03-31tcp: convert tcp_comp_sack_nr sysctl to u8Eric Dumazet
tcp_comp_sack_nr max value was already 255. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert igmp_link_local_mcast_reports sysctl to u8Eric Dumazet
This sysctl is a bool, can use less storage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8Eric Dumazet
Make room for better packing of netns_ipv4 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert udp_l3mdev_accept sysctl to u8Eric Dumazet
Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert fib_notify_on_flag_change sysctl to u8Eric Dumazet
Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-03-31 1) Fix ipv4 pmtu checks for xfrm anf vti interfaces. From Eyal Birger. 2) There are situations where the socket passed to xfrm_output_resume() is not the same as the one attached to the skb. Use the socket passed to xfrm_output_resume() to avoid lookup failures when xfrm is used with VRFs. From Evan Nimmo. 3) Make the xfrm_state_hash_generation sequence counter per network namespace because but its write serialization lock is also per network namespace. Write protection is insufficient otherwise. From Ahmed S. Darwish. 4) Fixup sctp featue flags when used with esp offload. From Xin Long. 5) xfrm BEET mode doesn't support fragments for inner packets. This is a limitation of the protocol, so no fix possible. Warn at least to notify the user about that situation. From Xin Long. 6) Fix NULL pointer dereference on policy lookup when namespaces are uses in combination with esp offload. 7) Fix incorrect transformation on esp offload when packets get segmented at layer 3. 8) Fix some user triggered usages of WARN_ONCE in the xfrm compat layer. From Dmitry Safonov. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30net: fix icmp_echo_enable_probe sysctlEric Dumazet
sysctl_icmp_echo_enable_probe is an u8. ipv4_net_table entry should use .maxlen = sizeof(u8). .proc_handler = proc_dou8vec_minmax, Fixes: f1b8fa9fa586 ("net: add sysctl for enabling RFC 8335 PROBE messages") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30udp: never accept GSO_FRAGLIST packetsPaolo Abeni
Currently the UDP protocol delivers GSO_FRAGLIST packets to the sockets without the expected segmentation. This change addresses the issue introducing and maintaining a couple of new fields to explicitly accept SKB_GSO_UDP_L4 or GSO_FRAGLIST packets. Additionally updates udp_unexpected_gso() accordingly. UDP sockets enabling UDP_GRO stil keep accept_udp_fraglist zeroed. v1 -> v2: - use 2 bits instead of a whole GSO bitmask (Willem) Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30udp: properly complete L4 GRO over UDP tunnel packetPaolo Abeni
After the previous patch, the stack can do L4 UDP aggregation on top of a UDP tunnel. In such scenario, udp{4,6}_gro_complete will be called twice. This function will enter its is_flist branch immediately, even though that is only correct on the second call, as GSO_FRAGLIST is only relevant for the inner packet. Instead, we need to try first UDP tunnel-based aggregation, if the GRO packet requires that. This patch changes udp{4,6}_gro_complete to skip the frag list processing when while encap_mark == 1, identifying processing of the outer tunnel header. Additionally, clears the field in udp_gro_complete() so that we can enter the frag list path on the next round, for the inner header. v1 -> v2: - hopefully clarified the commit message Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30udp: skip L4 aggregation for UDP tunnel packetsPaolo Abeni
If NETIF_F_GRO_FRAGLIST or NETIF_F_GRO_UDP_FWD are enabled, and there are UDP tunnels available in the system, udp_gro_receive() could end-up doing L4 aggregation (either SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST) at the outer UDP tunnel level for packets effectively carrying and UDP tunnel header. That could cause inner protocol corruption. If e.g. the relevant packets carry a vxlan header, different vxlan ids will be ignored/ aggregated to the same GSO packet. Inner headers will be ignored, too, so that e.g. TCP over vxlan push packets will be held in the GRO engine till the next flush, etc. Just skip the SKB_GSO_UDP_L4 and SKB_GSO_FRAGLIST code path if the current packet could land in a UDP tunnel, and let udp_gro_receive() do GRO via udp_sk(sk)->gro_receive. The check implemented in this patch is broader than what is strictly needed, as the existing UDP tunnel could be e.g. configured on top of a different device: we could end-up skipping GRO at-all for some packets. Anyhow, that is a very thin corner case and covering it will add quite a bit of complexity. v1 -> v2: - hopefully clarify the commit message Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Fixes: 36707061d6ba ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30udp: fixup csum for GSO receive slow pathPaolo Abeni
When UDP packets generated locally by a socket with UDP_SEGMENT traverse the following path: UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) -> UDP tunnel (rx) -> UDP socket (no UDP_GRO) ip_summed will be set to CHECKSUM_PARTIAL at creation time and such checksum mode will be preserved in the above path up to the UDP tunnel receive code where we have: __iptunnel_pull_header() -> skb_pull_rcsum() -> skb_postpull_rcsum() -> __skb_postpull_rcsum() The latter will convert the skb to CHECKSUM_NONE. The UDP GSO packet will be later segmented as part of the rx socket receive operation, and will present a CHECKSUM_NONE after segmentation. Additionally the segmented packets UDP CB still refers to the original GSO packet len. Overall that causes unexpected/wrong csum validation errors later in the UDP receive path. We could possibly address the issue with some additional checks and csum mangling in the UDP tunnel code. Since the issue affects only this UDP receive slow path, let's set a suitable csum status there. Note that SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets lacking an UDP encapsulation present a valid checksum when landing to udp_queue_rcv_skb(), as the UDP checksum has been validated by the GRO engine. v2 -> v3: - even more verbose commit message and comments v1 -> v2: - restrict the csum update to the packets strictly needing them - hopefully clarify the commit message and code comments Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31netfilter: nf_log_arp: merge with nf_log_syslogFlorian Westphal
similar to previous change: nf_log_syslog now covers ARP logging as well, the old nf_log_arp module is removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-31netfilter: nf_log_ipv4: rename to nf_log_syslogFlorian Westphal
Netfilter has multiple log modules: nf_log_arp nf_log_bridge nf_log_ipv4 nf_log_ipv6 nf_log_netdev nfnetlink_log nf_log_common With the exception of nfnetlink_log (packet is sent to userspace for dissection/logging), all of them log to the kernel ringbuffer. This is the first part of a series to merge all modules except nfnetlink_log into a single module: nf_log_syslog. This allows to reduce code. After the series, only two log modules remain: nfnetlink_log and nf_log_syslog. The latter provides the same functionality as the old per-af log modules. This renames nf_log_ipv4 to nf_log_syslog. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-30icmp: add response to RFC 8335 PROBE messagesAndreas Roeseler
Modify the icmp_rcv function to check PROBE messages and call icmp_echo if a PROBE request is detected. Modify the existing icmp_echo function to respond ot both ping and PROBE requests. This was tested using a custom modification to the iputils package and wireshark. It supports IPV4 probing by name, ifindex, and probing by both IPV4 and IPV6 addresses. It currently does not support responding to probes off the proxy node (see RFC 8335 Section 2). The modification to the iputils package is still in development and can be found here: https://github.com/Juniper-Clinic-2020/iputils.git. It supports full sending functionality of PROBE requests, but currently does not parse the response messages, which is why Wireshark is required to verify the sent and recieved PROBE messages. The modification adds the ``-e'' flag to the command which allows the user to specify the interface identifier to query the probed host. An example usage would be <./ping -4 -e 1 [destination]> to send a PROBE request of ifindex 1 to the destination node. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30net: add support for sending RFC 8335 PROBE messagesAndreas Roeseler
Modify the ping_supported function to support PROBE message types. This allows tools such as the ping command in the iputils package to be modified to send PROBE requests through the existing framework for sending ping requests. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30net: add sysctl for enabling RFC 8335 PROBE messagesAndreas Roeseler
Section 8 of RFC 8335 specifies potential security concerns of responding to PROBE requests, and states that nodes that support PROBE functionality MUST be able to enable/disable responses and that responses MUST be disabled by default Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-29bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACEMartin KaFai Lau
pahole currently only generates the btf_id for external function and ftrace-able function. Some functions in the bpf_tcp_ca_kfunc_ids are static (e.g. cubictcp_init). Thus, unless CONFIG_DYNAMIC_FTRACE is set, btf_ids for those functions will not be generated and the compilation fails during resolve_btfids. This patch limits those functions to CONFIG_DYNAMIC_FTRACE. I will address the pahole generation in a followup and then remove the CONFIG_DYNAMIC_FTRACE limitation. Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210329221357.834438-1-kafai@fb.com
2021-03-29tcp: fix tcp_min_tso_segs sysctlEric Dumazet
tcp_min_tso_segs is now stored in u8, so max value is 255. 255 limit is enforced by proc_dou8vec_minmax(). We can therefore remove the gso_max_segs variable. Fixes: 47996b489bdc ("tcp: convert elligible sysctls to u8") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-29xfrm: Provide private skb extensions for segmented and hw offloaded ESP packetsSteffen Klassert
Commit 94579ac3f6d0 ("xfrm: Fix double ESP trailer insertion in IPsec crypto offload.") added a XFRM_XMIT flag to avoid duplicate ESP trailer insertion on HW offload. This flag is set on the secpath that is shared amongst segments. This lead to a situation where some segments are not transformed correctly when segmentation happens at layer 3. Fix this by using private skb extensions for segmented and hw offloaded ESP packets. Fixes: 94579ac3f6d0 ("xfrm: Fix double ESP trailer insertion in IPsec crypto offload.") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-28bpf: tcp: Fix an error in the bpf_tcp_ca_kfunc_ids listMartin KaFai Lau
There is a typo in the bbr function, s/even/event/. This patch fixes it. Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210329003213.2274210-1-kafai@fb.com
2021-03-28nexthop: Rename artifacts related to legacy multipath nexthop groupsPetr Machata
After resilient next-hop groups have been added recently, there are two types of multipath next-hop groups: the legacy "mpath", and the new "resilient". Calling the legacy next-hop group type "mpath" is unfortunate, because that describes the fact that a packet could be forwarded in one of several paths, which is also true for the resilient next-hop groups. Therefore, to make the naming clearer, rename various artifacts to reflect the assumptions made. Therefore as of this patch: - The flag for multipath groups is nh_grp_entry::is_multipath. This includes the legacy and resilient groups, as well as any future group types that behave as multipath groups. Functions that assume this have "mpath" in the name. - The flag for legacy multipath groups is nh_grp_entry::hash_threshold. Functions that assume this have "hthr" in the name. - The flag for resilient groups is nh_grp_entry::resilient. Functions that assume this have "res" in the name. Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as well. UAPI artifacts were obviously left intact. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28ipv4: tcp_lp.c: Couple of typo fixesBhaskar Chowdhury
s/resrved/reserved/ s/within/within/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28ipv4: ip_output.c: Couple of typo fixesBhaskar Chowdhury
s/readibility/readability/ s/insufficent/insufficient/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28bpf: tcp: Remove comma which is causing build errorAtul Gopinathan
Currently, building the bpf-next source with the CONFIG_BPF_SYSCALL enabled is causing a compilation error: "net/ipv4/bpf_tcp_ca.c:209:28: error: expected identifier or '(' before ',' token" Fix this by removing an unnecessary comma. Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc") Reported-by: syzbot+0b74d8ec3bf0cc4e4209@syzkaller.appspotmail.com Signed-off-by: Atul Gopinathan <atulgopinathan@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210328120515.113895-1-atulgopinathan@gmail.com
2021-03-26bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-ccMartin KaFai Lau
This patch puts some tcp cong helper functions, tcp_slow_start() and tcp_cong_avoid_ai(), into the allowlist for the bpf-tcp-cc program. A few tcp cc implementation functions are also put into the allowlist. A potential use case is the bpf-tcp-cc implementation may only want to override a subset of a tcp_congestion_ops. For others, the bpf-tcp-cc can directly call the kernel counter parts instead of re-implementing (or copy-and-pasting) them to the bpf program. They will only be available to the bpf-tcp-cc typed program. The allowlist functions are not bounded to a fixed ABI contract. When any of them has changed, the bpf-tcp-cc program has to be changed like any in-tree/out-of-tree kernel tcp-cc implementations do also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210325015201.1546345-1-kafai@fb.com
2021-03-26tcp: Rename bictcp function prefix to cubictcpMartin KaFai Lau
The cubic functions in tcp_cubic.c are using the bictcp prefix as in tcp_bic.c. This patch gives it the proper name cubictcp because the later patch will allow the bpf prog to directly call the cubictcp implementation. Renaming them will avoid the name collision when trying to find the intended one to call during bpf prog load time. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015155.1545532-1-kafai@fb.com
2021-03-25tcp: convert elligible sysctls to u8Eric Dumazet
Many tcp sysctls are either bools or small ints that can fit into u8. Reducing space taken by sysctls can save few cache line misses when sending/receiving data while cpu caches are empty, for example after cpu idle period. This is hard to measure with typical network performance tests, but after this patch, struct netns_ipv4 has shrunk by three cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25inet: convert tcp_early_demux and udp_early_demux to u8Eric Dumazet
For these sysctls, their dedicated helpers have to use proc_dou8vec_minmax(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25ipv4: convert ip_forward_update_priority sysctl to u8Eric Dumazet
This sysctl uses ip_fwd_update_priority() helper, so the conversion needs to change it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25ipv4: shrink netns_ipv4 with sysctl conversionsEric Dumazet
These sysctls that can fit in one byte instead of one int are converted to save space and thus reduce cache line misses. - icmp_echo_ignore_all, icmp_echo_ignore_broadcasts, - icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr - tcp_ecn, tcp_ecn_fallback - ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu - ip_nonlocal_bind, ip_autobind_reuse - ip_dynaddr, ip_early_demux, raw_l3mdev_accept - nexthop_compat_mode, fwmark_reflect Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25net: ipv4: Fix some typosLu Wei
Modify "accomodate" to "accommodate" in net/ipv4/esp4.c. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Lu Wei <luwei32@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-24 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 15 day(s) which contain a total of 65 files changed, 3200 insertions(+), 738 deletions(-). The main changes are: 1) Static linking of multiple BPF ELF files, from Andrii. 2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo. 3) Spelling fixes from various folks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24inet: use bigger hash table for IP ID generationEric Dumazet
In commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") I used a very small hash table that could be abused by patient attackers to reveal sensitive information. Switch to a dynamic sizing, depending on RAM size. Typical big hosts will now use 128x more storage (2 MB) to get a similar increase in security and reduction of hash collisions. As a bonus, use of alloc_large_system_hash() spreads allocated memory among all NUMA nodes. Fixes: 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22net: ipconfig: ic_dev can be NULL in ic_close_devsVladimir Oltean
ic_close_dev contains a generalization of the logic to not close a network interface if it's the host port for a DSA switch. This logic is disguised behind an iteration through the lowers of ic_dev in ic_close_dev. When no interface for ipconfig can be found, ic_dev is NULL, and ic_close_dev: - dereferences a NULL pointer when assigning selected_dev - would attempt to search through the lower interfaces of a NULL net_device pointer So we should protect against that case. The "lower_dev" iterator variable was shortened to "lower" in order to keep the 80 character limit. Fixes: f68cbaed67cb ("net: ipconfig: avoid use-after-free in ic_close_devs") Fixes: 46acf7bdbc72 ("Revert "net: ipv4: handle DSA enabled master network devices"") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Heiko Thiery <heiko.thiery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22esp: delete NETIF_F_SCTP_CRC bit from features for esp offloadXin Long
Now in esp4/6_gso_segment(), before calling inner proto .gso_segment, NETIF_F_CSUM_MASK bits are deleted, as HW won't be able to do the csum for inner proto due to the packet encrypted already. So the UDP/TCP packet has to do the checksum on its own .gso_segment. But SCTP is using CRC checksum, and for that NETIF_F_SCTP_CRC should be deleted to make SCTP do the csum in own .gso_segment as well. In Xiumei's testing with SCTP over IPsec/veth, the packets are kept dropping due to the wrong CRC checksum. Reported-by: Xiumei Mu <xmu@redhat.com> Fixes: 7862b4058b9f ("esp: Add gso handlers for esp4 and esp6") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Several patches to testore use of memory barriers instead of RCU to ensure consistent access to ruleset, from Mark Tomlinson. 2) Fix dump of expectation via ctnetlink, from Florian Westphal. 3) GRE helper works for IPv6, from Ludovic Senecaux. 4) Set error on unsupported flowtable flags. 5) Use delayed instead of deferrable workqueue in the flowtable, from Yinjun Zhang. 6) Fix spurious EEXIST in case of add-after-delete flowtable in the same batch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17bpf: net: Emit anonymous enum with BPF_TCP_CLOSE value explicitlyYonghong Song
The selftest failed to compile with clang-built bpf-next. Adding LLVM=1 to your vmlinux and selftest build will use clang. The error message is: progs/test_sk_storage_tracing.c:38:18: error: use of undeclared identifier 'BPF_TCP_CLOSE' if (newstate == BPF_TCP_CLOSE) ^ 1 error generated. make: *** [Makefile:423: /bpf-next/tools/testing/selftests/bpf/test_sk_storage_tracing.o] Error 1 The reason for the failure is that BPF_TCP_CLOSE, a value of an anonymous enum defined in uapi bpf.h, is not defined in vmlinux.h. gcc does not have this problem. Since vmlinux.h is derived from BTF which is derived from vmlinux DWARF, that means gcc-produced vmlinux DWARF has BPF_TCP_CLOSE while llvm-produced vmlinux DWARF does not have. BPF_TCP_CLOSE is referenced in net/ipv4/tcp.c as BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE); The following test mimics the above BUILD_BUG_ON, preprocessed with clang compiler, and shows gcc DWARF contains BPF_TCP_CLOSE while llvm DWARF does not. $ cat t.c enum { BPF_TCP_ESTABLISHED = 1, BPF_TCP_CLOSE = 7, }; enum { TCP_ESTABLISHED = 1, TCP_CLOSE = 7, }; int test() { do { extern void __compiletime_assert_767(void) ; if ((int)BPF_TCP_CLOSE != (int)TCP_CLOSE) __compiletime_assert_767(); } while (0); return 0; } $ clang t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE $ gcc t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE DW_AT_name ("BPF_TCP_CLOSE") Further checking clang code find clang actually tried to evaluate condition at compile time. If it is definitely true/false, it will perform optimization and the whole if condition will be removed before generating IR/debuginfo. This patch explicited add an expression after the above mentioned BUILD_BUG_ON in net/ipv4/tcp.c like (void)BPF_TCP_ESTABLISHED to enable generation of debuginfo for the anonymous enum which also includes BPF_TCP_CLOSE. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210317174132.589276-1-yhs@fb.com
2021-03-16net: ipv4: route.c: simplify procfs codeYejune Deng
proc_creat_seq() that directly take a struct seq_operations, and deal with network namespaces in ->open. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15tcp: relookup sock for RST+ACK packets handled by obsolete req sockAlexander Ovechkin
Currently tcp_check_req can be called with obsolete req socket for which big socket have been already created (because of CPU race or early demux assigning req socket to multiple packets in gro batch). Commit e0f9759f530bf789e984 ("tcp: try to keep packet if SYN_RCV race is lost") added retry in case when tcp_check_req is called for PSH|ACK packet. But if client sends RST+ACK immediatly after connection being established (it is performing healthcheck, for example) retry does not occur. In that case tcp_check_req tries to close req socket, leaving big socket active. Fixes: e0f9759f530 ("tcp: try to keep packet if SYN_RCV race is lost") Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Reported-by: Oleg Senin <olegsenin@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15Revert "netfilter: x_tables: Switch synchronization to RCU"Mark Tomlinson
This reverts commit cc00bcaa589914096edef7fb87ca5cee4a166b5c. This (and the preceding) patch basically re-implemented the RCU mechanisms of patch 784544739a25. That patch was replaced because of the performance problems that it created when replacing tables. Now, we have the same issue: the call to synchronize_rcu() makes replacing tables slower by as much as an order of magnitude. Prior to using RCU a script calling "iptables" approx. 200 times was taking 1.16s. With RCU this increased to 11.59s. Revert these patches and fix the issue in a different way. Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-15Revert "netfilter: x_tables: Update remaining dereference to RCU"Mark Tomlinson
This reverts commit 443d6e86f821a165fae3fc3fc13086d27ac140b1. This (and the following) patch basically re-implemented the RCU mechanisms of patch 784544739a25. That patch was replaced because of the performance problems that it created when replacing tables. Now, we have the same issue: the call to synchronize_rcu() makes replacing tables slower by as much as an order of magnitude. Revert these patches and fix the issue in a different way. Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-12net: ipv4: route.c: Fix indentation of multi line comment.Shubhankar Kuranagatti
All comment lines inside the comment block have been aligned. Every line of comment starts with a * (uniformity in code). Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12esp4: Simplify the calculation of variablesJiapeng Chong
Fix the following coccicheck warnings: ./net/ipv4/esp4.c:757:16-18: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-11tcp: remove obsolete check in __tcp_retransmit_skb()Eric Dumazet
TSQ provides a nice way to avoid bufferbloat on individual socket, including retransmit packets. We can get rid of the old heuristic: /* Do not sent more than we queued. 1/4 is reserved for possible * copying overhead: fragmentation, tunneling, mangling etc. */ if (refcount_read(&sk->sk_wmem_alloc) > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) return -EAGAIN; This heuristic was giving false positives according to Jakub, whenever TX completions are delayed above RTT. (Ack packets are processed by TCP stack before clones are orphaned/freed) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()Eric Dumazet
Jakub reported Data included in a Fastopen SYN that had to be retransmit would have to wait for an RTO if TX completions are slow, even with prior fix. This is because tcp_rcv_fastopen_synack() does not use standard rtx logic, meaning TSQ handler exits early in tcp_tsq_write() because tp->lost_out == tp->retrans_out Lets make tcp_rcv_fastopen_synack() use standard rtx logic, by using tcp_mark_skb_lost() on the skb thats needs to be sent again. Not this raised a warning in tcp_fastretrans_alert() during my tests since we consider the data not being aknowledged by the receiver does not mean packet was lost on the network. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: plug skb_still_in_host_queue() to TSQEric Dumazet
Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) Main issue is that TCP stack has a logic preventing a packet being retransmit if the prior clone has not yet been orphaned or freed. This logic came with commit 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues") Thankfully, in the case skb_still_in_host_queue() detects the initial clone is still in flight, it can use TSQ logic that will eventually retry later, at the moment the clone is freed or orphaned. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neil Spring <ntspring@fb.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>