summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-07-05bridge: add br_vlan_get_proto()wenxu
This new function allows you to fetch the bridge port vlan protocol. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05netfilter: nft_meta_bridge: add NFT_META_BRI_IIFPVID supportwenxu
This patch allows you to match on the bridge port pvid, eg. nft add rule bridge firewall zones counter meta ibrpvid 10 Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05bridge: add br_vlan_get_pvid_rcu()Pablo Neira Ayuso
This new function allows you to fetch bridge pvid from packet path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
2019-07-05netfilter: nft_meta_bridge: Remove the br_private.h headerwenxu
nft_bridge_meta should not access the bridge internal API. Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05netfilter: nft_meta: move bridge meta keys into nft_meta_bridgewenxu
Separate bridge meta key from nft_meta to meta_bridge to avoid a dependency between the bridge module and nft_meta when using the bridge API available through include/linux/if_bridge.h Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05ipvs: strip gre tunnel headers from icmp errorsJulian Anastasov
Recognize GRE tunnels in received ICMP errors and properly strip the tunnel headers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05netfilter: nf_tables: Add synproxy supportFernando Fernandez Mancera
Add synproxy support for nf_tables. This behaves like the iptables synproxy target but it is structured in a way that allows us to propose improvements in the future. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-04ipvs: allow tunneling with gre encapsulationVadim Fedorenko
windows real servers can handle gre tunnels, this patch allows gre encapsulation with the tunneling method, thereby letting ipvs be load balancer for windows-based services Signed-off-by: Vadim Fedorenko <vfedorenko@yandex-team.ru> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-04netfilter: nf_queue: remove unused hook entries pointerFlorian Westphal
Its not used anywhere, so remove this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-04netfilter: nf_log: Replace a seq_printf() call by seq_puts() in seq_show()Markus Elfring
A string which did not contain a data format specification should be put into a sequence. Thus use the corresponding function “seq_puts”. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-04netfilter: rename nf_SYNPROXY.h to nf_synproxy.hPablo Neira Ayuso
Uppercase is a reminiscence from the iptables infrastructure, rename this header before this is included in stable kernels. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-25tipc: simplify stale link failure criteriaJon Maloy
In commit a4dc70d46cf1 ("tipc: extend link reset criteria for stale packet retransmission") we made link retransmission failure events dependent on the link tolerance, and not only of the number of failed retransmission attempts, as we did earlier. This works well. However, keeping the original, additional criteria of 99 failed retransmissions is now redundant, and may in some cases lead to failure detection times in the order of minutes instead of the expected 1.5 sec link tolerance value. We now remove this criteria altogether. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextPablo Neira Ayuso
Resolve conflict between d2912cb15bdd ("treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500") removing the GPL disclaimer and fe03d4745675 ("Update my email address") which updates Jozsef Kadlecsik's email. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-24ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1()Stefano Brivio
When we perform an inexact match on FIB nodes via fib6_locate_1(), longer prefixes will be preferred to shorter ones. However, it might happen that a node, with higher fn_bit value than some other, has no valid routing information. In this case, we'll pick that node, but it will be discarded by the check on RTN_RTINFO in fib6_locate(), and we might miss nodes with valid routing information but with lower fn_bit value. This is apparent when a routing exception is created for a default route: # ip -6 route list fc00:1::/64 dev veth_A-R1 proto kernel metric 256 pref medium fc00:2::/64 dev veth_A-R2 proto kernel metric 256 pref medium fc00:4::1 via fc00:2::2 dev veth_A-R2 metric 1024 pref medium fe80::/64 dev veth_A-R1 proto kernel metric 256 pref medium fe80::/64 dev veth_A-R2 proto kernel metric 256 pref medium default via fc00:1::2 dev veth_A-R1 metric 1024 pref medium # ip -6 route list cache fc00:4::1 via fc00:2::2 dev veth_A-R2 metric 1024 expires 593sec mtu 1500 pref medium fc00:3::1 via fc00:1::2 dev veth_A-R1 metric 1024 expires 593sec mtu 1500 pref medium # ip -6 route flush cache # node for default route is discarded Failed to send flush request: No such process # ip -6 route list cache fc00:3::1 via fc00:1::2 dev veth_A-R1 metric 1024 expires 586sec mtu 1500 pref medium Check right away if the node has a RTN_RTINFO flag, before replacing the 'prev' pointer, that indicates the longest matching prefix found so far. Fixes: 38fbeeeeccdb ("ipv6: prepare fib6_locate() for exception table") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv6: Dump route exceptions if requestedStefano Brivio
Since commit 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache"), route exceptions reside in a separate hash table, and won't be found by walking the FIB, so they won't be dumped to userspace on a RTM_GETROUTE message. This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to have no function anymore: # ip -6 route get fc00:3::1 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium # ip -6 route get fc00:4::1 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium # ip -6 route list cache # ip -6 route flush cache # ip -6 route get fc00:3::1 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium # ip -6 route get fc00:4::1 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium because iproute2 lists cached routes using RTM_GETROUTE, and flushes them by listing all the routes, and deleting them with RTM_DELROUTE one by one. If cached routes are requested using the RTM_F_CLONED flag together with strict checking, or if no strict checking is requested (and hence we can't consistently apply filters), look up exceptions in the hash table associated with the current fib6_info in rt6_dump_route(), and, if present and not expired, add them to the dump. We might be unable to dump all the entries for a given node in a single message, so keep track of how many entries were handled for the current node in fib6_walker, and skip that amount in case we start from the same partially dumped node. When a partial dump restarts, as the starting node might change when 'sernum' changes, we have no guarantee that we need to skip the same amount of in-node entries. Therefore, we need two counters, and we need to zero the in-node counter if the node from which the dump is resumed differs. Note that, with the current version of iproute2, this only fixes the 'ip -6 route list cache': on a flush command, iproute2 doesn't pass RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is still unable to fetch the routes to be flushed. This will be addressed in a patch for iproute2. To flush cached routes, a procfs entry could be introduced instead: that's how it works for IPv4. We already have a rt6_flush_exception() function ready to be wired to it. However, this would not solve the issue for listing. Versions of iproute2 and kernel tested: iproute2 kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0 5.1.0, patched 3.18 list + + + + + + flush + + + + + + 4.4 list + + + + + + flush + + + + + + 4.9 list + + + + + + flush + + + + + + 4.14 list + + + + + + flush + + + + + + 4.15 list flush 4.19 list flush 5.0 list flush 5.1 list flush with list + + + + + + fix flush + + + + v7: - Explain usage of "skip" counters in commit message (suggested by David Ahern) v6: - Rebase onto net-next, use recently introduced nexthop walker - Make rt6_nh_dump_exceptions() a separate function (suggested by David Ahern) v5: - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH, update test results (flushing works with iproute2 < 5.0.0 now) v4: - Split NLM_F_MATCH and strict check handling in separate patches - Filter routes using RTM_F_CLONED: if it's not set, only return non-cached routes, and if it's set, only return cached routes: change requested by David Ahern and Martin Lau. This implies that iproute2 needs a separate patch to be able to flush IPv6 cached routes. This is not ideal because we can't fix the breakage caused by 2b760fcf5cfb entirely in kernel. However, two years have passed since then, and this makes it more tolerable v3: - More descriptive comment about expired exceptions in rt6_dump_route() - Swap return values of rt6_dump_route() (suggested by Martin Lau) - Don't zero skip_in_node in case we don't dump anything in a given pass (also suggested by Martin Lau) - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic, it's just a flag to indicate the route was cloned, not to filter on routes v2: Add tracking of number of entries to be skipped in current node after a partial dump. As we restart from the same node, if not all the exceptions for a given node fit in a single message, the dump will not terminate, as suggested by Martin Lau. This is a concrete possibility, setting up a big number of exceptions for the same route actually causes the issue, suggested by David Ahern. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv6/route: Change return code of rt6_dump_route() for partial node dumpsStefano Brivio
In the next patch, we are going to add optional dump of exceptions to rt6_dump_route(). Change the return code of rt6_dump_route() to accomodate partial node dumps: we might dump multiple routes per node, and might be able to dump only a given number of them, so fib6_dump_node() will need to know how many routes have been dumped on partial dump, to restart the dump from the point where it was interrupted. Note that fib6_dump_node() is the only caller and already handles all non-negative return codes as success: those become -1 to signal that we're done with the node. If we fail, return 0, as we were unable to dump the single route in the node, but we're not done with it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv6/route: Don't match on fc_nh_id if not set in ip6_route_del()Stefano Brivio
If fc_nh_id isn't set, we shouldn't try to match against it. This actually matters just for the RTF_CACHE below (where this case is already handled): if iproute2 gets a route exception and tries to delete it, it won't reference it by fc_nh_id, even if a nexthop object might be associated to the originating route. Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24Revert "net/ipv6: Bail early if user only wants cloned entries"Stefano Brivio
This reverts commit 08e814c9e8eb5a982cbd1e8f6bd255d97c51026f: as we are preparing to fix listing and dumping of IPv6 cached routes, we need to allow RTM_F_CLONED as a flag to match routes against while dumping them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv4: Dump route exceptions if requestedStefano Brivio
Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached exception routes are stored as a separate entity, so they are not dumped on a FIB dump, even if the RTM_F_CLONED flag is passed. This implies that the command 'ip route list cache' doesn't return any result anymore. If the RTM_F_CLONED is passed, and strict checking requested, retrieve nexthop exception routes and dump them. If no strict checking is requested, filtering can't be performed consistently: dump everything in that case. With this, we need to add an argument to the netlink callback in order to track how many entries were already dumped for the last leaf included in a partial netlink dump. A single additional argument is sufficient, even if we traverse logically nested structures (nexthop objects, hash table buckets, bucket chains): it doesn't matter if we stop in the middle of any of those, because they are always traversed the same way. As an example, s_i values in [], s_fa values in (): node (fa) #1 [1] nexthop #1 bucket #1 -> #0 in chain (1) bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4) bucket #3 -> #0 in chain (5) -> #1 in chain (6) nexthop #2 bucket #1 -> #0 in chain (7) -> #1 in chain (8) bucket #2 -> #0 in chain (9) -- node (fa) #2 [2] nexthop #1 bucket #1 -> #0 in chain (1) -> #1 in chain (2) bucket #2 -> #0 in chain (3) it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2) for "node #2": walking flattens all that. It would even be possible to drop the distinction between the in-tree (s_i) and in-node (s_fa) counter, but a further improvement might advise against this. This is only as accurate as the existing tracking mechanism for leaves: if a partial dump is restarted after exceptions are removed or expired, we might skip some non-dumped entries. To improve this, we could attach a 'sernum' attribute (similar to the one used for IPv6) to nexthop entities, and bump this counter whenever exceptions change: having a distinction between the two counters would make this more convenient. Listing of exception routes (modified routes pre-3.5) was tested against these versions of kernel and iproute2: iproute2 kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0 3.5-rc4 + + + + + 4.4 4.9 4.14 4.15 4.19 5.0 5.1 fixed + + + + + v7: - Move loop over nexthop objects to route.c, and pass struct fib_info and table ID to it, not a struct fib_alias (suggested by David Ahern) - While at it, note that the NULL check on fa->fa_info is redundant, and the check on RTNH_F_DEAD is also not consistent with what's done with regular route listing: just keep it for nhc_flags - Rename entry point function for dumping exceptions to fib_dump_info_fnhe(), and rearrange arguments for consistency with fib_dump_info() - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle one bucket at a time - Expand commit message to describe why we can have a single "skip" counter for all exceptions stored in bucket chains in nexthop objects (suggested by David Ahern) v6: - Rebased onto net-next - Loop over nexthop paths too. Move loop over fnhe buckets to route.c, avoids need to export rt_fill_info() and to touch exceptions from fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that (suggested by David Ahern) Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv4/route: Allow NULL flowinfo in rt_fill_info()Stefano Brivio
In the next patch, we're going to use rt_fill_info() to dump exception routes upon RTM_GETROUTE with NLM_F_ROOT, meaning userspace is requesting a dump and not a specific route selection, which in turn implies the input interface is not relevant. Update rt_fill_info() to handle a NULL flowinfo. v7: If fl4 is NULL, explicitly set r->rtm_tos to 0: it's not initialised otherwise (spotted by David Ahern) v6: New patch Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filteringStefano Brivio
This functionally reverts the check introduced by commit e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking") as modified by commit e4e92fb160d7 ("net/ipv4: Bail early if user only wants prefix entries"). As we are preparing to fix listing of IPv4 cached routes, we need to give userspace a way to request them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONEDStefano Brivio
The following patches add back the ability to dump IPv4 and IPv6 exception routes, and we need to allow selection of regular routes or exceptions. Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions: iproute2 passes it in dump requests (except for IPv6 cache flush requests, this will be fixed in iproute2) and this used to work as long as exceptions were stored directly in the FIB, for both IPv4 and IPv6. Caveat: if strict checking is not requested (that is, if the dump request doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol, tables or route types. In this case, filtering on RTM_F_CLONED would be inconsistent: we would fix 'ip route list cache' by returning exception routes and at the same time introduce another bug in case another selector is present, e.g. on 'ip route list cache table main' we would return all exception routes, without filtering on tables. Keep this consistent by applying no filters at all, and dumping both routes and exceptions, if strict checking is not requested. iproute2 currently filters results anyway, and no unwanted results will be presented to the user. The kernel will just dump more data than needed. v7: No changes v6: Rebase onto net-next, no changes v5: New patch: add dump_routes and dump_exceptions flags in filter and simply clear the unwanted one if strict checking is enabled, don't ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set. Skip filtering altogether if no strict checking is requested: selecting routes or exceptions only would be inconsistent with the fact we can't filter on tables. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv4: fix confirm_addr_indev() when enable route_localnetShijie Luo
When arp_ignore=3, the NIC won't reply for scope host addresses, but if enable route_locanet, we need to reply ip address with head 127 and scope RT_SCOPE_HOST. Fixes: d0daebc3d622 ("ipv4: Add interface option to enable routing of 127.0.0.0/8") Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv4: fix inet_select_addr() when enable route_localnetShijie Luo
Suppose we have two interfaces eth0 and eth1 in two hosts, follow the same steps in the two hosts: # sysctl -w net.ipv4.conf.eth1.route_localnet=1 # sysctl -w net.ipv4.conf.eth1.arp_announce=2 # ip route del 127.0.0.0/8 dev lo table local and then set ip to eth1 in host1 like: # ifconfig eth1 127.25.3.4/24 set ip to eth2 in host2 and ping host1: # ifconfig eth1 127.25.3.14/24 # ping -I eth1 127.25.3.4 Well, host2 cannot connect to host1. When set a ip address with head 127, the scope of the address defaults to RT_SCOPE_HOST. In this situation, host2 will use arp_solicit() to send a arp request for the mac address of host1 with ip address 127.25.3.14. When arp_announce=2, inet_select_addr() cannot select a correct saddr with condition ifa->ifa_scope > scope, because ifa_scope is RT_SCOPE_HOST and scope is RT_SCOPE_LINK. Then, inet_select_addr() will go to no_in_dev to lookup all interfaces to find a primary ip and finally get the primary ip of eth0. Here I add a localnet_scope defaults to RT_SCOPE_HOST, and when route_localnet is enabled, this value changes to RT_SCOPE_LINK to make inet_select_addr() find a correct primary ip as saddr of arp request. Fixes: d0daebc3d622 ("ipv4: Add interface option to enable routing of 127.0.0.0/8") Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24tipc: remove the unnecessary msg->req check from tipc_nl_compat_bearer_setXin Long
tipc_nl_compat_bearer_set() is only called by tipc_nl_compat_link_set() which already does the check for msg->req check, so remove it from tipc_nl_compat_bearer_set(), and do the same in tipc_nl_compat_media_set(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24tipc: fix missing indentation in source codejohn.rutherford@dektech.com.au
Fix misalignment of policy statement in netlink.c due to automatic spatch code transformation. Fixes: 3b0f31f2b8c9 ("genetlink: make policy common to family") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: John Rutherford <john.rutherford@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREFWei Wang
For tx path, in most cases, we still have to take refcnt on the dst cause the caller is caching the dst somewhere. But it still is beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the route lookup. It is cause this flag prevents manipulating refcnt on net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each routing table. The null_entry is a shared object and constant updates on it cause false sharing. We converted the current major lookup function ip6_route_output_flags() to make use of RT6_LOOKUP_F_DST_NOREF. Together with the change in the rx path, we see noticable performance boost: I ran synflood tests between 2 hosts under the same switch. Both hosts have 20G mlx NIC, and 8 tx/rx queues. Sender sends pure SYN flood with random src IPs and ports using trafgen. Receiver has a simple TCP listener on the target port. Both hosts have multiple custom rules: - For incoming packets, only local table is traversed. - For outgoing packets, 3 tables are traversed to find the route. The packet processing rate on the receiver is as follows: - Before the fix: 3.78Mpps - After the fix: 5.50Mpps Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23ipv6: convert rx data path to not take refcnt on dstWei Wang
ip6_route_input() is the key function to do the route lookup in the rx data path. All the callers to this function are already holding rcu lock. So it is fairly easy to convert it to not take refcnt on the dst: We pass in flag RT6_LOOKUP_F_DST_NOREF and do skb_dst_set_noref(). This saves a few atomic inc or dec operations and should boost performance overall. This also makes the logic more aligned with v4. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logicWei Wang
This patch specifically converts the rule lookup logic to honor this flag and not release refcnt when traversing each rule and calling lookup() on each routing table. Similar to previous patch, we also need some special handling of dst entries in uncached list because there is always 1 refcnt taken for them even if RT6_LOOKUP_F_DST_NOREF flag is set. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23ipv6: initialize rt6->rt6i_uncached in all pre-allocated dst entriesWei Wang
Initialize rt6->rt6i_uncached on the following pre-allocated dsts: net->ipv6.ip6_null_entry net->ipv6.ip6_prohibit_entry net->ipv6.ip6_blk_hole_entry This is a preparation patch for later commits to be able to distinguish dst entries in uncached list by doing: !list_empty(rt6->rt6i_uncached) Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route()Wei Wang
This new flag is to instruct the route lookup function to not take refcnt on the dst entry. The user which does route lookup with this flag must properly use rcu protection. ip6_pol_route() is the major route lookup function for both tx and rx path. In this function: Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and directly return the route entry. The caller should be holding rcu lock when using this flag, and decide whether to take refcnt or not. One note on the dst cache in the uncached_list: As uncached_list does not consume refcnt, one refcnt is always returned back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set. Uncached dst is only possible in the output path. So in such call path, caller MUST check if the dst is in the uncached_list before assuming that there is no refcnt taken on the returned dst. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22netns: restore ops before calling ops_exit_listLi RongQing
ops has been iterated to first element when call pre_exit, and it needs to restore from save_ops, not save ops to save_ops Fixes: d7d99872c144 ("netns: add pre_exit method to struct pernet_operations") Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22ipv6: Error when route does not have any valid nexthopsIdo Schimmel
When user space sends invalid information in RTA_MULTIPATH, the nexthop list in ip6_route_multipath_add() is empty and 'rt_notif' is set to NULL. The code that emits the in-kernel notifications does not check for this condition, which results in a NULL pointer dereference [1]. Fix this by bailing earlier in the function if the parsed nexthop list is empty. This is consistent with the corresponding IPv4 code. v2: * Check if parsed nexthop list is empty and bail with extack set [1] kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9190 Comm: syz-executor149 Not tainted 5.2.0-rc5+ #38 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:call_fib6_multipath_entry_notifiers+0xd1/0x1a0 net/ipv6/ip6_fib.c:396 Code: 8b b5 30 ff ff ff 48 c7 85 68 ff ff ff 00 00 00 00 48 c7 85 70 ff ff ff 00 00 00 00 89 45 88 4c 89 e0 48 c1 e8 03 4c 89 65 80 <42> 80 3c 28 00 0f 85 9a 00 00 00 48 b8 00 00 00 00 00 fc ff df 4d RSP: 0018:ffff88809788f2c0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 1ffff11012f11e59 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88809788f390 R08: ffff88809788f8c0 R09: 000000000000000c R10: ffff88809788f5d8 R11: ffff88809788f527 R12: 0000000000000000 R13: dffffc0000000000 R14: ffff88809788f8c0 R15: ffffffff89541d80 FS: 000055555632c880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000080 CR3: 000000009ba7c000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip6_route_multipath_add+0xc55/0x1490 net/ipv6/route.c:5094 inet6_rtm_newroute+0xed/0x180 net/ipv6/route.c:5208 rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5219 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:646 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:665 ___sys_sendmsg+0x803/0x920 net/socket.c:2286 __sys_sendmsg+0x105/0x1d0 net/socket.c:2324 __do_sys_sendmsg net/socket.c:2333 [inline] __se_sys_sendmsg net/socket.c:2331 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2331 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4401f9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffc09fd0028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401f9 RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a80 R13: 0000000000401b10 R14: 0000000000000000 R15: 0000000000000000 Reported-by: syzbot+382566d339d52cd1a204@syzkaller.appspotmail.com Fixes: ebee3cad835f ("ipv6: Add IPv6 multipath notifications for add / replace") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22net: fastopen: robustness and endianness fixes for SipHashArd Biesheuvel
Some changes to the TCP fastopen code to make it more robust against future changes in the choice of key/cookie size, etc. - Instead of keeping the SipHash key in an untyped u8[] buffer and casting it to the right type upon use, use the correct type directly. This ensures that the key will appear at the correct alignment if we ever change the way these data structures are allocated. (Currently, they are only allocated via kmalloc so they always appear at the correct alignment) - Use DIV_ROUND_UP when sizing the u64[] array to hold the cookie, so it is always of sufficient size, even if TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8. - Drop the 'len' parameter from the tcp_fastopen_reset_cipher() function, which is no longer used. - Add endian swabbing when setting the keys and calculating the hash, to ensure that cookie values are the same for a given key and source/destination address pair regardless of the endianness of the server. Note that none of these are functional changes wrt the current state of the code, with the exception of the swabbing, which only affects big endian systems. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor SPDX change conflict. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume Nault. 2) Don't access the DDM interface unless the transceiver implements it in bnx2x, from Mauro S. M. Rodrigues. 3) Don't double fetch 'len' from userspace in sock_getsockopt(), from JingYi Hou. 4) Sign extension overflow in lio_core, from Colin Ian King. 5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski. 6) Fix epollout hang in hvsock, from Sunil Muthuswamy. 7) Fix regression in default fib6_type, from David Ahern. 8) Handle memory limits in tcp_fragment more appropriately, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits) tcp: refine memory limit test in tcp_fragment() inet: clear num_timeout reqsk_alloc() net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier net/af_iucv: build proper skbs for HiperTransport net/af_iucv: remove GFP_DMA restriction for HiperTransport net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge() hvsock: fix epollout hang from race condition net/udp_gso: Allow TX timestamp with UDP GSO net: netem: fix use after free and double free with packet corruption net: netem: fix backlog accounting for corrupted GSO frames net: lio_core: fix potential sign-extension overflow on large shift tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL tun: wake up waitqueues after IFF_UP is set net: remove duplicate fetch in sock_getsockopt tipc: fix issues with early FAILOVER_MSG from peer ...
2019-06-21tcp: refine memory limit test in tcp_fragment()Eric Dumazet
tcp_fragment() might be called for skbs in the write queue. Memory limits might have been exceeded because tcp_sendmsg() only checks limits at full skb (64KB) boundaries. Therefore, we need to make sure tcp_fragment() wont punish applications that might have setup very low SO_SNDBUF values. Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21Merge tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull more NFS client fixes from Anna Schumaker: "These are mostly refcounting issues that people have found recently. The revert fixes a suspend recovery performance issue. - SUNRPC: Fix a credential refcount leak - Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE" - SUNRPC: Fix xps refcount imbalance on the error path - NFS4: Only set creation opendata if O_CREAT" * tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: Fix a credential refcount leak Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE" net :sunrpc :clnt :Fix xps refcount imbalance on the error path NFS4: Only set creation opendata if O_CREAT
2019-06-21SUNRPC: Fix a credential refcount leakTrond Myklebust
All callers of __rpc_clone_client() pass in a value for args->cred, meaning that the credential gets assigned and referenced in the call to rpc_new_client(). Reported-by: Ido Schimmel <idosch@idosch.org> Fixes: 79caa5fad47c ("SUNRPC: Cache cred of process creating the rpc_client") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-21Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"Anna Schumaker
Jon Hunter reports: "I have been noticing intermittent failures with a system suspend test on some of our machines that have a NFS mounted root file-system. Bisecting this issue points to your commit 431235818bc3 ("SUNRPC: Declare RPC timers as TIMER_DEFERRABLE") and reverting this on top of v5.2-rc3 does appear to resolve the problem. The cause of the suspend failure appears to be a long delay observed sometimes when resuming from suspend, and this is causing our test to timeout." This reverts commit 431235818bc3a919ca7487500c67c3144feece80. Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-21net :sunrpc :clnt :Fix xps refcount imbalance on the error pathLin Yi
rpc_clnt_add_xprt take a reference to struct rpc_xprt_switch, but forget to release it before return, may lead to a memory leak. Signed-off-by: Lin Yi <teroincn@163.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-21Merge tag 'spdx-5.2-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull still more SPDX updates from Greg KH: "Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat" * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485 ...
2019-06-21netfilter: nf_tables: add support for matching IPv4 optionsStephen Suryaputra
This is the kernel change for the overall changes with this description: Add capability to have rules matching IPv4 options. This is developed mainly to support dropping of IP packets with loose and/or strict source route route options. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21netfilter: synproxy: fix manual bump of the reference counterFernando Fernandez Mancera
This operation is handled by nf_synproxy_ipv4_init() now. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21netfilter: bridge: Fix non-untagged fragment packetwenxu
ip netns exec ns1 ip a a dev eth0 10.0.0.7/24 ip netns exec ns2 ip link a link eth0 name vlan type vlan id 200 ip netns exec ns2 ip a a dev vlan 10.0.0.8/24 ip l add dev br0 type bridge vlan_filtering 1 brctl addif br0 veth1 brctl addif br0 veth2 bridge vlan add dev veth1 vid 200 pvid untagged bridge vlan add dev veth2 vid 200 A two fragment packet sent from ns2 contains the vlan tag 200. In the bridge conntrack, this packet will defrag to one skb with fraglist. When the packet is forwarded to ns1 through veth1, the first skb vlan tag will be cleared by the "untagged" flags. But the vlan tag in the second skb is still tagged, so the second fragment ends up with tag 200 to ns1. So if the first fragment packet doesn't contain the vlan tag, all of the remain should not contain vlan tag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20netfilter: bridge: prevent UAF in brnf_exit_net()Christian Brauner
Prevent a UAF in brnf_exit_net(). When unregister_net_sysctl_table() is called the ctl_hdr pointer will obviously be freed and so accessing it righter after is invalid. Fix this by stashing a pointer to the table we want to free before we unregister the sysctl header. Note that syzkaller falsely chased this down to the drm tree so the Fixes tag that syzkaller requested would be wrong. This commit uses a different but the correct Fixes tag. /* Splat */ BUG: KASAN: use-after-free in br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] BUG: KASAN: use-after-free in brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 Read of size 8 at addr ffff8880a4078d60 by task kworker/u4:4/8749 CPU: 0 PID: 8749 Comm: kworker/u4:4 Not tainted 5.2.0-rc5-next-20190618 #17 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351 __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 11374: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3645 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3654 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:743 [inline] __register_sysctl_table+0xc7/0xef0 fs/proc/proc_sysctl.c:1327 register_net_sysctl+0x29/0x30 net/sysctl_net.c:121 br_netfilter_sysctl_init_net net/bridge/br_netfilter_hooks.c:1105 [inline] brnf_init_net+0x379/0x6a0 net/bridge/br_netfilter_hooks.c:1126 ops_init+0xb3/0x410 net/core/net_namespace.c:130 setup_net+0x2d3/0x740 net/core/net_namespace.c:316 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:103 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:202 ksys_unshare+0x444/0x980 kernel/fork.c:2822 __do_sys_unshare kernel/fork.c:2890 [inline] __se_sys_unshare kernel/fork.c:2888 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:2888 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3417 [inline] kfree+0x10a/0x2c0 mm/slab.c:3746 __rcu_reclaim kernel/rcu/rcu.h:215 [inline] rcu_do_batch kernel/rcu/tree.c:2092 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline] rcu_core+0xcc7/0x1500 kernel/rcu/tree.c:2291 __do_softirq+0x25c/0x94c kernel/softirq.c:292 The buggy address belongs to the object at ffff8880a4078d40 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 32 bytes inside of 512-byte region [ffff8880a4078d40, ffff8880a4078f40) The buggy address belongs to the page: page:ffffea0002901e00 refcount:1 mapcount:0 mapping:ffff8880aa400a80 index:0xffff8880a40785c0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea0001d636c8 ffffea0001b07308 ffff8880aa400a80 raw: ffff8880a40785c0 ffff8880a40780c0 0000000100000004 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a4078c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > ffff8880a4078d00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a4078d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: syzbot+43a3fa52c0d9c5c94f41@syzkaller.appspotmail.com Fixes: 22567590b2e6 ("netfilter: bridge: namespace bridge netfilter sysctls") Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20netfilter: synproxy: use nf_cookie_v6_check() from corePablo Neira Ayuso
This helper function is never used and it is intended to avoid a direct dependency with the ipv6 module. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20netfilter: synproxy: fix building syncookie callsArnd Bergmann
When either CONFIG_IPV6 or CONFIG_SYN_COOKIES are disabled, the kernel fails to build: include/linux/netfilter_ipv6.h:180:9: error: implicit declaration of function '__cookie_v6_init_sequence' [-Werror,-Wimplicit-function-declaration] return __cookie_v6_init_sequence(iph, th, mssp); include/linux/netfilter_ipv6.h:194:9: error: implicit declaration of function '__cookie_v6_check' [-Werror,-Wimplicit-function-declaration] return __cookie_v6_check(iph, th, cookie); net/ipv6/netfilter.c:237:26: error: use of undeclared identifier '__cookie_v6_init_sequence'; did you mean 'cookie_init_sequence'? net/ipv6/netfilter.c:238:21: error: use of undeclared identifier '__cookie_v6_check'; did you mean '__cookie_v4_check'? Fix the IS_ENABLED() checks to match the function declaration and definitions for these. Fixes: 3006a5224f15 ("netfilter: synproxy: remove module dependency on IPv6 SYNPROXY") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-06-19 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin. 2) BTF based map definition, from Andrii. 3) support bpf_map_lookup_elem for xskmap, from Jonathan. 4) bounded loops and scalar precision logic in the verifier, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19inet: clear num_timeout reqsk_alloc()Eric Dumazet
KMSAN caught uninit-value in tcp_create_openreq_child() [1] This is caused by a recent change, combined by the fact that TCP cleared num_timeout, num_retrans and sk fields only when a request socket was about to be queued. Under syncookie mode, a temporary request socket is used, and req->num_timeout could contain garbage. Lets clear these three fields sooner, there is really no point trying to defer this and risk other bugs. [1] BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 </IRQ> do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x401d50 Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00 RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50 RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0 R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline] kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781 reqsk_alloc include/net/request_sock.h:84 [inline] inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>