summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2018-08-03netfilter: use kvmalloc_array to allocate memory for hashtableLi RongQing
nf_ct_alloc_hashtable is used to allocate memory for conntrack, NAT bysrc and expectation hashtable. Assuming 64k bucket size, which means 7th order page allocation, __get_free_pages, called by nf_ct_alloc_hashtable, will trigger the direct memory reclaim and stall for a long time, when system has lots of memory stress so replace combination of __get_free_pages and vzalloc with kvmalloc_array, which provides a overflow check and a fallback if no high order memory is available, and do not retry to reclaim memory, reduce stall and remove nf_ct_free_hashtable, since it is just a kvfree Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-30netfilter: nf_tables: handle meta/lookup with direct callFlorian Westphal
Currently nft uses inlined variants for common operations such as 'ip saddr 1.2.3.4' instead of an indirect call. Also handle meta get operations and lookups without indirect call, both are builtin. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-24Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Make sure we don't go over the maximum jump stack boundary, from Taehee Yoo. 2) Missing rcu_barrier() in hash and rbtree sets, also from Taehee. 3) Missing check to nul-node in rbtree timeout routine, from Taehee. 4) Use dev->name from flowtable to fix a memleak, from Florian. 5) Oneliner to free flowtable object on removal, from Florian. 6) Memleak in chain rename transaction, again from Florian. 7) Don't allow two chains to use the same name in the same transaction, from Florian. 8) handle DCCP SYNC/SYNCACK as invalid, this triggers an uninitialized timer in conntrack reported by syzbot, from Florian. 9) Fix leak in case netlink_dump_start() fails, from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24Merge tag 'mac80211-for-davem-2018-07-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Only a few fixes: * always keep regulatory user hint * add missing break statement in station flags parsing * fix non-linear SKBs in port-control-over-nl80211 * reconfigure VLAN stations during HW restart ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: cls_flower: propagate chain teplate creation and destruction to ↵Jiri Pirko
drivers Introduce a couple of flower offload commands in order to propagate template creation/destruction events down to device drivers. Drivers may use this information to prepare HW in an optimal way for future filter insertions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: introduce chain templatesJiri Pirko
Allow user to set a template for newly created chains. Template lock down the chain for particular classifier type/options combinations. The classifier needs to support templates, otherwise kernel would reply with error. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: introduce chain object to uapiJiri Pirko
Allow user to create, destroy, get and dump chain objects. Do that by extending rtnl commands by the chain-specific ones. User will now be able to explicitly create or destroy chains (so far this was done only automatically according the filter/act needs and refcounting). Also, the user will receive notification about any chain creation or destuction. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: Avoid implicit chain 0 creationJiri Pirko
Currently, chain 0 is implicitly created during block creation. However that does not align with chain object exposure, creation and destruction api introduced later on. So make the chain 0 behave the same way as any other chain and only create it when it is needed. Since chain 0 is somehow special as the qdiscs need to hold pointer to the first chain tp, this requires to move the chain head change callback infra to the block structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23ipv6: use fib6_info_hold_safe() when necessaryWei Wang
In the code path where only rcu read lock is held, e.g. in the route lookup code path, it is not safe to directly call fib6_info_hold() because the fib6_info may already have been deleted but still exists in the rcu grace period. Holding reference to it could cause double free and crash the kernel. This patch adds a new function fib6_info_hold_safe() and replace fib6_info_hold() in all necessary places. Syzbot reported 3 crash traces because of this. One of them is: 8021q: adding VLAN 0 to HW filter on device team0 IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready dst_release: dst:(____ptrval____) refcnt:-1 dst_release: dst:(____ptrval____) refcnt:-2 WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 dst_hold include/net/dst.h:239 [inline] WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204 dst_release: dst:(____ptrval____) refcnt:-1 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 4845 Comm: syz-executor493 Not tainted 4.18.0-rc3+ #10 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 panic+0x238/0x4e7 kernel/panic.c:184 dst_release: dst:(____ptrval____) refcnt:-2 dst_release: dst:(____ptrval____) refcnt:-3 __warn.cold.8+0x163/0x1ba kernel/panic.c:536 dst_release: dst:(____ptrval____) refcnt:-4 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296 dst_release: dst:(____ptrval____) refcnt:-5 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:dst_hold include/net/dst.h:239 [inline] RIP: 0010:ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204 Code: c1 ed 03 89 9d 18 ff ff ff 48 b8 00 00 00 00 00 fc ff df 41 c6 44 05 00 f8 e9 2d 01 00 00 4c 8b a5 c8 fe ff ff e8 1a f6 e6 fa <0f> 0b e9 6a fc ff ff e8 0e f6 e6 fa 48 8b 85 d0 fe ff ff 48 8d 78 RSP: 0018:ffff8801a8fcf178 EFLAGS: 00010293 RAX: ffff8801a8eba5c0 RBX: 0000000000000000 RCX: ffffffff869511e6 RDX: 0000000000000000 RSI: ffffffff869515b6 RDI: 0000000000000005 RBP: ffff8801a8fcf2c8 R08: ffff8801a8eba5c0 R09: ffffed0035ac8338 R10: ffffed0035ac8338 R11: ffff8801ad6419c3 R12: ffff8801a8fcf720 R13: ffff8801a8fcf6a0 R14: ffff8801ad6419c0 R15: ffff8801ad641980 ip6_make_skb+0x2c8/0x600 net/ipv6/ip6_output.c:1768 udpv6_sendmsg+0x2c90/0x35f0 net/ipv6/udp.c:1376 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:641 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:651 ___sys_sendmsg+0x51d/0x930 net/socket.c:2125 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2220 __do_sys_sendmmsg net/socket.c:2249 [inline] __se_sys_sendmmsg net/socket.c:2246 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2246 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x446ba9 Code: e8 cc bb 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fb39a469da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000006dcc54 RCX: 0000000000446ba9 RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003 RBP: 00000000006dcc50 R08: 00007fb39a46a700 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 45c828efc7a64843 R13: e6eeb815b9d8a477 R14: 5068caf6f713c6fc R15: 0000000000000001 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: syzbot+902e2a1bcd4f7808cef5@syzkaller.appspotmail.com Reported-by: syzbot+8ae62d67f647abeeceb9@syzkaller.appspotmail.com Reported-by: syzbot+3f08feb14086930677d0@syzkaller.appspotmail.com Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22nfp: bring back support for offloading shared blocksJakub Kicinski
Now that we have offload replay infrastructure added by commit 326367427cc0 ("net: sched: call reoffload op on block callback reg") and flows are guaranteed to be removed correctly, we can revert commit 951a8ee6def3 ("nfp: reject binding to shared blocks"). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-21net/ipv6: Fix linklocal to global address with VRFDavid Ahern
Example setup: host: ip -6 addr add dev eth1 2001:db8:104::4 where eth1 is enslaved to a VRF switch: ip -6 ro add 2001:db8:104::4/128 dev br1 where br1 only has an LLA ping6 2001:db8:104::4 ssh 2001:db8:104::4 (NOTE: UDP works fine if the PKTINFO has the address set to the global address and ifindex is set to the index of eth1 with a destination an LLA). For ICMP, icmp6_iif needs to be updated to check if skb->dev is an L3 master. If it is then return the ifindex from rt6i_idev similar to what is done for loopback. For TCP, restore the original tcp_v6_iif definition which is needed in most places and add a new tcp_v6_iif_l3_slave that considers the l3_slave variability. This latter check is only needed for socket lookups. Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20net: create reusable function for getting ownership info of sysfs inodesTyler Hicks
Make net_ns_get_ownership() reusable by networking code outside of core. This is useful, for example, to allow bridge related sysfs files to be owned by container root. Add a function comment since this is a potentially dangerous function to use given the way that kobject_get_ownership() works by initializing uid and gid before calling .get_ownership(). Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree: 1) No need to set ttl from reject action for the bridge family, from Taehee Yoo. 2) Use a fixed timeout for flow that are passed up from the flowtable to conntrack, from Florian Westphal. 3) More preparation patches for tproxy support for nf_tables, from Mate Eckl. 4) Remove unnecessary indirection in core IPv6 checksum function, from Florian Westphal. 5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it. From Florian Westphal. 6) socket match now selects socket infrastructure, instead of depending on it. From Mate Eckl. 7) Patch series to simplify conntrack tuple building/parsing from packet path and ctnetlink, from Florian Westphal. 8) Fetch timeout policy from protocol helpers, instead of doing it from core, from Florian Westphal. 9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from Florian Westphal. 10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES respectively, instead of IPV6. Patch from Mate Eckl. 11) Add specific function for garbage collection in conncount, from Yi-Hung Wei. 12) Catch number of elements in the connlimit list, from Yi-Hung Wei. 13) Move locking to nf_conncount, from Yi-Hung Wei. 14) Series of patches to add lockless tree traversal in nf_conncount, from Yi-Hung Wei. 15) Resolve clash in matching conntracks when race happens, from Martynas Pumputis. 16) If connection entry times out, remove template entry from the ip_vs_conn_tab table to improve behaviour under flood, from Julian Anastasov. 17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng. 18) Call abort from 2-phase commit protocol before requesting modules, make sure this is done under the mutex, from Florian Westphal. 19) Grab module reference when starting transaction, also from Florian. 20) Dynamically allocate expression info array for pre-parsing, from Florian. 21) Add per netns mutex for nf_tables, from Florian Westphal. 22) A couple of patches to simplify and refactor nf_osf code to prepare for nft_osf support. 23) Break evaluation on missing socket, from Mate Eckl. 24) Allow to match socket mark from nft_socket, from Mate Eckl. 25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is built-in into nf_conntrack. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20tcp: do not delay ACK in DCTCP upon CE status changeYuchung Cheng
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change has to be sent immediately so the sender can respond quickly: """ When receiving packets, the CE codepoint MUST be processed as follows: 1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to true and send an immediate ACK. 2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE to false and send an immediate ACK. """ Previously DCTCP implementation may continue to delay the ACK. This patch fixes that to implement the RFC by forcing an immediate ACK. Tested with this packetdrill script provided by Larry Brakmo 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> 0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> 0.110 < [ect0] . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0 0.200 < [ect0] . 1:1001(1000) ack 1 win 257 0.200 > [ect01] . 1:1(0) ack 1001 0.200 write(4, ..., 1) = 1 0.200 > [ect01] P. 1:2(1) ack 1001 0.200 < [ect0] . 1001:2001(1000) ack 2 win 257 +0.005 < [ce] . 2001:3001(1000) ack 2 win 257 +0.000 > [ect01] . 2:2(0) ack 2001 // Previously the ACK below would be delayed by 40ms +0.000 > [ect01] E. 2:2(0) ack 3001 +0.500 < F. 9501:9501(0) ack 4 win 257 Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20tcp: do not cancel delay-AcK on DCTCP special ACKYuchung Cheng
Currently when a DCTCP receiver delays an ACK and receive a data packet with a different CE mark from the previous one's, it sends two immediate ACKs acking previous and latest sequences respectly (for ECN accounting). Previously sending the first ACK may mark off the delayed ACK timer (tcp_event_ack_sent). This may subsequently prevent sending the second ACK to acknowledge the latest sequence (tcp_ack_snd_check). The culprit is that tcp_send_ack() assumes it always acknowleges the latest sequence, which is not true for the first special ACK. The fix is to not make the assumption in tcp_send_ack and check the actual ack sequence before cancelling the delayed ACK. Further it's safer to pass the ack sequence number as a local variable into tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid future bugs like this. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20netfilter: nf_tables: use dev->name directlyFlorian Westphal
no need to store the name in separate area. Furthermore, it uses kmalloc but not kfree and most accesses seem to treat it as char[IFNAMSIZ] not char *. Remove this and use dev->name instead. In case event zeroed dev, just omit the name in the dump. Fixes: d92191aa84e5f1 ("netfilter: nf_tables: cache device name in flowtable object") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-19flow_dissector: Dissect tos and ttl from the tunnel infoOr Gerlitz
Add dissection of the tos and ttl from the ip tunnel headers fields in case a match is needed on them. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18ipv6: fix useless rol32 call on hashColin Ian King
The rol32 call is currently rotating hash but the rol'd value is being discarded. I believe the current code is incorrect and hash should be assigned the rotated value returned from rol32. Thanks to David Lebrun for spotting this. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18net: dsa: Remove VLA usageSalvatore Mesoraca
We avoid 2 VLAs by using a pre-allocated field in dsa_switch. We also try to avoid dynamic allocation whenever possible (when using fewer than bits-per-long ports, which is the common case). Link: http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Link: http://lkml.kernel.org/r/20180505185145.GB32630@lunn.ch Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> [kees: tweak commit subject and message slightly] Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18ipv6: remove dependency of nf_defrag_ipv6 on ipv6 moduleFlorian Westphal
IPV6=m DEFRAG_IPV6=m CONNTRACK=y yields: net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get': net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable' net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6' Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params ip6_frag_init and ip6_expire_frag_queue so it would be needed to force IPV6=y too. This patch gets rid of the 'followup linker error' by removing the dependency of ipv6.ko symbols from netfilter ipv6 defrag. Shared code is placed into a header, then used from both. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: nf_tables: use dedicated mutex to guard transactionsFlorian Westphal
Continue to use nftnl subsys mutex to protect (un)registration of hook types, expressions and so on, but force batch operations to do their own locking. This allows distinct net namespaces to perform transactions in parallel. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: Remove useless param helper of nf_ct_helper_ext_addGao Feng
The param helper of nf_ct_helper_ext_add is useless now, then remove it now. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18ipvs: add assured state for conn templatesJulian Anastasov
cp->state was not used for templates. Add support for state bits and for the first "assured" bit which indicates that some connection controlled by this template was established or assured by the real server. In a followup patch we will use it to drop templates under SYN attack. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18ipvs: provide just conn to ip_vs_state_nameJulian Anastasov
In preparation for followup patches, provide just the cp ptr to ip_vs_state_name. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree ↵Yi-Hung Wei
search This patch is originally from Florian Westphal. This patch does the following 3 main tasks. 1) Add list lock to 'struct nf_conncount_list' so that we can alter the lists containing the individual connections without holding the main tree lock. It would be useful when we only need to add/remove to/from a list without allocate/remove a node in the tree. With this change, we update nft_connlimit accordingly since we longer need to maintain a list lock in nft_connlimit now. 2) Use RCU for the initial tree search to improve tree look up performance. 3) Add a garbage collection worker. This worker is schedule when there are excessive tree node that needed to be recycled. Moreover,the rbnode reclaim logic is moved from search tree to insert tree to avoid race condition. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanupYi-Hung Wei
This patch is originally from Florian Westphal. This patch does the following three tasks. It applies the same early exit technique for nf_conncount_lookup(). Since now we keep the number of connections in 'struct nf_conncount_list', we no longer need to return the count in nf_conncount_lookup(). Moreover, we expose the garbage collection function nf_conncount_gc_list() for nft_connlimit. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: nf_conncount: Switch to plain listYi-Hung Wei
Original patch is from Florian Westphal. This patch switches from hlist to plain list to store the list of connections with the same filtering key in nf_conncount. With the plain list, we can insert new connections at the tail, so over time the beginning of list holds long-running connections and those are expired, while the newly creates ones are at the end. Later on, we could probably move checked ones to the end of the list, so the next run has higher chance to reclaim stale entries in the front. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-17netfilter: nf_tables: fix jumpstack depth validationTaehee Yoo
The level of struct nft_ctx is updated by nf_tables_check_loops(). That is used to validate jumpstack depth. But jumpstack validation routine doesn't update and validate recursively. So, in some cases, chain depth can be bigger than the NFT_JUMP_STACK_SIZE. After this patch, The jumpstack validation routine is located in the nft_chain_validate(). When new rules or new set elements are added, the nft_table_validate() is called by the nf_tables_newrule and the nf_tables_newsetelem. The nft_table_validate() calls the nft_chain_validate() that visit all their children chains recursively. So it can update depth of chain certainly. Reproducer: %cat ./test.sh #!/bin/bash nft add table ip filter nft add chain ip filter input { type filter hook input priority 0\; } for ((i=0;i<20;i++)); do nft add chain ip filter a$i done nft add rule ip filter input jump a1 for ((i=0;i<10;i++)); do nft add rule ip filter a$i jump a$((i+1)) done for ((i=11;i<19;i++)); do nft add rule ip filter a$i jump a$((i+1)) done nft add rule ip filter a10 jump a11 Result: [ 253.931782] WARNING: CPU: 1 PID: 0 at net/netfilter/nf_tables_core.c:186 nft_do_chain+0xacc/0xdf0 [nf_tables] [ 253.931915] Modules linked in: nf_tables nfnetlink ip_tables x_tables [ 253.932153] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #48 [ 253.932153] RIP: 0010:nft_do_chain+0xacc/0xdf0 [nf_tables] [ 253.932153] Code: 83 f8 fb 0f 84 c7 00 00 00 e9 d0 00 00 00 83 f8 fd 74 0e 83 f8 ff 0f 84 b4 00 00 00 e9 bd 00 00 00 83 bd 64 fd ff ff 0f 76 09 <0f> 0b 31 c0 e9 bc 02 00 00 44 8b ad 64 fd [ 253.933807] RSP: 0018:ffff88011b807570 EFLAGS: 00010212 [ 253.933807] RAX: 00000000fffffffd RBX: ffff88011b807660 RCX: 0000000000000000 [ 253.933807] RDX: 0000000000000010 RSI: ffff880112b39d78 RDI: ffff88011b807670 [ 253.933807] RBP: ffff88011b807850 R08: ffffed0023700ece R09: ffffed0023700ecd [ 253.933807] R10: ffff88011b80766f R11: ffffed0023700ece R12: ffff88011b807898 [ 253.933807] R13: ffff880112b39d80 R14: ffff880112b39d60 R15: dffffc0000000000 [ 253.933807] FS: 0000000000000000(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000 [ 253.933807] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 253.933807] CR2: 00000000014f1008 CR3: 000000006b216000 CR4: 00000000001006e0 [ 253.933807] Call Trace: [ 253.933807] <IRQ> [ 253.933807] ? sched_clock_cpu+0x132/0x170 [ 253.933807] ? __nft_trace_packet+0x180/0x180 [nf_tables] [ 253.933807] ? sched_clock_cpu+0x132/0x170 [ 253.933807] ? debug_show_all_locks+0x290/0x290 [ 253.933807] ? __lock_acquire+0x4835/0x4af0 [ 253.933807] ? inet_ehash_locks_alloc+0x1a0/0x1a0 [ 253.933807] ? unwind_next_frame+0x159e/0x1840 [ 253.933807] ? __read_once_size_nocheck.constprop.4+0x5/0x10 [ 253.933807] ? nft_do_chain_ipv4+0x197/0x1e0 [nf_tables] [ 253.933807] ? nft_do_chain+0x5/0xdf0 [nf_tables] [ 253.933807] nft_do_chain_ipv4+0x197/0x1e0 [nf_tables] [ 253.933807] ? nft_do_chain_arp+0xb0/0xb0 [nf_tables] [ 253.933807] ? __lock_is_held+0x9d/0x130 [ 253.933807] nf_hook_slow+0xc4/0x150 [ 253.933807] ip_local_deliver+0x28b/0x380 [ 253.933807] ? ip_call_ra_chain+0x3e0/0x3e0 [ 253.933807] ? ip_rcv_finish+0x1610/0x1610 [ 253.933807] ip_rcv+0xbcc/0xcc0 [ 253.933807] ? debug_show_all_locks+0x290/0x290 [ 253.933807] ? ip_local_deliver+0x380/0x380 [ 253.933807] ? __lock_is_held+0x9d/0x130 [ 253.933807] ? ip_local_deliver+0x380/0x380 [ 253.933807] __netif_receive_skb_core+0x1c9c/0x2240 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-17netfilter: conntrack: remove l3proto abstractionFlorian Westphal
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto abstraction. This gets rid of all l3proto indirect calls and the need to do a lookup on the function to call for l3 demux. It increases module size by only a small amount (12kbyte), so this reduces size because nf_conntrack.ko is useless without either nf_conntrack_ipv4 or nf_conntrack_ipv6 module. before: text data bss dec hex filename 7357 1088 0 8445 20fd nf_conntrack_ipv4.ko 7405 1084 4 8493 212d nf_conntrack_ipv6.ko 72614 13689 236 86539 1520b nf_conntrack.ko 19K nf_conntrack_ipv4.ko 19K nf_conntrack_ipv6.ko 179K nf_conntrack.ko after: text data bss dec hex filename 79277 13937 236 93450 16d0a nf_conntrack.ko 191K nf_conntrack.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16ipv6/mcast: init as INCLUDE when join SSM INCLUDE groupHangbin Liu
This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when join source group". From RFC3810, part 6.1: If no per-interface state existed for that multicast address before the change (i.e., the change consisted of creating a new per-interface record), or if no state exists after the change (i.e., the change consisted of deleting a per-interface record), then the "non-existent" state is considered to have an INCLUDE filter mode and an empty source list. Which means a new multicast group should start with state IN(). Currently, for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(), then ip6_mc_source(), which will trigger a TO_IN() message instead of ALLOW(). The issue was exposed by commit a052517a8ff65 ("net/multicast: should not send source list records when have filter mode change"). Before this change, we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A). Fix it by adding a new parameter to init group mode. Also add some wrapper functions to avoid changing too much code. v1 -> v2: In the first version I only cleared the group change record. But this is not enough. Because when a new group join, it will init as EXCLUDE and trigger a filter mode change in ip/ip6_mc_add_src(), which will clear all source addresses sf_crcount. This will prevent early joined address sending state change records if multi source addressed joined at the same time. In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM JOIN_SOURCE_GROUP. I also split the original patch into two separated patches for IPv4 and IPv6. There is also a difference between v4 and v6 version. For IPv6, when the interface goes down and up, we will send correct state change record with unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is completed, we resend the change record TO_IN() in mld_send_initial_cr(). Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr(). Fixes: a052517a8ff65 ("net/multicast: should not send source list records when have filter mode change") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16netfilter: conntrack: remove get_timeout() indirectionFlorian Westphal
Not needed, we can have the l4trackers fetch it themselvs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: avoid calls to l4proto invert_tupleFlorian Westphal
Handle the common cases (tcp, udp, etc). in the core and only do the indirect call for the protocols that need it (GRE for instance). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_l4proto indirection from l3 protocol trackersFlorian Westphal
Handle it in the core instead. ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this doesn't create an ipv6 dependency. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove invert_tuple indirection from l3 protocol trackersFlorian Westphal
Its simpler to just handle it directly in nf_ct_invert_tuple(). Also gets rid of need to pass l3proto pointer to resolve_conntrack(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove pkt_to_tuple indirection from l3 protocol trackersFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove ctnetlink callbacks from l3 protocol trackersFlorian Westphal
handle everything from ctnetlink directly. After all these years we still only support ipv4 and ipv6, so it seems reasonable to remove l3 protocol tracker support and instead handle ipv4/ipv6 from a common, always builtin inet tracker. Step 1: Get rid of all the l3proto->func() calls. Start with ctnetlink, then move on to packet-path ones. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16openvswitch: use nf_ct_get_tuplepr, invert_tupleprFlorian Westphal
These versions deal with the l3proto/l4proto details internally. It removes only caller of nf_ct_get_tuple, so make it static. After this, l3proto->get_l4proto() can be removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: nft_tproxy: Move nf_tproxy_assign_sock() to nf_tproxy.hMáté Eckl
This function is also necessary to implement nft tproxy support Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY") Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16tls: Add rx inline crypto offloadBoris Pismenny
This patch completes the generic infrastructure to offload TLS crypto to a network device. It enables the kernel to skip decryption and authentication of some skbs marked as decrypted by the NIC. In the fast path, all packets received are decrypted by the NIC and the performance is comparable to plain TCP. This infrastructure doesn't require a TCP offload engine. Instead, the NIC only decrypts packets that contain the expected TCP sequence number. Out-Of-Order TCP packets are provided unmodified. As a result, at the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS Rx and this offload are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW after HW lost track of TLS record framing in the TCP stream. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Split tls_sw_release_resources_rxBoris Pismenny
This patch splits tls_sw_release_resources_rx into two functions one which releases all inner software tls structures and another that also frees the containing structure. In TLS_DEVICE we will need to release the software structures without freeeing the containing structure, which contains other information. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Split decrypt_skb to two functionsBoris Pismenny
Previously, decrypt_skb also updated the TLS context. Now, decrypt_skb only decrypts the payload using the current context, while decrypt_skb_update also updates the state. Later, in the tls_device Rx flow, we will use decrypt_skb directly. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Refactor tls_offload variable namesBoris Pismenny
For symmetry, we rename tls_offload_context to tls_offload_context_tx before we add tls_offload_context_rx. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13tcp: remove DELAYED ACK events in DCTCPYuchung Cheng
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK related callbacks are no longer needed Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net: sched: refactor flower walk to iterate over idrVlad Buslov
Extend struct tcf_walker with additional 'cookie' field. It is intended to be used by classifier walk implementations to continue iteration directly from particular filter, instead of iterating 'skip' number of times. Change flower walk implementation to save filter handle in 'cookie'. Each time flower walk is called, it looks up filter with saved handle directly with idr, instead of iterating over filter linked list 'skip' number of times. This change improves complexity of dumping flower classifier from quadratic to linearithmic. (assuming idr lookup has logarithmic complexity) Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13xdp: factor out common program/flags handling from driversJakub Kicinski
Basic operations drivers perform during xdp setup and query can be moved to helpers in the core. Encapsulate program and flags into a structure and add helpers. Note that the structure is intended as the "main" program information source in the driver. Most drivers will additionally place the program pointer in their fast path or ring structures. The helpers don't have a huge impact now, but they will decrease the code duplication when programs can be installed in HW and driver at the same time. Encapsulating the basic operations in helpers will hopefully also reduce the number of changes to drivers which adopt them. Helpers could really be static inline, but they depend on definition of struct netdev_bpf which means they'd have to be placed in netdevice.h, an already 4500 line header. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-12devlink: Add generic parameters region_snapshotAlex Vesker
region_snapshot - When set enables capturing region snapshots Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Add support for creating region snapshotsAlex Vesker
Each device address region can store multiple snapshots, each snapshot is identified using a different numerical ID. This ID is used when deleting a snapshot or showing an address region specific snapshot. This patch exposes a callback to add a new snapshot to an address region. The snapshot will be deleted using the destructor function when destroying a region or when a snapshot delete command from devlink user tool. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>