summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2019-01-28net: tls: Save iv in tls_rec for async crypto requestsDave Watson
aead_request_set_crypt takes an iv pointer, and we change the iv soon after setting it. Some async crypto algorithms don't save the iv, so we need to save it in the tls_rec for async requests. Found by hardcoding x64 aesni to use async crypto manager (to test the async codepath), however I don't think this combination can happen in the wild. Presumably other hardware offloads will need this fix, but there have been no user reports. Fixes: a42055e8d2c30 ("Add support for async encryption of records...") Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-23ax25: fix possible use-after-freeEric Dumazet
syzbot found that ax25 routes where not properly protected against concurrent use [1]. In this particular report the bug happened while copying ax25->digipeat. Fix this problem by making sure we call ax25_get_route() while ax25_route_lock is held, so that no modification could happen while using the route. The current two ax25_get_route() callers do not sleep, so this change should be fine. Once we do that, ax25_get_route() no longer needs to grab a reference on the found route. [1] ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline] BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113 Read of size 66 at addr ffff888066641a80 by task syz-executor2/531 ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1db/0x2d0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x24/0x50 mm/kasan/common.c:130 memcpy include/linux/string.h:352 [inline] kmemdup+0x42/0x60 mm/util.c:113 kmemdup include/linux/string.h:425 [inline] ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424 ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224 __sys_connect+0x357/0x490 net/socket.c:1664 __do_sys_connect net/socket.c:1675 [inline] __se_sys_connect net/socket.c:1672 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:1672 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458099 Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099 RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4 R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff Allocated by task 526: save_stack+0x45/0xd0 mm/kasan/common.c:73 set_track mm/kasan/common.c:85 [inline] __kasan_kmalloc mm/kasan/common.c:496 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504 ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609 kmalloc include/linux/slab.h:545 [inline] ax25_rt_add net/ax25/ax25_route.c:95 [inline] ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233 ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763 sock_do_ioctl+0xe2/0x400 net/socket.c:950 sock_ioctl+0x32f/0x6c0 net/socket.c:1074 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de Freed by task 550: save_stack+0x45/0xd0 mm/kasan/common.c:73 set_track mm/kasan/common.c:85 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458 kasan_slab_free+0xe/0x10 mm/kasan/common.c:466 __cache_free mm/slab.c:3487 [inline] kfree+0xcf/0x230 mm/slab.c:3806 ax25_rt_add net/ax25/ax25_route.c:92 [inline] ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233 ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763 sock_do_ioctl+0xe2/0x400 net/socket.c:950 sock_ioctl+0x32f/0x6c0 net/socket.c:1074 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888066641a80 which belongs to the cache kmalloc-96 of size 96 The buggy address is located 0 bytes inside of 96-byte region [ffff888066641a80, ffff888066641ae0) The buggy address belongs to the page: page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0 flags: 0x1fffc0000000200(slab) ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0 raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc >ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ^ ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ralf Baechle <ralf@linux-mips.org> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15Revert "rxrpc: Allow failed client calls to be retried"David Howells
The changes introduced to allow rxrpc calls to be retried creates an issue when it comes to refcounting afs_call structs. The problem is that when rxrpc_send_data() queues the last packet for an asynchronous call, the following sequence can occur: (1) The notify_end_tx callback is invoked which causes the state in the afs_call to be changed from AFS_CALL_CL_REQUESTING or AFS_CALL_SV_REPLYING. (2) afs_deliver_to_call() can then process event notifications from rxrpc on the async_work queue. (3) Delivery of events, such as an abort from the server, can cause the afs_call state to be changed to AFS_CALL_COMPLETE on async_work. (4) For an asynchronous call, afs_process_async_call() notes that the call is complete and tried to clean up all the refs on async_work. (5) rxrpc_send_data() might return the amount of data transferred (success) or an error - which could in turn reflect a local error or a received error. Synchronising the clean up after rxrpc_kernel_send_data() returns an error with the asynchronous cleanup is then tricky to get right. Mostly revert commit c038a58ccfd6704d4d7d60ed3d6a0fca13cf13a4. The two API functions the original commit added aren't currently used. This makes rxrpc_kernel_send_data() always return successfully if it queued the data it was given. Note that this doesn't affect synchronous calls since their Rx notification function merely pokes a wait queue and does not refcounting. The asynchronous call notification function *has* to do refcounting and pass a ref over the work item to avoid the need to sync the workqueue in call cleanup. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15net: ipv4: Fix memory leak in network namespace dismantleIdo Schimmel
IPv4 routing tables are flushed in two cases: 1. In response to events in the netdev and inetaddr notification chains 2. When a network namespace is being dismantled In both cases only routes associated with a dead nexthop group are flushed. However, a nexthop group will only be marked as dead in case it is populated with actual nexthops using a nexthop device. This is not the case when the route in question is an error route (e.g., 'blackhole', 'unreachable'). Therefore, when a network namespace is being dismantled such routes are not flushed and leaked [1]. To reproduce: # ip netns add blue # ip -n blue route add unreachable 192.0.2.0/24 # ip netns del blue Fix this by not skipping error routes that are not marked with RTNH_F_DEAD when flushing the routing tables. To prevent the flushing of such routes in case #1, add a parameter to fib_table_flush() that indicates if the table is flushed as part of namespace dismantle or not. Note that this problem does not exist in IPv6 since error routes are associated with the loopback device. [1] unreferenced object 0xffff888066650338 (size 56): comm "ip", pid 1206, jiffies 4294786063 (age 26.235s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff ..........ba.... e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00 ...d............ backtrace: [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220 [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20 [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380 [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690 [<0000000014f62875>] netlink_sendmsg+0x929/0xe10 [<00000000bac9d967>] sock_sendmsg+0xc8/0x110 [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0 [<000000002e94f880>] __sys_sendmsg+0xf7/0x250 [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610 [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<000000003a8b605b>] 0xffffffffffffffff unreferenced object 0xffff888061621c88 (size 48): comm "ip", pid 1206, jiffies 4294786063 (age 26.235s) hex dump (first 32 bytes): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff kkkkkkkk..&_.... backtrace: [<00000000733609e3>] fib_table_insert+0x978/0x1500 [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220 [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20 [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380 [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690 [<0000000014f62875>] netlink_sendmsg+0x929/0xe10 [<00000000bac9d967>] sock_sendmsg+0xc8/0x110 [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0 [<000000002e94f880>] __sys_sendmsg+0xf7/0x250 [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610 [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<000000003a8b605b>] 0xffffffffffffffff Fixes: 8cced9eff1d4 ("[NETNS]: Enable routing configuration in non-initial namespace.") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-11netfilter: nft_flow_offload: fix interaction with vrf slave devicewenxu
In the forward chain, the iif is changed from slave device to master vrf device. Thus, flow offload does not find a match on the lower slave device. This patch uses the cached route, ie. dst->dev, to update the iif and oif fields in the flow entry. After this patch, the following example works fine: # ip addr add dev eth0 1.1.1.1/24 # ip addr add dev eth1 10.0.0.1/24 # ip link add user1 type vrf table 1 # ip l set user1 up # ip l set dev eth0 master user1 # ip l set dev eth1 master user1 # nft add table firewall # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; } # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; } # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1 # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1 Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-03Remove 'type' argument from access_ok() functionLinus Torvalds
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-01ip: validate header length on virtual device xmitWillem de Bruijn
KMSAN detected read beyond end of buffer in vti and sit devices when passing truncated packets with PF_PACKET. The issue affects additional ip tunnel devices. Extend commit 76c0ddd8c3a6 ("ip6_tunnel: be careful when accessing the inner header") and commit ccfec9e5cb2d ("ip_tunnel: be careful when accessing the inner header"). Move the check to a separate helper and call at the start of each ndo_start_xmit function in net/ipv4 and net/ipv6. Minor changes: - convert dev_kfree_skb to kfree_skb on error path, as dev_kfree_skb calls consume_skb which is not for error paths. - use pskb_network_may_pull even though that is pedantic here, as the same as pskb_may_pull for devices without llheaders. - do not cache ipv6 hdrs if used only once (unsafe across pskb_may_pull, was more relevant to earlier patch) Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-01sock: Make sock->sk_stamp thread-safeDeepa Dinamani
Al Viro mentioned (Message-ID <20170626041334.GZ10672@ZenIV.linux.org.uk>) that there is probably a race condition lurking in accesses of sk_stamp on 32-bit machines. sock->sk_stamp is of type ktime_t which is always an s64. On a 32 bit architecture, we might run into situations of unsafe access as the access to the field becomes non atomic. Use seqlocks for synchronization. This allows us to avoid using spinlocks for readers as readers do not need mutual exclusion. Another approach to solve this is to require sk_lock for all modifications of the timestamps. The current approach allows for timestamps to have their own lock: sk_stamp_lock. This allows for the patch to not compete with already existing critical sections, and side effects are limited to the paths in the patch. The addition of the new field maintains the data locality optimizations from commit 9115e8cd2a0c ("net: reorganize struct sock for better data locality") Note that all the instances of the sk_stamp accesses are either through the ioctl or the syscall recvmsg. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-29netfilter: nf_conncount: speculative garbage collection on empty listsPablo Neira Ayuso
Instead of removing a empty list node that might be reintroduced soon thereafter, tentatively place the empty list node on the list passed to tree_nodes_free(), then re-check if the list is empty again before erasing it from the tree. [ Florian: rebase on top of pending nf_conncount fixes ] Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search") Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29netfilter: nf_conncount: merge lookup and add functionsFlorian Westphal
'lookup' is always followed by 'add'. Merge both and make the list-walk part of nf_conncount_add(). This also avoids one unneeded unlock/re-lock pair. Extra care needs to be taken in count_tree, as we only hold rcu read lock, i.e. we can only insert to an existing tree node after acquiring its lock and making sure it has a nonzero count. As a zero count should be rare, just fall back to insert_tree() (which acquires tree lock). This issue and its solution were pointed out by Shawn Bohrer during patch review. Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-24net: dccp: fix kernel crash on module loadPeter Oskolkov
Patch eedbbb0d98b2 "net: dccp: initialize (addr,port) ..." added calling to inet_hashinfo2_init() from dccp_init(). However, inet_hashinfo2_init() is marked as __init(), and thus the kernel panics when dccp is loaded as module. Removing __init() tag from inet_hashinfo2_init() is not feasible because it calls into __init functions in mm. This patch adds inet_hashinfo2_init_mod() function that can be called after the init phase is done; changes dccp_init() to call the new function; un-marks inet_hashinfo2_init() as exported. Fixes: eedbbb0d98b2 ("net: dccp: initialize (addr,port) ...") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Support for destination MAC in ipset, from Stefano Brivio. 2) Disallow all-zeroes MAC address in ipset, also from Stefano. 3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands, introduce protocol version number 7, from Jozsef Kadlecsik. A follow up patch to fix ip_set_byindex() is also included in this batch. 4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi. 5) Statify nf_flow_table_iterate(), from Taehee Yoo. 6) Use nf_flow_table_iterate() to simplify garbage collection in nf_flow_table logic, also from Taehee Yoo. 7) Don't use _bh variants of call_rcu(), rcu_barrier() and synchronize_rcu_bh() in Netfilter, from Paul E. McKenney. 8) Remove NFC_* cache definition from the old caching infrastructure. 9) Remove layer 4 port rover in NAT helpers, use random port instead, from Florian Westphal. 10) Use strscpy() in ipset, from Qian Cai. 11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that random port is allocated by default, from Xiaozhou Liu. 12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal. 13) Limit port allocation selection routine in NAT to avoid softlockup splats when most ports are in use, from Florian. 14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl() from Yafang Shao. 15) Direct call to nf_nat_l4proto_unique_tuple() instead of indirection, from Florian Westphal. 16) Several patches to remove all layer 4 NAT indirections, remove nf_nat_l4proto struct, from Florian Westphal. 17) Fix RTP/RTCP source port translation when SNAT is in place, from Alin Nastac. 18) Selective rule dump per chain, from Phil Sutter. 19) Revisit CLUSTERIP target, this includes a deadlock fix from netns path, sleep in atomic, remove bogus WARN_ON_ONCE() and disallow mismatching IP address and MAC address. Patchset from Taehee Yoo. 20) Update UDP timeout to stream after 2 seconds, from Florian. 21) Shrink UDP established timeout to 120 seconds like TCP timewait. 22) Sysctl knobs to set GRE timeouts, from Yafang Shao. 23) Move seq_print_acct() to conntrack core file, from Florian. 24) Add enum for conntrack sysctl knobs, also from Florian. 25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events and nf_conntrack_timestamp knobs in the core, from Florian Westphal. As a side effect, shrink netns_ct structure by removing obsolete sysctl anchors, also from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-12-21 The following pull-request contains BPF updates for your *net-next* tree. There is a merge conflict in test_verifier.c. Result looks as follows: [...] }, { "calls: cross frame pruning", .insns = { [...] .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .errstr_unpriv = "function calls to other bpf functions are allowed for root only", .result_unpriv = REJECT, .errstr = "!read_ok", .result = REJECT, }, { "jset: functional", .insns = { [...] { "jset: unknown const compare not taken", .insns = { BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32), BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1), BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .errstr_unpriv = "!read_ok", .result_unpriv = REJECT, .errstr = "!read_ok", .result = REJECT, }, [...] { "jset: range", .insns = { [...] }, .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .result_unpriv = ACCEPT, .result = ACCEPT, }, The main changes are: 1) Various BTF related improvements in order to get line info working. Meaning, verifier will now annotate the corresponding BPF C code to the error log, from Martin and Yonghong. 2) Implement support for raw BPF tracepoints in modules, from Matt. 3) Add several improvements to verifier state logic, namely speeding up stacksafe check, optimizations for stack state equivalence test and safety checks for liveness analysis, from Alexei. 4) Teach verifier to make use of BPF_JSET instruction, add several test cases to kselftests and remove nfp specific JSET optimization now that verifier has awareness, from Jakub. 5) Improve BPF verifier's slot_type marking logic in order to allow more stack slot sharing, from Jiong. 6) Add sk_msg->size member for context access and add set of fixes and improvements to make sock_map with kTLS usable with openssl based applications, from John. 7) Several cleanups and documentation updates in bpftool as well as auto-mount of tracefs for "bpftool prog tracelog" command, from Quentin. 8) Include sub-program tags from now on in bpf_prog_info in order to have a reliable way for user space to get all tags of the program e.g. needed for kallsyms correlation, from Song. 9) Add BTF annotations for cgroup_local_storage BPF maps and implement bpf fs pretty print support, from Roman. 10) Fix bpftool in order to allow for cross-compilation, from Ivan. 11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order to be compatible with libbfd and allow for Debian packaging, from Jakub. 12) Remove an obsolete prog->aux sanitation in dump and get rid of version check for prog load, from Daniel. 13) Fix a memory leak in libbpf's line info handling, from Prashant. 14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info does not get unaligned, from Jesper. 15) Fix test_progs kselftest to work with older compilers which are less smart in optimizing (and thus throwing build error), from Stanislav. 16) Cleanup and simplify AF_XDP socket teardown, from Björn. 17) Fix sk lookup in BPF kselftest's test_sock_addr with regards to netns_id argument, from Andrey. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20net: seg6.h: remove an unused #includePeter Oskolkov
A minor code cleanup. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-21netfilter: netns: shrink netns_ct structFlorian Westphal
remove the obsolete sysctl anchors and move auto_assign_helper_warned to avoid/cover a hole. Reduces size by 40 bytes on 64 bit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21netfilter: conntrack: remove empty pernet fini stubsFlorian Westphal
after moving sysctl handling into single place, the init functions can't fail anymore and some of the fini functions are empty. Remove them and change return type to void. This also simplifies error unwinding in conntrack module init path. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21netfilter: conntrack: un-export seq_print_acctFlorian Westphal
Only one caller, just place it where its needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21netfilter: conntrack: udp: only extend timeout to stream mode after 2sFlorian Westphal
Currently DNS resolvers that send both A and AAAA queries from same source port can trigger stream mode prematurely, which results in non-early-evictable conntrack entry for three minutes, even though DNS requests are done in a few milliseconds. Add a two second grace period where we continue to use the ordinary 30-second default timeout. Its enough for DNS request/response traffic, even if two request/reply packets are involved. ASSURED is still set, else conntrack (and thus a possible NAT mapping ...) gets zapped too in case conntrack table runs full. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-20bpf: sk_msg, sock{map|hash} redirect through ULPJohn Fastabend
A sockmap program that redirects through a kTLS ULP enabled socket will not work correctly because the ULP layer is skipped. This fixes the behavior to call through the ULP layer on redirect to ensure any operations required on the data stream at the ULP layer continue to be applied. To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid calling the BPF layer on a redirected message. This is required to avoid calling the BPF layer multiple times (possibly recursively) which is not the current/expected behavior without ULPs. In the future we may add a redirect flag if users _do_ want the policy applied again but this would need to work for both ULP and non-ULP sockets and be opt-in to avoid breaking existing programs. Also to avoid polluting the flag space with an internal flag we reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with MSG_WAITFORONE. Here WAITFORONE is specific to recv path and SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing to verify is user space API is masked correctly to ensure the flag can not be set by user. (Note this needs to be true regardless because we have internal flags already in-use that user space should not be able to set). But for completeness we have two UAPI paths into sendpage, sendfile and splice. In the sendfile case the function do_sendfile() zero's flags, ./fs/read_write.c: static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, size_t count, loff_t max) { ... fl = 0; #if 0 /* * We need to debate whether we can enable this or not. The * man page documents EAGAIN return for the output at least, * and the application is arguably buggy if it doesn't expect * EAGAIN on a non-blocking file descriptor. */ if (in.file->f_flags & O_NONBLOCK) fl = SPLICE_F_NONBLOCK; #endif file_start_write(out.file); retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl); } In the splice case the pipe_to_sendpage "actor" is used which masks flags with SPLICE_F_MORE. ./fs/splice.c: static int pipe_to_sendpage(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) { ... more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0; ... } Confirming what we expect that internal flags are in fact internal to socket side. Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Lots of conflicts, by happily all cases of overlapping changes, parallel adds, things of that nature. Thanks to Stephen Rothwell, Saeed Mahameed, and others for their guidance in these resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19iptunnel: make TUNNEL_FLAGS available in uapiwenxu
ip l add dev tun type gretap external ip r a 10.0.0.1 encap ip dst 192.168.152.171 id 1000 dev gretap For gretap Key example when the command set the id but don't set the TUNNEL_KEY flags. There is no key field in the send packet In the lwtunnel situation, some TUNNEL_FLAGS should can be set by userspace Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19neighbour: register rtnl doit handlerRoopa Prabhu
this patch registers neigh doit handler. The doit handler returns a neigh entry given dst and dev. This is similar to route and fdb doit (get) handlers. Also moves nda_policy declaration from rtnetlink.c to neighbour.c Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net: switch secpath to use skb extension infrastructureFlorian Westphal
Remove skb->sp and allocate secpath storage via extension infrastructure. This also reduces sk_buff by 8 bytes on x86_64. Total size of allyesconfig kernel is reduced slightly, as there is less inlined code (one conditional atomic op instead of two on skb_clone). No differences in throughput in following ipsec performance tests: - transport mode with aes on 10GB link - tunnel mode between two network namespaces with aes and null cipher Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19xfrm: use secpath_exist where applicableFlorian Westphal
Will reduce noise when skb->sp is removed later in this series. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net: use skb_sec_path helper in more placesFlorian Westphal
skb_sec_path gains 'const' qualifier to avoid xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type same reasoning as previous conversions: Won't need to touch these spots anymore when skb->sp is removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net: move secpath_exist helper to sk_buff.hFlorian Westphal
Future patch will remove skb->sp pointer. To reduce noise in those patches, move existing helper to sk_buff and use it in more places to ease skb->sp replacement later. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19xfrm: change secpath_set to return secpath struct, not error valueFlorian Westphal
It can only return 0 (success) or -ENOMEM. Change return value to a pointer to secpath struct. This avoids direct access to skb->sp: err = secpath_set(skb); if (!err) .. skb->sp-> ... Becomes: sp = secpath_set(skb) if (!sp) .. sp-> .. This reduces noise in followup patch which is going to remove skb->sp. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net: convert bridge_nf to use skb extension infrastructureFlorian Westphal
This converts the bridge netfilter (calling iptables hooks from bridge) facility to use the extension infrastructure. The bridge_nf specific hooks in skb clone and free paths are removed, they have been replaced by the skb_ext hooks that do the same as the bridge nf allocations hooks did. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19netfilter: avoid using skb->nf_bridge directlyFlorian Westphal
This pointer is going to be removed soon, so use the existing helpers in more places to avoid noise when the removal happens. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19Merge tag 'mac80211-next-for-davem-2018-12-19' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This time we have too many changes to list, highlights: * virt_wifi - wireless control simulation on top of another network interface * hwsim configurability to test capabilities similar to real hardware * various mesh improvements * various radiotap vendor data fixes in mac80211 * finally the nl_set_extack_cookie_u64() we talked about previously, used for * peer measurement APIs, right now only with FTM (flight time measurement) for location * made nl80211 radio/interface announcements more complete * various new HE (802.11ax) things: updates, TWT support, ... ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-12-18 1) Fix error return code in xfrm_output_one() when no dst_entry is attached to the skb. From Wei Yongjun. 2) The xfrm state hash bucket count reported to userspace is off by one. Fix from Benjamin Poirier. 3) Fix NULL pointer dereference in xfrm_input when skb_dst_force clears the dst_entry. 4) Fix freeing of xfrm states on acquire. We use a dedicated slab cache for the xfrm states now, so free it properly with kmem_cache_free. From Mathias Krause. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2018-12-18 1) Add xfrm policy selftest scripts. From Florian Westphal. 2) Split inexact policies into four different search list classes and use the rbtree infrastructure to store/lookup the policies. This is to improve the policy lookup performance after the flowcache removal. Patches from Florian Westphal. 3) Various coding style fixes, from Colin Ian King. 4) Fix policy lookup logic after adding the inexact policy search tree infrastructure. From Florian Westphal. 5) Remove a useless remove BUG_ON from xfrm6_dst_ifdown. From Li RongQing. 6) Use the correct policy direction for lookups on hash rebuilding. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18mac80211: propagate the support for TWT to the driverEmmanuel Grumbach
TWT is a feature that was added in 11ah and enhanced in 11ax. There are two bits that need to be set if we want to use the feature in 11ax: one in the HE Capability IE and one in the Extended Capability IE. This is because of backward compatibility between 11ah and 11ax. In order to simplify the flow for the low level driver in managed mode, aggregate the two bits and add a boolean that tells whether TWT is supported or not, but only if 11ax is supported. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18mac80211: document RCU requirements for ieee80211_tx_dequeue()Johannes Berg
In the iwlwifi conversion, we sometimes call this from outside of the wake_tx_queue() method, and in those cases must be in an RCU critical section. Document this requirement. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18cfg80211: clarify LCI/civic location documentationJohannes Berg
The older code and current userspace assumed that this data is the content of the Measurement Report element, starting with the Measurement Token. Clarify this in the documentation. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18wireless: FTM: fix kernel-doc "cannot understand" warningsRandy Dunlap
Fix kernel-doc warnings in FTM due to missing "struct" keyword. Fixes 109 warnings from <net/cfg80211.h>: ../include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' and fixes 88 warnings from <net/mac80211.h>: ../include/net/mac80211.h:477: warning: cannot understand function prototype: 'struct ieee80211_ftm_responder_params ' Fixes: 81e54d08d9d8 ("cfg80211: support FTM responder configuration/statistics") Fixes: bc847970f432 ("mac80211: support FTM responder configuration/statistics") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org> Cc: Johannes Berg <johannes.berg@intel.com> Cc: David Spinadel <david.spinadel@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-17net: add missing SOF_TIMESTAMPING_OPT_ID supportWillem de Bruijn
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets. But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN. Add skb_setup_tx_timestamp that configures both tx_flags and tskey for these paths that do not need corking or use bytestream keys. Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-17netfilter: nat: remove nf_nat_l4proto structFlorian Westphal
This removes the (now empty) nf_nat_l4proto struct, all its instances and all the no longer needed runtime (un)register functionality. nf_nat_need_gre() can be axed as well: the module that calls it (to load the no-longer-existing nat_gre module) also calls other nat core functions. GRE nat is now always available if kernel is built with it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: nat: remove l4proto->manip_pktFlorian Westphal
This removes the last l4proto indirection, the two callers, the l3proto packet mangling helpers for ipv4 and ipv6, now call the nf_nat_l4proto_manip_pkt() helper. nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though they contain no functionality anymore to not clutter this patch. Next patch will remove the empty files and the nf_nat_l4proto struct. nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the other nat manip functionality as well, not just udp and udplite. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: nat: remove l4proto->nlattr_to_rangeFlorian Westphal
all protocols did set this to nf_nat_l4proto_nlattr_to_range, so just call it directly. The important difference is that we'll now also call it for protocols that we don't support (i.e., nf_nat_proto_unknown did not provide .nlattr_to_range). However, there should be no harm, even icmp provided this callback. If we don't implement a specific l4nat for this, nothing would make use of this information, so adding a big switch/case construct listing all supported l4protocols seems a bit pointless. This change leaves a single function pointer in the l4proto struct. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: nat: remove l4proto->in_rangeFlorian Westphal
With exception of icmp, all of the l4 nat protocols set this to nf_nat_l4proto_in_range. Get rid of this and just check the l4proto in the caller. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: nat: fold in_range indirection into callerFlorian Westphal
No need for indirections here, we only support ipv4 and ipv6 and the called functions are very small. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: nat: remove l4proto->unique_tupleFlorian Westphal
fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple. The static-save of old incarnation of resolved key in gre and icmp is removed as well, just use the prandom based offset like the others. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: nat: un-export nf_nat_l4proto_unique_tupleFlorian Westphal
almost all l4proto->unique_tuple implementations just call this helper, so make ->unique_tuple() optional and call its helper directly if the l4proto doesn't override it. This is an intermediate step to get rid of ->unique_tuple completely. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17netfilter: remove NF_NAT_RANGE_PROTO_RANDOM supportFlorian Westphal
Historically this was net_random() based, and was then converted to a hash based algorithm (private boot seed + hash of endpoint addresses) due to concerns of leaking net_random() bits. RANDOM_FULLY mode was added later to avoid problems with hash based mode (see commit 34ce324019e76, "netfilter: nf_nat: add full port randomization support" for details). Just make prandom_u32() the default search starting point and get rid of ->secure_port() altogether. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-16net: dsa: ksz: Rename NET_DSA_TAG_KSZ to _KSZ9477Tristram Ha
Rename the tag Kconfig option and related macros in preparation for addition of new KSZ family switches with different tag formats. Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Signed-off-by: Marek Vasut <marex@denx.de> Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Cc: Woojung Huh <woojung.huh@microchip.com> Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-16neighbor: Add protocol attributeDavid Ahern
Similar to routes and rules, add protocol attribute to neighbor entries for easier tracking of how each was created. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15net: use indirect call wrappers at GRO transport layerPaolo Abeni
This avoids an indirect call in the receive path for TCP and UDP packets. TCP takes precedence on UDP, so that we have a single additional conditional in the common case. When IPV6 is build as module, all gro symbols except UDPv6 are builtin, while the latter belong to the ipv6 module, so we need some special care. v1 -> v2: - adapted to INDIRECT_CALL_ changes v2 -> v3: - fix build issue with CONFIG_IPV6=m Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15net: use indirect call wrappers at GRO network layerPaolo Abeni
This avoids an indirect calls for L3 GRO receive path, both for ipv4 and ipv6, if the latter is not compiled as a module. Note that when IPv6 is compiled as builtin, it will be checked first, so we have a single additional compare for the more common path. v1 -> v2: - adapted to INDIRECT_CALL_ changes Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15neighbor: Improve neighbour struct layoutDavid Ahern
Move arp_queue_len_bytes ahead of arp_queue to remove two 4-byte holes. Ensure ha element is always 8-byte aligned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>