summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2019-09-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
r8152 conflicts are the NAPI fixes in 'net' overlapping with some tasklet stuff in net-next Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31devlink: Use switch-case instead of if-elseParav Pandit
Make core more readable with switch-case for various port flavours. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31devlink: Make port index data type as unsigned intParav Pandit
Devlink port index attribute is returned to users as u32 through netlink response. Change index data type from 'unsigned' to 'unsigned int' to avoid below checkpatch.pl warning. WARNING: Prefer 'unsigned int' to bare use of 'unsigned' 81: FILE: include/net/devlink.h:81: + unsigned index; Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31net/tls: use RCU protection on icsk->icsk_ulp_dataJakub Kicinski
We need to make sure context does not get freed while diag code is interrogating it. Free struct tls_context with kfree_rcu(). We add the __rcu annotation directly in icsk, and cast it away in the datapath accessor. Presumably all ULPs will do a similar thing. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27net: fix skb use after free in netpollFeng Sun
After commit baeababb5b85d5c4e6c917efe2a1504179438d3b ("tun: return NET_XMIT_DROP for dropped packets"), when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP, netpoll_send_skb_on_dev will run into following use after free cases: 1. retry netpoll_start_xmit with freed skb; 2. queue freed skb in npinfo->txq. queue_process will also run into use after free case. hit netpoll_send_skb_on_dev first case with following kernel log: [ 117.864773] kernel BUG at mm/slub.c:306! [ 117.864773] invalid opcode: 0000 [#1] SMP PTI [ 117.864774] CPU: 3 PID: 2627 Comm: loop_printmsg Kdump: loaded Tainted: P OE 5.3.0-050300rc5-generic #201908182231 [ 117.864775] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 117.864775] RIP: 0010:kmem_cache_free+0x28d/0x2b0 [ 117.864781] Call Trace: [ 117.864781] ? tun_net_xmit+0x21c/0x460 [ 117.864781] kfree_skbmem+0x4e/0x60 [ 117.864782] kfree_skb+0x3a/0xa0 [ 117.864782] tun_net_xmit+0x21c/0x460 [ 117.864782] netpoll_start_xmit+0x11d/0x1b0 [ 117.864788] netpoll_send_skb_on_dev+0x1b8/0x200 [ 117.864789] __br_forward+0x1b9/0x1e0 [bridge] [ 117.864789] ? skb_clone+0x53/0xd0 [ 117.864790] ? __skb_clone+0x2e/0x120 [ 117.864790] deliver_clone+0x37/0x50 [bridge] [ 117.864790] maybe_deliver+0x89/0xc0 [bridge] [ 117.864791] br_flood+0x6c/0x130 [bridge] [ 117.864791] br_dev_xmit+0x315/0x3c0 [bridge] [ 117.864792] netpoll_start_xmit+0x11d/0x1b0 [ 117.864792] netpoll_send_skb_on_dev+0x1b8/0x200 [ 117.864792] netpoll_send_udp+0x2c6/0x3e8 [ 117.864793] write_msg+0xd9/0xf0 [netconsole] [ 117.864793] console_unlock+0x386/0x4e0 [ 117.864793] vprintk_emit+0x17e/0x280 [ 117.864794] vprintk_default+0x29/0x50 [ 117.864794] vprintk_func+0x4c/0xbc [ 117.864794] printk+0x58/0x6f [ 117.864795] loop_fun+0x24/0x41 [printmsg_loop] [ 117.864795] kthread+0x104/0x140 [ 117.864795] ? 0xffffffffc05b1000 [ 117.864796] ? kthread_park+0x80/0x80 [ 117.864796] ret_from_fork+0x35/0x40 Signed-off-by: Feng Sun <loyou85@gmail.com> Signed-off-by: Xiaojun Zhao <xiaojunzhao141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor conflict in r8169, bug fix had two versions in net and net-next, take the net-next hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24sock: fix potential memory leak in proto_register()zhanglin
If protocols registered exceeded PROTO_INUSE_NR, prot will be added to proto_list, but no available bit left for prot in proto_inuse_idx. Changes since v2: * Propagate the error code properly Signed-off-by: zhanglin <zhang.lin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net/core/skmsg: Delete an unnecessary check before the function call ↵Markus Elfring
“consume_skb” The consume_skb() function performs also input parameter validation. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-08-24 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei. 2) Fix a use-after-free in prog symbol exposure, from Daniel. 3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya. 4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan. 5) Fix a potential use-after-free on flow dissector detach, from Jakub. 6) Fix bpftool to close prog fd after showing metadata, from Quentin. 7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0Ilya Leoshkevich
test_select_reuseport fails on s390 due to verifier rejecting test_select_reuseport_kern.o with the following message: ; data_check.eth_protocol = reuse_md->eth_protocol; 18: (69) r1 = *(u16 *)(r6 +22) invalid bpf_context access off=22 size=2 This is because on big-endian machines casts from __u32 to __u16 are generated by referencing the respective variable as __u16 with an offset of 2 (as opposed to 0 on little-endian machines). The verifier already has all the infrastructure in place to allow such accesses, it's just that they are not explicitly enabled for eth_protocol field. Enable them for eth_protocol field by using bpf_ctx_range instead of offsetof. Ditto for ip_protocol, bind_inany and len, since they already allow narrowing, and the same problem can arise when working with them. Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-24flow_dissector: Fix potential use-after-free on BPF_PROG_DETACHJakub Sitnicki
Call to bpf_prog_put(), with help of call_rcu(), queues an RCU-callback to free the program once a grace period has elapsed. The callback can run together with new RCU readers that started after the last grace period. New RCU readers can potentially see the "old" to-be-freed or already-freed pointer to the program object before the RCU update-side NULLs it. Reorder the operations so that the RCU update-side resets the protected pointer before the end of the grace period after which the program will be freed. Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-23drop_monitor: Make timestamps y2038 safeIdo Schimmel
Timestamps are currently communicated to user space as 'struct timespec', which is not considered y2038 safe since it uses a 32-bit signed value for seconds. Fix this while the API is still not part of any official kernel release by using 64-bit nanoseconds timestamps instead. Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode") Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19tcp: make sure EPOLLOUT wont be missedEric Dumazet
As Jason Baron explained in commit 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure"), it is crucial we properly set SOCK_NOSPACE when needed. However, Jason patch had a bug, because the 'nonblocking' status as far as sk_stream_wait_memory() is concerned is governed by MSG_DONTWAIT flag passed at sendmsg() time : long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(), and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE cleared, if sk->sk_sndtimeo has been set to a small (but not zero) value. This patch removes the 'noblock' variable since we must always set SOCK_NOSPACE if -EAGAIN is returned. It also renames the do_nonblock label since we might reach this code path even if we were in blocking mode. Fixes: 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Reported-by: Vladimir Rutsky <rutsky@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19net: flow_offload: convert block_ing_cb_list to regular list typeVlad Buslov
RCU list block_ing_cb_list is protected by rcu read lock in flow_block_ing_cmd() and with flow_indr_block_ing_cb_lock mutex in all functions that use it. However, flow_block_ing_cmd() needs to call blocking functions while iterating block_ing_cb_list which leads to following suspicious RCU usage warning: [ 401.510948] ============================= [ 401.510952] WARNING: suspicious RCU usage [ 401.510993] 5.3.0-rc3+ #589 Not tainted [ 401.510996] ----------------------------- [ 401.511001] include/linux/rcupdate.h:265 Illegal context switch in RCU read-side critical section! [ 401.511004] other info that might help us debug this: [ 401.511008] rcu_scheduler_active = 2, debug_locks = 1 [ 401.511012] 7 locks held by test-ecmp-add-v/7576: [ 401.511015] #0: 00000000081d71a5 (sb_writers#4){.+.+}, at: vfs_write+0x166/0x1d0 [ 401.511037] #1: 000000002bd338c3 (&of->mutex){+.+.}, at: kernfs_fop_write+0xef/0x1b0 [ 401.511051] #2: 00000000c921c634 (kn->count#317){.+.+}, at: kernfs_fop_write+0xf7/0x1b0 [ 401.511062] #3: 00000000a19cdd56 (&dev->mutex){....}, at: sriov_numvfs_store+0x6b/0x130 [ 401.511079] #4: 000000005425fa52 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x30/0x140 [ 401.511092] #5: 00000000c5822793 (rtnl_mutex){+.+.}, at: unregister_netdevice_notifier+0x35/0x140 [ 401.511101] #6: 00000000c2f3507e (rcu_read_lock){....}, at: flow_block_ing_cmd+0x5/0x130 [ 401.511115] stack backtrace: [ 401.511121] CPU: 21 PID: 7576 Comm: test-ecmp-add-v Not tainted 5.3.0-rc3+ #589 [ 401.511124] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 401.511127] Call Trace: [ 401.511138] dump_stack+0x85/0xc0 [ 401.511146] ___might_sleep+0x100/0x180 [ 401.511154] __mutex_lock+0x5b/0x960 [ 401.511162] ? find_held_lock+0x2b/0x80 [ 401.511173] ? __tcf_get_next_chain+0x1d/0xb0 [ 401.511179] ? mark_held_locks+0x49/0x70 [ 401.511194] ? __tcf_get_next_chain+0x1d/0xb0 [ 401.511198] __tcf_get_next_chain+0x1d/0xb0 [ 401.511251] ? uplink_rep_async_event+0x70/0x70 [mlx5_core] [ 401.511261] tcf_block_playback_offloads+0x39/0x160 [ 401.511276] tcf_block_setup+0x1b0/0x240 [ 401.511312] ? mlx5e_rep_indr_setup_tc_cb+0xca/0x290 [mlx5_core] [ 401.511347] ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core] [ 401.511359] tc_indr_block_get_and_ing_cmd+0x11b/0x1e0 [ 401.511404] ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core] [ 401.511414] flow_block_ing_cmd+0x7e/0x130 [ 401.511453] ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core] [ 401.511462] __flow_indr_block_cb_unregister+0x7f/0xf0 [ 401.511502] mlx5e_nic_rep_netdevice_event+0x75/0xb0 [mlx5_core] [ 401.511513] unregister_netdevice_notifier+0xe9/0x140 [ 401.511554] mlx5e_cleanup_rep_tx+0x6f/0xe0 [mlx5_core] [ 401.511597] mlx5e_detach_netdev+0x4b/0x60 [mlx5_core] [ 401.511637] mlx5e_vport_rep_unload+0x71/0xc0 [mlx5_core] [ 401.511679] esw_offloads_disable+0x5b/0x90 [mlx5_core] [ 401.511724] mlx5_eswitch_disable.cold+0xdf/0x176 [mlx5_core] [ 401.511759] mlx5_device_disable_sriov+0xab/0xb0 [mlx5_core] [ 401.511794] mlx5_core_sriov_configure+0xaf/0xd0 [mlx5_core] [ 401.511805] sriov_numvfs_store+0xf8/0x130 [ 401.511817] kernfs_fop_write+0x122/0x1b0 [ 401.511826] vfs_write+0xdb/0x1d0 [ 401.511835] ksys_write+0x65/0xe0 [ 401.511847] do_syscall_64+0x5c/0xb0 [ 401.511857] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 401.511862] RIP: 0033:0x7fad892d30f8 [ 401.511868] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 96 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89 [ 401.511871] RSP: 002b:00007ffca2a9fad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 401.511875] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fad892d30f8 [ 401.511878] RDX: 0000000000000002 RSI: 000055afeb072a90 RDI: 0000000000000001 [ 401.511881] RBP: 000055afeb072a90 R08: 00000000ffffffff R09: 000000000000000a [ 401.511884] R10: 000055afeb058710 R11: 0000000000000246 R12: 0000000000000002 [ 401.511887] R13: 00007fad893a8780 R14: 0000000000000002 R15: 00007fad893a3740 To fix the described incorrect RCU usage, convert block_ing_cb_list from RCU list to regular list and protect it with flow_indr_block_ing_cb_lock mutex in flow_block_ing_cmd(). Fixes: 1150ab0f1b33 ("flow_offload: support get multi-subsystem block") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Merge conflict of mlx5 resolved using instructions in merge commit 9566e650bf7fdf58384bb06df634f7531ca3a97e. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17devlink: Add generic packet traps and groupsIdo Schimmel
Add generic packet traps and groups that can report dropped packets as well as exceptions such as TTL error. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17devlink: Add packet trap infrastructureIdo Schimmel
Add the basic packet trap infrastructure that allows device drivers to register their supported packet traps and trap groups with devlink. Each driver is expected to provide basic information about each supported trap, such as name and ID, but also the supported metadata types that will accompany each packet trapped via the trap. The currently supported metadata type is just the input port, but more will be added in the future. For example, output port and traffic class. Trap groups allow users to set the action of all member traps. In addition, users can retrieve per-group statistics in case per-trap statistics are too narrow. In the future, the trap group object can be extended with more attributes, such as policer settings which will limit the amount of traffic generated by member traps towards the CPU. Beside registering their packet traps with devlink, drivers are also expected to report trapped packets to devlink along with relevant metadata. devlink will maintain packets and bytes statistics for each packet trap and will potentially report the trapped packet with its metadata to user space via drop monitor netlink channel. The interface towards the drivers is simple and allows devlink to set the action of the trap. Currently, only two actions are supported: 'trap' and 'drop'. When set to 'trap', the device is expected to provide the sole copy of the packet to the driver which will pass it to devlink. When set to 'drop', the device is expected to drop the packet and not send a copy to the driver. In the future, more actions can be added, such as 'mirror'. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Allow user to start monitoring hardware dropsIdo Schimmel
Drop monitor has start and stop commands, but so far these were only used to start and stop monitoring of software drops. Now that drop monitor can also monitor hardware drops, we should allow the user to control these as well. Do that by adding SW and HW flags to these commands. If no flag is specified, then only start / stop monitoring software drops. This is done in order to maintain backward-compatibility with existing user space applications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Add support for summary alert mode for hardware dropsIdo Schimmel
In summary alert mode a notification is sent with a list of recent drop reasons and a count of how many packets were dropped due to this reason. To avoid expensive operations in the context in which packets are dropped, each CPU holds an array whose number of entries is the maximum number of drop reasons that can be encoded in the netlink notification. Each entry stores the drop reason and a count. When a packet is dropped the array is traversed and a new entry is created or the count of an existing entry is incremented. Later, in process context, the array is replaced with a newly allocated copy and the old array is encoded in a netlink notification. To avoid breaking user space, the notification includes the ancillary header, which is 'struct net_dm_alert_msg' with number of entries set to '0'. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Add support for packet alert mode for hardware dropsIdo Schimmel
In a similar fashion to software drops, extend drop monitor to send netlink events when packets are dropped by the underlying hardware. The main difference is that instead of encoding the program counter (PC) from which kfree_skb() was called in the netlink message, we encode the hardware trap name. The two are mostly equivalent since they should both help the user understand why the packet was dropped. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Consider all monitoring states before performing configurationIdo Schimmel
The drop monitor configuration (e.g., alert mode) is global, but user will be able to enable monitoring of only software or hardware drops. Therefore, ensure that monitoring of both software and hardware drops are disabled before allowing drop monitor configuration to take place. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Add basic infrastructure for hardware dropsIdo Schimmel
Export a function that can be invoked in order to report packets that were dropped by the underlying hardware along with metadata. Subsequent patches will add support for the different alert modes. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Initialize hardware per-CPU dataIdo Schimmel
Like software drops, hardware drops also need the same type of per-CPU data. Therefore, initialize it during module initialization and de-initialize it during module exit. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17drop_monitor: Move per-CPU data init/fini to separate functionsIdo Schimmel
Currently drop monitor only reports software drops to user space, but subsequent patches are going to add support for hardware drops. Like software drops, the per-CPU data of hardware drops needs to be initialized and de-initialized upon module initialization and exit. To avoid code duplication, break this code into separate functions, so that these could be re-used for hardware drops. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15page_pool: fix logic in __page_pool_get_cachedJonathan Lemon
__page_pool_get_cached() will return NULL when the ring is empty, even if there are pages present in the lookaside cache. It is also possible to refill the cache, and then return a NULL page. Restructure the logic so eliminate both cases. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15page_pool: remove unnecessary variable initYunsheng Lin
Remove variable initializations in functions that are followed by assignments before use Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-13net: devlink: remove redundant rtnl lock assertVlad Buslov
It is enough for caller of devlink_compat_switch_id_get() to hold the net device to guarantee that devlink port is not destroyed concurrently. Remove rtnl lock assertion and modify comment to warn user that they must hold either rtnl lock or reference to net device. This is necessary to accommodate future implementation of rtnl-unlocked TC offloads driver callbacks. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net-next* tree. There is a small merge conflict in libbpf (Cc Andrii so he's in the loop as well): for (i = 1; i <= btf__get_nr_types(btf); i++) { t = (struct btf_type *)btf__type_by_id(btf, i); if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); <<<<<<< HEAD /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t+1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && kind == BTF_KIND_DATASEC) { ======= t->size = sizeof(int); *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32); } else if (!has_datasec && btf_is_datasec(t)) { >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8 /* replace DATASEC with STRUCT */ Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to use new btf helpers"), so we need to pick the sanitation fixup as well as use the new btf_is_datasec() helper and the whitespace cleanup. Looks like the following: [...] if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && btf_is_datasec(t)) { /* replace DATASEC with STRUCT */ [...] The main changes are: 1) Addition of core parts of compile once - run everywhere (co-re) effort, that is, relocation of fields offsets in libbpf as well as exposure of kernel's own BTF via sysfs and loading through libbpf, from Andrii. More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2 and http://vger.kernel.org/lpc-bpf2018.html#session-2 2) Enable passing input flags to the BPF flow dissector to customize parsing and allowing it to stop early similar to the C based one, from Stanislav. 3) Add a BPF helper function that allows generating SYN cookies from XDP and tc BPF, from Petar. 4) Add devmap hash-based map type for more flexibility in device lookup for redirects, from Toke. 5) Improvements to XDP forwarding sample code now utilizing recently enabled devmap lookups, from Jesper. 6) Add support for reporting the effective cgroup progs in bpftool, from Jakub and Takshak. 7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter. 8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan. 9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei. 10) Add perf event output helper also for other skb-based program types, from Allan. 11) Fix a co-re related compilation error in selftests, from Yonghong. ==================== Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13devlink: send notifications for deleted snapshots on region destroyJiri Pirko
Currently the notifications for deleted snapshots are sent only in case user deletes a snapshot manually. Send the notifications in case region is destroyed too. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-11drop_monitor: Expose tail drop counterIdo Schimmel
Previous patch made the length of the per-CPU skb drop list configurable. Expose a counter that shows how many packets could not be enqueued to this list. This allows users determine the desired queue length. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Make drop queue length configurableIdo Schimmel
In packet alert mode, each CPU holds a list of dropped skbs that need to be processed in process context and sent to user space. To avoid exhausting the system's memory the maximum length of this queue is currently set to 1000. Allow users to tune the length of this queue according to their needs. The configured length is reported to user space when drop monitor configuration is queried. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Add a command to query current configurationIdo Schimmel
Users should be able to query the current configuration of drop monitor before they start using it. Add a command to query the existing configuration which currently consists of alert mode and packet truncation length. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Allow truncation of dropped packetsIdo Schimmel
When sending dropped packets to user space it is not always necessary to copy the entire packet as usually only the headers are of interest. Allow user to specify the truncation length and add the original length of the packet as additional metadata to the netlink message. By default no truncation is performed. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Add packet alert modeIdo Schimmel
So far drop monitor supported only one alert mode in which a summary of locations in which packets were recently dropped was sent to user space. This alert mode is sufficient in order to understand that packets were dropped, but lacks information to perform a more detailed analysis. Add a new alert mode in which the dropped packet itself is passed to user space along with metadata: The drop location (as program counter and resolved symbol), ingress netdevice and drop timestamp. More metadata can be added in the future. To avoid performing expensive operations in the context in which kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and queued on per-CPU skb drop list. Then, in process context the netlink message is allocated, prepared and finally sent to user space. The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting the system's memory. Subsequent patches will make this limit configurable and also add a counter that indicates how many skbs were tail dropped. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Add alert mode operationsIdo Schimmel
The next patch is going to add another alert mode in which the dropped packet is notified to user space, instead of only a summary of recent drops. Abstract the differences between the modes by adding alert mode operations. The operations are selected based on the currently configured mode and associated with the probes and the work item just before tracing starts. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Require CAP_NET_ADMIN for drop monitor configurationIdo Schimmel
Currently, the configure command does not do anything but return an error. Subsequent patches will enable the command to change various configuration options such as alert mode and packet truncation. Similar to other netlink-based configuration channels, make sure only users with the CAP_NET_ADMIN capability set can execute this command. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Reset per-CPU data before starting to traceIdo Schimmel
The function reset_per_cpu_data() allocates and prepares a new skb for the summary netlink alert message ('NET_DM_CMD_ALERT'). The new skb is stored in the per-CPU 'data' variable and the old is returned. The function is invoked during module initialization and from the workqueue, before an alert is sent. This means that it is possible to receive an alert with stale data, if we stopped tracing when the hysteresis timer ('data->send_timer') was pending. Instead of invoking the function during module initialization, invoke it just before we start tracing and ensure we get a fresh skb. This also allows us to remove the calls to initialize the timer and the work item from the module initialization path, since both could have been triggered by the error paths of reset_per_cpu_data(). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Initialize timer and work item upon tracing enableIdo Schimmel
The timer and work item are currently initialized once during module init, but subsequent patches will need to associate different functions with the work item, based on the configured alert mode. Allow subsequent patches to make that change by initializing and de-initializing these objects during tracing enable and disable. This also guarantees that once the request to disable tracing returns, no more netlink notifications will be generated. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-11drop_monitor: Split tracing enable / disable to different functionsIdo Schimmel
Subsequent patches will need to enable / disable tracing based on the configured alerting mode. Reduce the nesting level and prepare for the introduction of this functionality by splitting the tracing enable / disable operations into two different functions. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09sock: make cookie generation global instead of per netnsDaniel Borkmann
Generating and retrieving socket cookies are a useful feature that is exposed to BPF for various program types through bpf_get_socket_cookie() helper. The fact that the cookie counter is per netns is quite a limitation for BPF in practice in particular for programs in host namespace that use socket cookies as part of a map lookup key since they will be causing socket cookie collisions e.g. when attached to BPF cgroup hooks or cls_bpf on tc egress in host namespace handling container traffic from veth or ipvlan devices with peer in different netns. Change the counter to be global instead. Socket cookie consumers must assume the value as opqaue in any case. Not every socket must have a cookie generated and knowledge of the counter value itself does not provide much value either way hence conversion to global is fine. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Martynas Pumputis <m@lambda.lt> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09devlink: remove pointless data_len arg from region snapshot createJiri Pirko
The size of the snapshot has to be the same as the size of the region, therefore no need to pass it again during snapshot creation. Remove the arg and use region->size instead. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski
sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08flow_offload: support get multi-subsystem blockwenxu
It provide a callback list to find the blocks of tc and nft subsystems Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08flow_offload: move tc indirect block to flow offloadwenxu
move tc indirect block to flow_offload and rename it to flow indirect block.The nf_tables can use the indr block architecture. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08net: use listified RX for handling GRO_NORMAL skbsEdward Cree
When GRO decides not to coalesce a packet, in napi_frags_finish(), instead of passing it to the stack immediately, place it on a list in the napi struct. Then, at flush time (napi_complete_done(), napi_poll(), or napi_busy_loop()), call netif_receive_skb_list_internal() on the list. We'd like to do that in napi_gro_flush(), but it's not called if !napi->gro_bitmask, so we have to do it in the callers instead. (There are a handful of drivers that call napi_gro_flush() themselves, but it's not clear why, or whether this will affect them.) Because a full 64 packets is an inefficiently large batch, also consume the list whenever it exceeds gro_normal_batch, a new net/core sysctl that defaults to 8. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Just minor overlapping changes in the conflicts here. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06drop_monitor: Use pre_doit / post_doit hooksIdo Schimmel
Each operation from user space should be protected by the global drop monitor mutex. Use the pre_doit / post_doit hooks to take / release the lock instead of doing it explicitly in each function. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06drop_monitor: Add extack supportIdo Schimmel
Add various extack messages to make drop_monitor more user friendly. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06drop_monitor: Avoid multiple blank linesIdo Schimmel
Remove multiple blank lines which are visually annoying and useless. This suppresses the "Please don't use multiple blank lines" checkpatch messages. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06drop_monitor: Document scope of spinlockIdo Schimmel
While 'per_cpu_dm_data' is a per-CPU variable, its 'skb' and 'send_timer' fields can be accessed concurrently by the CPU sending the netlink notification to user space from the workqueue and the CPU tracing kfree_skb(). This spinlock is meant to protect against that. Document its scope and suppress the checkpatch message "spinlock_t definition without comment". Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>