summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-02-12net: sched: prevent insertion of new classifiers during chain flushVlad Buslov
Extend tcf_chain with 'flushing' flag. Use the flag to prevent insertion of new classifier instances when chain flushing is in progress in order to prevent resource leak when tcf_proto is created by unlocked users concurrently. Return EAGAIN error from tcf_chain_tp_insert_unique() to restart tc_new_tfilter() and lookup the chain/proto again. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: refactor tp insert/delete for concurrent executionVlad Buslov
Implement unique insertion function to atomically attach tcf_proto to chain after verifying that no other tcf proto with specified priority exists. Implement delete function that verifies that tp is actually empty before deleting it. Use these functions to refactor cls API to account for concurrent tp and rule update instead of relying on rtnl lock. Add new 'deleting' flag to tcf proto. Use it to restart search when iterating over tp's on chain to prevent accessing potentially inval tp->next pointer. Extend tcf proto with spinlock that is intended to be used to protect its data from concurrent modification instead of relying on rtnl mutex. Use it to protect 'deleting' flag. Add lockdep macros to validate that lock is held when accessing protected fields. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: traverse classifiers in chain with tcf_get_next_proto()Vlad Buslov
All users of chain->filters_chain rely on rtnl lock and assume that no new classifier instances are added when traversing the list. Use tcf_get_next_proto() to traverse filters list without relying on rtnl mutex. This function iterates over classifiers by taking reference to current iterator classifier only and doesn't assume external synchronization of filters list. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: introduce reference counting for tcf_protoVlad Buslov
In order to remove dependency on rtnl lock and allow concurrent tcf_proto modification, extend tcf_proto with reference counter. Implement helper get/put functions for tcf proto and use them to modify cls API to always take reference to tcf_proto while using it. Only release reference to parent chain after releasing last reference to tp. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect filter_chain list with filter_chain_lock mutexVlad Buslov
Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain when accessing filter_chain list, instead of relying on rtnl lock. Dereference filter_chain with tcf_chain_dereference() lockdep macro to verify that all users of chain_list have the lock taken. Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute all necessary code while holding chain lock in order to prevent invalidation of chain_info structure by potential concurrent change. This also serializes calls to tcf_chain0_head_change(), which allows head change callbacks to rely on filter_chain_lock for synchronization instead of rtnl mutex. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect chain template accesses with block lockVlad Buslov
When cls API is called without protection of rtnl lock, parallel modification of chain is possible, which means that chain template can be changed concurrently in certain circumstances. For example, when chain is 'deleted' by new user-space chain API, the chain might continue to be used if it is referenced by actions, and can be 're-created' again by user. In such case same chain structure is reused and its template is changed. To protect from described scenario, cache chain template while holding block lock. Introduce standalone tc_chain_notify_delete() function that works with cached template values, instead of chains themselves. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: traverse chains in block with tcf_get_next_chain()Vlad Buslov
All users of block->chain_list rely on rtnl lock and assume that no new chains are added when traversing the list. Use tcf_get_next_chain() to traverse chain list without relying on rtnl mutex. This function iterates over chains by taking reference to current iterator chain only and doesn't assume external synchronization of chain list. Don't take reference to all chains in block when flushing and use tcf_get_next_chain() to safely iterate over chain list instead. Remove tcf_block_put_all_chains() that is no longer used. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect block->chain0 with block->lockVlad Buslov
In order to remove dependency on rtnl lock, use block->lock to protect chain0 struct from concurrent modification. Rearrange code in chain0 callback add and del functions to only access chain0 when block->lock is held. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: refactor tc_ctl_chain() to use block->lockVlad Buslov
In order to remove dependency on rtnl lock, modify chain API to use block->lock to protect chain from concurrent modification. Rearrange tc_ctl_chain() code to call tcf_chain_hold() while holding block->lock to prevent concurrent chain removal. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect chain->explicitly_created with block->lockVlad Buslov
In order to remove dependency on rtnl lock, protect tcf_chain->explicitly_created flag with block->lock. Consolidate code that checks and resets 'explicitly_created' flag into __tcf_chain_put() to execute it atomically with rest of code that puts chain reference. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect block state with mutexVlad Buslov
Currently, tcf_block doesn't use any synchronization mechanisms to protect critical sections that manage lifetime of its chains. block->chain_list and multiple variables in tcf_chain that control its lifetime assume external synchronization provided by global rtnl lock. Converting chain reference counting to atomic reference counters is not possible because cls API uses multiple counters and flags to control chain lifetime, so all of them must be synchronized in chain get/put code. Use single per-block lock to protect block data and manage lifetime of all chains on the block. Always take block->lock when accessing chain_list. Chain get and put modify chain lifetime-management data and parent block's chain_list, so take the lock in these functions. Verify block->lock state with assertions in functions that expect to be called with the lock taken and are called from multiple places. Take block->lock when accessing filter_chain_list. In order to allow parallel update of rules on single block, move all calls to classifiers outside of critical sections protected by new block->lock. Rearrange chain get and put functions code to only access protected chain data while holding block lock: - Rearrange code to only access chain reference counter and chain action reference counter while holding block lock. - Extract code that requires block->lock from tcf_chain_destroy() into standalone tcf_chain_destroy() function that is called by __tcf_chain_put() in same critical section that changes chain reference counters. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/packet: fix 4gb buffer limit due to overflow checkKal Conley
When calculating rb->frames_per_block * req->tp_block_nr the result can overflow. Check it for overflow without limiting the total buffer size to UINT_MAX. This change fixes support for packet ring buffers >= UINT_MAX. Fixes: 8f8d28e4d6d8 ("net/packet: fix overflow in check for tp_frame_nr") Signed-off-by: Kal Conley <kal.conley@dectris.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12inet_diag: fix reporting cgroup classid and fallback to priorityKonstantin Khlebnikov
Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested extensions has only 8 bits. Thus extensions starting from DCTCPINFO cannot be requested directly. Some of them included into response unconditionally or hook into some of lower 8 bits. Extension INET_DIAG_CLASS_ID has not way to request from the beginning. This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space reservation, and documents behavior for other extensions. Also this patch adds fallback to reporting socket priority. This filed is more widely used for traffic classification because ipv4 sockets automatically maps TOS to priority and default qdisc pfifo_fast knows about that. But priority could be changed via setsockopt SO_PRIORITY so INET_DIAG_TOS isn't enough for predicting class. Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen). So, after this patch INET_DIAG_CLASS_ID will report socket priority for most common setup when net_cls isn't set and/or cgroup2 in use. Fixes: 0888e372c37f ("net: inet: diag: expose sockets cgroup classid") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12batman-adv: fix uninit-value in batadv_interface_tx()Eric Dumazet
KMSAN reported batadv_interface_tx() was possibly using a garbage value [1] batadv_get_vid() does have a pskb_may_pull() call but batadv_interface_tx() does not actually make sure this did not fail. [1] BUG: KMSAN: uninit-value in batadv_interface_tx+0x908/0x1e40 net/batman-adv/soft-interface.c:231 CPU: 0 PID: 10006 Comm: syz-executor469 Not tainted 4.20.0-rc7+ #5 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313 batadv_interface_tx+0x908/0x1e40 net/batman-adv/soft-interface.c:231 __netdev_start_xmit include/linux/netdevice.h:4356 [inline] netdev_start_xmit include/linux/netdevice.h:4365 [inline] xmit_one net/core/dev.c:3257 [inline] dev_hard_start_xmit+0x607/0xc40 net/core/dev.c:3273 __dev_queue_xmit+0x2e42/0x3bc0 net/core/dev.c:3843 dev_queue_xmit+0x4b/0x60 net/core/dev.c:3876 packet_snd net/packet/af_packet.c:2928 [inline] packet_sendmsg+0x8306/0x8f30 net/packet/af_packet.c:2953 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg net/socket.c:631 [inline] __sys_sendto+0x8c4/0xac0 net/socket.c:1788 __do_sys_sendto net/socket.c:1800 [inline] __se_sys_sendto+0x107/0x130 net/socket.c:1796 __x64_sys_sendto+0x6e/0x90 net/socket.c:1796 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x441889 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffdda6fd468 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000000000441889 RDX: 000000000000000e RSI: 00000000200000c0 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000216 R12: 00007ffdda6fd4c0 R13: 00007ffdda6fd4b0 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline] kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158 kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176 kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185 slab_post_alloc_hook mm/slab.h:446 [inline] slab_alloc_node mm/slub.c:2759 [inline] __kmalloc_node_track_caller+0xe18/0x1030 mm/slub.c:4383 __kmalloc_reserve net/core/skbuff.c:137 [inline] __alloc_skb+0x309/0xa20 net/core/skbuff.c:205 alloc_skb include/linux/skbuff.h:998 [inline] alloc_skb_with_frags+0x1c7/0xac0 net/core/skbuff.c:5220 sock_alloc_send_pskb+0xafd/0x10e0 net/core/sock.c:2083 packet_alloc_skb net/packet/af_packet.c:2781 [inline] packet_snd net/packet/af_packet.c:2872 [inline] packet_sendmsg+0x661a/0x8f30 net/packet/af_packet.c:2953 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg net/socket.c:631 [inline] __sys_sendto+0x8c4/0xac0 net/socket.c:1788 __do_sys_sendto net/socket.c:1800 [inline] __se_sys_sendto+0x107/0x130 net/socket.c:1796 __x64_sys_sendto+0x6e/0x90 net/socket.c:1796 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Marek Lindner <mareklindner@neomailbox.ch> Cc: Simon Wunderlich <sw@simonwunderlich.de> Cc: Antonio Quartulli <a@unstable.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12ipv6: propagate genlmsg_reply return codeLi RongQing
genlmsg_reply can fail, so propagate its return code Fixes: 915d7e5e593 ("ipv6: sr: add code base for control plane support of SR-IPv6") Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/tls: Do not use async crypto for non-data recordsVakul Garg
Addition of tls1.3 support broke tls1.2 handshake when async crypto accelerator is used. This is because the record type for non-data records is not propagated to user application. Also when async decryption happens, the decryption does not stop when two different types of records get dequeued and submitted for decryption. To address it, we decrypt tls1.2 non-data records in synchronous way. We check whether the record we just processed has same type as the previous one before checking for async condition and jumping to dequeue next record. Fixes: 130b392c6cd6b ("net: tls: Add tls 1.3 support") Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12devlink: use direct return of genlmsg_replyLi RongQing
This can remove redundant check Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12Revert "devlink: Add a generic wake_on_lan port parameter"Vasundhara Volam
This reverts commit b639583f9e36d044ac1b13090ae812266992cbac. As per discussion with Jakub Kicinski and Michal Kubecek, this will be better addressed by soon-too-come ethtool netlink API with additional indication that given configuration request is supposed to be persisted. Also, remove the parameter support from bnxt_en driver. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Michal Kubecek <mkubecek@suse.cz> Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/smc: check port_idx of ib eventKarsten Graul
For robustness protect of higher port numbers than expected to avoid setting bits behind our port_event_mask. In case of an DEVICE_FATAL event all ports must be checked. The IB_EVENT_GID_CHANGE event is provided in the global event handler, so handle it there. And handle a QP_FATAL event instead of an DEVICE_FATAL event in the qp handler. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/smc: check connections in smc_lgr_free_workKarsten Graul
Remove the shortcut that smc_lgr_free() would skip the check for existing connections when the link group is not in the link group list. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/smc: reduce amount of status updates to peerKarsten Graul
In smc_cdc_msg_recv_action() the received cdc message is evaluated. To reduce the number of messaged triggered by this evaluation the logic is streamlined. For the write_blocked condition we do not need to send a response immediately. The remaining conditions can be put together into one if clause. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/smc: no delay for free tx buffer waitKarsten Graul
When no free transfer buffers are available then a work to call smc_tx_work() is scheduled. Set the schedule delay to zero, because for the out-of-buffers condition the work can start immediately and will block in the called function smc_wr_tx_get_free_slot(), waiting for free buffers. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/smc: move wake up of close waiterKarsten Graul
Move the call to smc_close_wake_tx_prepared() (which wakes up a possibly waiting close processing that might wait for 'all data sent') to smc_tx_sndbuf_nonempty() (which is the main function to send data). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net/smc: reset cursor update required flagKarsten Graul
When an updated rx_cursor_confirmed field was sent to the peer then reset the cons_curs_upd_req flag. And remove the duplicate reset and cursor update in smc_tx_consumer_update(). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12Merge tag 'mac80211-for-davem-2019-02-12' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just a few fixes: * aggregation session teardown with internal TXQs was continuing to send some frames marked as aggregation, fix from Ilan * IBSS join was missed during firmware restart, should such a thing happen * speculative execution based on the return value of cfg80211_classify8021d() - which is controlled by the sender of the packet - could be problematic in some code using it, prevent it * a few peer measurement fixes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12ipvs: fix dependency on nf_defrag_ipv6Andrea Claudi
ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation, but lacks proper Kconfig dependencies and does not explicitly request defrag features. As a result, if netfilter hooks are not loaded, when IPv6 fragmented packet are handled by ipvs only the first fragment makes through. Fix it properly declaring the dependency on Kconfig and registering netfilter hooks on ip_vs_add_service() and ip_vs_new_dest(). Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in ↵Chieh-Min Wang
__nf_conntrack_confirm For bridge(br_flood) or broadcast/multicast packets, they could clone skb with unconfirmed conntrack which break the rule that unconfirmed skb->_nfct is never shared. With nfqueue running on my system, the race can be easily reproduced with following warning calltrace: [13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P W 4.4.60 #7744 [13257.707568] Hardware name: Qualcomm (Flattened Device Tree) [13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14) [13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8) [13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0) [13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24) [13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618) [13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc) [13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8) [13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0) [13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c) [13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8) [13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0) [13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4) [13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac) [13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170) [13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420) [13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248) [13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0) [13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c) [13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368) [13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44) [13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200) [13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64) [13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34) The original code just triggered the warning but do nothing. It will caused the shared conntrack moves to the dying list and the packet be droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack). - Reproduce steps: +----------------------------+ | br0(bridge) | | | +-+---------+---------+------+ | eth0| | eth1| | eth2| | | | | | | +--+--+ +--+--+ +---+-+ | | | | | | +--+-+ +-+--+ +--+-+ | PC1| | PC2| | PC3| +----+ +----+ +----+ iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass ps: Our nfq userspace program will set mark on packets whose connection has already been processed. PC1 sends broadcast packets simulated by hping3: hping3 --rand-source --udp 192.168.1.255 -i u100 - Broadcast racing flow chart is as follow: br_handle_frame BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish) // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage br_handle_frame_finish // check if this packet is broadcast br_flood_forward br_flood list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port maybe_deliver deliver_clone skb = skb_clone(skb) __br_forward BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...) // queue in our nfq and received by our userspace program // goto __nf_conntrack_confirm with process context on CPU 1 br_pass_frame_up BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...) // goto __nf_conntrack_confirm with softirq context on CPU 0 Because conntrack confirm can happen at both INPUT and POSTROUTING stage. So with NFQUEUE running, skb->_nfct with the same unconfirmed conntrack could race on different core. This patch fixes a repeating kernel splat, now it is only displayed once. Signed-off-by: Chieh-Min Wang <chiehminw@synology.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-11tipc: fix link session and re-establish issuesTuong Lien
When a link endpoint is re-created (e.g. after a node reboot or interface reset), the link session number is varied by random, the peer endpoint will be synced with this new session number before the link is re-established. However, there is a shortcoming in this mechanism that can lead to the link never re-established or faced with a failure then. It happens when the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as well as the 'in_session' flag have been set, but suddenly this link endpoint leaves. When it comes back with a random session number, there are two situations possible: 1/ If the random session number is larger than (or equal to) the previous one, the peer endpoint will be updated with this new session upon receipt of a RESET_MSG from this endpoint, and the link can be re- established as normal. Otherwise, all the RESET_MSGs from this endpoint will be rejected by the peer. In turn, when this link endpoint receives one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start to send STATE_MSGs, but again these messages will be dropped by the peer due to wrong session. The peer link endpoint can still become ESTABLISHED after receiving a traffic message from this endpoint (e.g. a BCAST_PROTOCOL or NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link will be forced down sooner or later! Even in case the random session number is larger than the previous one, it can be that the ACTIVATE_MSG from the peer arrives first, and this link endpoint moves quickly to ESTABLISHED without sending out any RESET_MSG yet. Consequently, the peer link will not be updated with the new session number, and the same link failure scenario as above will happen. 2/ Another situation can be that, the peer link endpoint was reset due to any reasons in the meantime, its link state was set to RESET from ESTABLISHING but still in session, i.e. the 'in_session' flag is not reset... Now, if the random session number from this endpoint is less than the previous one, all the RESET_MSGs from this endpoint will be rejected by the peer. In the other direction, when this link endpoint receives a RESET_MSG from the peer, it moves to ESTABLISHING and starts to send ACTIVATE_MSGs, but all these messages will be rejected by the peer too. As a result, the link cannot be re-established but gets stuck with this link endpoint in state ESTABLISHING and the peer in RESET! Solution: =========== This link endpoint should not go directly to ESTABLISHED when getting ACTIVATE_MSG from the peer which may belong to the old session if the link was re-created. To ensure the session to be correct before the link is re-established, the peer endpoint in ESTABLISHING state will send back the last session number in ACTIVATE_MSG for a verification at this endpoint. Then, if needed, a new and more appropriate session number will be regenerated to force a re-synch first. In addition, when a link in ESTABLISHING state is reset, its state will move to RESET according to the link FSM, along with resetting the 'in_session' flag (and the other data) as a normal link reset, it will also be deleted if requested. The solution is backward compatible. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11devlink: don't allocate attrs on the stackJakub Kicinski
Number of devlink attributes has grown over 128, causing the following warning: ../net/core/devlink.c: In function ‘devlink_nl_cmd_region_read_dumpit’: ../net/core/devlink.c:3740:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=] } ^ Since the number of attributes is only going to grow allocate the array dynamically. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11devlink: fix condition for compat device infoJakub Kicinski
We need the port to be both ethernet and have the rigth netdev, not one or the other. Fixes: ddb6e99e2db1 ("ethtool: add compat for devlink info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11net: fix IPv6 prefix route residueZhiqiang Liu
Follow those steps: # ip addr add 2001:123::1/32 dev eth0 # ip addr add 2001:123:456::2/64 dev eth0 # ip addr del 2001:123::1/32 dev eth0 # ip addr del 2001:123:456::2/64 dev eth0 and then prefix route of 2001:123::1/32 will still exist. This is because ipv6_prefix_equal in check_cleanup_prefix_route func does not check whether two IPv6 addresses have the same prefix length. If the prefix of one address starts with another shorter address prefix, even though their prefix lengths are different, the return value of ipv6_prefix_equal is true. Here I add a check of whether two addresses have the same prefix to decide whether their prefixes are equal. Fixes: 5b84efecb7d9 ("ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE") Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11tipc: fix skb may be leaky in tipc_link_inputHoang Le
When we free skb at tipc_data_input, we return a 'false' boolean. Then, skb passed to subcalling tipc_link_input in tipc_link_rcv, <snip> 1303 int tipc_link_rcv: ... 1354 if (!tipc_data_input(l, skb, l->inputq)) 1355 rc |= tipc_link_input(l, skb, l->inputq); </snip> Fix it by simple changing to a 'true' boolean when skb is being free-ed. Then, tipc_link_rcv will bypassed to subcalling tipc_link_input as above condition. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <maloy@donjonn.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12netfilter: xt_recent: Use struct_size() in kvzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; size = sizeof(struct foo) + count * sizeof(void *); instance = alloc(size, GFP_KERNEL) Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: size = struct_size(instance, entry, count); instance = alloc(size, GFP_KERNEL) Notice that, in this case, variable sz is not necessary, hence it is removed. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12ipvs: Use struct_size() helperGustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = alloc(size, GFP_KERNEL) Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: size = struct_size(instance, entry, count); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12netfilter: conntrack: fix indentation issueColin Ian King
A statement in an if block is not indented correctly. Fix this. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12netfilter: compat: initialize all fields in xt_initFrancesco Ruggeri
If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init time, the following panic can be caused by running % ebtables -t broute -F BROUTING from a 32-bit user level on a 64-bit kernel. This patch replaces kmalloc_array with kcalloc when allocating xt. [ 474.680846] BUG: unable to handle kernel paging request at 0000000009600920 [ 474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0 [ 474.693838] Oops: 0000 [#1] SMP [ 474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1 [ 474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013 [ 474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables] [ 474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d [ 474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207 [ 474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249 [ 474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124 [ 474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f [ 474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001 [ 474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8 [ 474.780234] FS: 0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700 [ 474.788612] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0 [ 474.802052] Call Trace: [ 474.804789] compat_do_replace+0x1fb/0x2a3 [ebtables] [ 474.810105] compat_do_ebt_set_ctl+0x69/0xe6 [ebtables] [ 474.815605] ? try_module_get+0x37/0x42 [ 474.819716] compat_nf_setsockopt+0x4f/0x6d [ 474.824172] compat_ip_setsockopt+0x7e/0x8c [ 474.828641] compat_raw_setsockopt+0x16/0x3a [ 474.833220] compat_sock_common_setsockopt+0x1d/0x24 [ 474.838458] __compat_sys_setsockopt+0x17e/0x1b1 [ 474.843343] ? __check_object_size+0x76/0x19a [ 474.847960] __ia32_compat_sys_socketcall+0x1cb/0x25b [ 474.853276] do_fast_syscall_32+0xaf/0xf6 [ 474.857548] entry_SYSENTER_compat+0x6b/0x7a Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-11devlink: Add WARN_ON to catch errors of not cleaning devlink objectsParav Pandit
Add WARN_ON to make sure that all sub objects of a devlink device are cleanedup before freeing the devlink device. This helps to catch any driver bugs. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11net/x25: do not hold the cpu too long in x25_new_lci()Eric Dumazet
Due to quadratic behavior of x25_new_lci(), syzbot was able to trigger an rcu stall. Fix this by not blocking BH for the whole duration of the function, and inserting a reschedule point when possible. If we care enough, using a bitmap could get rid of the quadratic behavior. syzbot report : rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-...!: (10500 ticks this GP) idle=4fa/1/0x4000000000000002 softirq=283376/283376 fqs=0 rcu: (t=10501 jiffies g=383105 q=136) rcu: rcu_preempt kthread starved for 10502 jiffies! g383105 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0 rcu: RCU grace-period kthread stack dump: rcu_preempt I28928 10 2 0x80000000 Call Trace: context_switch kernel/sched/core.c:2844 [inline] __schedule+0x817/0x1cc0 kernel/sched/core.c:3485 schedule+0x92/0x180 kernel/sched/core.c:3529 schedule_timeout+0x4db/0xfd0 kernel/time/timer.c:1803 rcu_gp_fqs_loop kernel/rcu/tree.c:1948 [inline] rcu_gp_kthread+0x956/0x17a0 kernel/rcu/tree.c:2105 kthread+0x357/0x430 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 NMI backtrace for cpu 0 CPU: 0 PID: 8759 Comm: syz-executor2 Not tainted 5.0.0-rc4+ #51 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline] rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree.c:1211 print_cpu_stall kernel/rcu/tree.c:1348 [inline] check_cpu_stall kernel/rcu/tree.c:1422 [inline] rcu_pending kernel/rcu/tree.c:3018 [inline] rcu_check_callbacks.cold+0x500/0xa4a kernel/rcu/tree.c:2521 update_process_times+0x32/0x80 kernel/time/timer.c:1635 tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:161 tick_sched_timer+0x47/0x130 kernel/time/tick-sched.c:1271 __run_hrtimer kernel/time/hrtimer.c:1389 [inline] __hrtimer_run_queues+0x33e/0xde0 kernel/time/hrtimer.c:1451 hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1035 [inline] smp_apic_timer_interrupt+0x120/0x570 arch/x86/kernel/apic/apic.c:1060 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 </IRQ> RIP: 0010:__read_once_size include/linux/compiler.h:193 [inline] RIP: 0010:queued_write_lock_slowpath+0x13e/0x290 kernel/locking/qrwlock.c:86 Code: 00 00 fc ff df 4c 8d 2c 01 41 83 c7 03 41 0f b6 45 00 41 38 c7 7c 08 84 c0 0f 85 0c 01 00 00 8b 03 3d 00 01 00 00 74 1a f3 90 <41> 0f b6 55 00 41 38 d7 7c eb 84 d2 74 e7 48 89 df e8 6c 0f 4f 00 RSP: 0018:ffff88805f117bd8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000300 RBX: ffffffff89413ba0 RCX: 1ffffffff1282774 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff89413ba0 RBP: ffff88805f117c70 R08: 1ffffffff1282774 R09: fffffbfff1282775 R10: fffffbfff1282774 R11: ffffffff89413ba3 R12: 00000000000000ff R13: fffffbfff1282774 R14: 1ffff1100be22f7d R15: 0000000000000003 queued_write_lock include/asm-generic/qrwlock.h:104 [inline] do_raw_write_lock+0x1d6/0x290 kernel/locking/spinlock_debug.c:203 __raw_write_lock_bh include/linux/rwlock_api_smp.h:204 [inline] _raw_write_lock_bh+0x3b/0x50 kernel/locking/spinlock.c:312 x25_insert_socket+0x21/0xe0 net/x25/af_x25.c:267 x25_bind+0x273/0x340 net/x25/af_x25.c:705 __sys_bind+0x23f/0x290 net/socket.c:1505 __do_sys_bind net/socket.c:1516 [inline] __se_sys_bind net/socket.c:1514 [inline] __x64_sys_bind+0x73/0xb0 net/socket.c:1514 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457e39 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fafccd0dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39 RDX: 0000000000000012 RSI: 0000000020000240 RDI: 0000000000000004 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fafccd0e6d4 R13: 00000000004bdf8b R14: 00000000004ce4b8 R15: 00000000ffffffff Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 CPU: 1 PID: 8752 Comm: syz-executor4 Not tainted 5.0.0-rc4+ #51 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__x25_find_socket+0x78/0x120 net/x25/af_x25.c:328 Code: 89 f8 48 c1 e8 03 80 3c 18 00 0f 85 a6 00 00 00 4d 8b 64 24 68 4d 85 e4 74 7f e8 03 97 3d fb 49 83 ec 68 74 74 e8 f8 96 3d fb <49> 8d bc 24 88 04 00 00 48 89 f8 48 c1 e8 03 0f b6 04 18 84 c0 74 RSP: 0018:ffff8880639efc58 EFLAGS: 00000246 RAX: 0000000000040000 RBX: dffffc0000000000 RCX: ffffc9000e677000 RDX: 0000000000040000 RSI: ffffffff863244b8 RDI: ffff88806a764628 RBP: ffff8880639efc80 R08: ffff8880a80d05c0 R09: fffffbfff1282775 R10: fffffbfff1282774 R11: ffffffff89413ba3 R12: ffff88806a7645c0 R13: 0000000000000001 R14: ffff88809f29ac00 R15: 0000000000000000 FS: 00007fe8d0c58700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32823000 CR3: 00000000672eb000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: x25_new_lci net/x25/af_x25.c:357 [inline] x25_connect+0x374/0xdf0 net/x25/af_x25.c:786 __sys_connect+0x266/0x330 net/socket.c:1686 __do_sys_connect net/socket.c:1697 [inline] __se_sys_connect net/socket.c:1694 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:1694 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457e39 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe8d0c57c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39 RDX: 0000000000000012 RSI: 0000000020000200 RDI: 0000000000000004 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe8d0c586d4 R13: 00000000004be378 R14: 00000000004ceb00 R15: 00000000ffffffff Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Andrew Hendry <andrew.hendry@gmail.com> Cc: linux-x25@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11net: dsa: microchip: add switch offload forwarding supportTristram Ha
The flag offload_fwd_mark is set as the switch can forward frames by itself. This can be considered a fix to a problem introduced in commit c2e866911e254067 where the port membership are not set in sync. The flag offload_fwd_mark just needs to be set in tag_ksz.c to prevent the software bridge from forwarding duplicate multicast frames. Fixes: c2e866911e254067 ("microchip: break KSZ9477 DSA driver into two files") Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11Documentation: bring operstate documentation up-to-dateJouke Witteveen
Netlink has moved from bitmasks to group numbers long ago. Signed-off-by: Jouke Witteveen <j.witteveen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11netfilter: nat: fix spurious connection timeoutsFlorian Westphal
Sander Eikelenboom bisected a NAT related regression down to the l4proto->manip_pkt indirection removal. I forgot that ICMP(v6) errors (e.g. PKTTOOBIG) can be set as related to the existing conntrack entry. Therefore, when passing the skb to nf_nat_ipv4/6_manip_pkt(), that ended up calling the wrong l4 manip function, as tuple->dst.protonum is the original flows l4 protocol (TCP, UDP, etc). Set the dst protocol field to ICMP(v6), we already have a private copy of the tuple due to the inversion of src/dst. Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Fixes: faec18dbb0405 ("netfilter: nat: remove l4proto->manip_pkt") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-11netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbsJann Horn
The generic ASN.1 decoder infrastructure doesn't guarantee that callbacks will get as much data as they expect; callbacks have to check the `datalen` parameter before looking at `data`. Make sure that snmp_version() and snmp_helper() don't read/write beyond the end of the packet data. (Also move the assignment to `pdata` down below the check to make it clear that it isn't necessarily a pointer we can use before the `datalen` check.) Fixes: cc2d58634e0f ("netfilter: nf_nat_snmp_basic: use asn1 decoder library") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-11mac80211: Fix Tx aggregation session tear down with ITXQsIlan Peer
When mac80211 requests the low level driver to stop an ongoing Tx aggregation, the low level driver is expected to call ieee80211_stop_tx_ba_cb_irqsafe() to indicate that it is ready to stop the session. The callback in turn schedules a worker to complete the session tear down, which in turn also handles the relevant state for the intermediate Tx queue. However, as this flow in asynchronous, the intermediate queue should be stopped and not continue servicing frames, as in such a case frames that are dequeued would be marked as part of an aggregation, although the aggregation is already been stopped. Fix this by stopping the intermediate Tx queue, before calling the low level driver to stop the Tx aggregation. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11cfg80211: prevent speculation on cfg80211_classify8021d() returnJohannes Berg
It's possible that the caller of cfg80211_classify8021d() uses the value to index an array, like mac80211 in ieee80211_downgrade_queue(). Prevent speculation on the return value. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11cfg80211: pmsr: record netlink port IDJohannes Berg
Without recording the netlink port ID, we cannot return the results or complete messages to userspace, nor will we be able to abort if the socket is closed, so clearly we need to fill the value. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11nl80211: Fix FTM per burst maximum valueAviya Erenfeld
Fix FTM per burst maximum value from 15 to 31 (The maximal bits that represents that number in the frame is 5 hence a maximal value of 31) Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11mac80211: call drv_ibss_join() on restartJohannes Berg
If a driver does any significant activity in its ibss_join method, then it will very well expect that to be called during restart, before any stations are added. Do that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-10bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sockMartin KaFai Lau
This patch adds a helper function BPF_FUNC_tcp_sock and it is currently available for cg_skb and sched_(cls|act): struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk); int cg_skb_foo(struct __sk_buff *skb) { struct bpf_tcp_sock *tp; struct bpf_sock *sk; __u32 snd_cwnd; sk = skb->sk; if (!sk) return 1; tp = bpf_tcp_sock(sk); if (!tp) return 1; snd_cwnd = tp->snd_cwnd; /* ... */ return 1; } A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide read-only access. bpf_tcp_sock has all the existing tcp_sock's fields that has already been exposed by the bpf_sock_ops. i.e. no new tcp_sock's fields are exposed in bpf.h. This helper returns a pointer to the tcp_sock. If it is not a tcp_sock or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it returns NULL. Hence, the caller needs to check for NULL before accessing it. The current use case is to expose members from tcp_sock to allow a cg_skb_bpf_prog to provide per cgroup traffic policing/shaping. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10bpf: Refactor sock_ops_convert_ctx_accessMartin KaFai Lau
The next patch will introduce a new "struct bpf_tcp_sock" which exposes the same tcp_sock's fields already exposed in "struct bpf_sock_ops". This patch refactor the existing convert_ctx_access() codes for "struct bpf_sock_ops" to get them ready to be reused for "struct bpf_tcp_sock". The "rtt_min" is not refactored in this patch because its handling is different from other fields. The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD changes are code move only. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sockMartin KaFai Lau
This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the bpf_sock. The userspace has already been using "state", e.g. inet_diag (ss -t) and getsockopt(TCP_INFO). This patch also allows narrow load on the following existing fields: "family", "type", "protocol" and "src_port". Unlike IP address, the load offset is resticted to the first byte for them but it can be relaxed later if there is a use case. This patch also folds __sock_filter_check_size() into bpf_sock_is_valid_access() since it is not called by any where else. All bpf_sock checking is in one place. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>