summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-11-26ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notfXin Long
ip_vs_dst_event is supposed to clean up all dst used in ipvs' destinations when a net dev is going down. But it works only when the dst's dev is the same as the dev from the event. Now with the same priority but late registration, ip_vs_dst_notifier is always called later than ipv6_dev_notf where the dst's dev is set to lo for NETDEV_DOWN event. As the dst's dev lo is not the same as the dev from the event in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work. Also as these dst have to wait for dest_trash_timer to clean them up. It would cause some non-permanent kernel warnings: unregister_netdevice: waiting for br0 to become free. Usage count = 3 To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5. Note that for ipv4 route fib_netdev_notifier doesn't set dst's dev to lo in NETDEV_DOWN event, so this fix is only needed when IP_VS_IPV6 is defined. Fixes: 7a4f0761fce3 ("IPVS: init and cleanup restructuring") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-11-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix an off-by-one bug when adjusting subprog start offsets after patching, from Edward. 2) Fix several bugs such as overflow in size allocation in queue / stack map creation, from Alexei. 3) Fix wrong IPv6 destination port byte order in bpf_sk_lookup_udp helper, from Andrey. 4) Fix several bugs in bpftool such as preventing an infinite loop in get_fdinfo, error handling and man page references, from Quentin. 5) Fix a warning in bpf_trace_printk() that wasn't catching an invalid format string, from Martynas. 6) Fix a bug in BPF cgroup local storage where non-atomic allocation was used in atomic context, from Roman. 7) Fix a NULL pointer dereference bug in bpftool from reallocarray() error handling, from Jakub and Wen. 8) Add a copy of pkt_cls.h and tc_bpf.h uapi headers to the tools include infrastructure so that bpftool compiles on older RHEL7-like user space which does not ship these headers, from Yonghong. 9) Fix BPF kselftests for user space where to get ping test working with ping6 and ping -6, from Li. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-25net: remove unsafe skb_insert()Eric Dumazet
I do not see how one can effectively use skb_insert() without holding some kind of lock. Otherwise other cpus could have changed the list right before we have a chance of acquiring list->lock. Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this one probably meant to use __skb_insert() since it appears nesqp->pau_list is protected by nesqp->pau_lock. This looks like nesqp->pau_lock could be removed, since nesqp->pau_list.lock could be used instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Faisal Latif <faisal.latif@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: linux-rdma <linux-rdma@vger.kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-25net: bridge: remove redundant checks for null p->dev and p->brColin Ian King
A recent change added a null check on p->dev after p->dev was being dereferenced by the ns_capable check on p->dev. It turns out that neither the p->dev and p->br null checks are necessary, and can be removed, which cleans up a static analyis warning. As Nikolay Aleksandrov noted, these checks can be removed because: "My reasoning of why it shouldn't be possible: - On port add new_nbp() sets both p->dev and p->br before creating kobj/sysfs - On port del (trickier) del_nbp() calls kobject_del() before call_rcu() to destroy the port which in turn calls sysfs_remove_dir() which uses kernfs_remove() which deactivates (shouldn't be able to open new files) and calls kernfs_drain() to drain current open/mmaped files in the respective dir before continuing, thus making it impossible to open a bridge port sysfs file with p->dev and p->br equal to NULL. So I think it's safe to remove those checks altogether. It'd be nice to get a second look over my reasoning as I might be missing something in sysfs/kernfs call path." Thanks to Nikolay Aleksandrov's suggestion to remove the check and David Miller for sanity checking this. Detected by CoverityScan, CID#751490 ("Dereference before null check") Fixes: a5f3ea54f3cc ("net: bridge: add support for raw sysfs port options") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-24net: always initialize pagedlenWillem de Bruijn
In ip packet generation, pagedlen is initialized for each skb at the start of the loop in __ip(6)_append_data, before label alloc_new_skb. Depending on compiler options, code can be generated that jumps to this label, triggering use of an an uninitialized variable. In practice, at -O2, the generated code moves the initialization below the label. But the code should not rely on that for correctness. Fixes: 15e36f5b8e98 ("udp: paged allocation with gso") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-24tcp: address problems caused by EDT misshapsEric Dumazet
When a qdisc setup including pacing FQ is dismantled and recreated, some TCP packets are sent earlier than instructed by TCP stack. TCP can be fooled when ACK comes back, because the following operation can return a negative value. tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr; Some paths in TCP stack were not dealing properly with this, this patch addresses four of them. Fixes: ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-11-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Need to take mutex in ath9k_add_interface(), from Dan Carpenter. 2) Fix mt76 build without CONFIG_LEDS_CLASS, from Arnd Bergmann. 3) Fix socket wmem accounting in SCTP, from Xin Long. 4) Fix failed resume crash in ena driver, from Arthur Kiyanovski. 5) qed driver passes bytes instead of bits into second arg of bitmap_weight(). From Denis Bolotin. 6) Fix reset deadlock in ibmvnic, from Juliet Kim. 7) skb_scrube_packet() needs to scrub the fwd marks too, from Petr Machata. 8) Make sure older TCP stacks see enough dup ACKs, and avoid doing SACK compression during this period, from Eric Dumazet. 9) Add atomicity to SMC protocol cursor handling, from Ursula Braun. 10) Don't leave dangling error pointer if bpf_prog_add() fails in thunderx driver, from Lorenzo Bianconi. Also, when we unmap TSO headers, set sq->tso_hdrs to NULL. 11) Fix race condition over state variables in act_police, from Davide Caratti. 12) Disable guest csum in the presence of XDP in virtio_net, from Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (64 commits) net: gemini: Fix copy/paste error net: phy: mscc: fix deadlock in vsc85xx_default_config dt-bindings: dsa: Fix typo in "probed" net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue net: amd: add missing of_node_put() team: no need to do team_notify_peers or team_mcast_rejoin when disabling port virtio-net: fail XDP set if guest csum is negotiated virtio-net: disable guest csum during XDP set net/sched: act_police: add missing spinlock initialization net: don't keep lonely packets forever in the gro hash net/ipv6: re-do dad when interface has IFF_NOARP flag change packet: copy user buffers before orphan or clone ibmvnic: Update driver queues after change in ring size support ibmvnic: Fix RX queue buffer cleanup net: thunderx: set xdp_prog to NULL if bpf_prog_add fails net/dim: Update DIM start sample after each DIM iteration net: faraday: ftmac100: remove netif_running(netdev) check before disabling interrupts net/smc: use after free fix in smc_wr_tx_put_slot() net/smc: atomic SMCD cursor handling net/smc: add SMC-D shutdown signal ...
2018-11-23rocker, dsa, ethsw: Don't filter VLAN events on bridge itselfPetr Machata
Due to an explicit check in rocker_world_port_obj_vlan_add(), dsa_slave_switchdev_event() resp. port_switchdev_event(), VLAN objects that are added to a device that is not a front-panel port device are ignored. Therefore this check is immaterial. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23switchdev: Replace port obj add/del SDO with a notificationPetr Machata
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of this field from all clients, which were migrated to use switchdev notification in the previous patches. Add a new function switchdev_port_obj_notify() that sends the switchdev notifications SWITCHDEV_PORT_OBJ_ADD and _DEL. Update switchdev_port_obj_del_now() to dispatch to this new function. Drop __switchdev_port_obj_add() and update switchdev_port_obj_add() likewise. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23switchdev: Add helpers to aid traversal through lower devicesPetr Machata
After the transition from switchdev operations to notifier chain (which will take place in following patches), the onus is on the driver to find its own devices below possible layer of LAG or other uppers. The logic to do so is fairly repetitive: each driver is looking for its own devices among the lowers of the notified device. For those that it finds, it calls a handler. To indicate that the event was handled, struct switchdev_notifier_port_obj_info.handled is set. The differences lie only in what constitutes an "own" device and what handler to call. Therefore abstract this logic into two helpers, switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If a driver only supports physical ports under a bridge device, it will simply avoid this layer of indirection. One area where this helper diverges from the current switchdev behavior is the case of mixed lowers, some of which are switchdev ports and some of which are not. Previously, such scenario would fail with -EOPNOTSUPP. The helper could do that for lowers for which the passed-in predicate doesn't hold. That would however break the case that switchdev ports from several different drivers are stashed under one master, a scenario that switchdev currently happily supports. Therefore tolerate any and all unknown netdevices, whether they are backed by a switchdev driver or not. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net: dsa: slave: Handle SWITCHDEV_PORT_OBJ_ADD/_DELPetr Machata
Following patches will change the way of distributing port object changes from a switchdev operation to a switchdev notifier. The switchdev code currently recursively descends through layers of lower devices, eventually calling the op on a front-panel port device. The notifier will instead be sent referencing the bridge port device, which may be a stacking device that's one of front-panel ports uppers, or a completely unrelated device. DSA currently doesn't support any other uppers than bridge. SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on the bridge port device. Thus the only case that a stacked device could be validly referenced by port object notifications are bridge notifications for VLAN objects added to the bridge itself. But the driver explicitly rejects such notifications in dsa_port_vlan_add(). It is therefore safe to assume that the only interesting case is that the notification is on a front-panel port netdevice. Therefore keep the filtering by dsa_slave_dev_check() in place. To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking notifier chain. Dispatch to rocker_port_obj_add() resp. _del() to maintain the behavior that the switchdev operation based code currently has. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23switchdev: Add a blocking notifier chainPetr Machata
In general one can't assume that a switchdev notifier is called in a non-atomic context, and correspondingly, the switchdev notifier chain is an atomic one. However, port object addition and deletion messages are delivered from a process context. Even the MDB addition messages, whose delivery is scheduled from atomic context, are queued and the delivery itself takes place in blocking context. For VLAN messages in particular, keeping the blocking nature is important for error reporting. Therefore introduce a blocking notifier chain and related service functions to distribute the notifications for which a blocking context can be assumed. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: unregister rkeys of unused bufferKarsten Graul
When an rmb is no longer in use by a connection, unregister its rkey at the remote peer with an LLC DELETE RKEY message. With this change, unused buffers held in the buffer pool are no longer registered at the remote peer. They are registered before the buffer is actually used and unregistered when they are no longer used by a connection. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: add infrastructure to send delete rkey messagesKarsten Graul
Add the infrastructure to send LLC messages of type DELETE RKEY to unregister a shared memory region at the peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: avoid a delay by waiting for nothingKarsten Graul
When a send failed then don't start to wait for a response in smc_llc_do_confirm_rkey. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: cleanup listen worker mutex unlockingUrsula Braun
For easier reading move the unlock of mutex smc_create_lgr_pending into smc_listen_work(), i.e. into the function the mutex has been locked. No functional change. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: short wait for late smc_clc_wait_msgUrsula Braun
After sending one of the initial LLC messages CONFIRM LINK or ADD LINK, there is already a wait for the LLC response. It does not make sense to wait another long time for a CLC DECLINE. Thus this patch introduces a shorter wait time for these cases. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: no link delete for a never active linkUrsula Braun
If a link is terminated that has never reached the active state, there is no need to trigger an LLC DELETE LINK. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: allow fallback after clc timeoutsUrsula Braun
If connection initialization fails for the LLC CONFIRM LINK or the LLC ADD LINK step, fallback to TCP should be enabled. Thus the negative return code -EAGAIN should switch to a positive timeout reason code in these cases, and the internal CLC socket should not have a set sk_err. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: remove sock_error detour in clc-functionsUrsula Braun
There is no need to store the return value in sk_err, if it is afterwards cleared again with sock_error(). This patch sets the return value directly. Just cleanup, no functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: make smc_lgr_free() staticUrsula Braun
smc_lgr_free() is just called inside smc_core.c. Make it static. Just cleanup, no functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/smc: cleanup tcp_listen_worker initializationUrsula Braun
The tcp_listen_worker is already initialized when socket is created (in smc_sock_alloc()). Get rid of the duplicate initialization in smc_listen(). No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net-gro: use ffs() to speedup napi_gro_flush()Eric Dumazet
We very often have few flows/chains to look at, and we might increase GRO_HASH_BUCKETS to 32 or 64 in the future. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23Merge tag 'ceph-for-4.20-rc4' of https://github.com/ceph/ceph-clientLinus Torvalds
Pullk ceph fix from Ilya Dryomov: "A messenger fix, marked for stable" * tag 'ceph-for-4.20-rc4' of https://github.com/ceph/ceph-client: libceph: fall back to sendmsg for slab pages
2018-11-23net/sched: act_police: add missing spinlock initializationDavide Caratti
commit f2cbd4852820 ("net/sched: act_police: fix race condition on state variables") introduces a new spinlock, but forgets its initialization. Ensure that tcf_police_init() initializes 'tcfp_lock' every time a 'police' action is newly created, to avoid the following lockdep splat: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. <...> Call Trace: dump_stack+0x85/0xcb register_lock_class+0x581/0x590 __lock_acquire+0xd4/0x1330 ? tcf_police_init+0x2fa/0x650 [act_police] ? lock_acquire+0x9e/0x1a0 lock_acquire+0x9e/0x1a0 ? tcf_police_init+0x2fa/0x650 [act_police] ? tcf_police_init+0x55a/0x650 [act_police] _raw_spin_lock_bh+0x34/0x40 ? tcf_police_init+0x2fa/0x650 [act_police] tcf_police_init+0x2fa/0x650 [act_police] tcf_action_init_1+0x384/0x4c0 tcf_action_init+0xf6/0x160 tcf_action_add+0x73/0x170 tc_ctl_action+0x122/0x160 rtnetlink_rcv_msg+0x2a4/0x490 ? netlink_deliver_tap+0x99/0x400 ? validate_linkmsg+0x370/0x370 netlink_rcv_skb+0x4d/0x130 netlink_unicast+0x196/0x230 netlink_sendmsg+0x2e5/0x3e0 sock_sendmsg+0x36/0x40 ___sys_sendmsg+0x280/0x2f0 ? _raw_spin_unlock+0x24/0x30 ? handle_pte_fault+0xafe/0xf30 ? find_held_lock+0x2d/0x90 ? syscall_trace_enter+0x1df/0x360 ? __sys_sendmsg+0x5e/0xa0 __sys_sendmsg+0x5e/0xa0 do_syscall_64+0x60/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f1841c7cf10 Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae cc 00 00 48 89 04 24 RSP: 002b:00007ffcf9df4d68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1841c7cf10 RDX: 0000000000000000 RSI: 00007ffcf9df4dc0 RDI: 0000000000000003 RBP: 000000005bf56105 R08: 0000000000000002 R09: 00007ffcf9df8edc R10: 00007ffcf9df47e0 R11: 0000000000000246 R12: 0000000000671be0 R13: 00007ffcf9df4e84 R14: 0000000000000008 R15: 0000000000000000 Fixes: f2cbd4852820 ("net/sched: act_police: fix race condition on state variables") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net: don't keep lonely packets forever in the gro hashPaolo Abeni
Eric noted that with UDP GRO and NAPI timeout, we could keep a single UDP packet inside the GRO hash forever, if the related NAPI instance calls napi_gro_complete() at an higher frequency than the NAPI timeout. Willem noted that even TCP packets could be trapped there, till the next retransmission. This patch tries to address the issue, flushing the old packets - those with a NAPI_GRO_CB age before the current jiffy - before scheduling the NAPI timeout. The rationale is that such a timeout should be well below a jiffy and we are not flushing packets eligible for sane GRO. v1 -> v2: - clarified the commit message and comment RFC -> v1: - added 'Fixes tags', cleaned-up the wording. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 3b47d30396ba ("net: gro: add a per device gro flush timer") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23net/ipv6: re-do dad when interface has IFF_NOARP flag changeHangbin Liu
When we add a new IPv6 address, we should also join corresponding solicited-node multicast address, unless the interface has IFF_NOARP flag, as function addrconf_join_solict() did. But if we remove IFF_NOARP flag later, we do not do dad and add the mcast address. So we will drop corresponding neighbour discovery message that came from other nodes. A typical example is after creating a ipvlan with mode l3, setting up an ipv6 address and changing the mode to l2. Then we will not be able to ping this address as the interface doesn't join related solicited-node mcast address. Fix it by re-doing dad when interface changed IFF_NOARP flag. Then we will add corresponding mcast group and check if there is a duplicate address on the network. Reported-by: Jianlin Shi <jishi@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23packet: copy user buffers before orphan or cloneWillem de Bruijn
tpacket_snd sends packets with user pages linked into skb frags. It notifies that pages can be reused when the skb is released by setting skb->destructor to tpacket_destruct_skb. This can cause data corruption if the skb is orphaned (e.g., on transmit through veth) or cloned (e.g., on mirror to another psock). Create a kernel-private copy of data in these cases, same as tun/tap zerocopy transmission. Reuse that infrastructure: mark the skb as SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx). Unlike other zerocopy packets, do not set shinfo destructor_arg to struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify when the original skb is released and a timestamp is recorded. Do not change this timestamp behavior. The ubuf_info->callback is not needed anyway, as no zerocopy notification is expected. Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The resulting value is not a valid ubuf_info pointer, nor a valid tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this. The fix relies on features introduced in commit 52267790ef52 ("sock: add MSG_ZEROCOPY"), so can be backported as is only to 4.14. Tested with from `./in_netns.sh ./txring_overwrite` from http://github.com/wdebruij/kerneltools/tests Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap") Reported-by: Anand H. Krishnan <anandhkrishnan@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-22bpf: add skb->tstamp r/w access from tc clsact and cg skb progsVlad Dumitrescu
This could be used to rate limit egress traffic in concert with a qdisc which supports Earliest Departure Time, such as FQ. Write access from cg skb progs only with CAP_SYS_ADMIN, since the value will be used by downstream qdiscs. It might make sense to relax this. Changes v1 -> v2: - allow access from cg skb, write only with CAP_SYS_ADMIN Signed-off-by: Vlad Dumitrescu <vladum@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-21bridge: Allow querying bridge port flagsIdo Schimmel
Allow querying bridge port flags so that drivers capable of performing VxLAN learning will update the bridge driver only if learning is enabled on its bridge port corresponding to the VxLAN device. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21net/smc: use after free fix in smc_wr_tx_put_slot()Ursula Braun
In smc_wr_tx_put_slot() field pend->idx is used after being cleared. That means always idx 0 is cleared in the wr_tx_mask. This results in a broken administration of available WR send payload buffers. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21net/smc: atomic SMCD cursor handlingUrsula Braun
Running uperf tests with SMCD on LPARs results in corrupted cursors. SMCD cursors should be treated atomically to fix cursor corruption. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21net/smc: add SMC-D shutdown signalHans Wippel
When a SMC-D link group is freed, a shutdown signal should be sent to the peer to indicate that the link group is invalid. This patch adds the shutdown signal to the SMC code. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21net/smc: use queue pair number when matching link groupKarsten Graul
When searching for an existing link group the queue pair number is also to be taken into consideration. When the SMC server sends a new number in a CLC packet (keeping all other values equal) then a new link group is to be created on the SMC client side. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21net/smc: abort CLC connection in smc_releaseHans Wippel
In case of a non-blocking SMC socket, the initial CLC handshake is performed over a blocking TCP connection in a worker. If the SMC socket is released, smc_release has to wait for the blocking CLC socket operations (e.g., kernel_connect) inside the worker. This patch aborts a CLC connection when the respective non-blocking SMC socket is released to avoid waiting on socket operations or timeouts. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21tcp: defer SACK compression after DupThreshEric Dumazet
Jean-Louis reported a TCP regression and bisected to recent SACK compression. After a loss episode (receiver not able to keep up and dropping packets because its backlog is full), linux TCP stack is sending a single SACK (DUPACK). Sender waits a full RTO timer before recovering losses. While RFC 6675 says in section 5, "Algorithm Details", (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true -- indicating at least three segments have arrived above the current cumulative acknowledgment point, which is taken to indicate loss -- go to step (4). ... (4) Invoke fast retransmit and enter loss recovery as follows: there are old TCP stacks not implementing this strategy, and still counting the dupacks before starting fast retransmit. While these stacks probably perform poorly when receivers implement LRO/GRO, we should be a little more gentle to them. This patch makes sure we do not enable SACK compression unless 3 dupacks have been sent since last rcv_nxt update. Ideally we should even rearm the timer to send one or two more DUPACK if no more packets are coming, but that will be work aiming for linux-4.21. Many thanks to Jean-Louis for bisecting the issue, providing packet captures and testing this patch. Fixes: 5d9f4262b7ea ("tcp: add SACK compression") Reported-by: Jean-Louis Dupond <jean-louis@dupond.be> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21net: skb_scrub_packet(): Scrub offload_fwd_markPetr Machata
When a packet is trapped and the corresponding SKB marked as already-forwarded, it retains this marking even after it is forwarded across veth links into another bridge. There, since it ingresses the bridge over veth, which doesn't have offload_fwd_mark, it triggers a warning in nbp_switchdev_frame_mark(). Then nbp_switchdev_allowed_egress() decides not to allow egress from this bridge through another veth, because the SKB is already marked, and the mark (of 0) of course matches. Thus the packet is incorrectly blocked. Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That function is called from tunnels and also from veth, and thus catches the cases where traffic is forwarded between bridges and transformed in a way that invalidates the marking. Fixes: 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices") Fixes: abf4bb6b63d0 ("skbuff: Add the offload_mr_fwd_mark field") Signed-off-by: Petr Machata <petrm@mellanox.com> Suggested-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-20net/sched: act_police: fix race condition on state variablesDavide Caratti
after 'police' configuration parameters were converted to use RCU instead of spinlock, the state variables used to compute the traffic rate (namely 'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in the traffic path without any protection. Use a dedicated spinlock to avoid race conditions on these variables, and ensure proper cache-line alignment. In this way, 'police' is still faster than what we observed when 'tcf_lock' was used in the traffic path _ i.e. reverting commit 2d550dbad83c ("net/sched: act_police: don't use spinlock in the data path"). Moreover, we preserve the throughput improvement that was obtained after 'police' started using per-cpu counters, when 'avrate' is used instead of 'rate'. Changes since v1 (thanks to Eric Dumazet): - call ktime_get_ns() before acquiring the lock in the traffic path - use a dedicated spinlock instead of tcf_lock - improve cache-line usage Fixes: 2d550dbad83c ("net/sched: act_police: don't use spinlock in the data path") Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com>
2018-11-20tcp: Fix SOF_TIMESTAMPING_RX_HARDWARE to use the latest timestamp during TCP ↵Stephen Mallon
coalescing During tcp coalescing ensure that the skb hardware timestamp refers to the highest sequence number data. Previously only the software timestamp was updated during coalescing. Signed-off-by: Stephen Mallon <stephen.mallon@sydney.edu.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-20tcp: drop dst in tcp_add_backlog()Eric Dumazet
Under stress, softirq rx handler often hits a socket owned by the user, and has to queue the packet into socket backlog. When this happens, skb dst refcount is taken before we escape rcu protected region. This is done from __sk_add_backlog() calling skb_dst_force(). Consumer will have to perform the opposite costly operation. AFAIK nothing in tcp stack requests the dst after skb was stored in the backlog. If this was the case, we would have had failures already since skb_dst_force() can end up clearing skb dst anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-20ipv4: Don't try to print ASCII of link level header in martian dumps.David S. Miller
This has no value whatsoever. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-20net_sched: sch_fq: avoid calling ktime_get_ns() if not neededEric Dumazet
There are two cases were we can avoid calling ktime_get_ns() : 1) Queue is empty. 2) Internal queue is not empty. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: sched: cls_u32: add res to offload informationJakub Kicinski
In case of egress offloads the class/flowid assigned by the filter may be very important for offloaded Qdisc selection. Provide this info to drivers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: sched: gred: support reporting stats from offloadsJakub Kicinski
Allow drivers which offload GRED to report back statistics. Since A lot of GRED stats is fairly ad hoc in nature pass to drivers the standard struct gnet_stats_basic/gnet_stats_queue pairs, and untangle the values in the core. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: sched: gred: add basic Qdisc offloadJakub Kicinski
Add basic offload for the GRED Qdisc. Inform the drivers any time Qdisc or virtual queue configuration changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: skb_scrub_packet(): Scrub offload_fwd_markPetr Machata
When a packet is trapped and the corresponding SKB marked as already-forwarded, it retains this marking even after it is forwarded across veth links into another bridge. There, since it ingresses the bridge over veth, which doesn't have offload_fwd_mark, it triggers a warning in nbp_switchdev_frame_mark(). Then nbp_switchdev_allowed_egress() decides not to allow egress from this bridge through another veth, because the SKB is already marked, and the mark (of 0) of course matches. Thus the packet is incorrectly blocked. Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That function is called from tunnels and also from veth, and thus catches the cases where traffic is forwarded between bridges and transformed in a way that invalidates the marking. Signed-off-by: Petr Machata <petrm@mellanox.com> Suggested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19sctp: not increase stream's incnt before sending addstrm_in requestXin Long
Different from processing the addstrm_out request, The receiver handles an addstrm_in request by sending back an addstrm_out request to the sender who will increase its stream's in and incnt later. Now stream->incnt has been increased since it sent out the addstrm_in request in sctp_send_add_streams(), with the wrong stream->incnt will even cause crash when copying stream info from the old stream's in to the new one's in sctp_process_strreset_addstrm_out(). This patch is to fix it by simply removing the stream->incnt change from sctp_send_add_streams(). Fixes: 242bd2d519d7 ("sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19Revert "sctp: remove sctp_transport_pmtu_check"Xin Long
This reverts commit 22d7be267eaa8114dcc28d66c1c347f667d7878a. The dst's mtu in transport can be updated by a non sctp place like in xfrm where the MTU information didn't get synced between asoc, transport and dst, so it is still needed to do the pmtu check in sctp_packet_config. Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19sctp: not allow to set asoc prsctp_enable by sockoptXin Long
As rfc7496#section4.5 says about SCTP_PR_SUPPORTED: This socket option allows the enabling or disabling of the negotiation of PR-SCTP support for future associations. For existing associations, it allows one to query whether or not PR-SCTP support was negotiated on a particular association. It means only sctp sock's prsctp_enable can be set. Note that for the limitation of SCTP_{CURRENT|ALL}_ASSOC, we will add it when introducing SCTP_{FUTURE|CURRENT|ALL}_ASSOC for linux sctp in another patchset. v1->v2: - drop the params.assoc_id check as Neil suggested. Fixes: 28aa4c26fce2 ("sctp: add SCTP_PR_SUPPORTED on sctp sockopt") Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>