summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2021-07-03Merge branch 'work.iov_iter' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter cleanups and fixes. There are followups, but this is what had sat in -next this cycle. IMO the macro forest in there became much thinner and easier to follow..." * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller clean up copy_mc_pipe_to_iter() pipe_zero(): we don't need no stinkin' kmap_atomic()... iov_iter: clean csum_and_copy_...() primitives up a bit copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases iterate_xarray(): only of the first iteration we might get offset != 0 pull handling of ->iov_offset into iterate_{iovec,bvec,xarray} iov_iter: make iterator callbacks use base and len instead of iovec iov_iter: make the amount already copied available to iterator callbacks iov_iter: get rid of separate bvec and xarray callbacks iov_iter: teach iterate_{bvec,xarray}() about possible short copies iterate_bvec(): expand bvec.h macro forest, massage a bit iov_iter: unify iterate_iovec and iterate_kvec iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec iterate_and_advance(): get rid of magic in case when n is 0 csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant [xarray] iov_iter_npages(): just use DIV_ROUND_UP() iov_iter_npages(): don't bother with iterate_all_kinds() ...
2021-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Trivial conflict in net/netfilter/nf_tables_api.c. Duplicate fix in tools/testing/selftests/net/devlink_port_split.py - take the net-next version. skmsg, and L4 bpf - keep the bpf code but remove the flags and err params. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29tcp: change ICSK_CA_PRIV_SIZE definitionEric Dumazet
Instead of a magic number (13 currently) and having to change it every other year, use sizeof_field() macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: sock: introduce sk_error_reportAlexander Aring
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: reference count the FDB addresses at the cross-chip notifier levelVladimir Oltean
The same concerns expressed for host MDB entries are valid for host FDBs just as well: - in the case of multiple bridges spanning the same switch chip, deleting a host FDB entry that belongs to one bridge will result in breakage to the other bridge - not deleting FDB entries across DSA links means that the switch's hardware tables will eventually run out, given enough wear&tear So do the same thing and introduce reference counting for CPU ports and DSA links using the same data structures as we have for MDB entries. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: reference count the MDB entries at the cross-chip notifier levelVladimir Oltean
Ever since the cross-chip notifiers were introduced, the design was meant to be simplistic and just get the job done without worrying too much about dangling resources left behind. For example, somebody installs an MDB entry on sw0p0 in this daisy chain topology. It gets installed using ds->ops->port_mdb_add() on sw0p0, sw1p4 and sw2p4. | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] Then the same person deletes that MDB entry. The cross-chip notifier for deletion only matches sw0p0: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ ] Why? Because the DSA links are 'trunk' ports, if we just go ahead and delete the MDB from sw1p4 and sw2p4 directly, we might delete those multicast entries when they are still needed. Just consider the fact that somebody does: - add a multicast MAC address towards sw0p0 [ via the cross-chip notifiers it gets installed on the DSA links too ] - add the same multicast MAC address towards sw0p1 (another port of that same switch) - delete the same multicast MAC address from sw0p0. At this point, if we deleted the MAC address from the DSA links, it would be flooded, even though there is still an entry on switch 0 which needs it not to. So that is why deletions only match the targeted source port and nothing on DSA links. Of course, dangling resources means that the hardware tables will eventually run out given enough additions/removals, but hey, at least it's simple. But there is a bigger concern which needs to be addressed, and that is our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add() on the upstream port, and a similar thing happens on deletion: dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the upstream port. When there are 2 VLAN-unaware bridges spanning the same switch (which is a use case DSA proudly supports), each bridge will install its own SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the user ports enslaved to br0 and the user ports enslaved to br1. Not good. The host-trapped multicast addresses installed by br1 will be deleted when any state changes in br0 (IGMP timers expire, or ports leave, etc). To avoid this, we could of course go the route of the zero-sum game and delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better design is to just admit that on shared ports like DSA links and CPU ports, we should be reference counting calls, even if this consumes some dynamic memory which DSA has traditionally avoided. On the flip side, the hardware tables of switches are limited in size, so it would be good if the OS managed them properly instead of having them eventually overflow. To address the memory usage concern, we only apply the refcounting of MDB entries on ports that are really shared (CPU ports and DSA links) and not on user ports. In a typical single-switch setup, this means only the CPU port (and the host MDB entries are not that many, really). The name of the newly introduced data structures (dsa_mac_addr) is chosen in such a way that will be reusable for host FDB entries (next patch). With this change, we can finally have the same matching logic for the MDB additions and deletions, as well as for their host-trapped variants. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: introduce dsa_is_upstream_port and dsa_switch_is_upstream_ofVladimir Oltean
In preparation for the new cross-chip notifiers for host addresses, let's introduce some more topology helpers which we are going to use to discern switches that are in our path towards the dedicated CPU port from switches that aren't. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge tag 'for-net-next-2021-06-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for QCA_ROME device (0cf3:e500) and RTL8822CE - Update management interface revision to 21 - Use of incluse language - Proper handling of HCI_LE_Advertising_Set_Terminated event - Recovery handing of HCI ncmd=0 - Various memory fixes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28sctp: validate from_addr_param returnMarcelo Ricardo Leitner
Ilja reported that, simply putting it, nothing was validating that from_addr_param functions were operating on initialized memory. That is, the parameter itself was being validated by sctp_walk_params, but it doesn't check for types and their specific sizes and it could be a 0-length one, causing from_addr_param to potentially work over the next parameter or even uninitialized memory. The fix here is to, in all calls to from_addr_param, check if enough space is there for the wanted IP address type. Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 12 day(s) which contain a total of 56 files changed, 394 insertions(+), 380 deletions(-). The main changes are: 1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney. 2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski. 3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev. 4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria. 5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards. 6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa. 7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi. 8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim. 9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young. 10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer. 11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski. 12) Fix up bpfilter startup log-level to info level, from Gary Lin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messagesAndreas Roeseler
This patch builds off of commit 2b246b2569cd2ac6ff700d0dce56b8bae29b1842 and adds functionality to respond to ICMPV6 PROBE requests. Add icmp_build_probe function to construct PROBE requests for both ICMPV4 and ICMPV6. Modify icmpv6_rcv to detect ICMPV6 PROBE messages and call the icmpv6_echo_reply handler. Modify icmpv6_echo_reply to build a PROBE response message based on the queried interface. This patch has been tested using a branch of the iputils git repo which can be found here: https://github.com/Juniper-Clinic-2020/iputils/tree/probe-request Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28flow_offload: action should not be NULL when it is referencedgushengxian
"action" should not be NULL when it is referenced. Signed-off-by: gushengxian <13145886936@163.com> Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: switchdev: add a context void pointer to struct switchdev_notifier_infoVladimir Oltean
In the case where the driver asks for a replay of a certain type of event (port object or attribute) for a bridge port that is a LAG, it may do so because this port has just joined the LAG. But there might already be other switchdev ports in that LAG, and it is preferable that those preexisting switchdev ports do not act upon the replayed event. The solution is to add a context to switchdev events, which is NULL most of the time (when the bridge layer initiates the call) but which can be set to a value controlled by the switchdev driver when a replay is requested. The driver can then check the context to figure out if all ports within the LAG should act upon the switchdev event, or just the ones that match the context. We have to modify all switchdev_handle_* helper functions as well as the prototypes in the drivers that use these helpers too, because these helpers hide the underlying struct switchdev_notifier_info from us and there is no way to retrieve the context otherwise. The context structure will be populated and used in later patches. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/gitDavid S. Miller
/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2021-06-28 1) Remove an unneeded error assignment in esp4_gro_receive(). From Yang Li. 2) Add a new byseq state hashtable to find acquire states faster. From Sabrina Dubroca. 3) Remove some unnecessary variables in pfkey_create(). From zuoqilin. 4) Remove the unused description from xfrm_type struct. From Florian Westphal. 5) Fix a spelling mistake in the comment of xfrm_state_ok(). From gushengxian. 6) Replace hdr_off indirections by a small helper function. From Florian Westphal. 7) Remove xfrm4_output_finish and xfrm6_output_finish declarations, they are not used anymore.From Antony Antony. 8) Remove xfrm replay indirections. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge tag 'mac80211-next-for-net-next-2021-06-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes berg says: ==================== Lots of changes: * aggregation handling improvements for some drivers * hidden AP discovery on 6 GHz and other HE 6 GHz improvements * minstrel improvements for no-ack frames * deferred rate control for TXQs to improve reaction times * virtual time-based airtime scheduler * along with various little cleanups/fixups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: lwtunnel: handle MTU calculation in forwadingVadim Fedorenko
Commit 14972cbd34ff ("net: lwtunnel: Handle fragmentation") moved fragmentation logic away from lwtunnel by carry encap headroom and use it in output MTU calculation. But the forwarding part was not covered and created difference in MTU for output and forwarding and further to silent drops on ipv4 forwarding path. Fix it by taking into account lwtunnel encap headroom. The same commit also introduced difference in how to treat RTAX_MTU in IPv4 and IPv6 where latter explicitly removes lwtunnel encap headroom from route MTU. Make IPv4 version do the same. Fixes: 14972cbd34ff ("net: lwtunnel: Handle fragmentation") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-26Bluetooth: Fix Set Extended (Scan Response) DataLuiz Augusto von Dentz
These command do have variable length and the length can go up to 251, so this changes the struct to not use a fixed size and then when creating the PDU only the actual length of the data send to the controller. Fixes: a0fb3726ba551 ("Bluetooth: Use Set ext adv/scan rsp data if controller supports") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: use inclusive language when filtering devicesArchie Pusaka
This patch replaces some non-inclusive terms based on the appropriate language mapping table compiled by the Bluetooth SIG: https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf Specifically, these terms are replaced: blacklist -> reject list whitelist -> accept list Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: use inclusive language when tracking connectionsArchie Pusaka
This patch replaces some non-inclusive terms based on the appropriate language mapping table compiled by the Bluetooth SIG: https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf Specifically, these terms are replaced: master -> central slave -> peripheral Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: use inclusive language in SMPArchie Pusaka
This patch replaces some non-inclusive terms based on the appropriate language mapping table compiled by the Bluetooth SIG: https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf Specifically, these terms are replaced: master -> initiator slave -> responder Signed-off-by: Archie Pusaka <apusaka@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: use inclusive language in HCI LE featuresArchie Pusaka
This patch replaces some non-inclusive terms based on the appropriate language mapping table compiled by the Bluetooth SIG: https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf Specifically, these terms are replaced: master -> central slave -> peripheral Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: use inclusive language to describe CPBArchie Pusaka
This patch replaces some non-inclusive terms based on the appropriate language mapping table compiled by the Bluetooth SIG: https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf Specifically, these terms are replaced when describing the connectionless peripheral broadcast feature: master -> central slave -> peripheral Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: use inclusive language in hci_core.hArchie Pusaka
This patch replaces some non-inclusive terms based on the appropriate language mapping table compiled by the Bluetooth SIG: https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf Specifically, these terms are replaced: master -> central slave -> peripheral These attributes are not used elsewhere in the code. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: Add ncmd=0 recovery handlingManish Mandlik
During command status or command complete event, the controller may set ncmd=0 indicating that it is not accepting any more commands. In such a case, host holds off sending any more commands to the controller. If the controller doesn't recover from such condition, host will wait forever, until the user decides that the Bluetooth is broken and may power cycles the Bluetooth. This patch triggers the hardware error to reset the controller and driver when it gets into such state as there is no other wat out. Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Manish Mandlik <mmandlik@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26Bluetooth: Return whether a connection is outboundYu Liu
When an MGMT_EV_DEVICE_CONNECTED event is reported back to the user space we will set the flags to tell if the established connection is outbound or not. This is useful for the user space to log better metrics and error messages. Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Reviewed-by: Alain Michaud <alainm@chromium.org> Signed-off-by: Yu Liu <yudiliu@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-25Merge tag 'wireless-drivers-next-2021-06-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for v5.14 Second, and most likely the last, set of patches for v5.14. mt76 and iwlwifi have most patches in this round, but rtw88 also has some new features. Nothing special really standing out. mt76 * mt7915 MSI support * disable ASPM on mt7915 * mt7915 tx status reporting * mt7921 decap offload rtw88 * beacon filter support * path diversity support * firmware crash information via devcoredump * quirks for disabling pci capabilities mt7601u * add USB ID for a XiaoDu WiFi Dongle ath11k * enable support for QCN9074 PCI devices brcmfmac * support parse country code map from DeviceTree iwlwifi * support for new hardware * support for BIOS control of 11ax enablement in Russia * support UNII4 band enablement from BIOS ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24sctp: do black hole detection in search complete stateXin Long
Currently the PLPMUTD probe will stop for a long period (interval * 30) after it enters search complete state. If there's a pmtu change on the route path, it takes a long time to be aware if the ICMP TooBig packet is lost or filtered. As it says in rfc8899#section-4.3: "A DPLPMTUD method MUST NOT rely solely on this method." (ICMP PTB message). This patch is to enable the other method for search complete state: "A PL can use the DPLPMTUD probing mechanism to periodically generate probe packets of the size of the current PLPMTU." With this patch, the probe will continue with the current pmtu every 'interval' until the PMTU_RAISE_TIMER 'timeout', which we implement by adding raise_count to raise the probe size when it counts to 30 and removing the SCTP_PL_COMPLETE check for PMTU_RAISE_TIMER. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: macsec: fix the length used to copy the key for offloadingAntoine Tenart
The key length used when offloading macsec to Ethernet or PHY drivers was set to MACSEC_KEYID_LEN (16), which is an issue as: - This was never meant to be the key length. - The key length can be > 16. Fix this by using MACSEC_MAX_KEY_LEN to store the key (the max length accepted in uAPI) and secy->key_len to copy it. Fixes: 3cf3227a21d1 ("net: macsec: hardware offloading infrastructure") Reported-by: Lior Nahmanson <liorna@nvidia.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24xdp: Add proper __rcu annotations to redirect map entriesToke Høiland-Jørgensen
XDP_REDIRECT works by a three-step process: the bpf_redirect() and bpf_redirect_map() helpers will lookup the target of the redirect and store it (along with some other metadata) in a per-CPU struct bpf_redirect_info. Next, when the program returns the XDP_REDIRECT return code, the driver will call xdp_do_redirect() which will use the information thus stored to actually enqueue the frame into a bulk queue structure (that differs slightly by map type, but shares the same principle). Finally, before exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will flush all the different bulk queues, thus completing the redirect. Pointers to the map entries will be kept around for this whole sequence of steps, protected by RCU. However, there is no top-level rcu_read_lock() in the core code; instead drivers add their own rcu_read_lock() around the XDP portions of the code, but somewhat inconsistently as Martin discovered[0]. However, things still work because everything happens inside a single NAPI poll sequence, which means it's between a pair of calls to local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could document this intention by using rcu_dereference_check() with rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and lockdep to verify that everything is done correctly. This patch does just that: we add an __rcu annotation to the map entry pointers and remove the various comments explaining the NAPI poll assurance strewn through devmap.c in favour of a longer explanation in filter.c. The goal is to have one coherent documentation of the entire flow, and rely on the RCU annotations as a "standard" way of communicating the flow in the map code (which can additionally be understood by sparse and lockdep). The RCU annotation replacements result in a fairly straight-forward replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE() becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the proper constructs to cast the pointer back and forth between __rcu and __kernel address space (for the benefit of sparse). The one complication is that xskmap has a few constructions where double-pointers are passed back and forth; these simply all gain __rcu annotations, and only the final reference/dereference to the inner-most pointer gets changed. With this, everything can be run through sparse without eliciting complaints, and lockdep can verify correctness even without the use of rcu_read_lock() in the drivers. Subsequent patches will clean these up from the drivers. [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/ [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
2021-06-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-06-23 The following pull-request contains BPF updates for your *net* tree. We've added 14 non-merge commits during the last 6 day(s) which contain a total of 13 files changed, 137 insertions(+), 64 deletions(-). Note that when you merge net into net-next, there is a small merge conflict between 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") from bpf with c49661aa6f70 ("skmsg: Remove unused parameters of sk_msg_wait_data()") from net-next. Resolution is to: i) net/ipv4/udp_bpf.c: take udp_msg_wait_data() and remove err parameter from the function, ii) net/ipv4/tcp_bpf.c: take tcp_msg_wait_data() and remove err parameter from the function, iii) for net/core/skmsg.c and include/linux/skmsg.h: remove the sk_msg_wait_data() implementation and its prototype in header. The main changes are: 1) Fix BPF poke descriptor adjustments after insn rewrite, from John Fastabend. 2) Fix regression when using BPF_OBJ_GET with non-O_RDWR flags, from Maciej Żenczykowski. 3) Various bug and error handling fixes for UDP-related sock_map, from Cong Wang. 4) Fix patching of vmlinux BTF IDs with correct endianness, from Tony Ambardar. 5) Two fixes for TX descriptor validation in AF_XDP, from Magnus Karlsson. 6) Fix overflow in size calculation for bpf_map_area_alloc(), from Bui Quang Minh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23net/tls: Remove the __TLS_DEC_STATS() macro.Kuniyuki Iwashima
The commit d26b698dd3cd ("net/tls: add skeleton of MIB statistics") introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is not defined also. Let's remove it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge tag 'mlx5-net-next-2021-06-22' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-net-next-2021-06-22 1) Various minor cleanups and fixes from net-next branch 2) Optimize mlx5 feature check on tx and a fix to allow Vxlan with Ipsec offloads ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-06-23 1) Don't return a mtu smaller than 1280 on IPv6 pmtu discovery. From Sabrina Dubroca 2) Fix seqcount rcu-read side in xfrm_policy_lookup_bytype for the PREEMPT_RT case. From Varad Gautam. 3) Remove a repeated declaration of xfrm_parse_spi. From Shaokun Zhang. 4) IPv4 beet mode can't handle fragments, but IPv6 does. commit 68dc022d04eb ("xfrm: BEET mode doesn't support fragments for inner packets") handled IPv4 and IPv6 the same way. Relax the check for IPv6 because fragments are possible here. From Xin Long. 5) Memory allocation failures are not reported for XFRMA_ENCAP and XFRMA_COADDR in xfrm_state_construct. Fix this by moving both cases in front of the function. 6) Fix a missing initialization in the xfrm offload fallback fail case for bonding devices. From Ayush Sawal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr, from Phil Sutter. 2) Simplify TCP option sanity check for TCP packets, also from Phil. 3) Add a new expression to store when the rule has been used last time. 4) Pass the hook state object to log function, from Florian Westphal. 5) Document the new sysctl knobs to tune the flowtable timeouts, from Oz Shlomo. 6) Fix snprintf error check in the new nfnetlink_hook infrastructure, from Dan Carpenter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23net: sched: remove qdisc->empty for lockless qdiscYunsheng Lin
As MISSED and DRAINING state are used to indicate a non-empty qdisc, qdisc->empty is not longer needed, so remove it. Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23net: sched: implement TCQ_F_CAN_BYPASS for lockless qdiscYunsheng Lin
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK flag set, but queue discipline by-pass does not work for lockless qdisc because skb is always enqueued to qdisc even when the qdisc is empty, see __dev_xmit_skb(). This patch calls sch_direct_xmit() to transmit the skb directly to the driver for empty lockless qdisc, which aviod enqueuing and dequeuing operation. As qdisc->empty is not reliable to indicate a empty qdisc because there is a time window between enqueuing and setting qdisc->empty. So we use the MISSED state added in commit a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc"), which indicate there is lock contention, suggesting that it is better not to do the qdisc bypass in order to avoid packet out of order problem. In order to make MISSED state reliable to indicate a empty qdisc, we need to ensure that testing and clearing of MISSED state is within the protection of qdisc->seqlock, only setting MISSED state can be done without the protection of qdisc->seqlock. A MISSED state testing is added without the protection of qdisc->seqlock to aviod doing unnecessary spin_trylock() for contention case. As the enqueuing is not within the protection of qdisc->seqlock, there is still a potential data race as mentioned by Jakub [1]: thread1 thread2 thread3 qdisc_run_begin() # true qdisc_run_begin(q) set(MISSED) pfifo_fast_dequeue clear(MISSED) # recheck the queue qdisc_run_end() enqueue skb1 qdisc empty # true qdisc_run_begin() # true sch_direct_xmit() # skb2 qdisc_run_begin() set(MISSED) When above happens, skb1 enqueued by thread2 is transmited after skb2 is transmited by thread3 because MISSED state setting and enqueuing is not under the qdisc->seqlock. If qdisc bypass is disabled, skb1 has better chance to be transmited quicker than skb2. This patch does not take care of the above data race, because we view this as similar as below: Even at the same time CPU1 and CPU2 write the skb to two socket which both heading to the same qdisc, there is no guarantee that which skb will hit the qdisc first, because there is a lot of factor like interrupt/softirq/cache miss/scheduling afffecting that. There are below cases that need special handling: 1. When MISSED state is cleared before another round of dequeuing in pfifo_fast_dequeue(), and __qdisc_run() might not be able to dequeue all skb in one round and call __netif_schedule(), which might result in a non-empty qdisc without MISSED set. In order to avoid this, the MISSED state is set for lockless qdisc and __netif_schedule() will be called at the end of qdisc_run_end. 2. The MISSED state also need to be set for lockless qdisc instead of calling __netif_schedule() directly when requeuing a skb for a similar reason. 3. For netdev queue stopped case, the MISSED case need clearing while the netdev queue is stopped, otherwise there may be unnecessary __netif_schedule() calling. So a new DRAINING state is added to indicate this case, which also indicate a non-empty qdisc. 4. As there is already netif_xmit_frozen_or_stopped() checking in dequeue_skb() and sch_direct_xmit(), which are both within the protection of qdisc->seqlock, but the same checking in __dev_xmit_skb() is without the protection, which might cause empty indication of a lockless qdisc to be not reliable. So remove the checking in __dev_xmit_skb(), and the checking in the protection of qdisc->seqlock seems enough to avoid the cpu consumption problem for netdev queue stopped case. 1. https://lkml.org/lkml/2021/5/29/215 Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23net: sched: avoid unnecessary seqcount operation for lockless qdiscYunsheng Lin
qdisc->running seqcount operation is mainly used to do heuristic locking on q->busylock for locked qdisc, see qdisc_is_running() and __dev_xmit_skb(). So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK flag. Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23cfg80211: Add wiphy_info_once()Dmitry Osipenko
Add wiphy_info_once() helper that prints info message only once. Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20210511211549.30571-1-digetx@gmail.com
2021-06-23mac80211: Switch to a virtual time-based airtime schedulerToke Høiland-Jørgensen
This switches the airtime scheduler in mac80211 to use a virtual time-based scheduler instead of the round-robin scheduler used before. This has a couple of advantages: - No need to sync up the round-robin scheduler in firmware/hardware with the round-robin airtime scheduler. - If several stations are eligible for transmission we can schedule both of them; no need to hard-block the scheduling rotation until the head of the queue has used up its quantum. - The check of whether a station is eligible for transmission becomes simpler (in ieee80211_txq_may_transmit()). The drawback is that scheduling becomes slightly more expensive, as we need to maintain an rbtree of TXQs sorted by virtual time. This means that ieee80211_register_airtime() becomes O(logN) in the number of currently scheduled TXQs because it can change the order of the scheduled stations. We mitigate this overhead by only resorting when a station changes position in the tree, and hopefully N rarely grows too big (it's only TXQs currently backlogged, not all associated stations), so it shouldn't be too big of an issue. To prevent divisions in the fast path, we maintain both station sums and pre-computed reciprocals of the sums. This turns the fast-path operation into a multiplication, with divisions only happening as the number of active stations change (to re-compute the current sum of all active station weights). To prevent this re-computation of the reciprocal from happening too frequently, we use a time-based notion of station activity, instead of updating the weight every time a station gets scheduled or de-scheduled. As queues can oscillate between empty and occupied quite frequently, this can significantly cut down on the number of re-computations. It also has the added benefit of making the station airtime calculation independent on whether the queue happened to have drained at the time an airtime value was accounted. Co-developed-by: Yibo Zhao <yiboz@codeaurora.org> Signed-off-by: Yibo Zhao <yiboz@codeaurora.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20210623134755.235545-1-toke@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: notify driver on mgd TX completionJohannes Berg
We have mgd_prepare_tx(), but sometimes drivers may want/need to take action when the exchange finishes, whether successfully or not. Add a notification to the driver on completion, i.e. call the new method mgd_complete_tx(). To unify the two scenarios, and to add more information, make both of them take a struct that has the duration (prepare only), subtype (both) and success (complete only). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.5d94e78f6230.I6dc979606b6f28701b740d7aab725f7853a5a155@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: allow advertising vendor-specific capabilitiesJohannes Berg
There may be cases where vendor-specific elements need to be used over the air. Rather than have driver or firmware add them and possibly cause problems that way, add them to the iftype-data band capabilities. This way we can advertise to userspace first, and use them in mac80211 next. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.e8c4f0347276.Iee5964682b3e9ec51fc1cd57a7c62383eaf6ddd7@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: add cfg80211_any_usable_channels()Johannes Berg
This helper function checks if there are any usable channels on any of the given bands with the given properties (as expressed by disallowed channel flags). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.2b613addaa85.Idaf8b859089490537878a7de5c7453a873a3f638@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: remove ieee80211_get_he_sta_cap()Johannes Berg
This function turned out to be too easy to misuse since it doesn't consider the interface type. Remove it now that we no longer use it in mac80211. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.8c9c72f914b0.I68e9c0626dc77a0f67f238a05ae16a0b77b09895@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23nl80211/cfg80211: add BSS color to NDP ranging parametersAvraham Stern
In NDP ranging, the initiator need to set the BSS color in the NDP to the BSS color of the responder. Add the BSS color as a parameter for NDP ranging. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.f097a6144b59.I27dec8b994df52e691925ea61be4dd4fa6d396c0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: add to bss_conf if broadcast TWT is supportedShaul Triebitz
Add to struct ieee80211_bss_conf a twt_broadcast field. Set it to true if both STA and AP support broadcast TWT. Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.f7c105237541.I50b302044e2b35e5ed4d3fb8bc7bd3d8bb89b1e1@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: move A-MPDU session check from minstrel_ht to mac80211Felix Fietkau
This avoids calling back into tx handlers from within the rate control module. Preparation for deferring rate control until tx dequeue Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210617163113.75815-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: expose the rfkill device to the low level driverEmmanuel Grumbach
This will allow the low level driver to query the rfkill state. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://lore.kernel.org/r/20210616202826.9833-1-emmanuel.grumbach@intel.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: add ieee80211_is_tx_data helper functionPhilipp Borgers
Add a helper function that checks if a frame is a data frame. Frames with hardware encapsulation enabled are data frames. Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de> Link: https://lore.kernel.org/r/20210519122019.92359-2-borgers@mi.fu-berlin.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: remove CFG80211_MAX_NUM_DIFFERENT_CHANNELSJohannes Berg
We no longer need to put any limits here, hardware will and mac80211-hwsim can do whatever it likes. The reason we had this was some accounting code (still mentioned in the comment) but that code was deleted in commit c781944b71f8 ("cfg80211: Remove unused cfg80211_can_use_iftype_chan()"). Link: https://lore.kernel.org/r/20210506221159.d1d61db1d31c.Iac4da68d54b9f1fdc18a03586bbe06aeb9515425@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-22net/xfrm: Add inner_ipproto into sec_pathHuy Nguyen
The inner_ipproto saves the inner IP protocol of the plain text packet. This allows vendor's IPsec feature making offload decision at skb's features_check and configuring hardware at ndo_start_xmit. For example, ConnectX6-DX IPsec device needs the plaintext's IP protocol to support partial checksum offload on VXLAN/GENEVE packet over IPsec transport mode tunnel. Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Huy Nguyen <huyn@nvidia.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>