summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-09-11Revert "net: dsa: Add more convenient functions for installing port VLANs"Vladimir Oltean
This reverts commit 314f76d7a68bab0516aa52877944e6aacfa0fc3f. Citing that commit message, the call graph was: dsa_slave_vlan_rx_add_vid dsa_port_setup_8021q_tagging | | | | | +-------------+ | | v v dsa_port_vid_add dsa_slave_port_obj_add | | +-------+ +-------+ | | v v dsa_port_vlan_add Now that tag_8021q has its own ops structure, it no longer relies on dsa_port_vid_add, and therefore on the dsa_switch_ops to install its VLANs. So dsa_port_vid_add now only has one single caller. So we can simplify the call graph to what it was before, aka: dsa_slave_vlan_rx_add_vid dsa_slave_port_obj_add | | +-------+ +-------+ | | v v dsa_port_vlan_add Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11net: dsa: tag_8021q: add a context structureVladimir Oltean
While working on another tag_8021q driver implementation, some things became apparent: - It is not mandatory for a DSA driver to offload the tag_8021q VLANs by using the VLAN table per se. For example, it can add custom TCAM rules that simply encapsulate RX traffic, and redirect & decapsulate rules for TX traffic. For such a driver, it makes no sense to receive the tag_8021q configuration through the same callback as it receives the VLAN configuration from the bridge and the 8021q modules. - Currently, sja1105 (the only tag_8021q user) sets a priv->expect_dsa_8021q variable to distinguish between the bridge calling, and tag_8021q calling. That can be improved, to say the least. - The crosschip bridging operations are, in fact, stateful already. The list of crosschip_links must be kept by the caller and passed to the relevant tag_8021q functions. So it would be nice if the tag_8021q configuration was more self-contained. This patch attempts to do that. Create a struct dsa_8021q_context which encapsulates a struct dsa_switch, and has 2 function pointers for adding and deleting a VLAN. These will replace the previous channel to the driver, which was through the .port_vlan_add and .port_vlan_del callbacks of dsa_switch_ops. Also put the list of crosschip_links into this dsa_8021q_context. Drivers that don't support cross-chip bridging can simply omit to initialize this list, as long as they dont call any cross-chip function. The sja1105_vlan_add and sja1105_vlan_del functions are refactored into a smaller sja1105_vlan_add_one, which now has 2 entry points: - sja1105_vlan_add, from struct dsa_switch_ops - sja1105_dsa_8021q_vlan_add, from the tag_8021q ops But even this change is fairly trivial. It just reflects the fact that for sja1105, the VLANs from these 2 channels end up in the same hardware table. However that is not necessarily true in the general sense (and that's the reason for making this change). The rest of the patch is mostly plain refactoring of "ds" -> "ctx". The dsa_8021q_context structure needs to be propagated because adding a VLAN is now done through the ops function pointers inside of it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11net: dsa: tag_8021q: setup tagging via a single function callVladimir Oltean
There is no point in calling dsa_port_setup_8021q_tagging for each individual port. Additionally, it will become more difficult to do that when we'll have a context structure to tag_8021q (next patch). So refactor this now. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11bridge: mcast: Fix incomplete MDB dumpIdo Schimmel
Each MDB entry is encoded in a nested netlink attribute called 'MDBA_MDB_ENTRY'. In turn, this attribute contains another nested attributed called 'MDBA_MDB_ENTRY_INFO', which encodes a single port group entry within the MDB entry. The cited commit added the ability to restart a dump from a specific port group entry. However, on failure to add a port group entry to the dump the entire MDB entry (stored in 'nest2') is removed, resulting in missing port group entries. Fix this by finalizing the MDB entry with the partial list of already encoded port group entries. Fixes: 5205e919c9f0 ("net: bridge: mcast: add support for src list and filter mode dumping") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11ipv6: remove redundant assignment to variable errColin Ian King
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Also re-order variable declarations in reverse Christmas tree ordering. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: use separate work queues for different worker typesKarsten Graul
There are 6 types of workers which exist per smc connection. 3 of them are used for listen and handshake processing, another 2 are used for close and abort processing and 1 is the tx worker that moves calls to sleeping functions into a worker. To prevent flooding of the system work queue when many connections are opened or closed at the same time (some pattern uperf implements), move those workers to one of 3 smc-specific work queues. Two work queues are module-global and used for handshake and close workers. The third work queue is defined per link group and used by the tx workers that may sleep waiting for resources of this link group. And in smc_llc_enqueue() queue the llc_event_work work to the system prio work queue because its critical that this work is started fast. Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: use the retry mechanism for netlink messagesGuvenc Gulce
When the netlink messages to be sent to the userspace are too big for a single netlink message, send them in chunks using the netlink_dump infrastructure. Modify the smc diag dump code so that it can signal to the netlink_dump infrastructure that it needs to send more data. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: immediate freeing in smc_lgr_cleanup_early()Ursula Braun
smc_lgr_cleanup_early() schedules the free worker with delay. DMB unregistering occurs in this delayed worker increasing the risk to reach the SMCD SBA limit without need. Terminate the linkgroup immediately, since termination means early DMB unregistering. For SMCD the global smc_server_lgr_pending lock is given up early. A linkgroup to be given up with smc_lgr_cleanup_early() may already contain more than one connection. Using __smc_lgr_terminate() in smc_lgr_cleanup_early() covers this. And consolidate smc_ism_put_vlan() and smc_put_device() into smc_lgr_free() only. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: reduce smc_listen_decline() callsUrsula Braun
smc_listen_work() contains already an smc_listen_decline() exit. Use this exit for smc_listen_rdma_finish() problems as well. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: improve server ISM device determinationUrsula Braun
Move check whether peer can be reached into smc_pnet_find_ism_by_pnetid(). Thus searching continues for another ism device, if check fails. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: common routine for CLC accept and confirmUrsula Braun
smc_clc_send_accept() and smc_clc_send_confirm() are quite similar. Move common code into a separate function smc_clc_send_confirm_accept(). And introduce separate SMCD and SMCR struct definitions for CLC accept resp. confirm. No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: dynamic allocation of CLC proposal bufferUrsula Braun
Reduce stack size for smc_listen_work() and smc_clc_send_proposal() by dynamic allocation of the CLC buffer to be received or sent. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: introduce better field namesUrsula Braun
Field names "srv_first_contact" and "cln_first_contact" are misleading, since they apply to both, server and client. Rename them to "first_contact_peer" and "first_contact_local". Rename "ism_gid" by the more precise name "ism_peer_gid". Rename version constant "SMC_CLC_V1" into "SMC_V1". No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: reduce active tcp_listen workersUrsula Braun
SMC starts a separate tcp_listen worker for every SMC socket in state SMC_LISTEN, and can accept an incoming connection request only, if this worker is really running and waiting in kernel_accept(). But the number of running workers is limited. This patch reworks the listening SMC code and starts a tcp_listen worker after the SYN-ACK handshake on the internal clc-socket only. Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tcp: reflect tos value received in SYN to the socketWei Wang
This commit adds a new TCP feature to reflect the tos value received in SYN, and send it out on the SYN-ACK, and eventually set the tos value of the established socket with this reflected tos value. This provides a way to set the traffic class/QoS level for all traffic in the same connection to be the same as the incoming SYN request. It could be useful in data centers to provide equivalent QoS according to the incoming request. This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by default turned off. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ip: pass tos into ip_build_and_send_pkt()Wei Wang
This commit adds tos as a new passed in parameter to ip_build_and_send_pkt() which will be used in the later commit. This is a pure restructure and does not have any functional change. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tcp: record received TOS value in the request socketWei Wang
A new field is added to the request sock to record the TOS value received on the listening socket during 3WHS: When not under syn flood, it is recording the TOS value sent in SYN. When under syn flood, it is recording the TOS value sent in the ACK. This is a preparation patch in order to do TOS reflection in the later commit. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: make sure napi_list is safe for RCU traversalJakub Kicinski
netpoll needs to traverse dev->napi_list under RCU, make sure it uses the right iterator and that removal from this list is handled safely. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: manage napi add/del idempotence explicitlyJakub Kicinski
To RCUify napi->dev_list we need to replace list_del_init() with list_del_rcu(). There is no _init() version for RCU for obvious reasons. Up until now netif_napi_del() was idempotent so to make sure it remains such add a bit which is set when NAPI is listed, and cleared when it removed. Since we don't expect multiple calls to netif_napi_add() to be correct, add a warning on that side. Now that napi_hash_add / napi_hash_del are only called by napi_add / del we can actually steal its bit. We just need to make sure hash node is initialized correctly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: remove napi_hash_del() from driver-facing APIJakub Kicinski
We allow drivers to call napi_hash_del() before calling netif_napi_del() to batch RCU grace periods. This makes the API asymmetric and leaks internal implementation details. Soon we will want the grace period to protect more than just the NAPI hash table. Restructure the API and have drivers call a new function - __netif_napi_del() if they want to take care of RCU waits. Note that only core was checking the return status from napi_hash_del() so the new helper does not report if the NAPI was actually deleted. Some notes on driver oddness: - veth observed the grace period before calling netif_napi_del() but that should not matter - myri10ge observed normal RCU flavor - bnx2x and enic did not actually observe the grace period (unless they did so implicitly) - virtio_net and enic only unhashed Rx NAPIs The last two points seem to indicate that the calls to napi_hash_del() were a left over rather than an optimization. Regardless, it's easy enough to correct them. This patch may introduce extra synchronize_net() calls for interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on free_netdev() to call netif_napi_del(). This seems inevitable since we want to use RCU for netpoll dev->napi_list traversal, and almost no drivers set IFF_DISABLE_NETPOLL. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10devlink: don't crash if netdev is NULLJakub Kicinski
Following change will add support for a corner case where we may not have a netdev to pass to devlink_port_type_eth_set() but we still want to set port type. This is definitely a corner case, and drivers should not normally pass NULL netdev - print a warning message when this happens. Sadly for other port types (ib) switches don't have a device reference, the way we always do for Ethernet, so we can't put the warning in __devlink_port_type_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ipmr: Use full VIF ID in netlink cache reportsPaul Davey
Insert the full 16 bit VIF ID into ipmr Netlink cache reports. The VIF_ID attribute has 32 bits of space so can store the full VIF ID extracted from the high and low byte fields in the igmpmsg. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ipmr: Add high byte of VIF ID to igmpmsgPaul Davey
Use the unused3 byte in struct igmpmsg to hold the high 8 bits of the VIF ID. If using more than 255 IPv4 multicast interfaces it is necessary to have access to a VIF ID for cache reports that is wider than 8 bits, the VIF ID present in the igmpmsg reports sent to mroute_sk was only 8 bits wide in the igmpmsg header. Adding the high 8 bits of the 16 bit VIF ID in the unused byte allows use of more than 255 IPv4 multicast interfaces. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ipmr: Add route table ID to netlink cache reportsPaul Davey
Insert the multicast route table ID as a Netlink attribute to Netlink cache report notifications. When multiple route tables are in use it is necessary to have a way to determine which route table a given cache report belongs to when receiving the cache report. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09devlink: Use controller while building phys_port_nameParav Pandit
Now that controller number attribute is available, use it when building phsy_port_name for external controller ports. An example devlink port and representor netdev name consist of controller annotation for external controller with controller number = 1, for a VF 1 of PF 0: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev ens2f0c1pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "ens2f0c1pf0vf1", "flavour": "pcivf", "controller": 1, "pfnum": 0, "vfnum": 1, "external": true, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Controller number annotation is skipped for non external controllers to maintain backward compatibility. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09devlink: Introduce controller numberParav Pandit
A devlink port may be for a controller consist of PCI device. A devlink instance holds ports of two types of controllers. (1) controller discovered on same system where eswitch resides This is the case where PCI PF/VF of a controller and devlink eswitch instance both are located on a single system. (2) controller located on external host system. This is the case where a controller is located in one system and its devlink eswitch ports are located in a different system. When a devlink eswitch instance serves the devlink ports of both controllers together, PCI PF/VF numbers may overlap. Due to this a unique phys_port_name cannot be constructed. For example in below such system controller-0 and controller-1, each has PCI PF pf0 whose eswitch ports can be present in controller-0. These results in phys_port_name as "pf0" for both. Similar problem exists for VFs and upcoming Sub functions. An example view of two controller systems: --------------------------------------------------------- | | | --------- --------- ------- ------- | ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | | connect | | ------- ------- | ----------- | | controller_num=1 (no eswitch) | ------|-------------------------------------------------- (internal wire) | --------------------------------------------------------- | devlink eswitch ports and reps | | ----------------------------------------------------- | | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | | ----------------------------------------------------- | | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | | |pf1 | pf1vfN | pf1sfN | pf1 | pf1vfN |pf0sfN | | | ----------------------------------------------------- | | | | | | --------- --------- ------- ------- | | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | ------- ----/---- ---/----- ------- ---/--- ---/--- | | | pf0 |______/________/ | pf1 |___/_______/ | | ------- ------- | | | | local controller_num=0 (eswitch) | --------------------------------------------------------- An example devlink port for external controller with controller number = 1 for a VF 1 of PF 0: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "ens2f0pf0vf1", "flavour": "pcivf", "controller": 1, "pfnum": 0, "vfnum": 1, "external": true, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09devlink: Introduce external controller flagParav Pandit
A devlink eswitch port may represent PCI PF/VF ports of a controller. A controller either located on same system or it can be an external controller located in host where such NIC is plugged in. Add the ability for driver to specify if a port is for external controller. Use such flag in the mlx5_core driver. An example of an external controller having VF1 of PF0 belong to controller 1. $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf pfnum 0 vfnum 1 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "ens2f0pf0vf1", "flavour": "pcivf", "pfnum": 0, "vfnum": 1, "external": true, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Rewrite inner header IPv6 in ICMPv6 messages in ip6t_NPT, from Michael Zhou. 2) do_ip_vs_set_ctl() dereferences uninitialized value, from Peilin Ye. 3) Support for userdata in tables, from Jose M. Guisado. 4) Do not increment ct error and invalid stats at the same time, from Florian Westphal. 5) Remove ct ignore stats, also from Florian. 6) Add ct stats for clash resolution, from Florian Westphal. 7) Bump reference counter bump on ct clash resolution only, this is safe because bucket lock is held, again from Florian. 8) Use ip_is_fragment() in xt_HMARK, from YueHaibing. 9) Add wildcard support for nft_socket, from Balazs Scheidler. 10) Remove superfluous IPVS dependency on iptables, from Yaroslav Bolyukin. 11) Remove unused definition in ebt_stp, from Wang Hai. 12) Replace CONFIG_NFT_CHAIN_NAT_{IPV4,IPV6} by CONFIG_NFT_NAT in selftests/net, from Fabian Frederick. 13) Add userdata support for nft_object, from Jose M. Guisado. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08ipv6: add tos reflection in TCP reset and ackWei Wang
Currently, ipv6 stack does not do any TOS reflection. To make the behavior consistent with v4 stack, this commit adds TOS reflection in tcp_v6_reqsk_send_ack() and tcp_v6_send_reset(). We clear the lower 2-bit ECN value of the received TOS in compliance with RFC 3168 6.1.5 robustness principles. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08Merge tag 'rxrpc-next-20200908' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Allow more calls to same peer Here are some development patches for AF_RXRPC that allow more simultaneous calls to be made to the same peer with the same security parameters. The current code allows a maximum of 4 simultaneous calls, which limits the afs filesystem to that many simultaneous threads. This increases the limit to 16. To make this work, the way client connections are limited has to be changed (incoming call/connection limits are unaffected) as the current code depends on queuing calls on a connection and then pushing the connection through a queue. The limit is on the number of available connections. This is changed such that there's a limit[*] on the total number of calls systemwide across all namespaces, but the limit on the number of client connections is removed. Once a call is allowed to proceed, it finds a bundle of connections and tries to grab a call slot. If there's a spare call slot, fine, otherwise it will wait. If there's already a waiter, it will try to create another connection in the bundle, unless the limit of 4 is reached (4 calls per connection, giving 16). A number of things throttle someone trying to set up endless connections: - Calls that fail immediately have their conns deleted immediately, - Calls that don't fail immediately have to wait for a timeout, - Connections normally get automatically reaped if they haven't been used for 2m, but this is sped up to 2s if the number of connections rises over 900. This number is tunable by sysctl. [*] Technically two limits - kernel sockets and userspace rxrpc sockets are accounted separately. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08net: bridge: mcast: fix unused br var when lockdep isn't definedNikolay Aleksandrov
Stephen reported the following warning: net/bridge/br_multicast.c: In function 'br_multicast_find_port': net/bridge/br_multicast.c:1818:21: warning: unused variable 'br' [-Wunused-variable] 1818 | struct net_bridge *br = mp->br; | ^~ It happens due to bridge's mlock_dereference() when lockdep isn't defined. Silence the warning by annotating the variable as __maybe_unused. Fixes: 0436862e417e ("net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08netlabel: Fix some kernel-doc warningsWang Hai
Fixes the following W=1 kernel build warning(s): net/netlabel/netlabel_calipso.c:438: warning: Excess function parameter 'audit_secid' description in 'calipso_doi_remove' net/netlabel/netlabel_calipso.c:605: warning: Excess function parameter 'reg' description in 'calipso_req_delattr' Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08cipso: fix 'audit_secid' kernel-doc warning in cipso_ipv4.cWang Hai
Fixes the following W=1 kernel build warning(s): net/ipv4/cipso_ipv4.c:510: warning: Excess function parameter 'audit_secid' description in 'cipso_v4_doi_remove' Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08net: sched: skip an unnecessay checkTom Rix
Reviewing the error handling in tcf_action_init_1() most of the early handling uses err_out: if (cookie) { kfree(cookie->data); kfree(cookie); } before cookie could ever be set. So skip the unnecessay check. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08rxrpc: Allow multiple client connections to the same peerDavid Howells
Allow the number of parallel connections to a machine to be expanded from a single connection to a maximum of four. This allows up to 16 calls to be in progress at the same time to any particular peer instead of 4. Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-08rxrpc: Rewrite the client connection managerDavid Howells
Rewrite the rxrpc client connection manager so that it can support multiple connections for a given security key to a peer. The following changes are made: (1) For each open socket, the code currently maintains an rbtree with the connections placed into it, keyed by communications parameters. This is tricky to maintain as connections can be culled from the tree or replaced within it. Connections can require replacement for a number of reasons, e.g. their IDs span too great a range for the IDR data type to represent efficiently, the call ID numbers on that conn would overflow or the conn got aborted. This is changed so that there's now a connection bundle object placed in the tree, keyed on the same parameters. The bundle, however, does not need to be replaced. (2) An rxrpc_bundle object can now manage the available channels for a set of parallel connections. The lock that manages this is moved there from the rxrpc_connection struct (channel_lock). (3) There'a a dummy bundle for all incoming connections to share so that they have a channel_lock too. It might be better to give each incoming connection its own bundle. This bundle is not needed to manage which channels incoming calls are made on because that's the solely at whim of the client. (4) The restrictions on how many client connections are around are removed. Instead, a previous patch limits the number of client calls that can be allocated. Ordinarily, client connections are reaped after 2 minutes on the idle queue, but when more than a certain number of connections are in existence, the reaper starts reaping them after 2s of idleness instead to get the numbers back down. It could also be made such that new call allocations are forced to wait until the number of outstanding connections subsides. Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-08rxrpc: Impose a maximum number of client callsDavid Howells
Impose a maximum on the number of client rxrpc calls that are allowed simultaneously. This will be in lieu of a maximum number of client connections as this is easier to administed as, unlike connections, calls aren't reusable (to be changed in a subsequent patch).. This doesn't affect the limits on service calls and connections. Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-08netfilter: nf_tables: add userdata support for nft_objectJose M. Guisado Gomez
Enables storing userdata for nft_object. Initially this will store an optional comment but can be extended in the future as needed. Adds new attribute NFTA_OBJ_USERDATA to nft_object. Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08netfilter: ebt_stp: Remove unused macro BPDU_TYPE_TCNWang Hai
BPDU_TYPE_TCN is never used after it was introduced. So better to remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-07net: dsa: don't print non-fatal MTU error if not supportedVladimir Oltean
Commit 72579e14a1d3 ("net: dsa: don't fail to probe if we couldn't set the MTU") changed, for some reason, the "err && err != -EOPNOTSUPP" check into a simple "err". This causes the MTU warning to be printed even for drivers that don't have the MTU operations implemented. Fix that. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: dsa: change PHY error message againVladimir Oltean
slave_dev->name is only populated at this stage if it was specified through a label in the device tree. However that is not mandatory. When it isn't, the error message looks like this: [ 5.037057] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d [ 5.044672] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d [ 5.052275] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d [ 5.059877] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d which is especially confusing since the error gets printed on behalf of the DSA master (fsl_enetc in this case). Printing an error message that contains a valid reference to the DSA port's name is difficult at this point in the initialization stage, so at least we should print some info that is more reliable, even if less user-friendly. That may be the driver name and the hardware port index. After this change, the error is printed as: [ 6.051587] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 0 [ 6.061192] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 1 [ 6.070765] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 2 [ 6.080324] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 3 Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07rxrpc: Remove unused macro rxrpc_min_rtt_wlenWang Hai
rxrpc_min_rtt_wlen is never used after it was introduced. So better to remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: destroy all entries via gcNikolay Aleksandrov
Since each entry type has timers that can be running simultaneously we need to make sure that entries are not freed before their timers have finished. In order to do that generalize the src gc work to mcast gc work and use a callback to free the entries (mdb, port group or src). v3: add IPv6 support v2: force mcast gc on port del to make sure all port group timers have finished before freeing the bridge port Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: improve IGMPv3/MLDv2 query processingNikolay Aleksandrov
When an IGMPv3/MLDv2 query is received and we're operating in such mode then we need to avoid updating group timers if the suppress flag is set. Also we should update only timers for groups in exclude mode. v3: add IPv6/MLDv2 support Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: support for IGMPV3/MLDv2 BLOCK_OLD_SOURCES reportNikolay Aleksandrov
We already have all necessary helpers, so process IGMPV3/MLDv2 BLOCK_OLD_SOURCES as per the RFCs. v3: add IPv6/MLDv2 support v2: directly do flag bit operations Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: support for IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE reportNikolay Aleksandrov
In order to process IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report types we need new helpers which allow us to mark entries based on their timer state and to query only marked entries. v3: add IPv6/MLDv2 support, fix other_query checks v2: directly do flag bit operations Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: support for IGMPV3/MLDv2 MODE_IS_INCLUDE/EXCLUDE reportNikolay Aleksandrov
In order to process IGMPV3/MLDv2_MODE_IS_INCLUDE/EXCLUDE report types we need some new helpers which allow us to set/clear flags for all current entries and later delete marked entries after the report sources have been processed. v3: add IPv6/MLDv2 support v2: drop flag helpers and directly do flag bit operations Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES reportNikolay Aleksandrov
This patch adds handling for the ALLOW_NEW_SOURCES IGMPv3/MLDv2 report types and limits them only when multicast_igmp_version == 3 or multicast_mld_version == 2 respectively. Now that IGMPv3/MLDv2 handling functions will be managing timers we need to delay their activation, thus a new argument is added which controls if the timer should be updated. We also disable host IGMPv3/MLDv2 handling as it's not yet implemented and could cause inconsistent group state, the host can only join a group as EXCLUDE {} or leave it. v4: rename update_timer to igmpv2_mldv1 and use the passed value from br_multicast_add_group's callers v3: Add IPv6/MLDv2 support Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mcast: delete expired port groups without srcsNikolay Aleksandrov
If an expired port group is in EXCLUDE mode, then we have to turn it into INCLUDE mode, remove all srcs with zero timer and finally remove the group itself if there are no more srcs with an active timer. For IGMPv2 use there would be no sources, so this will reduce to just removing the group as before. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07net: bridge: mdb: use mdb and port entries in notificationsNikolay Aleksandrov
We have to use mdb and port entries when sending mdb notifications in order to fill in all group attributes properly. Before this change we would've used a fake br_mdb_entry struct to fill in only partial information about the mdb. Now we can also reuse the mdb dump fill function and thus have only a single central place which fills the mdb attributes. v3: add IPv6 support Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>