summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2021-03-24net: flow_offload: add FLOW_ACTION_PPPOE_PUSHPablo Neira Ayuso
Add an action to represent the PPPoE hardware offload support that includes the session ID. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: bridge vlan hardware offload and switchdevFelix Fietkau
The switch might have already added the VLAN tag through PVID hardware offload. Keep this extra VLAN in the flowtable but skip it on egress. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: nft_flow_offload: use direct xmit if hardware offload is enabledPablo Neira Ayuso
If there is a forward path to reach an ethernet device and hardware offload is enabled, then use the direct xmit path. Moreover, store the real device in the direct xmit path info since software datapath uses dev_hard_header() to push the layer encapsulation headers while hardware offload refers to the real device. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: add vlan supportPablo Neira Ayuso
Add the vlan id and protocol to the flow tuple to uniquely identify flows from the receive path. For the transmit path, dev_hard_header() on the vlan device push the headers. This patch includes support for two vlan headers (QinQ) from the ingress path. Add a generic encap field to the flowtable entry which stores the protocol and the tag id. This allows to reuse these fields in the PPPoE support coming in a later patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: use dev_fill_forward_path() to obtain egress devicePablo Neira Ayuso
The egress device in the tuple is obtained from route. Use dev_fill_forward_path() instead to provide the real egress device for this flow whenever this is available. The new FLOW_OFFLOAD_XMIT_DIRECT type uses dev_queue_xmit() to transmit ethernet frames. Cache the source and destination hardware address to use dev_queue_xmit() to transfer packets. The FLOW_OFFLOAD_XMIT_DIRECT replaces FLOW_OFFLOAD_XMIT_NEIGH if dev_fill_forward_path() finds a direct transmit path. In case of topology updates, if peer is moved to different bridge port, the connection will time out, reconnect will result in a new entry with the correct path. Snooping fdb updates would allow for cleaning up stale flowtable entries. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: use dev_fill_forward_path() to obtain ingress devicePablo Neira Ayuso
Obtain the ingress device in the tuple from the route in the reply direction. Use dev_fill_forward_path() instead to get the real ingress device for this flow. Fall back to use the ingress device that the IP forwarding route provides if: - dev_fill_forward_path() finds no real ingress device. - the ingress device that is obtained is not part of the flowtable devices. - this route has a xfrm policy. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: add xmit path typesPablo Neira Ayuso
Add the xmit_type field that defines the two supported xmit paths in the flowtable data plane, which are the neighbour and the xfrm xmit paths. This patch prepares for new flowtable xmit path types to come. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: dsa: resolve forwarding path for dsa slave portsFelix Fietkau
Add .ndo_fill_forward_path for dsa slave port devices Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: ppp: resolve forwarding path for bridge pppoe devicesFelix Fietkau
Pass on the PPPoE session ID, destination hardware address and the real device. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: bridge: resolve forwarding path for VLAN tag actions in bridge devicesFelix Fietkau
Depending on the VLAN settings of the bridge and the port, the bridge can either add or remove a tag. When vlan filtering is enabled, the fdb lookup also needs to know the VLAN tag/proto for the destination address To provide this, keep track of the stack of VLAN tags for the path in the lookup context Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: bridge: resolve forwarding path for bridge devicesPablo Neira Ayuso
Add .ndo_fill_forward_path for bridge devices. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: 8021q: resolve forwarding path for vlan devicesPablo Neira Ayuso
Add .ndo_fill_forward_path for vlan devices. For instance, assuming the following topology: IP forwarding / \ eth0.100 eth0 | eth0 . . . ethX ab:cd:ef:ab:cd:ef For packets going through IP forwarding to eth0.100 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: eth0.100 -> eth0 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: resolve forwarding path from virtual netdevice and HW destination addressPablo Neira Ayuso
This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ | IP forwarding | | / \ \/ | br0 eth0 . / \ -> eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: make unregister netdev warning timeout configurableDmitry Vyukov
netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: ocelot: replay switchdev events when joining bridgeVladimir Oltean
The premise of this change is that the switchdev port attributes and objects offloaded by ocelot might have been missed when we are joining an already existing bridge port, such as a bonding interface. The patch pulls these switchdev attributes and objects from the bridge, on behalf of the 'bridge port' net device which might be either the ocelot switch interface, or the bonding upper interface. The ocelot_net.c belongs strictly to the switchdev ocelot driver, while ocelot.c is part of a library shared with the DSA felix driver. The ocelot_port_bridge_leave function (part of the common library) used to call ocelot_port_vlan_filtering(false), something which is not necessary for DSA, since the framework deals with that already there. So we move this function to ocelot_switchdev_unsync, which is specific to the switchdev driver. The code movement described above makes ocelot_port_bridge_leave no longer return an error code, so we change its type from int to void. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: bridge: add helper to replay VLANs installed on portVladimir Oltean
Currently this simple setup with DSA: ip link add br0 type bridge vlan_filtering 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 will not work because the bridge has created the PVID in br_add_if -> nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1, but that was too early, since swp0 was not yet a lower of bond0, so it had no reason to act upon that notification. We need a helper in the bridge to replay the switchdev VLAN objects that were notified since the bridge port creation, because some of them may have been missed. As opposed to the br_mdb_replay function, the vg->vlan_list write side protection is offered by the rtnl_mutex which is sleepable, so we don't need to queue up the objects in atomic context, we can replay them right away. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: bridge: add helper to replay port and local fdb entriesVladimir Oltean
When a switchdev port starts offloading a LAG that is already in a bridge and has an FDB entry pointing to it: ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link set swp0 master bond0 the switchdev driver will have no idea that this FDB entry is there, because it missed the switchdev event emitted at its creation. Ido Schimmel pointed this out during a discussion about challenges with switchdev offloading of stacked interfaces between the physical port and the bridge, and recommended to just catch that condition and deny the CHANGEUPPER event: https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/ But in fact, we might need to deal with the hard thing anyway, which is to replay all FDB addresses relevant to this port, because it isn't just static FDB entries, but also local addresses (ones that are not forwarded but terminated by the bridge). There, we can't just say 'oh yeah, there was an upper already so I'm not joining that'. So, similar to the logic for replaying MDB entries, add a function that must be called by individual switchdev drivers and replays local FDB entries as well as ones pointing towards a bridge port. This time, we use the atomic switchdev notifier block, since that's what FDB entries expect for some reason. Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: bridge: add helper to replay port and host-joined mdb entriesVladimir Oltean
I have a system with DSA ports, and udhcpcd is configured to bring interfaces up as soon as they are created. I create a bridge as follows: ip link add br0 type bridge As soon as I create the bridge and udhcpcd brings it up, I also have avahi which automatically starts sending IPv6 packets to advertise some local services, and because of that, the br0 bridge joins the following IPv6 groups due to the code path detailed below: 33:33:ff:6d:c1:9c vid 0 33:33:00:00:00:6a vid 0 33:33:00:00:00:fb vid 0 br_dev_xmit -> br_multicast_rcv -> br_ip6_multicast_add_group -> __br_multicast_add_group -> br_multicast_host_join -> br_mdb_notify This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host hooked up, and switchdev will attempt to offload the host joined groups to an empty list of ports. Of course nobody offloads them. Then when we add a port to br0: ip link set swp0 master br0 the bridge doesn't replay the host-joined MDB entries from br_add_if, and eventually the host joined addresses expire, and a switchdev notification for deleting it is emitted, but surprise, the original addition was already completely missed. The strategy to address this problem is to replay the MDB entries (both the port ones and the host joined ones) when the new port joins the bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can be populated and only then attached to a bridge that you offload). However there are 2 possibilities: the addresses can be 'pushed' by the bridge into the port, or the port can 'pull' them from the bridge. Considering that in the general case, the new port can be really late to the party, and there may have been many other switchdev ports that already received the initial notification, we would like to avoid delivering duplicate events to them, since they might misbehave. And currently, the bridge calls the entire switchdev notifier chain, whereas for replaying it should just call the notifier block of the new guy. But the bridge doesn't know what is the new guy's notifier block, it just knows where the switchdev notifier chain is. So for simplification, we make this a driver-initiated pull for now, and the notifier block is passed as an argument. To emulate the calling context for mdb objects (deferred and put on the blocking notifier chain), we must iterate under RCU protection through the bridge's mdb entries, queue them, and only call them once we're out of the RCU read-side critical section. There was some opportunity for reuse between br_mdb_switchdev_host_port, br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev mdb object is created, so a helper was created. Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: bridge: add helper to retrieve the current ageing timeVladimir Oltean
The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from: sysfs/ioctl/netlink -> br_set_ageing_time -> __set_ageing_time therefore not at bridge port creation time, so: (a) switchdev drivers have to hardcode the initial value for the address ageing time, because they didn't get any notification (b) that hardcoded value can be out of sync, if the user changes the ageing time before enslaving the port to the bridge We need a helper in the bridge, such that switchdev drivers can query the current value of the bridge ageing time when they start offloading it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: bridge: add helper for retrieving the current bridge port STP stateVladimir Oltean
It may happen that we have the following topology with DSA or any other switchdev driver with LAG offload: ip link add br0 type bridge stp_state 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 ip link set swp1 master bond0 STP decides that it should put bond0 into the BLOCKING state, and that's that. The ports that are actively listening for the switchdev port attributes emitted for the bond0 bridge port (because they are offloading it) and have the honor of seeing that switchdev port attribute can react to it, so we can program swp0 and swp1 into the BLOCKING state. But if then we do: ip link set swp2 master bond0 then as far as the bridge is concerned, nothing has changed: it still has one bridge port. But this new bridge port will not see any STP state change notification and will remain FORWARDING, which is how the standalone code leaves it in. We need a function in the bridge driver which retrieves the current STP state, such that drivers can synchronize to it when they may have missed switchdev events. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: lapb: Make "lapb_t1timer_running" able to detect an already running timerXie He
Problem: The "lapb_t1timer_running" function in "lapb_timer.c" is used in only one place: in the "lapb_kick" function in "lapb_out.c". "lapb_kick" calls "lapb_t1timer_running" to check if the timer is already pending, and if it is not, schedule it to run. However, if the timer has already fired and is running, and is waiting to get the "lapb->lock" lock, "lapb_t1timer_running" will not detect this, and "lapb_kick" will then schedule a new timer. The old timer will then abort when it sees a new timer pending. I think this is not right. The purpose of "lapb_kick" should be ensuring that the actual work of the timer function is scheduled to be done. If the timer function is already running but waiting for the lock, "lapb_kick" should not abort and reschedule it. Changes made: I added a new field "t1timer_running" in "struct lapb_cb" for "lapb_t1timer_running" to use. "t1timer_running" will accurately reflect whether the actual work of the timer is pending. If the timer has fired but is still waiting for the lock, "t1timer_running" will still correctly reflect whether the actual work is waiting to be done. The old "t1timer_stop" field, whose only responsibility is to ask a timer (that is already running but waiting for the lock) to abort, is no longer needed, because the new "t1timer_running" field can fully take over its responsibility. Therefore "t1timer_stop" is deleted. "t1timer_running" is not simply a negation of the old "t1timer_stop". At the end of the timer function, if it does not reschedule itself, "t1timer_running" is set to false to indicate that the timer is stopped. For consistency of the code, I also added "t2timer_running" and deleted "t2timer_stop". Signed-off-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22net: dsa: hellcreek: Report switch name and IDKurt Kanzenbach
Report the driver name, ASIC ID and the switch name via devlink. This is a useful information for user space tooling. Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: 1) Split flowtable workqueues per events, from Oz Shlomo. 2) fall-through warnings for clang, from Gustavo A. R. Silva 3) Remove unused declaration in conntrack, from YueHaibing. 4) Consolidate skb_try_make_writable() in flowtable datapath, simplify some of the existing codebase. 5) Call dst_check() to fall back to static classic forwarding path. 6) Update table flags from commit phase. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22net: set initial device refcount to 1Eric Dumazet
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the initial net device refcount was 0. When CONFIG_PCPU_DEV_REFCNT is not set, this means the first dev_hold() triggers an illegal refcount operation (addition on 0) refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4 Fix is to change initial (and final) refcount to be 1. Also add a missing kerneldoc piece, as reported by Stephen Rothwell. Fixes: 919067cc845f ("net: add CONFIG_PCPU_DEV_REFCNT") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <groeck@google.com> Tested-by: Guenter Roeck <groeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22Merge branch '100GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-03-22 This series contains updates to ice and iavf drivers. Haiyue Wang says: The Intel E810 Series supports a programmable pipeline for a domain specific protocols classification, for example GTP by Dynamic Device Personalization (DDP) profile. The E810 PF has introduced flex-bytes support by ethtool user-def option allowing for packet deeper matching based on an offset and value for DDP usage. For making VF also benefit from this flexible protocol classification, some new virtchnl messages are defined and handled by PF, so VF can query this new flow director capability, and use ethtool with extending the user-def option to configure Rx flow classification. The new user-def 0xAAAABBBBCCCCDDDD: BBBB is the 2 byte pattern while AAAA corresponds to its offset in the packet. Similarly DDDD is the 2 byte pattern with CCCC being the corresponding offset. The offset ranges from 0x0 to 0x1F7 (up to 504 bytes into the packet). The offset starts from the beginning of the packet. This feature can be used to allow customers to set flow director rules for protocols headers that are beyond standard ones supported by ethtool (e.g. PFCP or GTP-U). Like for matching GTP-U's TEID value 0x10203040: ethtool -N ens787f0v0 flow-type udp4 dst-port 2152 \ user-def 0x002e102000303040 action 13 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22net: move the ptype_all and ptype_base declarations to include/linux/netdevice.hVladimir Oltean
ptype_all and ptype_base are declared in net/core/dev.c as non-static, because they are used by net-procfs.c too. However, a "make W=1" build complains that there was no previous declaration of ptype_all and ptype_base in a header file, so this way of declaring things constitutes a violation of coding style. Let's move the extern declarations of ptype_all and ptype_base to the linux/netdevice.h file, which is included by net-procfs.c too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22linux/qed: Mundane spelling fixes throughout the fileBhaskar Chowdhury
s/unrequired/"not required"/ s/consme/consume/ .....two different places s/accros/across/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22netdev: add netdev_queue_set_dql_min_limit()Vincent Mailhol
Add a function to set the dynamic queue limit minimum value. Some specific drivers might have legitimate reasons to configure dql.min_limit to a given value. Typically, this is the case when the PDU of the protocol is smaller than the packet size to used to carry those frames to the device. Concrete example: a CAN (Control Area Network) device with an USB 2.0 interface. The PDU of classical CAN protocol are roughly 16 bytes but the USB packet size (which is used to carry the CAN frames to the device) might be up to 512 bytes. Wen small traffic burst occurs, BQL algorithm is not able to immediately adjust and this would result in having to send many small USB packets (i.e packet of 16 bytes for each CAN frame). Filling up the USB packet with CAN frames is relatively fast (small latency issue) but the gain of not having to send several small USB packets is huge (big throughput increase). In this case, forcing dql.min_limit to a given value that would allow to stuff the USB packet is always a win. This function is to be used by network drivers which are able to prove through a rationale and through empirical tests on several environment (with other applications, heavy context switching, virtualization...), that they constantly reach better performances with a specific predefined dql.min_limit value with no noticeable latency impact. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22ice: Enable FDIR Configure for AVFQi Zhang
The virtual channel is going to be extended to support FDIR and RSS configure from AVF. New data structures and OP codes will be added, the patch enable the FDIR part. To support above advanced AVF feature, we need to figure out what kind of data structure should be passed from VF to PF to describe an FDIR rule or RSS config rule. The common part of the requirement is we need a data structure to represent the input set selection of a rule's hash key. An input set selection is a group of fields be selected from one or more network protocol layers that could be identified as a specific flow. For example, select dst IP address from an IPv4 header combined with dst port from the TCP header as the input set for an IPv4/TCP flow. The patch adds a new data structure virtchnl_proto_hdrs to abstract a network protocol headers group which is composed of layers of network protocol header(virtchnl_proto_hdr). A protocol header contains a 32 bits mask (field_selector) to describe which fields are selected as input sets, as well as a header type (enum virtchnl_proto_hdr_type). Each bit is mapped to a field in enum virtchnl_proto_hdr_field guided by its header type. +------------+-----------+------------------------------+ | | Proto Hdr | Header Type A | | | +------------------------------+ | | | BIT 31 | ... | BIT 1 | BIT 0 | | |-----------+------------------------------+ |Proto Hdrs | Proto Hdr | Header Type B | | | +------------------------------+ | | | BIT 31 | ... | BIT 1 | BIT 0 | | |-----------+------------------------------+ | | Proto Hdr | Header Type C | | | +------------------------------+ | | | BIT 31 | ... | BIT 1 | BIT 0 | | |-----------+------------------------------+ | | .... | +-------------------------------------------------------+ All fields in enum virtchnl_proto_hdr_fields are grouped with header type and the value of the first field of a header type is always 32 aligned. enum proto_hdr_type { header_type_A = 0; header_type_B = 1; .... } enum proto_hdr_field { /* header type A */ header_A_field_0 = 0, header_A_field_1 = 1, header_A_field_2 = 2, header_A_field_3 = 3, /* header type B */ header_B_field_0 = 32, // = header_type_B << 5 header_B_field_0 = 33, header_B_field_0 = 34 header_B_field_0 = 35, .... }; So we have: proto_hdr_type = proto_hdr_field / 32 bit offset = proto_hdr_field % 32 To simply the protocol header's operations, couple help macros are added. For example, to select src IP and dst port as input set for an IPv4/UDP flow. we have: struct virtchnl_proto_hdr hdr[2]; VIRTCHNL_SET_PROTO_HDR_TYPE(&hdr[0], IPV4) VIRTCHNL_ADD_PROTO_HDR_FIELD(&hdr[0], IPV4, SRC) VIRTCHNL_SET_PROTO_HDR_TYPE(&hdr[1], UDP) VIRTCHNL_ADD_PROTO_HDR_FIELD(&hdr[1], UDP, DST) The byte array is used to store the protocol header of a training package. The byte array must be network order. The patch added virtual channel support for iAVF FDIR add/validate/delete filter. iAVF FDIR is Flow Director for Intel Adaptive Virtual Function which can direct Ethernet packets to the queues of the Network Interface Card. Add/delete command is adding or deleting one rule for each virtual channel message, while validate command is just verifying if this rule is valid without any other operations. To add or delete one rule, driver needs to config TCAM and Profile, build training packets which contains the input set value, and send the training packets through FDIR Tx queue. In addition, driver needs to manage the software context to avoid adding duplicated rules, deleting non-existent rule, input set conflicts and other invalid cases. NOTE: Supported pattern/actions and their parse functions are not be included in this patch, they will be added in a separate one. Signed-off-by: Jeff Guo <jia.guo@intel.com> Signed-off-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Simei Su <simei.su@intel.com> Signed-off-by: Beilei Xing <beilei.xing@intel.com> Signed-off-by: Qi Zhang <qi.z.zhang@intel.com> Tested-by: Chen Bo <BoX.C.Chen@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-03-19net: add CONFIG_PCPU_DEV_REFCNTEric Dumazet
I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1 unregister_netdevice: waiting for sit0 to become free. Usage count = -1 It would be nice if syzbot could trigger a warning at the time this reference count became negative. This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds. v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <lkp@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helperAlexander Lobakin
call_gro_receive() is used to limit GRO recursion, but it works only with callback pointers. There's a combined version of call_gro_receive() + INDIRECT_CALL_2() in <net/inet_common.h>, but it doesn't check for IPv6 modularity. Add a similar new helper to cover both of these. It can and will be used to avoid retpoline overhead when IP header lies behind another offloaded proto. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18gro: make net/gro.h self-containedAlexander Lobakin
If some source file includes <net/gro.h>, but doesn't include <linux/indirect_call_wrapper.h>: In file included from net/8021q/vlan_core.c:7: ./include/net/gro.h:6:1: warning: data definition has no type or storage class 6 | INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *, | ^~~~~~~~~~~~~~~~~~~~~~~~~ ./include/net/gro.h:6:1: error: type defaults to ‘int’ in declaration of ‘INDIRECT_CALLABLE_DECLARE’ [-Werror=implicit-int] [...] Include <linux/indirect_call_wrapper.h> directly. It's small and won't pull lots of dependencies. Also add some incomplete struct declarations to be fully stacked. Fixes: 04f00ab2275f ("net/core: move gro function declarations to separate header ") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18net: ocelot: support multiple bridgesVladimir Oltean
The ocelot switches are a bit odd in that they do not have an STP state to put the ports into. Instead, the forwarding configuration is delayed from the typical port_bridge_join into stp_state_set, when the port enters the BR_STATE_FORWARDING state. I can only guess that the implementation of this quirk is the reason that led to the simplification of the driver such that only one bridge could be offloaded at a time. We can simplify the data structures somewhat, and introduce a per-port bridge device pointer and STP state, similar to how the LAG offload works now (there we have a per-port bonding device pointer and TX enabled state). This allows offloading multiple bridges with relative ease, while still keeping in place the quirk to delay the programming of the PGIDs. We actually need this change now because we need to remove the bogus restriction from ocelot_bridge_stp_state_set that ocelot->bridge_mask needs to contain BIT(port), otherwise that function is a no-op. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18stmmac: intel: Add PSE and PCH PTP clock source selectionWong, Vee Khee
Intel mGbE variant implemented in EHL and TGL can be set to select different clock frequency based on GPO bits in MAC_GPIO_STATUS register. We introduce a new "void (*ptp_clk_freq_config)(void *priv)" in platform data so that if a platform is required to configure the frequency of clock source, in this case Intel mGBE does, the platform-specific configuration of the PTP clock setting is done when stmmac_ptp_register() is called. Signed-off-by: Wong, Vee Khee <vee.khee.wong@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Co-developed-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18net: dsa: Add helper to resolve bridge port from DSA portTobias Waldekranz
In order for a driver to be able to query a bridge for information about itself, e.g. reading out port flags, it has to use a netdev that is known to the bridge. In the simple case, that is just the netdev representing the port, e.g. swp0 or swp1 in this example: br0 / \ swp0 swp1 But in the case of an offloaded lag, this will be the bond or team interface, e.g. bond0 in this example: br0 / bond0 / \ swp0 swp1 Add a helper that hides some of this complexity from the drivers. Then, redefine dsa_port_offloads_bridge_port using the helper to avoid double accounting of the set of possible offloaded uppers. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18net: move the xps maps to an arrayAntoine Tenart
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in net_device. That will simplify a lot the code removing the need for lots of if/else conditionals as the correct map will be available using its offset in the array. This should not modify the xps maps behaviour in any way. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18net: embed nr_ids in the xps mapsAntoine Tenart
Embed nr_ids (the number of cpu for the xps cpus map, and the number of rxqs for the xps cpus map) in dev_maps. That will help not accessing out of bound memory if those values change after dev_maps was allocated. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18net: embed num_tc in the xps mapsAntoine Tenart
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when allocating the map. But later updates of dev->num_tc can lead to having a mismatch between the maps and how they're accessed. In such cases the map values do not make any sense and out of bound accesses can occur (that can be easily seen using KASAN). This patch aims at fixing this by embedding num_tc into the maps, using the value at the time the map is created. This brings two improvements: - The maps can be accessed using the embedded num_tc, so we know for sure we won't have out of bound accesses. - Checks can be made before accessing the maps so we know the values retrieved will make sense. We also update __netif_set_xps_queue to conditionally copy old maps from dev_maps in the new one only if the number of traffic classes from both maps match. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18netfilter: nftables: update table flags from the commit phasePablo Neira Ayuso
Do not update table flags from the preparation phase. Store the flags update into the transaction, then update the flags from the commit phase. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18netfilter: flowtable: fast NAT functions never failPablo Neira Ayuso
Simplify existing fast NAT routines by returning void. After the skb_try_make_writable() call consolidation, these routines cannot ever fail. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18netfilter: flowtable: move FLOW_OFFLOAD_DIR_MAX away from enumerationPablo Neira Ayuso
This allows to remove the default case which should not ever happen and that was added to avoid gcc warnings on unhandled FLOW_OFFLOAD_DIR_MAX enumeration case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18netfilter: conntrack: Remove unused variable declarationYueHaibing
commit e97c3e278e95 ("tproxy: split off ipv6 defragmentation to a separate module") left behind this. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-17net: dsa: tag_brcm: add support for legacy tagsÁlvaro Fernández Rojas
Add support for legacy Broadcom tags, which are similar to DSA_TAG_PROTO_BRCM. These tags are used on BCM5325, BCM5365 and BCM63xx switches. Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17ethtool: Add common function for filling out stringsAlexander Duyck
Add a function to handle the common pattern of printing a string into the ethtool strings interface and incrementing the string pointer by the ETH_GSTRING_LEN. Most of the drivers end up doing this and several have implemented their own versions of this function so it would make sense to consolidate on one implementation. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17Merge tag 'mlx5-updates-2021-03-16' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-03-16 mlx5 uplink representor netdev persistence. Before this patchset we used to have separate netdevs for Native NIC mode and Switchdev mode (uplink representor netdev), meaning that if user switches modes between Native to Switchdev and vice versa, the driver would cleanup the current netdev representor and create a new one for the new mode, such behavior created an administrative nightmare for users, where users need to be aware of such loss of both data path and control path configurations, e.g. netdev attributes and arp/route tables, where the later is more painful. A simple solution for this is not to replace the netdev in first place and use a single netdev to serve the uplink/physical port whether it is in switchdev mode or native mode. We already have different HW profiles for each netdev mode, in this series we just replace the HW profile on the fly and we keep the same netdev attached. Refactoring: Some refactoring has been made to overcome some technical difficulties 1) The netdev is created with the maximum amount of tx/rx queues to serve the two profiles. 2) Some ndos are not supported in some modes, so we added a mode check for such cases, e.g legacy sriov ndos must be blocked in switchdev mode. 3) Some mlx5 netdev private attributes need to be moved out of profiles and kept in a persistent place, where the netdev is created e.g devlink port and other global HW resources 4) The netdev devlink port is now always registered with the switch id Implementation: the last three patches implement the mechanism now as the netdev can be shared. 5) Don't recreate the netdev on switchdev mode changes 6) Prevent changing switchdev mode when some netdev operations are active, mostly when TC rules are being processed. This is required since the netdev is kept registered while switchdev mode can be changed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16net/mlx5e: Do not reload ethernet ports when changing eswitch modeRoi Dayan
When switching modes between legacy and switchdev and back, do not reload ethernet interfaces. just change the profile from nic profile to uplink rep profile in switchdev mode. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-16net/mlx5: Move devlink port from mlx5e priv to mlx5e resourcesRoi Dayan
We re-use the native NIC port net device instance for the Uplink representor, and the devlink port. When changing profiles we reset the mlx5e priv but we should still use the devlink port so move it to mlx5e resources. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-16net/mlx5: Move mlx5e hw resources into a sub objectRoi Dayan
This is to separate between resources attributes and other attributes we will want to use. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-16net: Change dev parameter to const in netif_device_present()Roi Dayan
Not all ndos check the present bit before calling the ndo and the driver may want to check it. Sometimes the dev parameter passed as const so we pass it to netif_device_present() as const. Since netif_device_present() doesn't modify dev parameter anyway, declare it as const. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-16Revert "net: socket: use BIT() for MSG_*"David S. Miller
This reverts commit 0bb3262c0248d44aea3be31076f44beb82a7b120. Breaks things on mips64/qemu Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>