summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-10-18Merge branch 'bpf-cpumap-type-for-XDP_REDIRECT'David S. Miller
Jesper Dangaard Brouer says: ==================== net: New bpf cpumap type for XDP_REDIRECT Introducing a new way to redirect XDP frames. Notice how no driver changes are necessary given the design of XDP_REDIRECT. This redirect map type is called 'cpumap', as it allows redirection XDP frames to remote CPUs. The remote CPU will do the SKB allocation and start the network stack invocation on that CPU. This is a scalability and isolation mechanism, that allow separating the early driver network XDP layer, from the rest of the netstack, and assigning dedicated CPUs for this stage. The sysadm control/configure the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how many queues are configured via ethtool --set-channels. Benchmarks show that a single CPU can handle approx 11Mpps. Thus, only assigning two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s wirespeed smallest packet 14.88Mpps. Reducing the number of queues have the advantage that more packets being "bulk" available per hard interrupt[1]. [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf Use-cases: 1. End-host based pre-filtering for DDoS mitigation. This is fast enough to allow software to see and filter all packets wirespeed. Thus, no packets getting silently dropped by hardware. 2. Given NIC HW unevenly distributes packets across RX queue, this mechanism can be used for redistribution load across CPUs. This usually happens when HW is unaware of a new protocol. This resembles RPS (Receive Packet Steering), just faster, but with more responsibility placed on the BPF program for correct steering. 3. Auto-scaling or power saving via only activating the appropriate number of remote CPUs for handling the current load. The cpumap tracepoints can function as a feedback loop for this purpose. In V7, a --stress-mode was implemented for the samples program, which between each stats update, adds + removes CPUs from the map concurrently with traffic. I did find and fix some concurrency issues in the tear-down path, details in patch desc. The stress test have now been running for 15 hours without any issues, while being bombarded with 11.6 Mpps via pktgen_sample04_many_flows.sh. See individual patches for patchset-version changes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18samples/bpf: add cpumap sample program xdp_redirect_cpuJesper Dangaard Brouer
This sample program show how to use cpumap and the associated tracepoints. It provides command line stats, which shows how the XDP-RX process, cpumap-enqueue and cpumap kthread dequeue is cooperating on a per CPU basis. It also utilize the xdp_exception and xdp_redirect_err transpoints to allow users quickly to identify setup issues. One issue with ixgbe driver is that the driver reset the link when loading XDP. This reset the procfs smp_affinity settings. Thus, after loading the program, these must be reconfigured. The easiest workaround it to reduce the RX-queue to e.g. two via: # ethtool --set-channels ixgbe1 combined 2 And then add CPUs above 0 and 1, like: # xdp_redirect_cpu --dev ixgbe1 --prog 2 --cpu 2 --cpu 3 --cpu 4 Another issue with ixgbe is that the page recycle mechanism is tied to the RX-ring size. And the default setting of 512 elements is too small. This is the same issue with regular devmap XDP_REDIRECT. To overcome this I've been using 1024 rx-ring size: # ethtool -G ixgbe1 rx 1024 tx 1024 V3: - whitespace cleanups - bpf tracepoint cannot access top part of struct V4: - report on kthread sched events, according to tracepoint change - report average bulk enqueue size V5: - bpf_map_lookup_elem on cpumap not allowed from bpf_prog use separate map to mark CPUs not available V6: - correct kthread sched summary output V7: - Added a --stress-mode for concurrently changing underlying cpumap Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18bpf: cpumap add tracepointsJesper Dangaard Brouer
This adds two tracepoint to the cpumap. One for the enqueue side trace_xdp_cpumap_enqueue() and one for the kthread dequeue side trace_xdp_cpumap_kthread(). To mitigate the tracepoint overhead, these are invoked during the enqueue/dequeue bulking phases, thus amortizing the cost. The obvious use-cases are for debugging and monitoring. The non-intuitive use-case is using these as a feedback loop to know the system load. One can imagine auto-scaling by reducing, adding or activating more worker CPUs on demand. V4: tracepoint remove time_limit info, instead add sched info V8: intro struct bpf_cpu_map_entry members cpu+map_id in this patch Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18bpf: cpumap xdp_buff to skb conversion and allocationJesper Dangaard Brouer
This patch makes cpumap functional, by adding SKB allocation and invoking the network stack on the dequeuing CPU. For constructing the SKB on the remote CPU, the xdp_buff in converted into a struct xdp_pkt, and it mapped into the top headroom of the packet, to avoid allocating separate mem. For now, struct xdp_pkt is just a cpumap internal data structure, with info carried between enqueue to dequeue. If a driver doesn't have enough headroom it is simply dropped, with return code -EOVERFLOW. This will be picked up the xdp tracepoint infrastructure, to allow users to catch this. V2: take into account xdp->data_meta V4: - Drop busypoll tricks, keeping it more simple. - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei V5: correct RCU read protection around __netif_receive_skb_core. V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18bpf: XDP_REDIRECT enable use of cpumapJesper Dangaard Brouer
This patch connects cpumap to the xdp_do_redirect_map infrastructure. Still no SKB allocation are done yet. The XDP frames are transferred to the other CPU, but they are simply refcnt decremented on the remote CPU. This served as a good benchmark for measuring the overhead of remote refcnt decrement. If driver page recycle cache is not efficient then this, exposes a bottleneck in the page allocator. A shout-out to MST's ptr_ring, which is the secret behind is being so efficient to transfer memory pointers between CPUs, without constantly bouncing cache-lines between CPUs. V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot. V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet, as implementation require a separate upstream discussion. V5: - Fix a maybe-uninitialized pointed out by kbuild test robot. - Restrict bpf-prog side access to cpumap, open when use-cases appear - Implement cpu_map_enqueue() as a more simple void pointer enqueue V6: - Allow cpumap type for usage in helper bpf_redirect_map, general bpf-prog side restriction moved to earlier patch. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAPJesper Dangaard Brouer
The 'cpumap' is primarily used as a backend map for XDP BPF helper call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'. This patch implement the main part of the map. It is not connected to the XDP redirect system yet, and no SKB allocation are done yet. The main concern in this patch is to ensure the datapath can run without any locking. This adds complexity to the setup and tear-down procedure, which assumptions are extra carefully documented in the code comments. V2: - make sure array isn't larger than NR_CPUS - make sure CPUs added is a valid possible CPU V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl> V5: - Restrict map allocation to root / CAP_SYS_ADMIN - WARN_ON_ONCE if queue is not empty on tear-down - Return -EPERM on memlock limit instead of -ENOMEM - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup() - Moved cpu_map_enqueue() to next patch V6: all notice by Daniel Borkmann - Fix err return code in cpu_map_alloc() introduced in V5 - Move cpu_possible() check after max_entries boundary check - Forbid usage initially in check_map_func_compatibility() V7: - Fix alloc error path spotted by Daniel Borkmann - Did stress test adding+removing CPUs from the map concurrently - Fixed refcnt issue on cpu_map_entry, kthread started too soon - Make sure packets are flushed during tear-down, involved use of rcu_barrier() and kthread_run only exit after queue is empty - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring V8: - Nitpicking comments and gramma by Edward Cree - Fix missing semi-colon introduced in V7 due to rebasing - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18net: ftgmac100: Request clock and set speedJoel Stanley
According to the ASPEED datasheet, gigabit speeds require a clock of 100MHz or higher. Other speeds require 25MHz or higher. This patch configures a 100MHz clock if the system has a direct-attached PHY, or 25MHz if the system is running NC-SI which is limited to 100MHz. There appear to be no other upstream users of the FTGMAC100 driver it is hard to know the clocking requirements of other platforms. Therefore a conservative approach was taken with enabling clocks. If the platform is not ASPEED, both requesting the clock and configuring the speed is skipped. Signed-off-by: Joel Stanley <joel@jms.id.au> Tested-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17net: export netdev_txq_to_tc to allow sch_mqprio to compile as moduleHenrik Austad
In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc to find the correct tc instead of dev->tc_to_txq[] However, when mqprio is compiled as a module, it cannot resolve the symbol, leading to this error: ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined! This adds an EXPORT_SYMBOL() since the other user in the kernel (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a sysfs-callback. Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Henrik Austad <haustad@cisco.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge branch 'mlxsw-GRE-Offload-decap-without-encap'David S. Miller
Jiri Pirko says: ==================== mlxsw: GRE: Offload decap without encap Petr says: The current code doesn't offload GRE decapsulation unless there's a corresponding encapsulation route as well. While not strictly incorrect (when encap route is absent, the decap route traps traffic to CPU and the kernel handles it), it's a missed optimization opportunity. With this patchset, IPIP entries are created as soon as offloadable tunneling netdevice is created. This then leads to offloading of decap route, if one exists, or is added afterwards, even when no encap route is present. In Linux, when there is a decap route, matching IP-in-IP packets are always decapsulated. However, with IPv4 overlays in particular, whether the inner packet is then forwarded depends on setting of net.ipv4.conf.*.rp_filter. When RP filtering is turned on, inner packets aren't forwarded unless there's a matching encap route. The mlxsw driver doesn't reflect this behavior in other router interfaces, and thus it's not implemented for tunnel types either. A better support for this will be subject of follow-up work. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mlxsw: spectrum: Drop refcounting of IPIP entriesPetr Machata
Formerly, IPIP entries were created lazily by next hops that referenced an offloadable IP-in-IP netdevice. However now that they are created eagerly as a reaction to events on such netdevices, the reference counting is useless. Hence drop it. The routes whose next hops reference an offloaded IP-in-IP netdevice actually linger around a bit after their device is unregistered. However, mlxsw_sp_ipip_entry_destroy() also destroys the backing loopback, and mlxsw_sp_rif_destroy() transitively (via mlxsw_sp_nexthop_rif_gone_sync()) calls mlxsw_sp_nexthop_ipip_fini(), which unlinks the IPIP entry from a next hop. Thus no dangling pointers are left behind for the brief window after netdevice is gone, but routes not yet. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mlxsw: spectrum: Support IPIP overlay VRF migrationPetr Machata
IPIP entries are created as soon as an offloadable device is created. That means that when such a device is later moved to a different VRF, the loopback device that backs the tunnel is wrong. Thus when an offloadable encapsulating netdevice moves from one VRF to another, make sure that the loopback is updated as necessary. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mlxsw: spectrum: Support decap-only IP-in-IP tunnelsPetr Machata
Current code for offloading IP-in-IP tunneling assumes that there is no decap without encap. But that's never true for IPv6 overlays, and is not true for IPv4 ones either, if net.ipv4.conf.*.rp_filter is unset. To support decap-only tunnels, an IPIP entry is now created as soon as an offloadable tunneling device is created. When that netdevice is up'd, a decap route is looked up and possibly offloaded. Thus decap is not handled implicitly as part of mlxsw_sp_ipip_entry_get() call anymore, but needs to be done explicitly after the get, if desired. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mlxsw: spectrum_router: Move mlxsw_sp_netdev_ipip_type()Petr Machata
Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mlxsw: spectrum: Move netdevice NB to struct mlxsw_spPetr Machata
So far, all netdevice notifications that the driver cared about were related to its own ports, and mlxsw_sp could be retrieved from the netdevice's private data. For IP-in-IP offloading however, the driver cares about events on foreign netdevices, and getting at mlxsw_sp or router data structures from the handler is inconvenient. Therefore move the netdevice notifier blocks from global scope to struct mlxsw_sp to allow retrieval from the notifier block pointer itself. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16tipc: fix rebasing errorJon Maloy
In commit 2f487712b893 ("tipc: guarantee that group broadcast doesn't bypass group unicast") there was introduced a last-minute rebasing error that broke non-group communication. We fix this here. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge branch 'net-core-rcuify-rtnl-af_ops'David S. Miller
Florian Westphal says: ==================== net: core: rcuify rtnl af_ops None of the rtnl af_ops callbacks sleep, so they can be called while holding rcu read lock. Switch handling of af_ops to rcu. This would allow to later call af_ops functions without holding the rtnl mutex anymore. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: core: rcu-ify rtnl af_opsFlorian Westphal
rtnl af_ops currently rely on rtnl mutex: unregister (called from module exit functions) takes the rtnl mutex and all users that do af_ops lookup also take the rtnl mutex. IOW, parallel rmmod will block until doit() callback is done. As none of the af_ops implementation sleep we can use rcu instead. doit functions that need the af_ops can now use rcu instead of the rtnl mutex provided the mutex isn't needed for other reasons. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: place link af dump into own helperFlorian Westphal
next patch will rcu-ify rtnl af_ops, i.e. allow af_ops lookup and function calls with rcu read lock held instead of rtnl mutex. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16tcp: cdg: make struct tcp_cdg staticColin Ian King
The structure tcp_cdg is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol 'tcp_cdg' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: systemport: add NET_DSA dependencyArnd Bergmann
The notifier cause a link error when NET_DSA is a loadable module: drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_remove': bcmsysport.c:(.text+0x1582): undefined reference to `unregister_dsa_notifier' drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_probe': bcmsysport.c:(.text+0x278d): undefined reference to `register_dsa_notifier' This adds a dependency that forces the systemport driver to be a loadable module as well when that happens, but otherwise allows it to be built normally when DSA is either built-in or completely disabled. Fixes: d156576362c0 ("net: systemport: Establish lower/upper queue mapping") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16hamradio: baycom_par: use new parport device modelSudip Mukherjee
Modify baycom driver to use the new parallel port device model. Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: dccp: mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Notice that for options.c file, I placed the "fall through" comment on its own line, which is what GCC is expecting to find. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16pch_gbe: Switch to new PCI IRQ allocation APIAndy Shevchenko
This removes custom flag handling. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16tracing: bpf: Hide bpf trace events when they are not usedSteven Rostedt (VMware)
All the trace events defined in include/trace/events/bpf.h are only used when CONFIG_BPF_SYSCALL is defined. But this file gets included by include/linux/bpf_trace.h which is included by the networking code with CREATE_TRACE_POINTS defined. If a trace event is created but not used it still has data structures and functions created for its use, even though nothing is using them. To not waste space, do not define the BPF trace events in bpf.h unless CONFIG_BPF_SYSCALL is defined. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16ipv6: only update __use and lastusetime once per jiffy at mostWei Wang
In order to not dirty the cacheline too often, we try to only update dst->__use and dst->lastusetime at most once per jiffy. As dst->lastusetime is only used by ipv6 garbage collector, it should be good enough time resolution. And __use is only used in ipv6_route_seq_show() to show how many times a dst has been used. And as __use is not atomic_t right now, it does not show the precise number of usage times anyway. So we think it should be OK to only update it at most once per jiffy. According to my latest syn flood test on a machine with intel Xeon 6th gen processor and 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA nodes: With this patch, the packet process rate increases from ~3.49Mpps to ~3.75Mpps with a 7% increase rate. Note: dst_use() is being renamed to dst_hold_and_use() to better specify the purpose of the function. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@googl.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16ipv6: check fn before doing FIB6_SUBTREE(fn)Wei Wang
In fib6_locate(), we need to first make sure fn is not NULL before doing FIB6_SUBTREE(fn) to avoid crash. This fixes the following static checker warning: net/ipv6/ip6_fib.c:1462 fib6_locate() warn: variable dereferenced before check 'fn' (see line 1459) net/ipv6/ip6_fib.c 1458 if (src_len) { 1459 struct fib6_node *subtree = FIB6_SUBTREE(fn); ^^^^^^^^^^^^^^^^ We shifted this dereference 1460 1461 WARN_ON(saddr == NULL); 1462 if (fn && subtree) ^^ before the check for NULL. 1463 fn = fib6_locate_1(subtree, saddr, src_len, 1464 offsetof(struct rt6_info, rt6i_src) Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16bpf: Add -target to clang switch while cross compiling.Abhijit Ayarekar
Update to llvm excludes assembly instructions. llvm git revision is below commit 65fad7c26569 ("bpf: add inline-asm support") This change will be part of llvm release 6.0 __ASM_SYSREG_H define is not required for native compile. -target switch includes appropriate target specific files while cross compiling Tested on x86 and arm64. Signed-off-by: Abhijit Ayarekar <abhijit.ayarekar@caviumnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge branch 'sched-tp_q-remove'David S. Miller
Jiri Pirko <jiri@mellanox.com> ==================== net: sched: remove some tp->q usage In order to prepare for block sharing, tcf_proto instances need to be independent on particular qdisc instances. This patchset takes care of removal of couple occurrences of tp->q usage. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: propagate q and parent from caller down to tcf_fill_nodeJiri Pirko
The callers have this info, they will pass it down to tcf_fill_node. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: use tcf_block_q helper to get q pointer for sch_tree_lockJiri Pirko
Use tcf_block_q helper to get q pointer to be used for direct call of sch_tree_lock/unlock instead of tcf_tree_lock/unlock. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct QdiscJiri Pirko
Use helper to get q pointer per block. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: cls_u32: use block instead of q in tc_u_commonJiri Pirko
tc_u_common is now per-q. With blocks, it has to be converted to be per-block. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: ematch: obtain net pointer from blocksJiri Pirko
Instead of using tp->q, use block to get the net pointer. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: teach tcf_bind/unbind_filter to use block->qJiri Pirko
Whenever the block->q is set, it can be used instead of tp->q as it contains the same value. When it is not set, which can't happen now but it might happen with the follow-up shared blocks introduction, the class is not set in the result. That would lead to a class lookup instead of direct class pointer use for classful qdiscs. However, it is not planned to support classful qdisqs sharing filter blocks, so that may never happen. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: introduce tcf_block_q and tcf_block_dev helpersJiri Pirko
These helpers allows to get a q and netdev pointers for given block easily. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: store net pointer in block and introduce qdisc_net helperJiri Pirko
Store net pointer in the block structure. Along the way, introduce qdisc_net helper which allows to easily obtain net pointer for qdisc instance. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: store Qdisc pointer in struct blockJiri Pirko
Prepare for removal of tp->q and store Qdisc pointer in the block structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mqprio: Reserve last 32 classid values for HW traffic classes and misc IDsAlexander Duyck
This patch makes a slight tweak to mqprio in order to bring the classid values used back in line with what is used for mq. The general idea is to reserve values :ffe0 - :ffef to identify hardware traffic classes normally reported via dev->num_tc. By doing this we can maintain a consistent behavior with mq for classid where :1 - :ffdf will represent a physical qdisc mapped onto a Tx queue represented by classid - 1, and the traffic classes will be mapped onto a known subset of classid values reserved for our virtual qdiscs. Note I reserved the range from :fff0 - :ffff since this way we might be able to reuse these classid values with clsact and ingress which would mean that for mq, mqprio, ingress, and clsact we should be able to maintain a similar classid layout. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge tag 'mlx5-updates-2017-10-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-updates-2017-10-11: IPoIB Multi Pkey support This series provides the support for IPoIB Multi Pkey. InfiniBand Pkeys are the equivalent of Ethernet vlans. Currently IPoIB device driver supports only default Pkey and IPoIB Pkey child interfaces are not supported with IPoIB offloads mode, this series will add the support for that by allowing creating mlx5 multiple IPoIB netdevices with a non-default Pkey. mlx5 IPoIB Pkey child interface is smaller version of mlx5i IPoIB interfaces and shares most of its resources with the parent IPoIB interface, namely RX steering and ring queue resources. The only mlx5 resources a child Pkey interface will be creating are the TX rings, since they should be assigned to a specific Pkey. mlx5i Pkey netdev is implemented via new mlx5e netdev profile implemented in mlx5/core/ipoib/ipoib_vlan.c. The series starts with a refactoring of mlx5e PTP and mlx5 clock implementation to move the code to be part of mlx5 core rather than mlx5e netdevice, in order to make mlx5 clock and PTP registration part of the core to be shared with mlx5e master Ethernet netdev/IPoIB parent netdev and mlx5_ib in the near future. Add the support for attaching multiple underlay QPs for the different Pkeys in mlx5 core RX steering. Add Pkey index to rdma_netdev to add the ability to set PKEY index to lower IPoIB offload netdev. Use hash-table to map between DQPN (Destination QP number) to child netdev for the IPoIB parent netdev to forward RX packets to the corresponding child Pkey netdev, since the RX rings are shared. The reset of the series adds the ipoib child Pkey: mlx5e netdev profile, netdev nods implementation and minimal set of ethtool callbacks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2017-10-13 This series contains updates to mqprio and i40e. Amritha introduces a new hardware offload mode in tc/mqprio where the TCs, the queue configurations and bandwidth rate limits are offloaded to the hardware. The existing mqprio framework is extended to configure the queue counts and layout and also added support for rate limiting. This is achieved through new netlink attributes for the 'mode' option which takes values such as 'dcb' (default) and 'channel' and a 'shaper' option for QoS attributes such as bandwidth rate limits in hw mode 1. Legacy devices can fall back to the existing setup supporting hw mode 1 without these additional options where only the TCs are offloaded and then the 'mode' and 'shaper' options defaults to DCB support. The i40e driver enables the new mqprio hardware offload mechanism factoring the TCs, queue configuration and bandwidth rates by creating HW channel VSIs. In this new mode, the priority to traffic class mapping and the user specified queue ranges are used to configure the traffic class when the 'mode' option is set to 'channel'. This is achieved by creating HW channels(VSI). A new channel is created for each of the traffic class configuration offloaded via mqprio framework except for the first TC (TC0) which is for the main VSI. TC0 for the main VSI is also reconfigured as per user provided queue parameters. Finally, bandwidth rate limits are set on these traffic classes through the shaper attribute by sending these rates in addition to the number of TCs and the queue configurations. Colin Ian King makes an array of constant values "constant". Alan fixes and issue where on some firmware versions, we were failing to actually fill out the phy_types which caused ethtool to not report any link types. Also hardened against a potentially malicious VF by not letting the VF to reset itself after requesting to change the number of queues (via ethtool), let the PF reset the VF to institute the requested changes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14Merge branch 'tc-testing-updates'David S. Miller
Lucas Bates says: ==================== tc-testing: Test suite updates This patch series is a roundup of changes to the tc-testing suite: - Add test cases for police and mirred modules and some coverage in already-submitted test categories - Break the test case files down into more user-friendly sizes - Bug fix to the tdc.py script's handling of the -l argument v2: fix the lack of final newlines in two new files (thanks David) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14tc-testing: fix the -l argument bug in tdc.pyLucas Bates
This patch fixes a bug in the tdc script, where executing tdc with the -l argument would cause the tests to start running as opposed to listing all the known test cases. Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14tc-testing: Add test cases for police and skbmodLucas Bates
Add basic unit tests for police and skbmod actions in tc. Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14tc-testing: Split test case files into smaller chunksLucas Bates
The original submission had the test cases stored in one monolithic file. This can be unwieldy to edit, especially as more test cases are added. This patch removes the original tests.json file in favour of individual ones broken down by category. Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14tc-testing: Add test cases for flushing actionsLucas Bates
Tests for flushing gact and mirred were missing. This patch adds test cases to explicitly test the flush of any installed gact/mirred actions. Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14Merge branch 'macvlan-cleanups'David S. Miller
Alexander Duyck says: ==================== net: Minor macvlan source mode cleanups So this patch series is just a few minor cleanups for macvlan source mode. The first patch addresses double receives when a packet is being routed to the macvlan destination address, and the other addresses the pkt_type being updated in cases where it most likely should not be. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14macvlan: Only update pkt_type if destination MAC address matchesAlexander Duyck
This patch updates the pkt_type to PACKET_HOST only if the destination MAC address matches on the on the source based macvlan. It didn't make sense to be updating broadcast, multicast, and non-local destined frames with PACKET_HOST. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14macvlan: Only deliver one copy of the frame to the macvlan interfaceAlexander Duyck
This patch intoduces a slight adjustment for macvlan to address the fact that in source mode I was seeing two copies of any packet addressed to the macvlan interface being delivered where there should have been only one. The issue appears to be that one copy was delivered based on the source MAC address and then the second copy was being delivered based on the destination MAC address. To fix it I am just treating a unicast address match as though it is not a match since source based macvlan isn't supposed to be matching based on the destination MAC anyway. Fixes: 79cf79abce71 ("macvlan: add source mode") Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14tcp: add a tracepoint for tcp retransmissionCong Wang
We need a real-time notification for tcp retransmission for monitoring. Of course we could use ftrace to dynamically instrument this kernel function too, however we can't retrieve the connection information at the same time, for example perf-tools [1] reads /proc/net/tcp for socket details, which is slow when we have a lots of connections. Therefore, this patch adds a tracepoint for __tcp_retransmit_skb() and exposes src/dst IP addresses and ports of the connection. This also makes it easier to integrate into perf. Note, I expose both IPv4 and IPv6 addresses at the same time: for a IPv4 socket, v4 mapped address is used as IPv6 addresses, for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel. Also, add sk and skb pointers as they are useful for BPF. 1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Brendan Gregg <bgregg@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14net_sched: fix a compile warning in act_ifeCong Wang
Apparently ife_meta_id2name() is only called when CONFIG_MODULES is defined. This fixes: net/sched/act_ife.c:251:20: warning: ‘ife_meta_id2name’ defined but not used [-Wunused-function] static const char *ife_meta_id2name(u32 metaid) ^~~~~~~~~~~~~~~~ Fixes: d3f24ba895f0 ("net sched actions: fix module auto-loading") Cc: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>