summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-10-09net: hns3: Consistently using GENMASK in hns3 driverYunsheng Lin
This patch uses GENMASK to generate bit mask whenever possible in hns3 driver. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net: hns3: Cleanup indentation for Kconfig in the the hisilicon folderYunsheng Lin
This patch fixes a few indentation for Kconfig file in the hisilicon folder. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net: hns3: Add hns3_get_handle macro in hns3 driverYunsheng Lin
There are many places that will need to get the handle of netdev, so add a macro to get the handle of netdev. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net: hns3: Cleanup for shifting true in hns3 driverYunsheng Lin
This patch fixes a shifting true in hclge_main module. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08qed: Delete redundant check on dcb_app priorityChristos Gkekas
dcb_app priority is unsigned thus checking whether it is less than zero is redundant. Signed-off-by: Christos Gkekas <chris.gekas@gmail.com> Acked-By: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08net: ethernet: stmmac: Clean up dead codeChristos Gkekas
Many macros in dwmac-ipq806x are unused and should be removed. Moreover gmac->id is an unsigned variable and therefore checking whether it is less than zero is redundant. Signed-off-by: Christos Gkekas <chris.gekas@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08Merge branch 'ipv6_dev_get_saddr-rcu'David S. Miller
Eric Dumazet says: ==================== ipv6: ipv6_dev_get_saddr() rcu works Sending IPv6 udp packets on non connected sockets is quite slow, because ipv6_dev_get_saddr() is still using an rwlock and silly references games on ifa. Tested: $ ./super_netperf 16 -H 4444::555:0786 -l 2000 -t UDP_STREAM -- -m 100 & [1] 12527 Performance is boosted from 2.02 Mpps to 4.28 Mpps Kernel profile before patches : 22.62% [kernel] [k] _raw_read_lock_bh 7.04% [kernel] [k] refcount_sub_and_test 6.56% [kernel] [k] ipv6_get_saddr_eval 5.67% [kernel] [k] _raw_read_unlock_bh 5.34% [kernel] [k] __ipv6_dev_get_saddr 4.95% [kernel] [k] refcount_inc_not_zero 4.03% [kernel] [k] __ip6addrlbl_match 3.70% [kernel] [k] _raw_spin_lock 3.44% [kernel] [k] ipv6_dev_get_saddr 3.24% [kernel] [k] ip6_pol_route 3.06% [kernel] [k] refcount_add_not_zero 2.30% [kernel] [k] __local_bh_enable_ip 1.81% [kernel] [k] mlx4_en_xmit 1.20% [kernel] [k] __ip6_append_data 1.12% [kernel] [k] __ip6_make_skb 1.11% [kernel] [k] __dev_queue_xmit 1.06% [kernel] [k] l3mdev_master_ifindex_rcu Kernel profile after patches : 11.36% [kernel] [k] ip6_pol_route 7.65% [kernel] [k] _raw_spin_lock 7.16% [kernel] [k] __ipv6_dev_get_saddr 6.49% [kernel] [k] ipv6_get_saddr_eval 6.04% [kernel] [k] refcount_add_not_zero 3.34% [kernel] [k] __ip6addrlbl_match 2.62% [kernel] [k] __dev_queue_xmit 2.37% [kernel] [k] mlx4_en_xmit 2.26% [kernel] [k] dst_release 1.89% [kernel] [k] __ip6_make_skb 1.87% [kernel] [k] __ip6_append_data 1.86% [kernel] [k] udpv6_sendmsg 1.86% [kernel] [k] ip6t_do_table 1.64% [kernel] [k] ipv6_dev_get_saddr 1.64% [kernel] [k] find_match 1.51% [kernel] [k] l3mdev_master_ifindex_rcu 1.24% [kernel] [k] ipv6_addr_label ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: avoid cache line dirtying in ipv6_dev_get_saddr()Eric Dumazet
By extending the rcu section a bit, we can avoid these very expensive in6_ifa_put()/in6_ifa_hold() calls done in __ipv6_dev_get_saddr() and ipv6_dev_get_saddr() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: __ipv6_dev_get_saddr() rcu conversionEric Dumazet
Callers hold rcu_read_lock(), so we do not need the rcu_read_lock()/rcu_read_unlock() pair. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: ipv6_chk_prefix() rcu conversionEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: ipv6_chk_custom_prefix() rcu conversionEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: ipv6_count_addresses() rcu conversionEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: prepare RCU lookups for idev->addr_listEric Dumazet
inet6_ifa_finish_destroy() already uses kfree_rcu() to free inet6_ifaddr structs. We need to use proper list additions/deletions in order to allow readers to use RCU instead of idev->lock rwlock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08Merge branch 'bridge-neigh-msg-proxy-and-flood-suppression-support'David S. Miller
Roopa Prabhu says: ==================== bridge: neigh msg proxy and flood suppression support This series implements arp and nd suppression in the bridge driver for ethernet vpns. It implements rfc7432, section 10 https://tools.ietf.org/html/rfc7432#section-10 for ethernet VPN deployments. It is similar to the existing BR_PROXYARP* flags but has a few semantic differences to conform to EVPN standard. Unlike the existing flags, this new flag suppresses flood of all neigh discovery packets (arp and nd) to tunnel ports. Supports both vlan filtering and non-vlan filtering bridges. In case of EVPN, it is mainly used to avoid flooding of arp and nd packets to tunnel ports like vxlan. v2 : rebase to latest + address some optimization feedback from Nikolay. v3 : fix kbuild reported build errors with CONFIG_INET off v4 : simplify port flag mask as suggested by stephen v5 : address some feedback from Toshiaki v6 : some v5 cleanups in nd suppress (keep it consistent with arp suppress) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08bridge: suppress nd pkts on BR_NEIGH_SUPPRESS portsRoopa Prabhu
This patch avoids flooding and proxies ndisc packets for BR_NEIGH_SUPPRESS ports. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08bridge: suppress arp pkts on BR_NEIGH_SUPPRESS portsRoopa Prabhu
This patch avoids flooding and proxies arp packets for BR_NEIGH_SUPPRESS ports. Moves existing br_do_proxy_arp to br_do_proxy_suppress_arp to support both proxy arp and neigh suppress. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd floodRoopa Prabhu
This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to suppress arp and nd flood on bridge ports. It implements rfc7432, section 10. https://tools.ietf.org/html/rfc7432#section-10 for ethernet VPN deployments. It is similar to the existing BR_PROXYARP* flags but has a few semantic differences to conform to EVPN standard. Unlike the existing flags, this new flag suppresses flood of all neigh discovery packets (arp and nd) to tunnel ports. Supports both vlan filtering and non-vlan filtering bridges. In case of EVPN, it is mainly used to avoid flooding of arp and nd packets to tunnel ports like vxlan. This patch adds netlink and sysfs support to set this bridge port flag. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08ipv6: fix a BUG in rt6_get_pcpu_route()Eric Dumazet
Ido reported following splat and provided a patch. [ 122.221814] BUG: using smp_processor_id() in preemptible [00000000] code: sshd/2672 [ 122.221845] caller is debug_smp_processor_id+0x17/0x20 [ 122.221866] CPU: 0 PID: 2672 Comm: sshd Not tainted 4.14.0-rc3-idosch-next-custom #639 [ 122.221880] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016 [ 122.221893] Call Trace: [ 122.221919] dump_stack+0xb1/0x10c [ 122.221946] ? _atomic_dec_and_lock+0x124/0x124 [ 122.221974] ? ___ratelimit+0xfe/0x240 [ 122.222020] check_preemption_disabled+0x173/0x1b0 [ 122.222060] debug_smp_processor_id+0x17/0x20 [ 122.222083] ip6_pol_route+0x1482/0x24a0 ... I believe we can simplify this code path a bit, since we no longer hold a read_lock and need to release it to avoid a dead lock. By disabling BH, we make sure we'll prevent code re-entry and rt6_get_pcpu_route()/rt6_make_pcpu_route() run on the same cpu. Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08Merge tag 'mlx5-updates-2017-10-06' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== Mellanox, mlx5 updates 2017-10-06 This series includes some shared code updates for kernel 4.15 to both net-next and rdma-next trees. The series includes mlx5 low level flow steering updates and optimizations to support firmware command parallelism for flow steering requests from Maor Gottlieb and two other small fixes from Matan and Maor. One fix from Matan adds error handling for when the destination list of the flow steering rule is full. Maor introduced a patch to avoid NULL pointer dereference on steering cleanup. Then Some refactoring patches needed by the series for code sharing purposes. and split the Flow Table Entry (FTE) and Flow Group (FG) creation code to two parts: 1) Object allocation - allocate the steering node and initialize its resources. 2) The firmware command execution. This change will give us the ability to take write lock on the parent node (e.g. FG for FTE creating) only on the software data struct allocation and creation part of the procedure where the synchronization is really required, and will allow us to execute multiple firmware commands simultaneously and overcome the firmware bottleneck. Refactor the locking scheme of the mlx5 core flow steering as follows: 1) Replace the mutex lock with readers-writers semaphore and take the write lock only when necessary (e.g. allocating a new flow table entry index or adding a node to the parent's children list). When we try to find a suitable child in the parent's children list (e.g. search for flow group with the same match_criteria of the rule) then we only take the read lock. 2) Add versioning mechanism - each steering entity (FT, FG, FTE, DST) will have an incremental version. The version is increased when the entity is changed (e.g. when a new FTE was added to FG - the FG's version is increased). Versioning is used in order to determine if the last traverse of an entity's children is valid or a rescan under write lock is required. Last patch adds FGs and FTEs memory pool, It is useful because these objects are not small and could be allocated/deallocated many times. This support improves the insertion rate of steering rules from ~5k/sec to ~40k/sec. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08Merge branch 'hv_netvsc-TCP-hash-level'David S. Miller
Haiyang Zhang says: ==================== hv_netvsc: support changing TCP hash level The patch set simplifies the existing hash level switching code for UDP. It also adds the support for changing TCP hash level. So users can switch between L3 an L4 hash levels for TCP and UDP. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08hv_netvsc: Update netvsc Document for TCP hash level settingHaiyang Zhang
Update Documentation/networking/netvsc.txt for TCP hash level setting and related info. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08hv_netvsc: Add ethtool handler to set and get TCP hash levelsHaiyang Zhang
The patch supports the options to switch TCP hash level between L3 and L4 by ethtool command. TCP over IPv4 and v6 can be set differently. The default hash level is L4. We currently only allow switching TX hash level from within the guests. For example, for TCP over IPv4 on eth0: To include TCP port numbers in hashing: ethtool -N eth0 rx-flow-hash tcp4 sdfn To exclude TCP port numbers in hashing: ethtool -N eth0 rx-flow-hash tcp4 sd To show TCP hash level: ethtool -n eth0 rx-flow-hash tcp4 Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08hv_netvsc: Change the hash level variable to bit flagsHaiyang Zhang
This simplifies the logic and make it easier to add more options. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08Merge branch 'mlxsw-more-extack'David S. Miller
Jiri Pirko says: ==================== mlxsw: Add more extack error reporting Ido says: Add error messages to VLAN and bridge enslavements to help users understand why the enslavement failed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08mlxsw: spectrum: Propagate extack further for bridge enslavementsIdo Schimmel
The code that actually takes care of bridge offload introduces a few more non-trivial constraints with regards to bridge enslavements. Propagate extack there to indicate the reason. $ ip link add link enp1s0np1 name enp1s0np1.10 type vlan id 10 $ ip link add link enp1s0np1 name enp1s0np1.20 type vlan id 20 $ ip link add name br0 type bridge $ ip link set dev enp1s0np1.10 master br0 $ ip link set dev enp1s0np1.20 master br0 Error: spectrum: Can not bridge VLAN uppers of the same port. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08mlxsw: spectrum: Add extack for VLAN enslavementsIdo Schimmel
Similar to physical ports, enslavement of VLAN devices can also fail. Use extack to indicate why the enslavement failed. $ ip link add link enp1s0np1 name enp1s0np1.10 type vlan id 10 $ ip link add name bond0 type bond mode 802.3ad $ ip link set dev enp1s0np1.10 master bond0 Error: spectrum: VLAN devices only support bridge and VRF uppers. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07Merge branch 'bpf-obj-name-misc'David S. Miller
Martin KaFai Lau says: ==================== bpf: Misc improvements and a new usage on bpf obj name The first two patches make improvements on the bpf obj name. The last patch adds the prog name to kallsyms. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: Append prog->aux->name in bpf_get_prog_name()Martin KaFai Lau
This patch makes the bpf_prog's name available in kallsyms. The new format is bpf_prog_tag[_name]. Sample kallsyms from running selftests/bpf/test_progs: [root@arch-fb-vm1 ~]# egrep ' bpf_prog_[0-9a-fA-F]{16}' /proc/kallsyms ffffffffa0048000 t bpf_prog_dabf0207d1992486_test_obj_id ffffffffa0038000 t bpf_prog_a04f5eef06a7f555__123456789ABCDE ffffffffa0050000 t bpf_prog_a04f5eef06a7f555 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: Use char in prog and map nameMartin KaFai Lau
Instead of u8, use char for prog and map name. It can avoid the userspace tool getting compiler's signess warning. The bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and bpf_map_info are changed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: Change bpf_obj_name_cpy() to better ensure map's name is init by 0Martin KaFai Lau
During get_info_by_fd, the prog/map name is memcpy-ed. It depends on the prog->aux->name and map->name to be zero initialized. bpf_prog_aux is easy to guarantee that aux->name is zero init. The name in bpf_map may be harder to be guaranteed in the future when new map type is added. Hence, this patch makes bpf_obj_name_cpy() to always zero init the prog/map name. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ip_gre: check packet length and mtu correctly in erspan txWilliam Tu
Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device is the length of the whole ether packet. So skb->len should subtract the dev->hard_header_len. Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel") Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: David Laight <David.Laight@aculab.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07net: phonet: mark phonet_protocol as constLin Zhang
The phonet_protocol structs don't need to be written by anyone and so can be marked as const. Signed-off-by: Lin Zhang <xiaolou4617@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07net: phonet: mark header_ops as constLin Zhang
Signed-off-by: Lin Zhang <xiaolou4617@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07Merge branch 'bpf-perf-time-helpers'David S. Miller
Yonghong Song says: ==================== bpf: add two helpers to read perf event enabled/running time Hardware pmu counters are limited resources. When there are more pmu based perf events opened than available counters, kernel will multiplex these events so each event gets certain percentage (but not 100%) of the pmu time. In case that multiplexing happens, the number of samples or counter value will not reflect the case compared to no multiplexing. This makes comparison between different runs difficult. Typically, the number of samples or counter value should be normalized before comparing to other experiments. The typical normalization is done like: normalized_num_samples = num_samples * time_enabled / time_running normalized_counter_value = counter_value * time_enabled / time_running where time_enabled is the time enabled for event and time_running is the time running for event since last normalization. This patch set implements two helper functions. The helper bpf_perf_event_read_value reads counter/time_enabled/time_running for perf event array map. The helper bpf_perf_prog_read_value read counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: add a test case for helper bpf_perf_prog_read_valueYonghong Song
The bpf sample program trace_event is enhanced to use the new helper to print out enabled/running time. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: add helper bpf_perf_prog_read_valueYonghong Song
This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf programs, to read event counter and enabled/running time. The enabled/running time is accumulated since the perf event open. The typical use case for perf event based bpf program is to attach itself to a single event. In such cases, if it is desirable to get scaling factor between two bpf invocations, users can can save the time values in a map, and use the value from the map and the current value to calculate the scaling factor. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: add a test case for helper bpf_perf_event_read_valueYonghong Song
The bpf sample program tracex6 is enhanced to use the new helper to read enabled/running time as well. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: add helper bpf_perf_event_read_value for perf event array mapYonghong Song
Hardware pmu counters are limited resources. When there are more pmu based perf events opened than available counters, kernel will multiplex these events so each event gets certain percentage (but not 100%) of the pmu time. In case that multiplexing happens, the number of samples or counter value will not reflect the case compared to no multiplexing. This makes comparison between different runs difficult. Typically, the number of samples or counter value should be normalized before comparing to other experiments. The typical normalization is done like: normalized_num_samples = num_samples * time_enabled / time_running normalized_counter_value = counter_value * time_enabled / time_running where time_enabled is the time enabled for event and time_running is the time running for event since last normalization. This patch adds helper bpf_perf_event_read_value for kprobed based perf event array map, to read perf counter and enabled/running time. The enabled/running time is accumulated since the perf event open. To achieve scaling factor between two bpf invocations, users can can use cpu_id as the key (which is typical for perf array usage model) to remember the previous value and do the calculation inside the bpf program. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07bpf: perf event change needed for subsequent bpf helpersYonghong Song
This patch does not impact existing functionalities. It contains the changes in perf event area needed for subsequent bpf_perf_event_read_value and bpf_perf_prog_read_value helpers. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ip_tunnel: add mpls over gre supportAmine Kherbouche
This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new tunnel type TUNNEL_ENCAP_MPLS. Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07Merge branch 'fib6-rcu'David S. Miller
Wei Wang says: ==================== ipv6: replace rwlock with rcu and spinlock in fib6 table Currently, fib6 table is protected by rwlock. During route lookup, reader lock is taken and during route insertion, deletion or modification, writer lock is taken. This is a very inefficient implementation because the fastpath always has to do the operation to grab the reader lock. According to my latest syn flood test on an iota ivybridage machine with 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA nodes, and with the upstream net-next kernel: ipv4 stack can handle around 4.2Mpps ipv6 stack can handle around 1.3Mpps In order to close the gap of the performance number between ipv4 and ipv6 stack, this patch series tries to get rid of the usage of the rwlock and replace it with rcu and spinlock protection. This will greatly speed up the fastpath performance as it only needs to hold rcu which is much less expensive than grabbing the reader lock. It also makes ipv6 fib implementation more consistent with ipv4. In order to be able to replace the current rwlock with rcu and spinlock, some preparation work is needed: Patch 1-8 introduces a per-route hash table (protected by rcu and a different spinlock) to store all cached routes created by pmtu and ip redirect under its main route. This makes the main fib6 tree only contain static routes. Patch 9-14 prepares all the reader path to be ready to tolerate concurrent writer. Patch 15 finally does the rwlock to rcu and spinlock conversion. Patch 16 takes care of rt6_stats. After this patch series, in the same syn flood test, ipv6 stack can now handle around 3.5Mpps compared to previous 1.3Mpps in my test setup. After this patch series, there are still some improvements that should be done in ipv6 stack: 1. During route lookup, dst_use() is called everytime on the selected route to update dst->__use and dst->lastuse. This dirties the cacheline and causes extra cacheline miss and should be avoided. 2. when no route is found in the current table, net->ip6.ipv6_null_entry is used and refcnt is taken on it. As there is no pcpu cache for this specific route, frequent change on the refcnt for this route causes quite some cacheline misses. And to make things worse, if CONFIG_IPV6_MULTIPLE_TABLES is defined, output path route lookup always starts with local table first and guarantees to hit net->ipv6.ip6_null_entry before continuing to do lookup in the main table. These operations on net->ipv6.ip6_null_entry could potentially be avoided. 3. ipv6 input path route lookup grabs refcnt on dst. This is different from ipv4. We could potentially change this behavior to let ipv6 input path route lookup not to grab refcnt on dst. However, it does not give us much performance boost as we currently have pcpu route cache for input path as well in ipv6. But this work probably is still worth doing to unify ipv6 and ipv4 route lookup behavior. The above issues will be addressed separately after this patch series has been accepted. This is a joint work with Martin KaFai Lau and Eric Dumazet. And many many thanks to them for their inspiring ideas and big big code review efforts. ==================== Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: take care of rt6_statsWei Wang
Currently, most of the rt6_stats are not hooked up correctly. As the last part of this patch series, hook up all existing rt6_stats and add one new stat fib_rt_uncache to indicate the number of routes in the uncached list. For details of the stats, please refer to the comments added in include/net/ip6_fib.h. Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified under a lock. So atomic_t is used for them. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: replace rwlock with rcu and spinlock in fib6_tableWei Wang
With all the preparation work before, we are now ready to replace rwlock with rcu and spinlock in fib6_table. That means now all fib6_node in fib6_table are protected by rcu. And when freeing fib6_node, call_rcu() is used to wait for the rcu grace period before releasing the memory. When accessing fib6_node, corresponding rcu APIs need to be used. And all previous sessions protected by the write lock will now be protected by the spin lock per table. All previous sessions protected by read lock will now be protected by rcu_read_lock(). A couple of things to note here: 1. As part of the work of replacing rwlock with rcu, the linked list of fn->leaf now has to be rcu protected as well. So both fn->leaf and rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are used when manipulating them. 2. For fn->rr_ptr, first of all, it also needs to be rcu protected now and is tagged with __rcu and rcu APIs are used in corresponding places. Secondly, fn->rr_ptr is changed in rt6_select() which is a reader thread. This makes the issue a bit complicated. We think a valid solution for it is to let rt6_select() grab the tb6_lock if it decides to change it. As it is not in the normal operation and only happens when there is no valid neighbor cache for the route, we think the performance impact should be low. 3. fib6_walk_continue() has to be called with tb6_lock held even in the route dumping related functions, e.g. inet6_dump_fib(), fib6_tables_dump() and ipv6_route_seq_ops. It is because fib6_walk_continue() makes modifications to the walker structure, and so are fib6_repair_tree() and fib6_del_route(). In order to do proper syncing between them, we need to let fib6_walk_continue() hold the lock. We may be able to do further improvement on the way we do the tree walk to get rid of the need for holding the spin lock. But not for now. 4. When fib6_del_route() removes a route from the tree, we no longer mark rt->dst.rt6_next to NULL to make simultaneous reader be able to further traverse the list with rcu. However, rt->dst.rt6_next is only valid within this same rcu period. No one should access it later. 5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be performed before we publish this route (either by linking it to fn->leaf or insert it in the list pointed by fn->leaf) just to be safe because as soon as we publish the route, some read thread will be able to access it. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: add key length check into rt6_select()Wei Wang
After rwlock is replaced with rcu and spinlock, fib6_lookup() could potentially return an intermediate node if other thread is doing fib6_del() on a route which is the only route on the node so that fib6_repair_tree() will be called on this node and potentially assigns fn->leaf to the its child's fn->leaf. In order to detect this situation in rt6_select(), we have to check if fn->fn_bit is consistent with the key length stored in the route. And depending on if the fn is in the subtree or not, the key is either rt->rt6i_dst or rt->rt6i_src. If any inconsistency is found, that means the node no longer holds valid routes in it. So net->ipv6.ip6_null_entry is returned. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: check fn->leaf before it is usedWei Wang
If rwlock is replaced with rcu and spinlock, it is possible that the reader thread will see fn->leaf as NULL in the following scenarios: 1. fib6_add() is in progress and we have already inserted a new node but not yet inserted the route. 2. fib6_del_route() is in progress and we have already set fn->leaf to NULL but not yet freed the node because of rcu grace period. This patch makes sure all the reader threads check fn->leaf first before using it. And together with later patch to grab rcu_read_lock() and rcu_dereference() fn->leaf, it makes sure reader threads are safe when accessing fn->leaf. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: update fn_sernum after route is inserted to treeWei Wang
fib6_add() logic currently calls fib6_add_1() to figure out what node should be used for the newly added route and then call fib6_add_rt2node() to insert the route to the node. And during the call of fib6_add_1(), fn_sernum is updated for all nodes that share the same prefix as the new route. This does not have issue in the current code because reader thread will not be able to access the tree while writer thread is inserting new route to it. However, it is not the case once we transition to use RCU. Reader thread could potentially see the new fn_sernum before the new route is inserted. As a result, reader thread's route lookup will return a stale route with the new fn_sernum. In order to solve this issue, we remove all the update of fn_sernum in fib6_add_1(), and instead, introduce a new function that updates fn_sernum for all related nodes and call this functions once the route is successfully inserted to the tree. Also, smp_wmb() is used after a route is successfully inserted into the fib tree and right before the updated of fn->sernum. And smp_rmb() is used right after fn->sernum is accessed in rt6_get_cookie_safe(). This is to guarantee that when the reader thread sees the new fn->sernum, the new route is already inserted in the tree in memory. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: replace dst_hold() with dst_hold_safe() in routing codeWei Wang
With rwlock, it is safe to call dst_hold() in the read thread because read thread is guaranteed to be separated from write thread. However, after we replace rwlock with rcu, it is no longer safe to use dst_hold(). A dst might already have been deleted but is waiting for the rcu grace period to pass before freeing the memory when a read thread is trying to do dst_hold(). This could potentially cause double free issue. So this commit replaces all dst_hold() with dst_hold_safe() in all read thread to avoid this double free issue. And in order to make the code more compact, a new function ip6_hold_safe() is introduced. It calls dst_hold_safe() first, and if that fails, it will either fall back to hold and return net->ipv6.ip6_null_entry or set rt to NULL according to the caller's need. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: don't release rt->rt6i_pcpu memory during rt6_release()Wei Wang
After rwlock is replaced with rcu and spinlock, route lookup can happen simultanously with route deletion. This patch removes the call to free_percpu(rt->rt6i_pcpu) from rt6_release() to avoid the race condition between rt6_release() and rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already called in ip6_dst_destroy() after the rcu grace period, it is safe to do this change. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: grab rt->rt6i_ref before allocating pcpu rtWei Wang
After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be called with only rcu held. That means rt6 route deletion could happen simultaneously with rt6_make_pcpu_rt(). This could potentially cause memory leak if rt6_release() is called right before rt6_make_pcpu_rt() on the same route. This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt() to make sure rt6_release() will not get triggered while rt6_make_pcpu_rt() is in progress. And rt6_release() is called after rt6_make_pcpu_rt() is finished. Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a very slim chance that fib6_purge_rt() will be triggered unnecessarily when deleting a route if ip6_pol_route() running on another thread picks this route as well and tries to make pcpu cache for it. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07ipv6: hook up exception table to store dst cacheWei Wang
This commit makes use of the exception hash table implementation to store dst caches created by pmtu discovery and ip redirect into the hash table under the rt_info and no longer inserts these routes into fib6 tree. This makes the fib6 tree only contain static configured routes and could now be protected by rcu instead of a rw lock. With this change, in the route lookup related functions, after finding the rt6_info with the longest prefix, we also need to search for the exception table before doing backtracking. In the route delete function, if the route being deleted is not a dst cache, deletion of this route also need to flush the whole hash table under it. If it is a dst cache, then only delete the cached dst in the hash table. Note: for fib6_walk_continue() function, w->root now is always pointing to a root node considering that fib6_prune_clones() is removed from the code. So we add a WARN_ON() msg to make sure w->root always points to a root node and also removed the update of w->root in fib6_repair_tree(). This is a prerequisite for later patch because we don't need to make w->root as rcu protected when replacing rwlock with RCU. Also, we remove all prune related variables as it is no longer used. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>