summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-03-31inet: frags: break the 2GB limit for frags storageEric Dumazet
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: remove inet_frag_maybe_warn_overflow()Eric Dumazet
This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: get rif of inet_frag_evicting()Eric Dumazet
This refactors ip_expire() since one indentation level is removed. Note: in the future, we should try hard to avoid the skb_clone() since this is a serious performance cost. Under DDOS, the ICMP message wont be sent because of rate limits. Fact that ip6_expire_frag_queue() does not use skb_clone() is disturbing too. Presumably IPv6 should have the same issue than the one we fixed in commit ec4fbd64751d ("inet: frag: release spinlock before calling icmp_send()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: remove some helpersEric Dumazet
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem() Also since we use rhashtable we can bring back the number of fragments in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was removed in commit 434d305405ab ("inet: frag: don't account number of fragment queues") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: use rhashtables for reassembly unitsEric Dumazet
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31rhashtable: add schedule pointsEric Dumazet
Rehashing and destroying large hash table takes a lot of time, and happens in process context. It is safe to add cond_resched() in rhashtable_rehash_table() and rhashtable_free_and_destroy() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: refactor ipfrag_init()Eric Dumazet
We need to call inet_frags_init() before register_pernet_subsys(), as a prereq for following patch ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: refactor lowpan_net_frag_init()Eric Dumazet
We want to call lowpan_net_frag_init() earlier. Similar to commit "inet: frags: refactor ipv6_frag_init()" This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: refactor ipv6_frag_init()Eric Dumazet
We want to call inet_frags_init() earlier. This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: add a pointer to struct netns_fragsEric Dumazet
In order to simplify the API, add a pointer to struct inet_frags. This will allow us to make things less complex. These functions no longer have a struct inet_frags parameter : inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */) ip6_expire_frag_queue(struct net *net, struct frag_queue *fq) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: change inet_frags_init_net() return valueEric Dumazet
We will soon initialize one rhashtable per struct netns_frags in inet_frags_init_net(). This patch changes the return value to eventually propagate an error. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31ipv6: frag: remove unused fieldEric Dumazet
csum field in struct frag_queue is not used, remove it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'bnxt_en-next'David S. Miller
Michael Chan says: ==================== bnxt_en: Update for net-next. Misc. updates including updated firmware interface, some additional port statistics, a new IRQ assignment scheme for the RDMA driver, support for VF trust, and other changes and improvements for SRIOV. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Add ULP calls to stop and restart IRQs.Michael Chan
When the driver needs to re-initailize the IRQ vectors, we make the new ulp_irq_stop() call to tell the RDMA driver to disable and free the IRQ vectors. After IRQ vectors have been re-initailized, we make the ulp_irq_restart() call to tell the RDMA driver that IRQs can be restarted. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Reserve completion rings and MSIX for bnxt_re RDMA driver.Michael Chan
Add additional logic to reserve completion rings for the bnxt_re driver when it requests MSIX vectors. The function bnxt_cp_rings_in_use() will return the total number of completion rings used by both drivers that need to be reserved. If the network interface in up, we will close and open the NIC to reserve the new set of completion rings and re-initialize the vectors. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Refactor bnxt_need_reserve_rings().Michael Chan
Refactor bnxt_need_reserve_rings() slightly so that __bnxt_reserve_rings() can call it and remove some duplicated code. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Add IRQ remapping logic.Michael Chan
Add remapping logic so that bnxt_en can use any arbitrary MSIX vectors. This will allow the driver to reserve one range of MSIX vectors to be used by both bnxt_en and bnxt_re. bnxt_en can now skip over the MSIX vectors used by bnxt_re. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Change IRQ assignment for RDMA driver.Michael Chan
In the current code, the range of MSIX vectors allocated for the RDMA driver is disjoint from the network driver. This creates a problem for the new firmware ring reservation scheme. The new scheme requires the reserved completion rings/MSIX vectors to be in a contiguous range. Change the logic to allocate RDMA MSIX vectors to be contiguous with the vectors used by bnxt_en on new firmware using the new scheme. The new function bnxt_get_num_msix() calculates the exact number of vectors needed by both drivers. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Improve ring allocation logic.Michael Chan
Currently, the driver code makes some assumptions about the group index and the map index of rings. This makes the code more difficult to understand and less flexible. Improve it by adding the grp_idx and map_idx fields explicitly to the bnxt_ring_struct as a union. The grp_idx is initialized for each tx ring and rx agg ring during init. time. We do the same for the map_idx for each cmpl ring. The grp_idx ties the tx ring to the ring group. The map_idx is the doorbell index of the ring. With this new infrastructure, we can change the ring index mapping scheme easily in the future. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Improve valid bit checking in firmware response message.Michael Chan
When firmware sends a DMA response to the driver, the last byte of the message will be set to 1 to indicate that the whole response is valid. The driver waits for the message to be valid before reading the message. The firmware spec allows these response messages to increase in length by adding new fields to the end of these messages. The older spec's valid location may become a new field in a newer spec. To guarantee compatibility, the driver should zero the valid byte before interpreting the entire message so that any new fields not implemented by the older spec will be read as zero. For messages that are forwarded to VFs, we need to set the length and re-instate the valid bit so the VF will see the valid response. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Improve resource accounting for SRIOV.Michael Chan
When VFs are created, the current code subtracts the maximum VF resources from the PF's pool. This under-estimates the resources remaining in the PF pool. Instead, we should subtract the minimum VF resources. The VF minimum resources are guaranteed to the VFs and only these should be subtracted from the PF's pool. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Check max_tx_scheduler_inputs value from firmware.Michael Chan
When checking for the maximum pre-set TX channels for ethtool -l, we need to check the current max_tx_scheduler_inputs parameter from firmware. This parameter specifies the max input for the internal QoS nodes currently available to this function. The function's TX rings will be capped by this parameter. By adding this logic, we provide a more accurate pre-set max TX channels to the user. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Add extended port statistics supportVasundhara Volam
Gather periodic extended port statistics, if the device is PF and link is up. Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Include additional hardware port statistics in ethtool -S.Vasundhara Volam
Include additional hardware port statistics in ethtool -S, which are useful for debugging. Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Add support for ndo_set_vf_trustVasundhara Volam
Trusted VFs are allowed to modify MAC address, even when PF has assigned one. Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: fix clear flags in ethtool reset handlingScott Branden
Clear flags when reset command processed successfully for components specified. Fixes: 6502ad5963a5 ("bnxt_en: Add ETH_RESET_AP support") Signed-off-by: Scott Branden <scott.branden@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Use a dedicated VNIC mode for RDMA.Michael Chan
If the RDMA driver is registered, use a new VNIC mode that allows RDMA traffic to be seen on the netdev in promiscuous mode. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Adjust default rings for multi-port NICs.Michael Chan
Change the default ring logic to select default number of rings to be up to 8 per port if the default rings x NIC ports <= total CPUs. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31bnxt_en: Update firmware interface to 1.9.1.15.Michael Chan
Minor changes, such as new extended port statistics. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31vlan: vlan_hw_filter_capable() can be staticWei Yongjun
Fixes the following sparse warning: net/8021q/vlan_core.c:168:6: warning: symbol 'vlan_hw_filter_capable' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge tag 'mlx5-updates-2018-03-30' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2018-03-30 This series contains updates to mlx5 core and mlx5e netdev drivers. The main highlight of this series is the RX optimizations for striding RQ path, introduced by Tariq. First Four patches are trivial misc cleanups. - Spelling mistake fix - Dead code removal - Warning messages RX optimizations for striding RQ: 1) RX refactoring, cleanups and micro optimizations - MTU calculation simplifications, obsoletes some WQEs-to-packets translation functions and helps delete ~60 LOC. - Do not busy-wait a pending UMR completion. - post the new values of UMR WQE inline, instead of using a data pointer. - use pre-initialized structures to save calculations in datapath. 2) Use linear SKB in Striding RQ "build_skb", (Using linear SKB has many advantages): - Saves a memcpy of the headers. - No page-boundary checks in datapath. - No filler CQEs. - Significantly smaller CQ. - SKB data continuously resides in linear part, and not split to small amount (linear part) and large amount (fragment). This saves datapath cycles in driver and improves utilization of SKB fragments in GRO. - The fragments of a resulting GRO SKB follow the IP forwarding assumption of equal-size fragments. implementation details: HW writes the packets to the beginning of a stride, i.e. does not keep headroom. To overcome this we make sure we can extend backwards and use the last bytes of stride i-1. Extra care is needed for stride 0 as it has no preceding stride. We make sure headroom bytes are available by shifting the buffer pointer passed to HW by headroom bytes. This configuration now becomes default, whenever capable. Of course, this implies turning LRO off. Performance testing: ConnectX-5, single core, single RX ring, default MTU. UDP packet rate, early drop in TC layer: -------------------------------------------- | pkt size | before | after | ratio | -------------------------------------------- | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x | | 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x | | 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x | -------------------------------------------- TCP streams: ~20% gain 3) Support XDP over Striding RQ: Now that linear SKB is supported over Striding RQ, we can support XDP by setting stride size to PAGE_SIZE and headroom to XDP_PACKET_HEADROOM. Striding RQ is capable of a higher packet-rate than conventional RQ. Performance testing: ConnectX-5, 24 rings, default MTU. CQE compression ON (to reduce completions BW in PCI). XDP_DROP packet rate: -------------------------------------------------- | pkt size | XDP rate | 100GbE linerate | pct% | -------------------------------------------------- | 64byte | 126.2 Mpps | 148.0 Mpps | 85% | | 128byte | 80.0 Mpps | 84.8 Mpps | 94% | | 256byte | 42.7 Mpps | 42.7 Mpps | 100% | | 512byte | 23.4 Mpps | 23.4 Mpps | 100% | -------------------------------------------------- 4) Remove mlx5 page_ref bulking in Striding RQ and use page_ref_inc only when needed. Without this bulking, we have: - no atomic ops on WQE allocation or free - one atomic op per SKB - In the default MTU configuration (1500, stride size is 2K), the non-bulking method execute 2 atomic ops as before - For larger MTUs with stride size of 4K, non-bulking method executes only a single op. - For XDP (stride size of 4K, no SKBs), non-bulking have no atomic ops per packet at all. Performance testing: ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Single core packet rate (64 bytes). Early drop in TC: no degradation. XDP_DROP: before: 14,270,188 pps after: 20,503,603 pps, 43% improvement. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge tag 'rxrpc-next-20180330' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes and more traces Here are some patches that add some more tracepoints to AF_RXRPC and fix some issues therein: (1) Fix the use of VERSION packets to keep firewall routes open. (2) Fix the incorrect current time usage in a tracepoint. (3) Fix Tx ring annotation corruption. (4) Fix accidental conversion of call-level abort into connection-level abort. (5) Fix calculation of resend time. (6) Remove a couple of unused variables. (7) Fix a bunch of checker warnings and an error. Note that not all warnings can be quashed as checker doesn't seem to correctly handle seqlocks. (8) Fix a potential race between call destruction and socket/net destruction. (9) Add a tracepoint to track rxrpc_local refcounting. (10) Fix an apparent leak of rxrpc_local objects. (11) Add a tracepoint to track rxrpc_peer refcounting. (12) Fix a leak of rxrpc_peer objects. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31hv_netvsc: Clean up extra parameter from rndis_filter_receive_data()Haiyang Zhang
The variables, msg and data, have the same value. This patch removes the extra one. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31ethernet: hisilicon: hns: hns_dsaf_mac: Use generic eth_broadcast_addrJoe Perches
Rather than use an on-stack array to copy a broadcast address, use the generic eth_broadcast_addr function to save a trivial amount of object code. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'net_rwsem-fixes'David S. Miller
Kirill Tkhai says: ==================== net_rwsem fixes there is wext_netdev_notifier_call()->wireless_nlevent_flush() netdevice notifier, which takes net_rwsem, so we can't take net_rwsem in {,un}register_netdevice_notifier(). Since {,un}register_netdevice_notifier() is executed under pernet_ops_rwsem, net_namespace_list can't change, while we holding it, so there is no need net_rwsem in these functions [1/2]. The same is in [2/2]. We make callers of __rtnl_link_unregister() take pernet_ops_rwsem, and close the race with setup_net() and cleanup_net(), so __rtnl_link_unregister() does not need it. This also fixes the problem of that __rtnl_link_unregister() does not see initializing and exiting nets. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: Do not take net_rwsem in __rtnl_link_unregister()Kirill Tkhai
This function calls call_netdevice_notifier(), which also may take net_rwsem. So, we can't use net_rwsem here. This patch makes callers of this functions take pernet_ops_rwsem, like register_netdevice_notifier() does. This will protect the modifications of net_namespace_list, and allows notifiers to take it (they won't have to care about context). Since __rtnl_link_unregister() is used on module load and unload (which are not frequent operations), this looks for me better, than make all call_netdevice_notifier() always executing in "protected net_namespace_list" context. Also, this fixes the problem we had a deal in 328fbe747ad4 "Close race between {un, }register_netdevice_notifier and ...", and guarantees __rtnl_link_unregister() does not skip exitting net. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: Remove net_rwsem from {, un}register_netdevice_notifier()Kirill Tkhai
These functions take net_rwsem, while wireless_nlevent_flush() also takes it. But down_read() can't be taken recursive, because of rw_semaphore design, which prevents it to be occupied by only readers forever. Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(), net list can't change, so these down_read()/up_read() can be removed. Fixes: f0b07bb151b0 "net: Introduce net_rwsem to protect net_namespace_list" Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: hns3: remove unnecessary pci_set_drvdata() and devm_kfree()Wei Yongjun
There is no need for explicit calls of devm_kfree(), as the allocated memory will be freed during driver's detach. The driver core clears the driver data to NULL after device_release. Thus, it is not needed to manually clear the device driver data to NULL. So remove the unnecessary pci_set_drvdata() and devm_kfree(). Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31netdevsim: Change nsim_devlink_setup to return error to callerDavid Ahern
Change nsim_devlink_setup to return any error back to the caller and update nsim_init to handle it. Requested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'tipc-slim-down-name-table'David S. Miller
Jon Maloy says: ==================== tipc: slim down name table We clean up and improve the name binding table: - Replace the memory consuming 'sub_sequence/service range' array with an RB tree. - Introduce support for overlapping service sequences/ranges v2: #1: Fixed a missing initialization reported by David Miller #4: Obsoleted and replaced a few more macros to get a consistent terminology in the API. #5: Added new commit to fix a potential string overflow bug (it is still only in net-next) reported by Arnd Bergmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31tipc: avoid possible string overflowJon Maloy
gcc points out that the combined length of the fixed-length inputs to l->name is larger than the destination buffer size: net/tipc/link.c: In function 'tipc_link_create': net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes into a region of size between 26 and 58 [-Werror=format-overflow=] sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes (assuming 75) into a destination of size 60 sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); A detailed analysis reveals that the theoretical maximum length of a link name is: max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name = 16 + 1 + 15 + 1 + 16 + 1 + 15 = 65 Since we also need space for a trailing zero we now set MAX_LINK_NAME to 68. Just to be on the safe side we also replace the sprintf() call with snprintf(). Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31tipc: tipc: rename address types in user apiJon Maloy
The three address type structs in the user API have names that in reality reflect the specific, non-Linux environment where they were originally created. We now give them more intuitive names, in accordance with how TIPC is described in the current documentation. struct tipc_portid -> struct tipc_socket_addr struct tipc_name -> struct tipc_service_addr struct tipc_name_seq -> struct tipc_service_range To avoid confusion, we also update some commmets and macro names to match the new terminology. For compatibility, we add macros that map all old names to the new ones. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31tipc: permit overlapping service ranges in name tableJon Maloy
With the new RB tree structure for service ranges it becomes possible to solve an old problem; - we can now allow overlapping service ranges in the table. When inserting a new service range to the tree, we use 'lower' as primary key, and when necessary 'upper' as secondary key. Since there may now be multiple service ranges matching an indicated 'lower' value, we must also add the 'upper' value to the functions used for removing publications, so that the correct, corresponding range item can be found. These changes guarantee that a well-formed publication/withdrawal item from a peer node never will be rejected, and make it possible to eliminate the problematic backlog functionality we currently have for handling such cases. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31tipc: refactor name table translate functionJon Maloy
The function tipc_nametbl_translate() function is ugly and hard to follow. This can be improved somewhat by introducing a stack variable for holding the publication list to be used and re-ordering the if- clauses for selection of algorithm. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31tipc: replace name table service range array with rb treeJon Maloy
The current design of the binding table has an unnecessary memory consuming and complex data structure. It aggregates the service range items into an array, which is expanded by a factor two every time it becomes too small to hold a new item. Furthermore, the arrays never shrink when the number of ranges diminishes. We now replace this array with an RB tree that is holding the range items as tree nodes, each range directly holding a list of bindings. This, along with a few name changes, improves both readability and volume of the code, as well as reducing memory consumption and hopefully improving cache hit rate. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'bridge-mtu'David S. Miller
Nikolay Aleksandrov says: ==================== net: bridge: MTU handling changes As previously discussed the recent changes break some setups and could lead to packet drops. Thus the first patch reverts the behaviour for the bridge to follow the minimum MTU but also keeps the ability to set the MTU to the maximum (out of all ports) if vlan filtering is enabled. Patch 02 is the bigger change in behaviour - we've always had trouble when configuring bridges and their MTU which is auto tuning on port events (add/del/changemtu), which means config software needs to chase it and fix it after each such event, after patch 02 we allow the user to configure any MTU (ETH_MIN/MAX limited) but once that is done the bridge stops auto tuning and relies on the user to keep the MTU correct. This should be compatible with cases that don't touch the MTU (or set it to the same value), while allowing to configure the MTU and not worry about it changing afterwards. The patches are intentionally split like this, so that if they get accepted and there are any complaints patch 02 can be reverted. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: bridge: disable bridge MTU auto tuning if it was set manuallyNikolay Aleksandrov
As Roopa noted today the biggest source of problems when configuring bridge and ports is that the bridge MTU keeps changing automatically on port events (add/del/changemtu). That leads to inconsistent behaviour and network config software needs to chase the MTU and fix it on each such event. Let's improve on that situation and allow for the user to set any MTU within ETH_MIN/MAX limits, but once manually configured it is the user's responsibility to keep it correct afterwards. In case the MTU isn't manually set - the behaviour reverts to the previous and the bridge follows the minimum MTU. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: bridge: set min MTU on port events and allow user to set maxNikolay Aleksandrov
Recently the bridge was changed to automatically set maximum MTU on port events (add/del/changemtu) when vlan filtering is enabled, but that actually changes behaviour in a way which breaks some setups and can lead to packet drops. In order to still allow that maximum to be set while being compatible, we add the ability for the user to tune the bridge MTU up to the maximum when vlan filtering is enabled, but that has to be done explicitly and all port events (add/del/changemtu) lead to resetting that MTU to the minimum as before. Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'thunderx-DMAC-filtering'David S. Miller
Vadim Lomovtsev says: ==================== net: thunderx: implement DMAC filtering support By default CN88XX BGX accepts all incoming multicast and broadcast packets and filtering is disabled. The nic driver doesn't provide an ability to change such behaviour. This series is to implement DMAC filtering management for CN88XX nic driver allowing user to enable/disable filtering and configure specific MAC addresses to filter traffic. Changes from v1: build issues: - update code in order to address compiler warnings; checkpatch.pl reported issues: - update code in order to fit 80 symbols length; - update commit descriptions in order to fit 80 symbols length; ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: thunderx: add ndo_set_rx_mode callback implementation for VFVadim Lomovtsev
The ndo_set_rx_mode() is called from atomic context which causes messages response timeouts while VF to PF communication via MSIx. To get rid of that we're copy passed mc list, parse flags and queue handling of kernel request to ordered workqueue. Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>