summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-01-20Merge branch 'tipc-multicast-through-replication'David S. Miller
Jon Maloy says: ==================== tipc: emulate multicast through replication TIPC multicast messages are currently distributed via L2 broadcast or IP multicast to all nodes in the cluster, irrespective of the number of real destinations of the message. In this series we introduce an option to transport messages via replication ("replicast") across a selected number of unicast links, instead of relying on the underlying media. This option is used when true broadcast/multicast is not supported by the media, or when the number of true destinations is much smaller than the cluster size. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: make replicast a user selectable optionJon Paul Maloy
If the bearer carrying multicast messages supports broadcast, those messages will be sent to all cluster nodes, irrespective of whether these nodes host any actual destinations socket or not. This is clearly wasteful if the cluster is large and there are only a few real destinations for the message being sent. In this commit we extend the eligibility of the newly introduced "replicast" transmit option. We now make it possible for a user to select which method he wants to be used, either as a mandatory setting via setsockopt(), or as a relative setting where we let the broadcast layer decide which method to use based on the ratio between cluster size and the message's actual number of destination nodes. In the latter case, a sending socket must stick to a previously selected method until it enters an idle period of at least 5 seconds. This eliminates the risk of message reordering caused by method change, i.e., when changes to cluster size or number of destinations would otherwise mandate a new method to be used. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: introduce replicast as transport option for multicastJon Paul Maloy
TIPC multicast messages are currently carried over a reliable 'broadcast link', making use of the underlying media's ability to transport packets as L2 broadcast or IP multicast to all nodes in the cluster. When the used bearer is lacking that ability, we can instead emulate the broadcast service by replicating and sending the packets over as many unicast links as needed to reach all identified destinations. We now introduce a new TIPC link-level 'replicast' service that does this. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: add functionality to lookup multicast destination nodesJon Paul Maloy
As a further preparation for the upcoming 'replicast' functionality, we add some necessary structs and functions for looking up and returning a list of all nodes that host destinations for a given multicast message. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: add function for checking broadcast support in bearerJon Paul Maloy
As a preparation for the 'replicast' functionality we are going to introduce in the next commits, we need the broadcast base structure to store whether bearer broadcast is available at all from the currently used bearer or bearers. We do this by adding a new function tipc_bearer_bcast_support() to the bearer layer, and letting the bearer selection function in bcast.c use this to give a new boolean field, 'bcast_support' the appropriate value. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20bpf: add bpf_probe_read_str helperGianluca Borello
Provide a simple helper with the same semantics of strncpy_from_unsafe(): int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr) This gives more flexibility to a bpf program. A typical use case is intercepting a file name during sys_open(). The current approach is: SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs *ctx) { char buf[PATHLEN]; // PATHLEN is defined to 256 bpf_probe_read(buf, sizeof(buf), ctx->di); /* consume buf */ } This is suboptimal because the size of the string needs to be estimated at compile time, causing more memory to be copied than often necessary, and can become more problematic if further processing on buf is done, for example by pushing it to userspace via bpf_perf_event_output(), since the real length of the string is unknown and the entire buffer must be copied (and defining an unrolled strnlen() inside the bpf program is a very inefficient and unfeasible approach). With the new helper, the code can easily operate on the actual string length rather than the buffer size: SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs *ctx) { char buf[PATHLEN]; // PATHLEN is defined to 256 int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di); /* consume buf, for example push it to userspace via * bpf_perf_event_output(), but this time we can use * res (the string length) as event size, after checking * its boundaries. */ } Another useful use case is when parsing individual process arguments or individual environment variables navigating current->mm->arg_start and current->mm->env_start: using this helper and the return value, one can quickly iterate at the right offset of the memory area. The code changes simply leverage the already existent strncpy_from_unsafe() kernel function, which is safe to be called from a bpf program as it is used in bpf_trace_printk(). Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20Merge branch 'bus-agnostic-num-vf'David S. Miller
Phil Sutter says: ==================== Retrieve number of VFs in a bus-agnostic way Previously, it was assumed that only PCI NICs would be capable of having virtual functions - with my proposed enhancement of dummy NIC driver implementing (fake) ones for testing purposes, this is no longer true. Discussion of said patch has led to the suggestion of implementing a bus-agnostic method for VF count retrieval so rtnetlink could work with both real VF-capable PCI NICs as well as my dummy modifications without introducing ugly hacks. The following series tries to achieve just that by introducing a bus type callback to retrieve a device's number of VFs, implementing this callback for PCI bus and finally adjusting rtnetlink to make use of the generalized infrastructure. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20device: Implement a bus agnostic dev_num_vf routinePhil Sutter
Now that pci_bus_type has num_vf callback set, dev_num_vf can be implemented in a bus type independent way and the check for whether a PCI device is being handled in rtnetlink can be dropped. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20PCI: implement num_vf bus type callbackPhil Sutter
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20device: bus_type: Introduce num_vf callbackPhil Sutter
This allows for bus types to implement their own method of retrieving the number of virtual functions a NIC on that type of bus supports. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20sock: use hlist_entry_safeGeliang Tang
Use hlist_entry_safe() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20gre6: Clean up unused struct ipv6_tel_txoption definitionJakub Sitnicki
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") removed the ip6gre specific transmit function, but left the struct ipv6_tel_txoption definition. Clean it up. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: remove bh disabling around percpu_counter accessesEric Dumazet
Shaohua Li made percpu_counter irq safe in commit 098faf5805c8 ("percpu_counter: make APIs irq safe") We can safely remove BH disable/enable sections around various percpu_counter manipulations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20cxgb4: hide unused warningsArnd Bergmann
The two new variables are only used inside of an #ifdef and cause harmless warnings when that is disabled: drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c: In function 'init_one': drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4646:9: error: unused variable 'port_vec' [-Werror=unused-variable] drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4646:6: error: unused variable 'v' [-Werror=unused-variable] This adds another #ifdef around the declarations. Fixes: 96fe11f27b70 ("cxgb4: Implement ndo_get_phys_port_id for mgmt dev") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: ipv6: Keep nexthop of multipath route on admin downDavid Ahern
IPv6 deletes route entries associated with multipath routes on an admin down where IPv4 does not. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 metric 64 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.100.2.254 dev eth2 weight 1 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Set link down: $ ip li set eth1 down IPv4 retains the multihop route but flags eth1 route as dead: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.16 dev eth1 weight 1 dead linkdown nexthop via 10.100.2.16 dev eth2 weight 1 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 and IPv6 deletes the route as part of flushing all routes for the device: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Worse, on admin up of the device the multipath route has to be deleted to get this leg of the route re-added. This patch keeps routes that are part of a multipath route if ignore_routes_with_linkdown is set with the dead and linkdown flags enabling consistency between IPv4 and IPv6: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19mlx4: support __GFP_MEMALLOC for rxEric Dumazet
Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX ring with __GFP_NOMEMALLOC") added code that appears to be not needed at that time, since mlx4 never used __GFP_MEMALLOC allocations anyway. As using memory reserves is a must in some situations (swap over NFS or iSCSI), this patch adds this flag. Note that this driver does not reuse pages (yet) so we do not have to add anything else. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19Revert "net: qcom/emac: configure the external phy to allow pause frames"Timur Tabi
This reverts commit 3e884493448131179a5b7cae1ddca1028ffaecc8. With commit 529ed1275263 ("net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause"), phylib now handles automatically enabling pause frame support in the PHY, and the MAC driver should follow suit. Since the EMAC driver driver does this, we no longer need to force pause frames support. Signed-off-by: Timur Tabi <timur@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19phy: increase size of MII_BUS_ID_SIZE and bus_idVolodymyr Bendiuga
Some bus names are pretty long and do not fit into 17 chars. Increase therefore MII_BUS_ID_SIZE and phy_fixup.bus_id to larger number. Now mii_bus.id can host larger name. Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com> Signed-off-by: Magnus Öberg <magnus.oberg@westermo.se> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19macb: Common code to enable ptp support for MACB/GEMAndrei.Pistirica@microchip.com
This patch does the following: - MACB/GEM-PTP interface - registers and bitfields for TSU - capability flags to enable PTP per platform basis Signed-off-by: Andrei Pistirica <andrei.pistirica@microchip.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: caif: Remove unused stats member from struct chnl_netTobias Klauser
The stats member of struct chnl_net is used nowhere in the code, so it might as well be removed. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net/sched: cls_flower: reduce fl_change stack sizeArnd Bergmann
The new ARP support has pushed the stack size over the edge on ARM, as there are two large objects on the stack in this function (mask and tb) and both have now grown a bit more: net/sched/cls_flower.c: In function 'fl_change': net/sched/cls_flower.c:928:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] We can solve this by dynamically allocating one or both of them. I first tried to do it just for the mask, but that only saved 152 bytes on ARM, while this version just does it for the 'tb' array, bringing the stack size back down to 664 bytes. Fixes: 99d31326cbe6 ("net/sched: cls_flower: Support matching on ARP") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: Remove usage of net_device last_rx memberTobias Klauser
The network stack no longer uses the last_rx member of struct net_device since the bonding driver switched to use its own private last_rx in commit 9f242738376d ("bonding: use last_arp_rx in slave_last_rx()"). However, some drivers still (ab)use the field for their own purposes and some driver just update it without actually using it. Previously, there was an accompanying comment for the last_rx member added in commit 4dc89133f49b ("net: add a comment on netdev->last_rx") which asked drivers not to update is, unless really needed. However, this commend was removed in commit f8ff080dacec ("bonding: remove useless updating of slave->dev->last_rx"), so some drivers added later on still did update last_rx. Remove all usage of last_rx and switch three drivers (sky2, atp and smc91c92_cs) which actually read and write it to use their own private copy in netdev_priv. Compile-tested with allyesconfig and allmodconfig on x86 and arm. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: dsa: use cpu_switch instead of ds[0]Vivien Didelot
Now that the DSA Ethernet switches are true Linux devices, the CPU switch is not necessarily the first one. If its address is higher than the second switch on the same MDIO bus, its index will be 1, not 0. Avoid any confusion by using dst->cpu_switch instead of dst->ds[0]. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: dsa: store CPU switch structure in the treeVivien Didelot
Store a dsa_switch pointer to the CPU switch in the tree instead of only its index. This avoids the need to initialize it to -1. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ethernet: ti: davinci_cpdma: correct check on NULL in set rateIvan Khoronzhuk
Check "ch" on NULL first, then get ctlr. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18Merge branch 'vhost_net-batching'David S. Miller
Jason Wang says: ==================== vhost_net tx batching This series tries to implement tx batching support for vhost. This was done by using MSG_MORE as a hint for under layer socket. The backend (e.g tap) can then batch the packets temporarily in a list and submit it all once the number of bacthed exceeds a limitation. Tests shows obvious improvement on guest pktgen over over mlx4(noqueue) on host: Mpps -+% rx-frames = 0 0.91 +0% rx-frames = 4 1.00 +9.8% rx-frames = 8 1.00 +9.8% rx-frames = 16 1.01 +10.9% rx-frames = 32 1.07 +17.5% rx-frames = 48 1.07 +17.5% rx-frames = 64 1.08 +18.6% rx-frames = 64 (no MSG_MORE) 0.91 +0% Changes from V4: - stick to NAPI_POLL_WEIGHT for rx-frames is user specify a value greater than it. Changes from V3: - use ethtool instead of module parameter to control the maximum number of batched packets - avoid overhead when MSG_MORE were not set and no packet queued Changes from V2: - remove uselss queue limitation check (and we don't drop any packet now) Changes from V1: - drop NAPI handler since we don't use NAPI now - fix the issues that may exceeds max pending of zerocopy - more improvement on available buffer detection - move the limitation of batched pacekts from vhost to tuntap ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18tun: rx batchingJason Wang
We can only process 1 packet at one time during sendmsg(). This often lead bad cache utilization under heavy load. So this patch tries to do some batching during rx before submitting them to host network stack. This is done through accepting MSG_MORE as a hint from sendmsg() caller, if it was set, batch the packet temporarily in a linked list and submit them all once MSG_MORE were cleared. Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host: Mpps -+% rx-frames = 0 0.91 +0% rx-frames = 4 1.00 +9.8% rx-frames = 8 1.00 +9.8% rx-frames = 16 1.01 +10.9% rx-frames = 32 1.07 +17.5% rx-frames = 48 1.07 +17.5% rx-frames = 64 1.08 +18.6% rx-frames = 64 (no MSG_MORE) 0.91 +0% User were allowed to change per device batched packets through ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation to prevent bh from being disabled too long. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18vhost_net: tx batchingJason Wang
This patch tries to utilize tuntap rx batching by peeking the tx virtqueue during transmission, if there's more available buffers in the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap) to batch the packets. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18vhost: better detection of available buffersJason Wang
This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net:add one common config ARCH_WANT_RELAX_ORDER to support relax orderingMao Wenan
Relax ordering(RO) is one feature of 82599 NIC, to enable this feature can enhance the performance for some cpu architecure, such as SPARC and so on. Currently it only supports one special cpu architecture(SPARC) in 82599 driver to enable RO feature, this is not very common for other cpu architecture which really needs RO feature. This patch add one common config CONFIG_ARCH_WANT_RELAX_ORDER to set RO feature, and should define CONFIG_ARCH_WANT_RELAX_ORDER in sparc Kconfig firstly. Signed-off-by: Mao Wenan <maowenan@huawei.com> Reviewed-by: Alexander Duyck <alexander.duyck@gmail.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18Merge branch 'ipv6-simplify-rt6_fill_node'David S. Miller
David Ahern says: ==================== net: ipv6: simplify rt6_fill_node Remove a couple of unnecessary input arguments to rt6_fill_node. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove prefix arg to rt6_fill_nodeDavid Ahern
The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route where a user is requesting a prefix only dump. Simplify rt6_fill_node by removing the prefix arg and moving the prefix check to rt6_dump_route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove nowait arg to rt6_fill_nodeDavid Ahern
All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and simplify rt6_fill_node accordingly. rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the nowait arg from it as well. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18Merge branch 'sctp-sender-side-stream-reconf-ssn-reset-request-chunk'David S. Miller
Xin Long says: ==================== sctp: add sender-side procedures for stream reconf ssn reset request chunk Patch 6/6 is to implement sender-side procedures for the Outgoing and Incoming SSN Reset Request Parameter described in rfc6525 section 5.1.2 and 5.1.3 Patches 1-5/6 are ahead of it to define some apis and asoc members for it. Note that with this patchset, asoc->reconf_enable has no chance yet to be set, until the patch "sctp: add get and set sockopt for reconf_enable" is applied in the future. As we can not just enable it when sctp is not capable of processing reconf chunk yet. v1->v2: - put these into a smaller group. - rename some temporary variables in the codes. - rename the titles of the commits and improve some changelogs. v2->v3: - re-split the patchset and make sure it has no dead codes for review. v3->v4: - move sctp_make_reconf() into patch 1/6 to avoid kbuild warning. - drop unused struct sctp_strreset_req. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: implement sender-side procedures for SSN Reset Request ParameterXin Long
This patch is to implement sender-side procedures for the Outgoing and Incoming SSN Reset Request Parameter described in rfc6525 section 5.1.2 and 5.1.3. It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2 for users. Note that the new asoc member strreset_outstanding is to make sure only one reconf request chunk on the fly as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add sockopt SCTP_ENABLE_STREAM_RESETXin Long
This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set strreset_enable to indicate which reconf request type it supports, which is described in rfc6525 section 6.3.1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add reconf_enable in asoc ep and netnsXin Long
This patch is to add reconf_enable field in all of asoc ep and netns to indicate if they support stream reset. When initializing, asoc reconf_enable get the default value from ep reconf_enable which is from netns netns reconf_enable by default. It is also to add reconf_capable in asoc peer part to know if peer supports reconf_enable, the value is set if ext params have reconf chunk support when processing init chunk, just as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add stream reconf primitiveXin Long
This patch is to add a primitive based on sctp primitive frame for sending stream reconf request. It works as the other primitives, and create a SCTP_CMD_REPLY command to send the request chunk out. sctp_primitive_RECONF would be the api to send a reconf request chunk. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add stream reconf timerXin Long
This patch is to add a per transport timer based on sctp timer frame for stream reconf chunk retransmission. It would start after sending a reconf request chunk, and stop after receiving the response chunk. If the timer expires, besides retransmitting the reconf request chunk, it would also do the same thing with data RTO timer. like to increase the appropriate error counts, and perform threshold management, possibly destroying the asoc if sctp retransmission thresholds are exceeded, just as section 5.1.1 describes. This patch is also to add asoc strreset_chunk, it is used to save the reconf request chunk, so that it can be retransmitted, and to check if the response is really for this request by comparing the information inside with the response chunk as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add support for generating stream reconf ssn reset request chunkXin Long
This patch is to add asoc strreset_outseq and strreset_inseq for saving the reconf request sequence, initialize them when create assoc and process init, and also to define Incoming and Outgoing SSN Reset Request Parameter described in rfc6525 section 4.1 and 4.2, As they can be in one same chunk as section rfc6525 3.1-3 describes, it makes them in one function. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18Merge branch 'rework-inet_csk_get_port'David S. Miller
Josef Bacik says: ==================== Rework inet_csk_get_port V3->V4: -Removed the random include of addrconf.h that is no longer needed. V2->V3: -Dropped the fastsock from the tb and instead just carry the saddrs, family, and ipv6 only flag. -Reworked the helper functions to deal with this change so I could still use them when checking the fast path. -Killed tb->num_owners as per Eric's request. -Attached a reproducer to the bottom of this email. V1->V2: -Added a new patch 'inet: collapse ipv4/v6 rcv_saddr_equal functions into one' at Hannes' suggestion. -Dropped ->bind_conflict and just use the new helper. -Fixed a compile bug from the original ->bind_conflict patch. The original description of the series follows: At some point recently the guys working on our load balancer added the ability to use SO_REUSEPORT. When they restarted their app with this option enabled they immediately hit a softlockup on what appeared to be the inet_bind_bucket->lock. Eventually what all of our debugging and discussion led us to was the fact that the application comes up without SO_REUSEPORT, shuts down which creates around 100k twsk's, and then comes up and tries to open a bunch of sockets using SO_REUSEPORT, which meant traversing the inet_bind_bucket owners list under the lock. Since this lock is needed for dealing with the twsk's and basically anything else related to connections we would softlockup, and sometimes not ever recover. To solve this problem I did what you see in Path 5/5. Once we have a SO_REUSEPORT socket on the tb->owners list we know that the socket has no conflicts with any of the other sockets on that list. So we can add a copy of the sock_common (really all we need is the recv_saddr but it seemed ugly to copy just the ipv6, ipv4, and flag to indicate if we were ipv6 only in there so I've copied the whole common) in order to check subsequent SO_REUSEPORT sockets. If they match the previous one then we can skip the expensive inet_csk_bind_conflict check. This is what eliminated the soft lockup that we were seeing. Patches 1-4 are cleanups and re-workings. For instance when we specify port == 0 we need to find an open port, but we would do two passes through inet_csk_bind_conflict every time we found a possible port. We would also keep track of the smallest_port value in order to try and use it if we found no port our first run through. This however made no sense as it would have had to fail the first pass through inet_csk_bind_conflict, so would not actually pass the second pass through either. Finally I split the function into two functions in order to make it easier to read and to distinguish between the two behaviors. I have tested this on one of our load balancing boxes during peak traffic and it hasn't fallen over. But this is not my area, so obviously feel free to point out where I'm being stupid and I'll get it fixed up and retested. Thanks, ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: reset tb->fastreuseport when adding a reuseport skJosef Bacik
If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and never set it again. Which means that in the future if we end up adding a bunch of reuseport sk's to that tb we'll have to do the expensive scan every time. Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family so we know what comparison to make, and the ipv6 only setting so we can make sure to compare with new sockets appropriately. Once one sk has made it onto the list we know that there are no potential bind conflicts on the owners list that match that sk's rcv_addr. So copy the sk's information into our bind bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do an extra check for subsequent reuseport sockets and skip the expensive bind conflict check. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: split inet_csk_get_port into two functionsJosef Bacik
inet_csk_get_port does two different things, it either scans for an open port, or it tries to see if the specified port is available for use. Since these two operations have different rules and are basically independent lets split them into two different functions to make them both more readable. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: don't check for bind conflicts twice when searching for a portJosef Bacik
This is just wasted time, we've already found a tb that doesn't have a bind conflict, and we don't drop the head lock so scanning again isn't going to give us a different answer. Instead move the tb->reuse setting logic outside of the found_tb path and put it in the success: path. Then make it so that we don't goto again if we find a bind conflict in the found_tb path as we won't reach this anymore when we are scanning for an ephemeral port. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: kill smallest_size and smallest_portJosef Bacik
In inet_csk_get_port we seem to be using smallest_port to figure out where the best place to look for a SO_REUSEPORT sk that matches with an existing set of SO_REUSEPORT's. However if we get to the logic if (smallest_size != -1) { port = smallest_port; goto have_port; } we will do a useless search, because we would have already done the inet_csk_bind_conflict for that port and it would have returned 1, otherwise we would have gone to found_tb and succeeded. Since this logic makes us do yet another trip through inet_csk_bind_conflict for a port we know won't work just delete this code and save us the time. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: drop ->bind_conflictJosef Bacik
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict is how they check the rcv_saddr, so delete this call back and simply change inet_csk_bind_conflict to call inet_rcv_saddr_equal. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: collapse ipv4/v6 rcv_saddr_equal functions into oneJosef Bacik
We pass these per-protocol equal functions around in various places, but we can just have one function that checks the sk->sk_family and then do the right comparison function. I've also changed the ipv4 version to not cast to inet_sock since it is unneeded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18stmicro: add more information to Kconfigjpinto
This patch adds more info to stmicro' Kconfig files in order to be clearer that the driver can be used by ethernet cards based on 10/100/1000/EQOS Synopsys IP Cores. EQOS was also added stmmac/Kconfig Kconfig, since dwmac4 is in fact EQoS, one of Synopsys Ethernet IPs. More info at: https://www.synopsys.com/dw/ipdir.php?ds=dwc_ether_qos Signed-off-by: Joao Pinto <jpinto@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net/mlx5e: Support bpf_xdp_adjust_head()Martin KaFai Lau
This patch adds bpf_xdp_adjust_head() support to mlx5e. 1. rx_headroom is added to struct mlx5e_rq. It uses an existing 4 byte hole in the struct. 2. The adjusted data length is checked against MLX5E_XDP_MIN_INLINE and MLX5E_SW2HW_MTU(rq->netdev->mtu). 3. The macro MLX5E_SW2HW_MTU is moved from en_main.c to en.h. MLX5E_HW2SW_MTU is also moved to en.h for symmetric reason but it is not a must. v2: - Keep the xdp specific logic in mlx5e_xdp_handle() - Update dma_len after the sanity checks in mlx5e_xmit_xdp_frame() Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17tcp: accept RST for rcv_nxt - 1 after receiving a FINJason Baron
Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then sometimes (they are rate-limited) does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. A packetdrill test demonstrating the behavior: // testing mac osx rst behavior // Establish a connection 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5> 0.200 < . 1:1(0) ack 1 win 32768 0.200 accept(3, ..., ...) = 4 // Client closes the connection 0.300 < F. 1:1(0) ack 1 win 32768 // now send rst with same sequence 0.300 < R. 1:1(0) ack 1 win 32768 // make sure we are in TCP_CLOSE 0.400 %{ assert tcpi_state == 7 }% Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>