summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2016-02-12net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloadsEdward Cree
All users now pass false, so we can remove it, and remove the code that was conditional upon it. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: gre: Implement LCO for GRE over IPv4Edward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12fou: enable LCO in FOU and GUEEdward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: udp: always set up for CHECKSUM_PARTIAL offloadEdward Cree
If the dst device doesn't support it, it'll get fixed up later anyway by validate_xmit_skb(). Also, this allows us to take advantage of LCO to avoid summing the payload multiple times. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: local checksum offload for encapsulationEdward Cree
The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12tcp/dccp: better use of ephemeral ports in bind()Eric Dumazet
Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12tcp/dccp: better use of ephemeral ports in connect()Eric Dumazet
In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"), I added a very simple heuristic, so that we got better chances to use even ports, and allow bind() users to have more available slots. It gave nice results, but with more than 200,000 TCP sessions on a typical server, the ~30,000 ephemeral ports are still a rare resource. I chose to go a step further, by looking at all even ports, and if none was available, fallback to odd ports. The companion patch does the same in bind(), but in opposite way. I've seen exec times of up to 30ms on busy servers, so I no longer disable BH for the whole traversal, but only for each hash bucket. I also call cond_resched() to be gentle to other tasks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespacify igmp_qrv sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespaceify igmp_llm_reports sysctl knobNikolay Borisov
This was initially introduced in df2cf4a78e488d26 ("IGMP: Inhibit reports for local multicast groups") by defining the sysctl in the ipv4_net_table array, however it was never implemented to be namespace aware. Fix this by changing the code accordingly. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespaceify igmp_max_msf sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespaceify igmp_max_memberships sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11udp: Use uh->len instead of skb->len to compute checksum in segmentationAlexander Duyck
The segmentation code was having to do a bunch of work to pull the skb->len and strip the udp header offset before the value could be used to adjust the checksum. Instead of doing all this work we can just use the value that goes into uh->len since that is the correct value with the correct byte order that we need anyway. By using this value we can save ourselves a bunch of pain as there is no need to do multiple byte swaps. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11udp: Clean up the use of flags in UDP segmentation offloadAlexander Duyck
This patch goes though and cleans up the logic related to several of the control flags used in UDP segmentation. Specifically the use of dont_encap isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and if it isn't set then we don't need to update the internal headers. As such we can just drop that value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11gre: Use inner_proto to obtain inner header protocolAlexander Duyck
Instead of parsing headers to determine the inner protocol we can just pull the value from inner_proto. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11gre: Use GSO flags to determine csum need instead of GRE flagsAlexander Duyck
This patch updates the gre checksum path to follow something much closer to the UDP checksum path. By doing this we can avoid needing to do as much header inspection and can just make use of the fields we were already reading in the sk_buff structure. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Move skb_has_shared_frag check out of GRE code and into segmentationAlexander Duyck
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help to verify that no frags can be modified by an external entity. This check really doesn't belong in the GRE path but in the skb_segment function itself. This way any protocol that might be segmented will be performing this check before attempting to offload a checksum to software. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Store checksum result for offloaded GSO checksumsAlexander Duyck
This patch makes it so that we can offload the checksums for a packet up to a certain point and then begin computing the checksums via software. Setting this up is fairly straight forward as all we need to do is reset the values stored in csum and csum_start for the GSO context block. One complication for this is remote checksum offload. In order to allow the inner checksums to be offloaded while computing the outer checksum manually we needed to have some way of indicating that the offload wasn't real. In order to do that I replaced CHECKSUM_PARTIAL with CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer header while skipping computing checksums for the inner headers. We clean up the ip_summed flag and set it to either CHECKSUM_PARTIAL or CHECKSUM_NONE once we hand the packet off to the next lower level. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Update remote checksum segmentation to support use of GSO checksumAlexander Duyck
This patch addresses two main issues. First in the case of remote checksum offload we were avoiding dealing with scatter-gather issues. As a result it would be possible to assemble a series of frames that used frags instead of being linearized as they should have if remote checksum offload was enabled. Second I have updated the code so that we now let GSO take care of doing the checksum on the data itself and drop the special case that was added for remote checksum offload. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Drop unecessary enc_features variable from tunnel segmentation functionsAlexander Duyck
The enc_features variable isn't necessary since features isn't used anywhere after we create enc_features so instead just use a destructive AND on features itself and save ourselves the variable declaration. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop gratuitous ARP packetsJohannes Berg
In certain 802.11 wireless deployments, there will be ARP proxies that use knowledge of the network to correctly answer requests. To prevent gratuitous ARP frames on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_gratuitous_arp". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop unicast encapsulated in L2 multicastJohannes Berg
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv4 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Additionally, enabling this option provides compliance with a SHOULD clause of RFC 1122. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11soreuseport: fast reuseport TCP socket selectionCraig Gallek
This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11inet: refactor inet[6]_lookup functions to take skbCraig Gallek
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT groups. Doing so with a BPF filter will require access to the skb in question. This change plumbs the skb (and offset to payload data) through the call stack to the listening socket lookup implementations where it will be used in a following patch. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11sock: struct proto hash function may errorCraig Gallek
In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08tcp: Fix syncookies sysctl default.David S. Miller
Unintentionally the default was changed to zero, fix that. Fixes: 12ed8244ed ("ipv4: Namespaceify tcp syncookies sysctl knob") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_notsent_lowat sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_fin_timeout sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_orphan_retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_retries2 sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_retries1 sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp reordering sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syncookies sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp synack retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syn retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: tcp_cong_control helperYuchung Cheng
Refactor and consolidate cwnd and rate updates into a new function tcp_cong_control(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: make congestion control more robust against reorderingYuchung Cheng
This change enables congestion control to update cwnd based on not only packet cumulatively acked but also packets delivered out-of-order. This makes congestion control robust against packet reordering because it may raise cwnd as long as packets are being delivered once reordering has been detected (i.e., it only cares the amount of packets delivered, not the ordering among them). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: refactor pkts acked accountingYuchung Cheng
A small refactoring that gets number of packets cumulatively acked from tcp_clean_rtx_queue() directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: new delivery accountingYuchung Cheng
This patch changes the accounting of how many packets are newly acked or sacked when the sender receives an ACK. The current approach basically computes newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out) where prior_packets and prior_sacked out are snapshot at the beginning of the ACK processing. The new approach tracks the delivery information via a new TCP state variable "delivered" which monotically increases as new packets are delivered in order or out-of-order. The reason for this change is that the current approach is brittle that produces negative or inaccurate estimate. 1) For non-SACK connections, an ACK that advances the SND.UNA could reset the DUPACK counters (tp->sacked_out) in tcp_process_loss() or tcp_fastretrans_alert(). This inflates the inflight suddenly and causes under-estimate or even negative estimate. Here is a real example: before after (processing ACK) packets_out 75 73 sacked_out 23 0 ca state Loss Open The old approach computes (75-23) - (73 - 0) = -21 delivered while the new approach computes 1 delivered since it considers the 2nd-24th packets are delivered OOO. 2) MSS change would re-count packets_out and sacked_out so the estimate is in-accurate and can even become negative. E.g., the inflight is doubled when MSS is halved. 3) Spurious retransmission signaled by DSACK is not accounted The new approach is simpler and more robust. For SACK connections, tp->delivered increments as packets are being acked or sacked in SACK and ACK processing. For non-sack connections, it's done in tcp_remove_reno_sacks() and tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered is incremented by the number of packets ACKed (less the current number of DUPACKs received plus one packet hole). Upon receiving a DUPACK, tp->delivered is incremented assuming one out-of-order packet is delivered. Upon receiving a DSACK, tp->delivered is incremtened assuming one retransmission is delivered in tcp_sacktag_write_queue(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: move cwnd reduction after recovery state procesingYuchung Cheng
Currently the cwnd is reduced and increased in various different places. The reduction happens in various places in the recovery state processing (tcp_fastretrans_alert) while the increase happens afterward. A better sequence is to identify lost packets and update the congestion control state (icsk_ca_state) first. Then base on the new state, up/down the cwnd in one central place. It's more clear to reason cwnd changes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: retransmit after recovery processing and congestion controlYuchung Cheng
The retransmission and F-RTO transmission currently happen inside recovery state processing (tcp_fastretrans_alert) but before congestion control. This refactoring moves the logic after both s.t. we can determine how much to send (cwnd) before deciding what to send. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tcp: fastopen: call tcp_fin() if FIN present in SYNACKEric Dumazet
When we acknowledge a FIN, it is not enough to ack the sequence number and queue the skb into receive queue. We also have to call tcp_fin() to properly update socket state and send proper poll() notifications. It seems we also had the problem if we received a SYN packet with the FIN flag set, but it does not seem an urgent issue, as no known implementation can do that. Fixes: 61d2bcae99f6 ("tcp: fastopen: accept data/FIN present in SYNACK message") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tcp: do not enqueue skb with SYN flagEric Dumazet
If we remove the SYN flag from the skbs that tcp_fastopen_add_skb() places in socket receive queue, then we can remove the test that tcp_recvmsg() has to perform in fast path. All we have to do is to adjust SEQ in the slow path. For the moment, we place an unlikely() and output a message if we find an skb having SYN flag set. Goal would be to get rid of the test completely. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tcp: fastopen: accept data/FIN present in SYNACK messageEric Dumazet
RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message MAY include data and/or FIN This patch adds support for the client side : If we receive a SYNACK with payload or FIN, queue the skb instead of ignoring it. Since we already support the same for SYN, we refactor the existing code and reuse it. Note we need to clone the skb, so this operation might fail under memory pressure. Sara Dickinson pointed out FreeBSD server Fast Open implementation was planned to generate such SYNACK in the future. The server side might be implemented on linux later. Reported-by: Sara Dickinson <sara@sinodun.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "This looks like a lot but it's a mixture of regression fixes as well as fixes for longer standing issues. 1) Fix on-channel cancellation in mac80211, from Johannes Berg. 2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables module, from Eric Dumazet. 3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric Dumazet. 4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is bound, from Craig Gallek. 5) GRO key comparisons don't take lightweight tunnels into account, from Jesse Gross. 6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric Dumazet. 7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we register them, otherwise the NEWLINK netlink message is missing the proper attributes. From Thadeu Lima de Souza Cascardo. 8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido Schimmel 9) Handle fragments properly in ipv4 easly socket demux, from Eric Dumazet. 10) Don't ignore the ifindex key specifier on ipv6 output route lookups, from Paolo Abeni" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits) tcp: avoid cwnd undo after receiving ECN irda: fix a potential use-after-free in ircomm_param_request net: tg3: avoid uninitialized variable warning net: nb8800: avoid uninitialized variable warning net: vxge: avoid unused function warnings net: bgmac: clarify CONFIG_BCMA dependency net: hp100: remove unnecessary #ifdefs net: davinci_cpdma: use dma_addr_t for DMA address ipv6/udp: use sticky pktinfo egress ifindex on connect() ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() netlink: not trim skb for mmaped socket when dump vxlan: fix a out of bounds access in __vxlan_find_mac net: dsa: mv88e6xxx: fix port VLAN maps fib_trie: Fix shift by 32 in fib_table_lookup net: moxart: use correct accessors for DMA memory ipv4: ipconfig: avoid unused ic_proto_used symbol bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout. bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter. bnxt_en: Ring free response from close path should use completion ring net_sched: drr: check for NULL pointer in drr_dequeue ...
2016-01-29tcp: avoid cwnd undo after receiving ECNYuchung Cheng
RFC 4015 section 3.4 says the TCP sender MUST refrain from reversing the congestion control state when the ACK signals congestion through the ECN-Echo flag. Currently we may not always do that when prior_ssthresh is reset upon receiving ACKs with ECE marks. This patch fixes that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29fib_trie: Fix shift by 32 in fib_table_lookupAlexander Duyck
The fib_table_lookup function had a shift by 32 that triggered a UBSAN warning. This was due to the fact that I had placed the shift first and then followed it with the check for the suffix length to ignore the undefined behavior. If we reorder this so that we verify the suffix is less than 32 before shifting the value we can avoid the issue. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29ipv4: ipconfig: avoid unused ic_proto_used symbolArnd Bergmann
When CONFIG_PROC_FS, CONFIG_IP_PNP_BOOTP, CONFIG_IP_PNP_DHCP and CONFIG_IP_PNP_RARP are all disabled, we get a warning about the ic_proto_used variable being unused: net/ipv4/ipconfig.c:146:12: error: 'ic_proto_used' defined but not used [-Werror=unused-variable] This avoids the warning, by making the definition conditional on whether a dynamic IP configuration protocol is configured. If not, we know that the value is always zero, so we can optimize away the variable and all code that depends on it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29ipv4: early demux should be aware of fragmentsEric Dumazet
We should not assume a valid protocol header is present, as this is not the case for IPv4 fragments. Lets avoid extra cache line misses and potential bugs if we actually find a socket and incorrectly uses its dst. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-28tcp: beware of alignments in tcp_get_info()Eric Dumazet
With some combinations of user provided flags in netlink command, it is possible to call tcp_get_info() with a buffer that is not 8-bytes aligned. It does matter on some arches, so we need to use put_unaligned() to store the u64 fields. Current iproute2 package does not trigger this particular issue. Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-28tcp: fix tcp_mark_head_lost to check skb len before fragmentingNeal Cardwell
This commit fixes a corner case in tcp_mark_head_lost() which was causing the WARN_ON(len > skb->len) in tcp_fragment() to fire. tcp_mark_head_lost() was assuming that if a packet has tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of M*mss bytes, for any M < N. But with the tricky way TCP pcounts are maintained, this is not always true. For example, suppose the sender sends 4 1-byte packets and have the last 3 packet sacked. It will merge the last 3 packets in the write queue into an skb with pcount = 3 and len = 3 bytes. If another recovery happens after a sack reneging event, tcp_mark_head_lost() may attempt to split the skb assuming it has more than 2*MSS bytes. This sounds very counterintuitive, but as the commit description for the related commit c0638c247f55 ("tcp: don't fragment SACKed skbs in tcp_mark_head_lost()") notes, this is because tcp_shifted_skb() coalesces adjacent regions of SACKed skbs, and when doing this it preserves the sum of their packet counts in order to reflect the real-world dynamics on the wire. The c0638c247f55 commit tried to avoid problems by not fragmenting SACKed skbs, since SACKed skbs are where the non-proportionality between pcount and skb->len/mss is known to be possible. However, that commit did not handle the case where during a reneging event one of these weird SACKed skbs becomes an un-SACKed skb, which tcp_mark_head_lost() can then try to fragment. The fix is to simply mark the entire skb lost when this happens. This makes the recovery slightly more aggressive in such corner cases before we detect reordering. But once we detect reordering this code path is by-passed because FACK is disabled. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>