summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2013-08-30tcp: tcp_make_synack() should use sock_wmallocPhil Oester
In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed the call to sock_wmalloc in tcp_make_synack to alloc_skb. In doing so, the netfilter owner match lost its ability to block the SYNACK packet on outbound listening sockets. Revert the change, restoring the owner match functionality. This closes netfilter bugzilla #847. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30tcp: do not use cached RTT for RTT estimationYuchung Cheng
RTT cached in the TCP metrics are valuable for the initial timeout because SYN RTT usually does not account for serialization delays on low BW path. However using it to seed the RTT estimator maybe disruptive because other components (e.g., pacing) require the smooth RTT to be obtained from actual connection. The solution is to use the higher cached RTT to set the first RTO conservatively like tcp_rtt_estimator(), but avoid seeding the other RTT estimator variables such as srtt. It is also a good idea to keep RTO conservative to obtain the first RTT sample, and the performance is insured by TCP loss probe if SYN RTT is available. To keep the seeding formula consistent across SYN RTT and cached RTT, the rttvar is twice the cached RTT instead of cached RTTVAR value. The reason is because cached variation may be too small (near min RTO) which defeats the purpose of being conservative on first RTO. However the metrics still keep the RTT variations as they might be useful for user applications (through ip). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== This pull request fixes some issues that arise when 6in4 or 4in6 tunnels are used in combination with IPsec, all from Hannes Frederic Sowa and a null pointer dereference when queueing packets to the policy hold queue. 1) We might access the local error handler of the wrong address family if 6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer to the correct local_error to xfrm_state_afinet. 2) Add a helper function to always refer to the correct interpretation of skb->sk. 3) Call skb_reset_inner_headers to record the position of the inner headers when adding a new one in various ipv6 tunnels. This is needed to identify the addresses where to send back errors in the xfrm layer. 4) Dereference inner ipv6 header if encapsulated to always call the right error handler. 5) Choose protocol family by skb protocol to not call the wrong xfrm{4,6}_local_error handler in case an ipv6 sockets is used in ipv4 mode. 6) Partly revert "xfrm: introduce helper for safe determination of mtu" because this introduced pmtu discovery problems. 7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs. We need this to get the correct mtu informations in xfrm. 8) Fix null pointer dereference in xdst_queue_output. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29ipv4: sendto/hdrincl: don't use destination address found in headerChris Clark
ipv4: raw_sendmsg: don't use header's destination address A sendto() regression was bisected and found to start with commit f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.) The problem is that it tries to ARP-lookup the constructed packet's destination address rather than the explicitly provided address. Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used. cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b Reported-by: Chris Clark <chris.clark@alcatel-lucent.com> Bisected-by: Chris Clark <chris.clark@alcatel-lucent.com> Tested-by: Chris Clark <chris.clark@alcatel-lucent.com> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Chris Clark <chris.clark@alcatel-lucent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29tcp: TSO packets automatic sizingEric Dumazet
After hearing many people over past years complaining against TSO being bursty or even buggy, we are proud to present automatic sizing of TSO packets. One part of the problem is that tcp_tso_should_defer() uses an heuristic relying on upcoming ACKS instead of a timer, but more generally, having big TSO packets makes little sense for low rates, as it tends to create micro bursts on the network, and general consensus is to reduce the buffering amount. This patch introduces a per socket sk_pacing_rate, that approximates the current sending rate, and allows us to size the TSO packets so that we try to send one packet every ms. This field could be set by other transports. Patch has no impact for high speed flows, where having large TSO packets makes sense to reach line rate. For other flows, this helps better packet scheduling and ACK clocking. This patch increases performance of TCP flows in lossy environments. A new sysctl (tcp_min_tso_segs) is added, to specify the minimal size of a TSO packet (default being 2). A follow-up patch will provide a new packet scheduler (FQ), using sk_pacing_rate as an input to perform optional per flow pacing. This explains why we chose to set sk_pacing_rate to twice the current rate, allowing 'slow start' ramp up. sk_pacing_rate = 2 * cwnd * mss / srtt v2: Neal Cardwell reported a suspect deferring of last two segments on initial write of 10 MSS, I had to change tcp_tso_should_defer() to take into account tp->xmit_size_goal_segs Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Cc: Tom Herbert <therbert@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29tcp: don't apply tsoffset if rcv_tsecr is zeroAndrew Vagin
The zero value means that tsecr is not valid, so it's a special case. tsoffset is used to customize tcp_time_stamp for one socket. tsoffset is usually zero, it's used when a socket was moved from one host to another host. Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to TCP_RTO_MAX. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29tcp: initialize rcv_tstamp for restored socketsAndrew Vagin
u32 rcv_tstamp; /* timestamp of last received ACK */ Its value used in tcp_retransmit_timer, which closes socket if the last ack was received more then TCP_RTO_MAX ago. Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer is called before receiving a first ack, the connection is closed. This patch initializes rcv_tstamp to a timestamp, when a socket was restored. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28netfilter: add SYNPROXY core/targetPatrick McHardy
Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy core with common functions and an address family specific target. The SYNPROXY receives the connection request from the client, responds with a SYN/ACK containing a SYN cookie and announcing a zero window and checks whether the final ACK from the client contains a valid cookie. It then establishes a connection to the original destination and, if successful, sends a window update to the client with the window size announced by the server. Support for timestamps, SACK, window scaling and MSS options can be statically configured as target parameters if the features of the server are known. If timestamps are used, the timestamp value sent back to the client in the SYN/ACK will be different from the real timestamp of the server. In order to now break PAWS, the timestamps are translated in the direction server->client. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28net: syncookies: export cookie_v4_init_sequence/cookie_v4_checkPatrick McHardy
Extract the local TCP stack independant parts of tcp_v4_init_sequence() and cookie_v4_check() and export them for use by the upcoming SYNPROXY target. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28netfilter: nf_conntrack: make sequence number adjustments usuable without NATPatrick McHardy
Split out sequence number adjustments from NAT and move them to the conntrack core to make them usable for SYN proxying. The sequence number adjustment information is moved to a seperate extend. The extend is added to new conntracks when a NAT mapping is set up for a connection using a helper. As a side effect, this saves 24 bytes per connection with NAT in the common case that a connection does not have a helper assigned. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridgedPhil Oester
As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT with the tcp-reset option sends out reset packets with the src MAC address of the local bridge interface, instead of the MAC address of the intended destination. This causes some routers/firewalls to drop the reset packet as it appears to be spoofed. Fix this by bypassing ip[6]_local_out and setting the MAC of the sender in the tcp reset packet. This closes netfilter bugzilla #531. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27net: tcp_probe: allow more advanced ingress filtering by markDaniel Borkmann
Currently, the tcp_probe snooper can either filter packets by a given port (handed to the module via module parameter e.g. port=80) or lets all TCP traffic pass (port=0, default). When a port is specified, the port number is tested against the sk's source/destination port. Thus, if one of them matches, the information will be further processed for the log. As this is quite limited, allow for more advanced filtering possibilities which can facilitate debugging/analysis with the help of the tcp_probe snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e ("pkt_sched: ingress socket filter by mark"), add the possibility to use skb->mark as a filter. If the mark is not being used otherwise, this allows ingress filtering by flow (e.g. in order to track updates from only a single flow, or a subset of all flows for a given port) and other things such as dynamic logging and reconfiguration without removing/re-inserting the tcp_probe module, etc. Simple example: insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1 ... iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \ --sport 60952 -j MARK --set-mark 8888 [... sampling interval ...] iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \ --sport 60952 -j MARK --set-mark 8888 The current option to filter by a given port is still being preserved. A similar approach could be done for the sctp_probe module as a follow-up. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/wireless/iwlwifi/pcie/trans.c include/linux/inetdevice.h The inetdevice.h conflict involves moving the IPV4_DEVCONF values into a UAPI header, overlapping additions of some new entries. The iwlwifi conflict is a context overlap. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-26xfrm: revert ipv4 mtu determination to dst_mtuHannes Frederic Sowa
In commit 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce helper for safe determination of mtu") I switched the determination of ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is never correct for ipv4 ipsec. This patch partly reverts 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce helper for safe determination of mtu"). Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-25ipip: potential race in ip_tunnel_init_net()Dan Carpenter
Eric Dumazet says that my previous fix for an ERR_PTR dereference (ea857f28ab 'ipip: dereferencing an ERR_PTR in ip_tunnel_init_net()') could be racy and suggests the following fix instead. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22net: tcp_probe: add IPv6 supportDaniel Borkmann
The tcp_probe currently only supports analysis of IPv4 connections. Therefore, it would be nice to have IPv6 supported as well. Since we have the recently added %pISpc specifier that is IPv4/IPv6 generic, build related sockaddress structures from the flow information and pass this to our format string. Tested with SSH and HTTP sessions on IPv4 and IPv6. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22net: tcp_probe: kprobes: adapt jtcp_rcv_established signatureDaniel Borkmann
This patches fixes a rather unproblematic function signature mismatch as the const specifier was missing for the th variable; and next to that it adds a build-time assertion so that future function signature mismatches for kprobes will not end badly, similarly as commit 22222997 ("net: sctp: add build check for sctp_sf_eat_sack_6_2/jsctp_sf_eat_sack") did it for SCTP. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22net: tcp_probe: also include rcv_wnd next to snd_wndDaniel Borkmann
It is helpful to sometimes know the TCP window sizes of an established socket e.g. to confirm that window scaling is working or to tweak the window size to improve high-latency connections, etc etc. Currently the TCP snooper only exports the send window size, but not the receive window size. Therefore, also add the receive window size to the end of the output line. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-22tcp: increase throughput when reordering is highYuchung Cheng
The stack currently detects reordering and avoid spurious retransmission very well. However the throughput is sub-optimal under high reordering because cwnd is increased only if the data is deliverd in order. I.e., FLAG_DATA_ACKED check in tcp_ack(). The more packet are reordered the worse the throughput is. Therefore when reordering is proven high, cwnd should advance whenever the data is delivered regardless of its ordering. If reordering is low, conservatively advance cwnd only on ordered deliveries in Open state, and retain cwnd in Disordered state (RFC5681). Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms to 55ms (for reordering effect). This change increases TCP throughput by 20 - 25% to near bottleneck BW. A special case is the stretched ACK with new SACK and/or ECE mark. For example, a receiver may receive an out of order or ECN packet with unacked data buffered because of LRO or delayed ACK. The principle on such an ACK is to advance cwnd on the cummulative acked part first, then reduce cwnd in tcp_fastretrans_alert(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20ipip: dereferencing an ERR_PTR in ip_tunnel_init_net()Dan Carpenter
We need to move the derefernce after the IS_ERR() check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20ipv4: raise IP_MAX_MTU to theoretical limitEric Dumazet
As discussed last year [1], there is no compelling reason to limit IPv4 MTU to 0xFFF0, while real limit is 0xFFFF [1] : http://marc.info/?l=linux-netdev&m=135607247609434&w=2 Willem raised this issue again because some of our internal regression tests broke after lo mtu being set to 65536. IP_MTU reports 0xFFF0, and the test attempts to send a RAW datagram of mtu + 1 bytes, expecting the send() to fail, but it does not. Alexey raised interesting points about TCP MSS, that should be addressed in follow-up patches in TCP stack if needed, as someone could also set an odd mtu anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20tcp: trivial: Remove nocache argument from tcp_v4_send_synackChristoph Paasch
The nocache-argument was used in tcp_v4_send_synack as an argument to inet_csk_route_req. However, since ba3f7f04ef2b (ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.) this is no more used. This patch removes the unsued argument from tcp_v4_send_synack. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Conflicts: net/netfilter/nf_conntrack_proto_tcp.c The conflict had to do with overlapping changes dealing with fixing the use of an "s32" to hold the value returned by NAT_OFFSET(). Pablo Neira Ayuso says: ==================== The following batch contains Netfilter/IPVS updates for your net-next tree. More specifically, they are: * Trivial typo fix in xt_addrtype, from Phil Oester. * Remove net_ratelimit in the conntrack logging for consistency with other logging subsystem, from Patrick McHardy. * Remove unneeded includes from the recently added xt_connlabel support, from Florian Westphal. * Allow to update conntracks via nfqueue, don't need NFQA_CFG_F_CONNTRACK for this, from Florian Westphal. * Remove tproxy core, now that we have socket early demux, from Florian Westphal. * A couple of patches to refactor conntrack event reporting to save a good bunch of lines, from Florian Westphal. * Fix missing locking in NAT sequence adjustment, it did not manifested in any known bug so far, from Patrick McHardy. * Change sequence number adjustment variable to 32 bits, to delay the possible early overflow in long standing connections, also from Patrick. * Comestic cleanups for IPVS, from Dragos Foianu. * Fix possible null dereference in IPVS in the SH scheduler, from Daniel Borkmann. * Allow to attach conntrack expectations via nfqueue. Before this patch, you had to use ctnetlink instead, thus, we save the conntrack lookup. * Export xt_rpfilter and xt_HMARK header files, from Nicolas Dichtel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20tcp: set timestamps for restored skb-sAndrey Vagin
When the repair mode is turned off, the write queue seqs are updated so that the whole queue is considered to be 'already sent. The "when" field must be set for such skb. It's used in tcp_rearm_rto for example. If the "when" field isn't set, the retransmit timeout can be calculated incorrectly and a tcp connected can stop for two minutes (TCP_RTO_MAX). Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2013-08-15net: proc_fs: trivial: print UIDs as unsigned intFrancesco Fusco
UIDs are printed in the proc_fs as signed int, whereas they are unsigned int. Signed-off-by: Francesco Fusco <ffusco@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15ipip: add x-netns supportNicolas Dichtel
This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15ipv4 tunnels: use net_eq() helper to check netnsNicolas Dichtel
It's better to use available helpers for these tests. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15dev: move skb_scrub_packet() after eth_type_trans()Nicolas Dichtel
skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans() set pkt_type. In fact, we should force pkt_type to PACKET_HOST, so move the call after eth_type_trans(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-14xfrm: introduce helper for safe determination of mtuHannes Frederic Sowa
skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we always have to make sure we a referring to the correct interpretation of skb->sk. We only depend on header defines to query the mtu, so we don't introduce a new dependency to ipv6 by this change. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-14xfrm: make local error reporting more robustHannes Frederic Sowa
In xfrm4 and xfrm6 we need to take care about sockets of the other address family. This could happen because a 6in4 or 4in6 tunnel could get protected by ipsec. Because we don't want to have a run-time dependency on ipv6 when only using ipv4 xfrm we have to embed a pointer to the correct local_error function in xfrm_state_afinet and look it up when returning an error depending on the socket address family. Thanks to vi0ss for the great bug report: <https://bugzilla.kernel.org/show_bug.cgi?id=58691> v2: a) fix two more unsafe interpretations of skb->sk as ipv6 socket (xfrm6_local_dontfrag and __xfrm6_output) v3: a) add an EXPORT_SYMBOL_GPL(xfrm_local_error) to fix a link error when building ipv6 as a module (thanks to Steffen Klassert) Reported-by: <vi0oss@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-13ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id.Pravin B Shelar
Using inner-id for tunnel id is not safe in some rare cases. E.g. packets coming from multiple sources entering same tunnel can have same id. Therefore on tunnel packet receive we could have packets from two different stream but with same source and dst IP with same ip-id which could confuse ip packet reassembly. Following patch reverts optimization from commit 490ab08127 (IP_GRE: Fix IP-Identification.) CC: Jarno Rajahalme <jrajahalme@nicira.com> CC: Ansis Atteka <aatteka@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13tcp: reset reordering est. selectively on timeoutYuchung Cheng
On timeout the TCP sender unconditionally resets the estimated degree of network reordering (tp->reordering). The idea behind this is that the estimate is too large to trigger fast recovery (e.g., due to a IP path change). But for example if the sender only had 2 packets outstanding, then a timeout doesn't tell much about reordering. A sender that learns about reordering on big writes and loses packets on small writes will end up falsely retransmitting again and again, especially when reordering is more likely on big writes. Therefore the sender should only suspect that tp->reordering is too high if it could have gone into fast recovery with the (lower) default estimate. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10tcp: add server ip to encrypt cookie in fast openYuchung Cheng
Encrypt the cookie with both server and client IPv4 addresses, such that multi-homed server will grant different cookies based on both the source and destination IPs. No client change is needed since cookie is opaque to the client. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: rename busy poll MIB counterEliezer Tamir
Rename mib counter from "low latency" to "busy poll" v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer) Eric Dumazet suggested that the current location is better. So v2 just renames the counter to fit the new naming convention. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Allow user-space configuration of igmp unsolicited report intervalWilliam Manley
Adds the new procfs knobs: /proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval /proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval Which will allow userspace configuration of the IGMP unsolicited report interval (see below) in milliseconds. The defaults are 10000ms for IGMPv2 and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376. Background: If an IGMP join packet is lost you will not receive data sent to the multicast group so if no data arrives from that multicast group in a period of time after the IGMP join a second IGMP join will be sent. The delay between joins is the "IGMP Unsolicited Report Interval". Prior to this patch this value was hard coded in the kernel to 10s for IGMPv2 and 1s for IGMPv3. 10s is unsuitable for some use-cases, such as IPTV as it can cause channel change to be slow in the presence of packet loss. This patch allows the value to be overridden from userspace for both IGMPv2 and IGMPv3 such that it can be tuned accoding to the network. Tested with Wireshark and a simple program to join a (non-existent) multicast group. The distribution of timings for the second join differ based upon setting the procfs knobs. igmpvX_unsolicited_report_interval is intended to follow the pattern established by force_igmp_version, and while a procfs entry has been added a corresponding sysctl knob has not as it is my understanding that sysctl is deprecated[1]. [1]: http://lwn.net/Articles/247243/ Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Don't flush routing cache when force_igmp_version is modifiedWilliam Manley
The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the IGMP protocol version to use to be explicitly set. As a side effect this caused the routing cache to be flushed as it was declared as a DEVINET_SYSCTL_FLUSHING_ENTRY. Flushing is unnecessary and this patch makes it so flushing does not occur. Requested by Hannes Frederic Sowa as he was reviewing other patches adding procfs entries. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3William Manley
If an IGMP join packet is lost you will not receive data sent to the multicast group so if no data arrives from that multicast group in a period of time after the IGMP join a second IGMP join will be sent. The delay between joins is the "IGMP Unsolicited Report Interval". Previously this value was hard coded to be chosen randomly between 0-10s. This can be too long for some use-cases, such as IPTV as it can cause channel change to be slow in the presence of packet loss. The value 10s has come from IGMPv2 RFC2236, which was reduced to 1s in IGMPv3 RFC3376. This patch makes the kernel use the 1s value from the later RFC if we are operating in IGMPv3 mode. IGMPv2 behaviour is unaffected. Tested with Wireshark and a simple program to join a (non-existent) multicast group. The distribution of timings for the second join differ based upon setting /proc/sys/net/ipv4/conf/eth0/force_igmp_version. Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09ip_gre: fix ipgre_header to return correct offsetTimo Teräs
Fix ipgre_header() (header_ops->create) to return the correct amount of bytes pushed. Most callers of dev_hard_header() seem to care only if it was success, but af_packet.c uses it as offset to the skb to copy from userspace only once. In practice this fixes packet socket sendto()/sendmsg() to gre tunnels. Regression introduced in c54419321455631079c7d6e60bc732dd0c5914c5 ("GRE: Refactor GRE tunneling code.") Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09netfilter: nf_conntrack: don't send destroy events from iteratorFlorian Westphal
Let nf_ct_delete handle delivery of the DESTROY event. Based on earlier patch from Pablo Neira. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-08net: add SNMP counters tracking incoming ECN bitsEric Dumazet
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP counters do not count the number of frames, but number of aggregated segments. Its probably too late to change this now. This patch adds four new counters, tracking number of frames, regardless of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6. Ip[6]NoECTPkts : Number of packets received with NOECT Ip[6]ECT1Pkts : Number of packets received with ECT(1) Ip[6]ECT0Pkts : Number of packets received with ECT(0) Ip[6]CEPkts : Number of packets received with Congestion Experienced lph37:~# nstat | egrep "Pkts|InReceive" IpInReceives 1634137 0.0 Ip6InReceives 3714107 0.0 Ip6InNoECTPkts 19205 0.0 Ip6InECT0Pkts 52651828 0.0 IpExtInNoECTPkts 33630 0.0 IpExtInECT0Pkts 15581379 0.0 IpExtInCEPkts 6 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-07ip_tunnel: embed hash list headstephen hemminger
The IP tunnel hash heads can be embedded in the per-net structure since it is a fixed size. Reduce the size so that the total structure fits in a page size. The original size was overly large, even NETDEV_HASHBITS is only 8 bits! Also, add some white space for readability. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Pravin B Shelar <pshelar@nicira.com>. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07tcp: cubic: fix bug in bictcp_acked()Eric Dumazet
While investigating about strange increase of retransmit rates on hosts ~24 days after boot, Van found hystart was disabled if ca->epoch_start was 0, as following condition is true when tcp_time_stamp high order bit is set. (s32)(tcp_time_stamp - ca->epoch_start) < HZ Quoting Van : At initialization & after every loss ca->epoch_start is set to zero so I believe that the above line will turn off hystart as soon as the 2^31 bit is set in tcp_time_stamp & hystart will stay off for 24 days. I think we've observed that cubic's restart is too aggressive without hystart so this might account for the higher drop rate we observe. Diagnosed-by: Van Jacobson <vanj@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07tcp: cubic: fix overflow error in bictcp_update()Eric Dumazet
commit 17a6e9f1aa9 ("tcp_cubic: fix clock dependency") added an overflow error in bictcp_update() in following code : /* change the unit from HZ to bictcp_HZ */ t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) - ca->epoch_start) << BICTCP_HZ) / HZ; Because msecs_to_jiffies() being unsigned long, compiler does implicit type promotion. We really want to constrain (tcp_time_stamp - ca->epoch_start) to a signed 32bit value, or else 't' has unexpected high values. This bugs triggers an increase of retransmit rates ~24 days after boot [1], as the high order bit of tcp_time_stamp flips. [1] for hosts with HZ=1000 Big thanks to Van Jacobson for spotting this problem. Diagnosed-by: Van Jacobson <vanj@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05fib_trie: remove potential out of bound accessEric Dumazet
AddressSanitizer [1] dynamic checker pointed a potential out of bound access in leaf_walk_rcu() We could allocate one more slot in tnode_new() to leave the prefetch() in-place but it looks not worth the pain. Bug added in commit 82cfbb008572b ("[IPV4] fib_trie: iterator recode") [1] : https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05net: esp{4,6}: fix potential MTU calculation overflowsDaniel Borkmann
Commit 91657eafb ("xfrm: take net hdr len into account for esp payload size calculation") introduced a possible interger overflow in esp{4,6}_get_mtu() handlers in case of x->props.mode equals XFRM_MODE_TUNNEL. Thus, the following expression will overflow unsigned int net_adj; ... <case ipv{4,6} XFRM_MODE_TUNNEL> net_adj = 0; ... return ((mtu - x->props.header_len - crypto_aead_authsize(esp->aead) - net_adj) & ~(align - 1)) + (net_adj - 2); where (net_adj - 2) would be evaluated as <foo> + (0 - 2) in an unsigned context. Fix it by simply removing brackets as those operations here do not need to have special precedence. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Benjamin Poirier <bpoirier@suse.de> Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge net into net-next to setup some infrastructure Eric Dumazet needs for usbnet changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03fib_rules: fix suppressor names and default valuesStefan Tomanek
This change brings the suppressor attribute names into line; it also changes the data types to provide a more consistent interface. While -1 indicates that the suppressor is not enabled, values >= 0 for suppress_prefixlen or suppress_ifgroup reject routing decisions violating the constraint. This changes the previously presented behaviour of suppress_prefixlen, where a prefix length _less_ than the attribute value was rejected. After this change, a prefix length less than *or* equal to the value is considered a violation of the rule constraint. It also changes the default values for default and newly added rules (disabling any suppression for those). Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02fib_rules: add route suppression based on ifgroupStefan Tomanek
This change adds the ability to suppress a routing decision based upon the interface group the selected interface belongs to. This allows it to exclude specific devices from a routing decision. Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>