summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2013-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2013-08-15net: proc_fs: trivial: print UIDs as unsigned intFrancesco Fusco
UIDs are printed in the proc_fs as signed int, whereas they are unsigned int. Signed-off-by: Francesco Fusco <ffusco@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15ipip: add x-netns supportNicolas Dichtel
This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15ipv4 tunnels: use net_eq() helper to check netnsNicolas Dichtel
It's better to use available helpers for these tests. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15dev: move skb_scrub_packet() after eth_type_trans()Nicolas Dichtel
skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans() set pkt_type. In fact, we should force pkt_type to PACKET_HOST, so move the call after eth_type_trans(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-14xfrm: introduce helper for safe determination of mtuHannes Frederic Sowa
skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we always have to make sure we a referring to the correct interpretation of skb->sk. We only depend on header defines to query the mtu, so we don't introduce a new dependency to ipv6 by this change. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-14xfrm: make local error reporting more robustHannes Frederic Sowa
In xfrm4 and xfrm6 we need to take care about sockets of the other address family. This could happen because a 6in4 or 4in6 tunnel could get protected by ipsec. Because we don't want to have a run-time dependency on ipv6 when only using ipv4 xfrm we have to embed a pointer to the correct local_error function in xfrm_state_afinet and look it up when returning an error depending on the socket address family. Thanks to vi0ss for the great bug report: <https://bugzilla.kernel.org/show_bug.cgi?id=58691> v2: a) fix two more unsafe interpretations of skb->sk as ipv6 socket (xfrm6_local_dontfrag and __xfrm6_output) v3: a) add an EXPORT_SYMBOL_GPL(xfrm_local_error) to fix a link error when building ipv6 as a module (thanks to Steffen Klassert) Reported-by: <vi0oss@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-13ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id.Pravin B Shelar
Using inner-id for tunnel id is not safe in some rare cases. E.g. packets coming from multiple sources entering same tunnel can have same id. Therefore on tunnel packet receive we could have packets from two different stream but with same source and dst IP with same ip-id which could confuse ip packet reassembly. Following patch reverts optimization from commit 490ab08127 (IP_GRE: Fix IP-Identification.) CC: Jarno Rajahalme <jrajahalme@nicira.com> CC: Ansis Atteka <aatteka@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13tcp: reset reordering est. selectively on timeoutYuchung Cheng
On timeout the TCP sender unconditionally resets the estimated degree of network reordering (tp->reordering). The idea behind this is that the estimate is too large to trigger fast recovery (e.g., due to a IP path change). But for example if the sender only had 2 packets outstanding, then a timeout doesn't tell much about reordering. A sender that learns about reordering on big writes and loses packets on small writes will end up falsely retransmitting again and again, especially when reordering is more likely on big writes. Therefore the sender should only suspect that tp->reordering is too high if it could have gone into fast recovery with the (lower) default estimate. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10tcp: add server ip to encrypt cookie in fast openYuchung Cheng
Encrypt the cookie with both server and client IPv4 addresses, such that multi-homed server will grant different cookies based on both the source and destination IPs. No client change is needed since cookie is opaque to the client. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: rename busy poll MIB counterEliezer Tamir
Rename mib counter from "low latency" to "busy poll" v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer) Eric Dumazet suggested that the current location is better. So v2 just renames the counter to fit the new naming convention. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Allow user-space configuration of igmp unsolicited report intervalWilliam Manley
Adds the new procfs knobs: /proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval /proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval Which will allow userspace configuration of the IGMP unsolicited report interval (see below) in milliseconds. The defaults are 10000ms for IGMPv2 and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376. Background: If an IGMP join packet is lost you will not receive data sent to the multicast group so if no data arrives from that multicast group in a period of time after the IGMP join a second IGMP join will be sent. The delay between joins is the "IGMP Unsolicited Report Interval". Prior to this patch this value was hard coded in the kernel to 10s for IGMPv2 and 1s for IGMPv3. 10s is unsuitable for some use-cases, such as IPTV as it can cause channel change to be slow in the presence of packet loss. This patch allows the value to be overridden from userspace for both IGMPv2 and IGMPv3 such that it can be tuned accoding to the network. Tested with Wireshark and a simple program to join a (non-existent) multicast group. The distribution of timings for the second join differ based upon setting the procfs knobs. igmpvX_unsolicited_report_interval is intended to follow the pattern established by force_igmp_version, and while a procfs entry has been added a corresponding sysctl knob has not as it is my understanding that sysctl is deprecated[1]. [1]: http://lwn.net/Articles/247243/ Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Don't flush routing cache when force_igmp_version is modifiedWilliam Manley
The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the IGMP protocol version to use to be explicitly set. As a side effect this caused the routing cache to be flushed as it was declared as a DEVINET_SYSCTL_FLUSHING_ENTRY. Flushing is unnecessary and this patch makes it so flushing does not occur. Requested by Hannes Frederic Sowa as he was reviewing other patches adding procfs entries. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3William Manley
If an IGMP join packet is lost you will not receive data sent to the multicast group so if no data arrives from that multicast group in a period of time after the IGMP join a second IGMP join will be sent. The delay between joins is the "IGMP Unsolicited Report Interval". Previously this value was hard coded to be chosen randomly between 0-10s. This can be too long for some use-cases, such as IPTV as it can cause channel change to be slow in the presence of packet loss. The value 10s has come from IGMPv2 RFC2236, which was reduced to 1s in IGMPv3 RFC3376. This patch makes the kernel use the 1s value from the later RFC if we are operating in IGMPv3 mode. IGMPv2 behaviour is unaffected. Tested with Wireshark and a simple program to join a (non-existent) multicast group. The distribution of timings for the second join differ based upon setting /proc/sys/net/ipv4/conf/eth0/force_igmp_version. Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09ip_gre: fix ipgre_header to return correct offsetTimo Teräs
Fix ipgre_header() (header_ops->create) to return the correct amount of bytes pushed. Most callers of dev_hard_header() seem to care only if it was success, but af_packet.c uses it as offset to the skb to copy from userspace only once. In practice this fixes packet socket sendto()/sendmsg() to gre tunnels. Regression introduced in c54419321455631079c7d6e60bc732dd0c5914c5 ("GRE: Refactor GRE tunneling code.") Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09netfilter: nf_conntrack: don't send destroy events from iteratorFlorian Westphal
Let nf_ct_delete handle delivery of the DESTROY event. Based on earlier patch from Pablo Neira. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-08net: add SNMP counters tracking incoming ECN bitsEric Dumazet
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP counters do not count the number of frames, but number of aggregated segments. Its probably too late to change this now. This patch adds four new counters, tracking number of frames, regardless of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6. Ip[6]NoECTPkts : Number of packets received with NOECT Ip[6]ECT1Pkts : Number of packets received with ECT(1) Ip[6]ECT0Pkts : Number of packets received with ECT(0) Ip[6]CEPkts : Number of packets received with Congestion Experienced lph37:~# nstat | egrep "Pkts|InReceive" IpInReceives 1634137 0.0 Ip6InReceives 3714107 0.0 Ip6InNoECTPkts 19205 0.0 Ip6InECT0Pkts 52651828 0.0 IpExtInNoECTPkts 33630 0.0 IpExtInECT0Pkts 15581379 0.0 IpExtInCEPkts 6 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-07ip_tunnel: embed hash list headstephen hemminger
The IP tunnel hash heads can be embedded in the per-net structure since it is a fixed size. Reduce the size so that the total structure fits in a page size. The original size was overly large, even NETDEV_HASHBITS is only 8 bits! Also, add some white space for readability. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Pravin B Shelar <pshelar@nicira.com>. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07tcp: cubic: fix bug in bictcp_acked()Eric Dumazet
While investigating about strange increase of retransmit rates on hosts ~24 days after boot, Van found hystart was disabled if ca->epoch_start was 0, as following condition is true when tcp_time_stamp high order bit is set. (s32)(tcp_time_stamp - ca->epoch_start) < HZ Quoting Van : At initialization & after every loss ca->epoch_start is set to zero so I believe that the above line will turn off hystart as soon as the 2^31 bit is set in tcp_time_stamp & hystart will stay off for 24 days. I think we've observed that cubic's restart is too aggressive without hystart so this might account for the higher drop rate we observe. Diagnosed-by: Van Jacobson <vanj@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07tcp: cubic: fix overflow error in bictcp_update()Eric Dumazet
commit 17a6e9f1aa9 ("tcp_cubic: fix clock dependency") added an overflow error in bictcp_update() in following code : /* change the unit from HZ to bictcp_HZ */ t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) - ca->epoch_start) << BICTCP_HZ) / HZ; Because msecs_to_jiffies() being unsigned long, compiler does implicit type promotion. We really want to constrain (tcp_time_stamp - ca->epoch_start) to a signed 32bit value, or else 't' has unexpected high values. This bugs triggers an increase of retransmit rates ~24 days after boot [1], as the high order bit of tcp_time_stamp flips. [1] for hosts with HZ=1000 Big thanks to Van Jacobson for spotting this problem. Diagnosed-by: Van Jacobson <vanj@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05fib_trie: remove potential out of bound accessEric Dumazet
AddressSanitizer [1] dynamic checker pointed a potential out of bound access in leaf_walk_rcu() We could allocate one more slot in tnode_new() to leave the prefetch() in-place but it looks not worth the pain. Bug added in commit 82cfbb008572b ("[IPV4] fib_trie: iterator recode") [1] : https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05net: esp{4,6}: fix potential MTU calculation overflowsDaniel Borkmann
Commit 91657eafb ("xfrm: take net hdr len into account for esp payload size calculation") introduced a possible interger overflow in esp{4,6}_get_mtu() handlers in case of x->props.mode equals XFRM_MODE_TUNNEL. Thus, the following expression will overflow unsigned int net_adj; ... <case ipv{4,6} XFRM_MODE_TUNNEL> net_adj = 0; ... return ((mtu - x->props.header_len - crypto_aead_authsize(esp->aead) - net_adj) & ~(align - 1)) + (net_adj - 2); where (net_adj - 2) would be evaluated as <foo> + (0 - 2) in an unsigned context. Fix it by simply removing brackets as those operations here do not need to have special precedence. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Benjamin Poirier <bpoirier@suse.de> Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge net into net-next to setup some infrastructure Eric Dumazet needs for usbnet changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03fib_rules: fix suppressor names and default valuesStefan Tomanek
This change brings the suppressor attribute names into line; it also changes the data types to provide a more consistent interface. While -1 indicates that the suppressor is not enabled, values >= 0 for suppress_prefixlen or suppress_ifgroup reject routing decisions violating the constraint. This changes the previously presented behaviour of suppress_prefixlen, where a prefix length _less_ than the attribute value was rejected. After this change, a prefix length less than *or* equal to the value is considered a violation of the rule constraint. It also changes the default values for default and newly added rules (disabling any suppression for those). Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02fib_rules: add route suppression based on ifgroupStefan Tomanek
This change adds the ability to suppress a routing decision based upon the interface group the selected interface belongs to. This allows it to exclude specific devices from a routing decision. Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02net: rtm_to_ifaddr: free ifa if ifa_cacheinfo processing failsDaniel Borkmann
Commit 5c766d642 ("ipv4: introduce address lifetime") leaves the ifa resource that was allocated via inet_alloc_ifa() unfreed when returning the function with -EINVAL. Thus, free it first via inet_free_ifa(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31fib_rules: add .suppress operationStefan Tomanek
This change adds a new operation to the fib_rules_ops struct; it allows the suppression of routing decisions if certain criteria are not met by its results. The first implemented constraint is a minimum prefix length added to the structures of routing rules. If a rule is added with a minimum prefix length >0, only routes meeting this threshold will be considered. Any other (more general) routing table entries will be ignored. When configuring a system with multiple network uplinks and default routes, it is often convinient to reference the main routing table multiple times - but omitting the default route. Using this patch and a modified "ip" utility, this can be achieved by using the following command sequence: $ ip route add table secuplink default via 10.42.23.1 $ ip rule add pref 100 table main prefixlength 1 $ ip rule add pref 150 fwmark 0xA table secuplink With this setup, packets marked 0xA will be processed by the additional routing table "secuplink", but only if no suitable route in the main routing table can be found. By using a minimal prefixlength of 1, the default route (/0) of the table "main" is hidden to packets processed by rule 100; packets traveling to destinations with more specific routing entries are processed as usual. Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31net: split rt_genid for ipv4 and ipv6fan.du
Current net name space has only one genid for both IPv4 and IPv6, it has below drawbacks: - Add/delete an IPv4 address will invalidate all IPv6 routing table entries. - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table entries even when the policy is only applied for one address family. Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6 separately in a fine granularity. Signed-off-by: Fan Du <fan.du@windriver.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31tcp: Remove unused tcpct declarations and commentsDmitry Popov
Remove declaration, 4 defines and confusing comment that are no longer used since 1a2c6181c4 ("tcp: Remove TCPCT"). Signed-off-by: Dmitry Popov <dp@highloadlab.com> Acked-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30net: remove an unneeded checkDan Carpenter
"ifa->ifa_label" is an array inside the in_ifaddr struct. It can never be NULL so we can remove this check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30tcp: add tcp_syncookies mode to allow unconditionally generation of syncookiesHannes Frederic Sowa
| If you want to test which effects syncookies have to your | network connections you can set this knob to 2 to enable | unconditionally generation of syncookies. Original idea and first implementation by Eric Dumazet. Cc: Florian Westphal <fw@strlen.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-28ipv4, ipv6: send igmpv3/mld packets with TC_PRIO_CONTROLHannes Frederic Sowa
v2: a) Also send ipv4 igmp messages with TC_PRIO_CONTROL Cc: William Manley <william.manley@youview.com> Cc: Lukas Tribus <luky-37@hotmail.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27pktgen: Add UDPCSUM flag to support UDP checksumsThomas Graf
UDP checksums are optional, hence pktgen has been omitting them in favour of performance. The optional flag UDPCSUM enables UDP checksumming. If the output device supports hardware checksumming the skb is prepared and marked CHECKSUM_PARTIAL, otherwise the checksum is generated in software. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24tcp: TCP_NOTSENT_LOWAT socket optionEric Dumazet
Idea of this patch is to add optional limitation of number of unsent bytes in TCP sockets, to reduce usage of kernel memory. TCP receiver might announce a big window, and TCP sender autotuning might allow a large amount of bytes in write queue, but this has little performance impact if a large part of this buffering is wasted : Write queue needs to be large only to deal with large BDP, not necessarily to cope with scheduling delays (incoming ACKS make room for the application to queue more bytes) For most workloads, using a value of 128 KB or less is OK to give applications enough time to react to POLLOUT events in time (or being awaken in a blocking sendmsg()) This patch adds two ways to set the limit : 1) Per socket option TCP_NOTSENT_LOWAT 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets not using TCP_NOTSENT_LOWAT socket option (or setting a zero value) Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect. This changes poll()/select()/epoll() to report POLLOUT only if number of unsent bytes is below tp->nosent_lowat Note this might increase number of sendmsg()/sendfile() calls when using non blocking sockets, and increase number of context switches for blocking sockets. Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is defined as : Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol) Tested: netperf sessions, and watching /proc/net/protocols "memory" column for TCP With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458) lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y Using 128KB has no bad effect on the throughput or cpu usage of a single flow, although there is an increase of context switches. A bonus is that we hold socket lock for a shorter amount of time and should improve latencies of ACK processing. lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 412,514 context-switches 200.034645535 seconds time elapsed lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 2,675,818 context-switches 200.029651391 seconds time elapsed Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-By: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24net: add sk_stream_is_writeable() helperEric Dumazet
Several call sites use the hardcoded following condition : sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) Lets use a helper because TCP_NOTSENT_LOWAT support will change this condition for TCP sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24fib_trie: potential out of bounds access in trie_show_stats()Jerry Snitselaar
With the <= max condition in the for loop, it will be always go 1 element further than needed. If the condition for the while loop is never met, then max is MAX_STAT_DEPTH, and for loop will walk off the end of nodesizes[]. Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23vti: switch to new ip tunnel codeAmerigo Wang
GRE tunnel and IPIP tunnel already switched to the new ip tunnel code, VTI tunnel can use it too. Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Saurabh Mohan <saurabh.mohan@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23ipmr: change the prototype of ip_mr_forward().Rami Rosen
This patch changes the prototpye of the ip_mr_forward() method to return void instead of int. The ip_mr_forward() method always returns 0; moreover, the return value of this method is not checked anywhere. Signed-off-by: Rami Rosen <ramirose@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23net: convert resend IGMP to notifier eventJiri Pirko
Until now, bond_resend_igmp_join_requests() looks for vlans attached to bonding device, bridge where bonding act as port manually. It does not care of other scenarios, like stacked bonds or team device above. Make this more generic and use netdev notifier to propagate the event to upper devices and to actually call ip_mc_rejoin_groups(). Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22tcp: use RTT from SACK for RTOYuchung Cheng
If RTT is not available because Karn's check has failed or no new packet is acked, use the RTT measured from SACK to estimate the RTO. The sender can continue to estimate the RTO during loss recovery or reordering event upon receiving non-partial ACKs. This also changes when the RTO is re-armed. Previously it is only re-armed when some data is cummulatively acknowledged (i.e., SND.UNA advances), but now it is re-armed whenever RTT estimator is updated. This feature is particularly useful to reduce spurious timeout for buffer bloat including cellular carriers [1], and RTT estimation on reordering events. [1] "An In-depth Study of LTE: Effect of Network Protocol and Application Behavior on Performance", In Proc. of SIGCOMM 2013 Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22tcp: measure RTT from new SACKYuchung Cheng
Take RTT sample if an ACK selectively acks some sequences that have never been retransmitted. The Karn's algorithm does not apply even if that ACK (s)acks other retransmitted sequences, because it must been generated by an original but perhaps out-of-order packet. There is no ambiguity. In case when multiple blocks are newly sacked because of ACK losses the earliest block is used to measure RTT, similar to cummulative ACKs. Such RTT samples allow the sender to estimate the RTO during loss recovery and packet reordering events. It is still useful even with TCP timestamps. That's because during these events the SND.UNA may not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED check before calling tcp_ack_update_rtt()). Therefore this new RTT source is complementary to existing ACK and TS RTT mechanisms. This patch does not update the RTO. It is done in the next patch. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22tcp: prefer packet timing to TS-ECR for RTTYuchung Cheng
Prefer packet timings to TS-ecr for RTT measurements when both sources are available. That's because broken middle-boxes and remote peer can return packets with corrupted TS ECR fields. Similarly most congestion controls that require RTT signals favor timing-based sources as well. Also check for bad TS ECR values to avoid RTT blow-ups. It has happened on production Web servers. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22tcp: consolidate SYNACK RTT samplingYuchung Cheng
The first patch consolidates SYNACK and other RTT measurement to use a central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK RTT measurement happens after PAWS check, potentially reducing the impact of RTO seeding on bad TCP timestamps values. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-19sysctl net: Keep tcp_syn_retries inside the boundaryMichal Tesar
Limit the min/max value passed to the /proc/sys/net/ipv4/tcp_syn_retries. Signed-off-by: Michal Tesar <mtesar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-16ipv4: set transport header earlierEric Dumazet
commit 45f00f99d6e ("ipv4: tcp: clean up tcp_v4_early_demux()") added a performance regression for non GRO traffic, basically disabling IP early demux. IPv6 stack resets transport header in ip6_rcv() before calling IP early demux in ip6_rcv_finish(), while IPv4 does this only in ip_local_deliver_finish(), _after_ IP early demux. GRO traffic happened to enable IP early demux because transport header is also set in inet_gro_receive() Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the same : transport_header should be set in ip_rcv() instead of ip_local_deliver_finish() ip_local_deliver_finish() can also use skb_network_header_len() which is faster than ip_hdrlen() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-15PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.Rusty Russell
Sweep of the simple cases. Cc: netdev@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-arm-kernel@lists.infradead.org Cc: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-12tcp: account all retransmit failuresYuchung Cheng
Change snmp RETRANSFAILS stat to include timeout retransmit failures in addition to other loss recoveries. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11gre: Fix MTU sizing check for gretap tunnelsAlexander Duyck
This change fixes an MTU sizing issue seen with gretap tunnels when non-gso packets are sent from the interface. In my case I was able to reproduce the issue by simply sending a ping of 1421 bytes with the gretap interface created on a device with a standard 1500 mtu. This fix is based on the fact that the tunnel mtu is already adjusted by dev->hard_header_len so it would make sense that any packets being compared against that mtu should also be adjusted by hard_header_len and the tunnel header instead of just the tunnel header. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Reported-by: Cong Wang <amwang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11gso: Update tunnel segmentation to support Tx checksum offloadAlexander Duyck
This change makes it so that the GRE and VXLAN tunnels can make use of Tx checksum offload support provided by some drivers via the hw_enc_features. Without this fix enabling GSO means sacrificing Tx checksum offload and this actually leads to a performance regression as shown below: Utilization Send Throughput local GSO 10^6bits/s % S state 6276.51 8.39 enabled 7123.52 8.42 disabled To resolve this it was necessary to address two items. First netif_skb_features needed to be updated so that it would correctly handle the Trans Ether Bridging protocol without impacting the need to check for Q-in-Q tagging. To do this it was necessary to update harmonize_features so that it used skb_network_protocol instead of just using the outer protocol. Second it was necessary to update the GRE and UDP tunnel segmentation offloads so that they would reset the encapsulation bit and inner header offsets after the offload was complete. As a result of this change I have seen the following results on a interface with Tx checksum enabled for encapsulated frames: Utilization Send Throughput local GSO 10^6bits/s % S state 7123.52 8.42 disabled 8321.75 5.43 enabled v2: Instead of replacing refrence to skb->protocol with skb_network_protocol just replace the protocol reference in harmonize_features to allow for double VLAN tag checks. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>