summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2011-03-10ipv4: Kill flowi arg to fib_select_multipath()David S. Miller
Completely unused. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10ipv4: Remove unnecessary test from ip_mkroute_input()David S. Miller
fl->oif will always be zero on the input path, so there is no reason to test for that. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10ipv4: Remove redundant RCU locking in ip_check_mc().David S. Miller
All callers are under rcu_read_lock() protection already. Rename to ip_check_mc_rcu() to make it even more clear. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller
2011-03-10tcp: mark tcp_congestion_ops read_mostlyStephen Hemminger
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Optimize flow initialization in fib_validate_source().David S. Miller
Like in commit 44713b67db10c774f14280c129b0d5fd13c70cf2 ("ipv4: Optimize flow initialization in output route lookup." we can optimize the on-stack flow setup to only initialize the members which are actually used. Otherwise we bzero the entire structure, then initialize explicitly the first half of it. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Optimize flow initialization in input route lookup.David S. Miller
Like in commit 44713b67db10c774f14280c129b0d5fd13c70cf2 ("ipv4: Optimize flow initialization in output route lookup." we can optimize the on-stack flow setup to only initialize the members which are actually used. Otherwise we bzero the entire structure, then initialize explicitly the first half of it. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10net: don't allow CAP_NET_ADMIN to load non-netdev kernel modulesVasiliy Kulikov
Since a8f80e8ff94ecba629542d9b4b5f5a8ee3eb565c any process with CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't allow anybody load any module not related to networking. This patch restricts an ability of autoloading modules to netdev modules with explicit aliases. This fixes CVE-2011-1019. Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior of loading netdev modules by name (without any prefix) for processes with CAP_SYS_MODULE to maintain the compatibility with network scripts that use autoloading netdev modules by aliases like "eth0", "wlan0". Currently there are only three users of the feature in the upstream kernel: ipip, ip_gre and sit. root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) -- root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: fffffff800001000 CapEff: fffffff800001000 CapBnd: fffffff800001000 root@albatros:~# modprobe xfs FATAL: Error inserting xfs (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted root@albatros:~# lsmod | grep xfs root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit sit: error fetching interface information: Device not found root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit0 sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 root@albatros:~# lsmod | grep sit sit 10457 0 tunnel4 2957 1 sit For CAP_SYS_MODULE module loading is still relaxed: root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs xfs 745319 0 Reference: https://lkml.org/lkml/2011/2/24/203 Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-03-09tcp: ioctl type SIOCOUTQNSD returns amount of data not sentMario Schuknecht
In contrast to SIOCOUTQ which returns the amount of data sent but not yet acknowledged plus data not yet sent this patch only returns the data not sent. For various methods of live streaming bitrate control it may be helpful to know how much data are in the tcp outqueue are not sent yet. Signed-off-by: Mario Schuknecht <m.schuknecht@dresearch.de> Signed-off-by: Steffen Sledz <sledz@dresearch.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Lookup multicast routes by rtable using helper.David S. Miller
Create a common helper for this operation, since we do it identically in three spots. Suggested by Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Fix erroneous uses of ifa_address.David S. Miller
In usual cases ifa_address == ifa_local, but in the case where SIOCSIFDSTADDR sets the destination address on a point-to-point link, ifa_address gets set to that destination address. Therefore we should use ifa_local when we want the local interface address. There were two cases where the selection was done incorrectly: 1) When devinet_ioctl() does matching, it checks ifa_address even though gifconf correct reported ifa_local to the user 2) IN_DEV_ARP_NOTIFY handling sends a gratuitous ARP using ifa_address instead of ifa_local. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-08inetpeer: Don't disable BH for initial fast RCU lookup.David S. Miller
If modifications on other cpus are ok, then modifications to the tree during lookup done by the local cpu are ok too. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-08ipv4: Fix scope value used in route src-address caching.David S. Miller
We have to use cfg->fc_scope not the final nh_scope value. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07ipv4: Cache source address in nexthop entries.David S. Miller
When doing output route lookups, we have to select the source address if the user has not specified an explicit one. First, if the route has an explicit preferred source address specified, then we use that. Otherwise we search the route's outgoing interface for a suitable address. This search can be precomputed and cached at route insertion time. The only missing part is that we have to refresh this precomputed value any time addresses are added or removed from the interface, and this is accomplished by fib_update_nh_saddrs(). Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07ipv4: Inline fib_semantic_match into check_leafDavid S. Miller
This elimiates a lot of pure overhead due to parameter passing. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07ipv4: Validate route entry type at insert instead of every lookup.David S. Miller
fib_semantic_match() requires that if the type doesn't signal an automatic error, it must be of type RTN_UNICAST, RTN_LOCAL, RTN_BROADCAST, RTN_ANYCAST, or RTN_MULTICAST. Checking this every route lookup is pointless work. Instead validate it during route insertion, via fib_create_info(). Also, there was nothing making sure the type value was less than RTN_MAX, so add that missing check while we're here. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-04ipv4: Remove flowi from struct rtable.David S. Miller
The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-04ipv4: Set rt->rt_iif more sanely on output routes.David S. Miller
rt->rt_iif is only ever inspected on input routes, for example DCCP uses this to populate a route lookup flow key when generating replies to another packet. Therefore, setting it to anything other than zero on output routes makes no sense. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-04ipv4: Get peer more cheaply in rt_init_metrics().David S. Miller
We know this is a new route object, so doing atomics and stuff makes no sense at all. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-04ipv4: Optimize flow initialization in output route lookup.David S. Miller
We burn a lot of useless cycles, cpu store buffer traffic, and memory operations memset()'ing the on-stack flow used to perform output route lookups in __ip_route_output_key(). Only the first half of the flow object members even matter for output route lookups in this context, specifically: FIB rules matching cares about: dst, src, tos, iif, oif, mark FIB trie lookup cares about: dst FIB semantic match cares about: tos, scope, oif Therefore only initialize these specific members and elide the memset entirely. On Niagara2 this kills about ~300 cycles from the output route lookup path. Likely, we can take things further, since all callers of output route lookups essentially throw away the on-stack flow they use. So they don't care if we use it as a scratch-pad to compute the final flow key. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2011-03-04inetpeer: seqlock optimizationEric Dumazet
David noticed : ------------------ Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. ----------------- One solution is to use a seqlock instead of a spinlock to protect struct inet_peer_base. After a failed avl tree lookup, we can easily detect if a writer did some changes during our lookup. Taking the lock and redo the lookup is only necessary in this case. Note: Add one private rcu_deref_locked() macro to place in one spot the access to spinlock included in seqlock. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h
2011-03-03ipv4: Fix __ip_dev_find() to use ifa_local instead of ifa_address.David S. Miller
Reported-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03ipv4: Fix crash in dst_release when udp_sendmsg route lookup fails.David S. Miller
As reported by Eric: [11483.697233] IP: [<c12b0638>] dst_release+0x18/0x60 ... [11483.697741] Call Trace: [11483.697764] [<c12fc9d2>] udp_sendmsg+0x282/0x6e0 [11483.697790] [<c12a1c01>] ? memcpy_toiovec+0x51/0x70 [11483.697818] [<c12dbd90>] ? ip_generic_getfrag+0x0/0xb0 The pointer passed to dst_release() is -EINVAL, that's because we leave an error pointer in the local variable "rt" by accident. NULL it out to fix the bug. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02ipv4: ip_route_output_key() is better as an inline.David S. Miller
This avoid a stack frame at zero cost. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02ipv4: Make output route lookup return rtable directly.David S. Miller
Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02xfrm: Return dst directly from xfrm_lookup()David S. Miller
Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01inet: Replace left-over references to inet->corkHerbert Xu
The patch to replace inet->cork with cork left out two spots in __ip_append_data that can result in bogus packet construction. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Make icmp route lookup code a bit clearer.David S. Miller
The route lookup code in icmp_send() is slightly tricky as a result of having to handle all of the requirements of RFC 4301 host relookups. Pull the route resolution into a seperate function, so that the error handling and route reference counting is hopefully easier to see and contained wholly within this new routine. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01xfrm: Handle blackhole route creation via afinfo.David S. Miller
That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01xfrm: Kill XFRM_LOOKUP_WAIT flag.David S. Miller
This can be determined from the flow flags instead. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Kill can_sleep arg to ip_route_output_flow()David S. Miller
This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01net: Add FLOWI_FLAG_CAN_SLEEP.David S. Miller
And set is in contexts where the route resolution can sleep. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"David S. Miller
Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Can final ip_route_connect() arg to boolean "can_sleep".David S. Miller
Since that's what the current vague "flags" thing means. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01udp: Add lockless transmit pathHerbert Xu
The UDP transmit path has been running under the socket lock for a long time because of the corking feature. This means that transmitting to the same socket in multiple threads does not scale at all. However, as most users don't actually use corking, the locking can be removed in the common case. This patch creates a lockless fast path where corking is not used. Please note that this does create a slight inaccuracy in the enforcement of socket send buffer limits. In particular, we may exceed the socket limit by up to (number of CPUs) * (packet size) because of the way the limit is computed. As the primary purpose of socket buffers is to indicate congestion, this should not be a great problem for now. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01udp: Switch to ip_finish_skbHerbert Xu
This patch converts UDP to use the new ip_finish_skb API. This would then allows us to more easily use ip_make_skb which allows UDP to run without a socket lock. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01inet: Add ip_make_skb and ip_finish_skbHerbert Xu
This patch adds the helper ip_make_skb which is like ip_append_data and ip_push_pending_frames all rolled into one, except that it does not send the skb produced. The sending part is carried out by ip_send_skb, which the transport protocol can call after it has tweaked the skb. It is meant to be called in cases where corking is not used should have a one-to-one correspondence to sendmsg. This patch also adds the helper ip_finish_skb which is meant to be replace ip_push_pending_frames when corking is required. Previously the protocol stack would peek at the socket write queue and add its header to the first packet. With ip_finish_skb, the protocol stack can directly operate on the final skb instead, just like the non-corking case with ip_make_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01inet: Remove explicit write references to sk/inet in ip_append_dataHerbert Xu
In order to allow simultaneous calls to ip_append_data on the same socket, it must not modify any shared state in sk or inet (other than those that are designed to allow that such as atomic counters). This patch abstracts out write references to sk and inet_sk in ip_append_data and its friends so that we may use the underlying code in parallel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01inet: Remove unused sk_sndmsg_* from UFOHerbert Xu
UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24ipv4: Rearrange how ip_route_newports() gets port keys.David S. Miller
ip_route_newports() is the only place in the entire kernel that cares about the port members in the routing cache entry's lookup flow key. Therefore the only reason we store an entire flow inside of the struct rtentry is for this one special case. Rewrite ip_route_newports() such that: 1) The caller passes in the original port values, so we don't need to use the rth->fl.fl_ip_{s,d}port values to remember them. 2) The lookup flow is constructed by hand instead of being copied from the routing cache entry's flow. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23xfrm: Const'ify address arguments to ->dst_lookup()David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23xfrm: Const'ify tmpl and address arguments to ->init_temprop()David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22xfrm: Mark flowi arg to ->init_tempsel() const.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22xfrm: Mark flowi arg to ->fill_dst() const.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22xfrm: Mark flowi arg to ->get_tos() const.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-21tcp: undo_retrans counter fixesYuchung Cheng
Fix a bug that undo_retrans is incorrectly decremented when undo_marker is not set or undo_retrans is already 0. This happens when sender receives more DSACK ACKs than packets retransmitted during the current undo phase. This may also happen when sender receives DSACK after the undo operation is completed or cancelled. Fix another bug that undo_retrans is incorrectly incremented when sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case is rare but not impossible. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20tcp: Remove debug macro of TCP_CHECK_TIMERShan Wei
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing. And, it has been there for several years, maybe 6 years. Remove it to keep code clearer. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-19Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/e1000e/netdev.c net/xfrm/xfrm_policy.c