summaryrefslogtreecommitdiff
path: root/net/ipv4/xfrm4_policy.c
AgeCommit message (Collapse)Author
2011-03-04ipv4: Remove flowi from struct rtable.David S. Miller
The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02ipv4: Make output route lookup return rtable directly.David S. Miller
Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01xfrm: Handle blackhole route creation via afinfo.David S. Miller
That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23xfrm: Const'ify address arguments to ->dst_lookup()David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22xfrm: Mark flowi arg to ->fill_dst() const.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22xfrm: Mark flowi arg to ->get_tos() const.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-26net: Implement read-only protection and COW'ing of metrics.David S. Miller
Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-17net: use the macros defined for the members of flowiChangli Gao
Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15xfrm: use gre key as flow upper protocol infoTimo Teräs
The GRE Key field is intended to be used for identifying an individual traffic flow within a tunnel. It is useful to be able to have XFRM policy selector matches to have different policies for different GRE tunnels. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-11net: get rid of rtable->idevEric Dumazet
It seems idev field in struct rtable has no special purpose, but adding extra atomic ops. We hold refcounts on the device itself (using percpu data, so pretty cheap in current kernel). infiniband case is solved using dst.dev instead of idev->dev Removal of this field means routing without route cache is now using shared data, percpu data, and only potential contention is a pair of atomic ops on struct neighbour per forwarded packet. About 5% speedup on routing test. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11net dst: use a percpu_counter to track entriesEric Dumazet
struct dst_ops tracks number of allocated dst in an atomic_t field, subject to high cache line contention in stress workload. Switch to a percpu_counter, to reduce number of time we need to dirty a central location. Place it on a separate cache line to avoid dirtying read only fields. Stress test : (Sending 160.000.000 UDP frames, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_TRIE, SLUB/NUMA) Before: real 0m51.179s user 0m15.329s sys 10m15.942s After: real 0m45.570s user 0m15.525s sys 9m56.669s With a small reordering of struct neighbour fields, subject of a following patch, (to separate refcnt from other read mostly fields) real 0m41.841s user 0m15.261s sys 8m45.949s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-22xfrm4: strip ECN bits from tos fieldUlrich Weber
otherwise ECT(1) bit will get interpreted as RTO_ONLINK and routing will fail with XfrmOutBundleGenError. Signed-off-by: Ulrich Weber <uweber@astaro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-07Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2010-07-04xfrm: fix xfrm by MARK logicPeter Kosyh
While using xfrm by MARK feature in 2.6.34 - 2.6.35 kernels, the mark is always cleared in flowi structure via memset in _decode_session4 (net/ipv4/xfrm4_policy.c), so the policy lookup fails. IPv6 code is affected by this bug too. Signed-off-by: Peter Kosyh <p.kosyh@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10net-next: remove useless union keywordChangli Gao
remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07xfrm: cache bundles instead of policies for outgoing flowsTimo Teräs
__xfrm_lookup() is called for each packet transmitted out of system. The xfrm_find_bundle() does a linear search which can kill system performance depending on how many bundles are required per policy. This modifies __xfrm_lookup() to store bundles directly in the flow cache. If we did not get a hit, we just create a new bundle instead of doing slow search. This means that we can now get multiple xfrm_dst's for same flow (on per-cpu basis). Signed-off-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-03ipsec: Fix bogus bundle flowiHerbert Xu
When I merged the bundle creation code, I introduced a bogus flowi value in the bundle. Instead of getting from the caller, it was instead set to the flow in the route object, which is totally different. The end result is that the bundles we created never match, and we instead end up with an ever growing bundle list. Thanks to Jamal for find this problem. Reported-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-24netns xfrm: deal with dst entries in netnsAlexey Dobriyan
GC is non-existent in netns, so after you hit GC threshold, no new dst entries will be created until someone triggers cleanup in init_net. Make xfrm4_dst_ops and xfrm6_dst_ops per-netns. This is not done in a generic way, because it woule waste (AF_MAX - 2) * sizeof(struct dst_ops) bytes per-netns. Reorder GC threshold initialization so it'd be done before registering XFRM policies. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-12sysctl net: Remove unused binary sysctl codeEric W. Biederman
Now that sys_sysctl is a compatiblity wrapper around /proc/sys all sysctl strategy routines, and all ctl_name and strategy entries in the sysctl tables are unused, and can be revmoed. In addition neigh_sysctl_register has been modified to no longer take a strategy argument and it's callers have been modified not to pass one. Cc: "David Miller" <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-08-04xfrm4: fix build when SYSCTLs are disabledRandy Dunlap
Fix build errors when SYSCTLs are not enabled: (.init.text+0x5154): undefined reference to `net_ipv4_ctl_path' (.init.text+0x5176): undefined reference to `register_net_sysctl_table' xfrm4_policy.c:(.exit.text+0x573): undefined reference to `unregister_net_sysctl_table Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-30xfrm: select sane defaults for xfrm[4|6] gc_threshNeil Horman
Choose saner defaults for xfrm[4|6] gc_thresh values on init Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values (set to 1024). Given that the ipv4 and ipv6 routing caches are sized dynamically at boot time, the static selections can be non-sensical. This patch dynamically selects an appropriate gc threshold based on the corresponding main routing table size, using the assumption that we should in the worst case be able to handle as many connections as the routing table can. For ipv4, the maximum route cache size is 16 * the number of hash buckets in the route cache. Given that xfrm4 starts garbage collection at the gc_thresh and prevents new allocations at 2 * gc_thresh, we set gc_thresh to half the maximum route cache size. For ipv6, its a bit trickier. there is no maximum route cache size, but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems sane to select a simmilar gc_thresh for the xfrm6 code that is half the number of hash buckets in the v6 route cache times 16 (like the v4 code does). Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-27xfrm: export xfrm garbage collector thresholds via sysctlNeil Horman
Export garbage collector thresholds for xfrm[4|6]_dst_ops Had a problem reported to me recently in which a high volume of ipsec connections on a system began reporting ENOBUFS for new connections eventually. It seemed that after about 2000 connections we started being unable to create more. A quick look revealed that the xfrm code used a dst_ops structure that limited the gc_thresh value to 1024, and always dropped route cache entries after 2x the gc_thresh. It seems the most direct solution is to export the gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing table does, so that higher volumes of connections can be supported. This patch has been tested and allows the reporter to increase their ipsec connection volume successfully. Reported-by: Joe Nall <joe@nall.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> ipv4/xfrm4_policy.c | 18 ++++++++++++++++++ ipv6/xfrm6_policy.c | 18 ++++++++++++++++++ 2 files changed, 36 insertions(+) Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-03xfrm4: fix the ports decode of sctp protocolWei Yongjun
The SCTP pushed the skb data above the sctp chunk header, so the check of pskb_may_pull(skb, xprth + 4 - skb->data) in _decode_session4() will never return 0 because xprth + 4 - skb->data < 0, the ports decode of sctp will always fail. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01net: replace uses of __constant_{endian}Harvey Harrison
Base versions handle constant folding now. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25netns xfrm: ->get_saddr in netnsAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25netns xfrm: ->dst_lookup in netnsAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25netns xfrm: dst garbage-collecting in netnsAlexey Dobriyan
Pass netns pointer to struct xfrm_policy_afinfo::garbage_collect() [This needs more thoughts on what to do with dst_ops] [Currently stub to init_net] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-11net: remove struct dst_entry::entry_sizeAlexey Dobriyan
Unused after kmem_cache_zalloc() conversion. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c ↵Jianjun Kong
xfrm4_policy.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26[NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.YOSHIFUJI Hideaki
Introduce per-net_device inlines: dev_net(), dev_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-01-31[NET]: should explicitely initialize atomic_t field in struct dst_opsEric Dumazet
All but one struct dst_ops static initializations miss explicit initialization of entries field. As this field is atomic_t, we should use ATOMIC_INIT(0), and not rely on atomic_t implementation. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Add namespace parameter to __ip_route_output_key.Denis V. Lunev
This is only required to propagate it down to the ip_route_output_slow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS][DST] dst: pass the dst_ops as parameter to the gc functionsDaniel Lezcano
The garbage collection function receive the dst_ops structure as parameter. This is useful for the next incoming patchset because it will need the dst_ops (there will be several instances) and the network namespace pointer (contained in the dst_ops). The protocols which do not take care of the namespaces will not be impacted by this change (expect for the function signature), they do just ignore the parameter. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[XFRM] IPv6: Fix dst/routing check at transformation.Masahide NAKAMURA
IPv6 specific thing is wrongly removed from transformation at net-2.6.25. This patch recovers it with current design. o Update "path" of xfrm_dst since IPv6 transformation should care about routing changes. It is required by MIPv6 and off-link destined IPsec. o Rename nfheader_len which is for non-fragment transformation used by MIPv6 to rt6i_nfheader_len as IPv6 name space. Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Added xfrm_decode_session_reverse and xfrmX_policy_check_reverseHerbert Xu
RFC 4301 requires us to relookup ICMP traffic that does not match any policies using the reverse of its payload. This patch adds the functions xfrm_decode_session_reverse and xfrmX_policy_check_reverse so we can get the reverse flow to perform such a lookup. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Multiple namespaces in the all dst_ifdown routines.Denis V. Lunev
Move dst entries to a namespace loopback to catch refcounting leaks. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Merge most of the output pathHerbert Xu
As part of the work on asynchrnous cryptographic operations, we need to be able to resume from the spot where they occur. As such, it helps if we isolate them to one spot. This patch moves most of the remaining family-specific processing into the common output code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Merge common code into xfrm_bundle_createHerbert Xu
Half of the code in xfrm4_bundle_create and xfrm6_bundle_create are common. This patch extracts that logic and puts it into xfrm_bundle_create. The rest of it are then accessed through afinfo. As a result this fixes the problem with inter-family transforms where we treat every xfrm dst in the bundle as if it belongs to the top family. This patch also fixes a long-standing error-path bug where we may free the xfrm states twice. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Move flow construction into xfrm_dst_lookupHerbert Xu
This patch moves the flow construction from the callers of xfrm_dst_lookup into that function. It also changes xfrm_dst_lookup so that it takes an xfrm state as its argument instead of explicit addresses. This removes any address-specific logic from the callers of xfrm_dst_lookup which is needed to correctly support inter-family transforms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Make sure idev is consistent with dev in xfrm_dstHerbert Xu
Previously we took the device from the bottom route and idev from the top route. This is bad because idev may well point to a different device. This patch changes it so that we get the idev from the device directly. It also makes it an error if either dev or idev is NULL. This is consistent with the rest of the routing code which also treats these cases as errors. I've removed the err initialisation in xfrm6_policy.c because it achieves no purpose and hid a bug when an initial version of this patch neglected to set err to -ENODEV (fortunately the IPv4 version warned about it). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Set dst->input to dst_discardHerbert Xu
The input function should never be invoked on IPsec dst objects. This is because we don't apply IPsec on input until after we've made the routing decision. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPSEC]: Only set neighbour on top xfrm dstHerbert Xu
The neighbour field is only used by dst_confirm which only ever happens on the top-most xfrm dst. So it's a waste to duplicate for every other xfrm dst. This patch moves its setting out of the loop so that only the top one gets set. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPV6]: Move nfheader_len into rt6_infoHerbert Xu
The dst member nfheader_len is only used by IPv6. It's also currently creating a rather ugly alignment hole in struct dst. Therefore this patch moves it from there into struct rt6_info. It also reorders the fields in rt6_info to minimize holes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[IPSEC]: Rename mode to outer_mode and add inner_modeHerbert Xu
This patch adds a new field to xfrm states called inner_mode. The existing mode object is renamed to outer_mode. This is the first part of an attempt to fix inter-family transforms. As it is we always use the outer family when determining which mode to use. As a result we may end up shoving IPv4 packets into netfilter6 and vice versa. What we really want is to use the inner family for the first part of outbound processing and the outer family for the second part. For inbound processing we'd use the opposite pairing. I've also added a check to prevent silly combinations such as transport mode with inter-family transforms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[IPSEC]: Use the top IPv4 route's peer instead of the bottomHerbert Xu
For IPv4 we were using the bottom route's peer instead of the top one. This is wrong because the peer is only used by TCP to keep track of information about the TCP destination address which certainly does not live in the bottom route. This patch fixes that which allows us to get rid of the family check since the bottom route could be IPv6 while the top one must always be IPv4. I've also changed the other fields which are IPv4-specific to get the info from the top route instead of potentially bogus data from the bottom route. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[IPSEC]: Store afinfo pointer in xfrm_modeHerbert Xu
It is convenient to have a pointer from xfrm_state to address-specific functions such as the output function for a family. Currently the address-specific policy code calls out to the xfrm state code to get those pointers when we could get it in an easier way via the state itself. This patch adds an xfrm_state_afinfo to xfrm_mode (since they're address-specific) and changes the policy code to use it. I've also added an owner field to do reference counting on the module providing the afinfo even though it isn't strictly necessary today since IPv6 can't be unloaded yet. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17[IPSEC]: Add missing BEET checksHerbert Xu
Currently BEET mode does not reinject the packet back into the stack like tunnel mode does. Since BEET should behave just like tunnel mode this is incorrect. This patch fixes this by introducing a flags field to xfrm_mode that tells the IPsec code whether it should terminate and reinject the packet back into the stack. It then sets the flag for BEET and tunnel mode. I've also added a number of missing BEET checks elsewhere where we check whether a given mode is a tunnel or not. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make the loopback device per network namespace.Eric W. Biederman
This patch makes loopback_dev per network namespace. Adding code to create a different loopback device for each network namespace and adding the code to free a loopback device when a network namespace exits. This patch modifies all users the loopback_dev so they access it as init_net.loopback_dev, keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Dynamically allocate the loopback device, part 1.Daniel Lezcano
This patch replaces all occurences to the static variable loopback_dev to a pointer loopback_dev. That provides the mindless, trivial, uninteressting change part for the dynamic allocation for the loopback. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-By: Kirill Korotaev <dev@sw.ru> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET]: cleanup extra semicolonsStephen Hemminger
Spring cleaning time... There seems to be a lot of places in the network code that have extra bogus semicolons after conditionals. Most commonly is a bogus semicolon after: switch() { } Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>