summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2015-08-29vxlan: do not receive IPv4 packets on IPv6 socketJiri Benc
By default (subject to the sysctl settings), IPv6 sockets listen also for IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in packets received through an IPv6 socket. In addition, it's currently not possible to have both IPv4 and IPv6 vxlan tunnel on the same port (unless bindv6only sysctl is enabled), as it's not possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's no way to specify both IPv4 and IPv6 remote/group IP addresses. Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not done globally in udp_tunnel, as l2tp and tipc seems to work okay when receiving IPv4 packets on IPv6 socket and people may rely on this behavior. The other tunnels (geneve and fou) do not support IPv6. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29ip_tunnels: convert the mode field of ip_tunnel_info to flagsJiri Benc
The mode field holds a single bit of information only (whether the ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags. This allows more mode flags to be added. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. In sum, patches to address fallout from the previous round plus updates from the IPVS folks via Simon Horman, they are: 1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm directs network connections to the server with the highest weight that is currently available and overflows to the next when active connections exceed the node's weight. From Raducu Deaconu. 2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch from Julian Anastasov. 3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From Julian Anastasov. 4) Enhance multicast configuration for the IPVS state sync daemon. Also from Julian. 5) Resolve sparse warnings in the nf_dup modules. 6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set. 7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative subsets of code 1. From Andreas Herz. 8) Revert the jumpstack size calculation from mark_source_chains due to chain depth miscalculations, from Florian Westphal. 9) Calm down more sparse warning around the Netfilter tree, again from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28netfilter: reduce sparse warningsFlorian Westphal
bridge/netfilter/ebtables.c:290:26: warning: incorrect type in assignment (different modifiers) -> remove __pure annotation. ipv6/netfilter/ip6t_SYNPROXY.c:240:27: warning: cast from restricted __be16 -> switch ntohs to htons and vice versa. netfilter/core.c:391:30: warning: symbol 'nfq_ct_nat_hook' was not declared. Should it be static? -> delete it, got removed net/netfilter/nf_synproxy_core.c:221:48: warning: cast to restricted __be32 -> Use __be32 instead of u32. Tested with objdiff that these changes do not affect generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-28Revert "netfilter: xtables: compute exact size needed for jumpstack"Florian Westphal
This reverts commit 98d1bd802cdbc8f56868fae51edec13e86b59515. mark_source_chains will not re-visit chains, so *filter :INPUT ACCEPT [365:25776] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [217:45832] :t1 - [0:0] :t2 - [0:0] :t3 - [0:0] :t4 - [0:0] -A t1 -i lo -j t2 -A t2 -i lo -j t3 -A t3 -i lo -j t4 # -A INPUT -j t4 # -A INPUT -j t3 # -A INPUT -j t2 -A INPUT -j t1 COMMIT Will compute a chain depth of 2 if the comments are removed. Revert back to counting the number of chains for the time being. Reported-by: Cong Wang <cwang@twopensource.com> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2015-08-27ipv6: Export nf_ct_frag6_gather()Joe Stringer
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26netfilter: ip6t_REJECT: added missing icmpv6 codesAndreas Herz
RFC 4443 added two new codes values for ICMPv6 type 1: 5 - Source address failed ingress/egress policy 6 - Reject route to destination And RFC 7084 states in L-14 that IPv6 Router MUST send ICMPv6 Destination Unreachable with code 5 for packets forwarded to it that use an address from a prefix that has been invalidated. Codes 5 and 6 are more informative subsets of code 1. Signed-off-by: Andreas Herz <andi@geekosphere.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-25ip6_gre: release cached dst on tunnel removalhuaibin Wang
When a tunnel is deleted, the cached dst entry should be released. This problem may prevent the removal of a netns (seen with a x-netns IPv6 gre tunnel): unregister_netdevice: waiting for lo to become free. Usage count = 3 CC: Dmitry Kozlov <xeb@mail.ru> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25ah6: fix error return codeJulia Lawall
Return a negative error code on failure. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier ret; expression e1,e2; @@ ( if (\(ret < 0\|ret != 0\)) { ... return ret; } | ret = 0 ) ... when != ret = e1 when != &ret *if(...) { ... when != ret = e2 when forall return ret; } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25xfrm: Use VRF master index if output device is enslavedDavid Ahern
Directs route lookups to VRF table. Compiles out if NET_VRF is not enabled. With this patch able to successfully bring up ipsec tunnels in VRFs, even with duplicate network configuration. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24ila: Precompute checksum difference for translationsTom Herbert
In the ILA build state for LWT compute the checksum difference to apply to transport checksums that include the IPv6 pseudo header. The difference is between the route destination (from fib6_config) and the locator to write. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24lwt: Add cfg argument to build_stateTom Herbert
Add cfg and family arguments to lwt build state functions. cfg is a void pointer and will either be a pointer to a fib_config or fib6_config structure. The family parameter indicates which one (either AF_INET or AF_INET6). LWT encpasulation implementation may use the fib configuration to build the LWT state. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21netfilter: nf_dup: fix sparse warningsPablo Neira Ayuso
>> net/ipv4/netfilter/nft_dup_ipv4.c:29:37: sparse: incorrect type in initializer (different base types) net/ipv4/netfilter/nft_dup_ipv4.c:29:37: expected restricted __be32 [user type] s_addr net/ipv4/netfilter/nft_dup_ipv4.c:29:37: got unsigned int [unsigned] <noident> >> net/ipv6/netfilter/nf_dup_ipv6.c:48:23: sparse: incorrect type in assignment (different base types) net/ipv6/netfilter/nf_dup_ipv6.c:48:23: expected restricted __be32 [addressable] [assigned] [usertype] flowlabel net/ipv6/netfilter/nf_dup_ipv6.c:48:23: got int Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This is second pull request includes the conflict resolution patch that resulted from the updates that we got for the conntrack template through kmalloc. No changes with regards to the previously sent 15 patches. The following patchset contains Netfilter updates for your net-next tree, they are: 1) Rework the existing nf_tables counter expression to make it per-cpu. 2) Prepare and factor out common packet duplication code from the TEE target so it can be reused from the new dup expression. 3) Add the new dup expression for the nf_tables IPv4 and IPv6 families. 4) Convert the nf_tables limit expression to use a token-based approach with 64-bits precision. 5) Enhance the nf_tables limit expression to support limiting at packet byte. This comes after several preparation patches. 6) Add a burst parameter to indicate the amount of packets or bytes that can exceed the limiting. 7) Add netns support to nfacct, from Andreas Schultz. 8) Pass the nf_conn_zone structure instead of the zone ID in nf_tables to allow accessing more zone specific information, from Daniel Borkmann. 9) Allow to define zone per-direction to support netns containers with overlapping network addressing, also from Daniel. 10) Extend the CT target to allow setting the zone based on the skb->mark as a way to support simple mappings from iptables, also from Daniel. 11) Make the nf_tables payload expression aware of the fact that VLAN offload may have removed a vlan header, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21Merge branch 'master' of ↵Pablo Neira Ayuso
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Resolve conflicts with conntrack template fixes. Conflicts: net/netfilter/nf_conntrack_core.c net/netfilter/nf_synproxy_core.c net/netfilter/xt_CT.c Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-20ipv6: route: extend flow representation with tunnel keyJiri Benc
Use flowi_tunnel in flowi6 similarly to what is done with IPv4. This complements commit 1b7179d3adff ("route: Extend flow representation with tunnel key"). Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20ipv6: ndisc: inherit metadata dst when creating ndisc requestsJiri Benc
If output device wants to see the dst, inherit the dst of the original skb in the ndisc request. This is an IPv6 counterpart of commit 0accfc268f4d ("arp: Inherit metadata dst when creating ARP requests"). Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20ipv6: drop metadata dst in ip6_route_inputJiri Benc
The fix in commit 48fb6b554501 is incomplete, as now ip6_route_input can be called with non-NULL dst if it's a metadata dst and the reference is leaked. Drop the reference. Fixes: 48fb6b554501 ("ipv6: fix crash over flow-based vxlan device") Fixes: ee122c79d422 ("vxlan: Flow based tunneling") CC: Wei-Chun Chao <weichunc@plumgrid.com> CC: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20route: move lwtunnel state to dst_entryJiri Benc
Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17net: Identifier Locator Addressing moduleTom Herbert
Adding new module name ila. This implements ILA translation. Light weight tunnel redirection is used to perform the translation in the data path. This is configured by the "ip -6 route" command using the "encap ila <locator>" option, where <locator> is the value to set in destination locator of the packet. e.g. ip -6 route add 3333:0:0:1:5555:0:1:0/128 \ encap ila 2001:0:0:1 via 2401:db00:20:911a:face:0:25:0 Sets a route where 3333:0:0:1 will be overwritten by 2001:0:0:1 on output. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17net: Change pseudohdr argument of inet_proto_csum_replace* to be a boolTom Herbert
inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates the checksum field carries a pseudo header. This argument should be a boolean instead of an int. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17lwt: Add support to redirect dst.inputTom Herbert
This patch adds the capability to redirect dst input in the same way that dst output is redirected by LWT. Also, save the original dst.input and and dst.out when setting up lwtunnel redirection. These can be called by the client as a pass- through. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18netfilter: nf_conntrack: add efficient mark to zone mappingDaniel Borkmann
This work adds the possibility of deriving the zone id from the skb->mark field in a scalable manner. This allows for having only a single template serving hundreds/thousands of different zones, for example, instead of the need to have one match for each zone as an extra CT jump target. Note that we'd need to have this information attached to the template as at the time when we're trying to lookup a possible ct object, we already need to know zone information for a possible match when going into __nf_conntrack_find_get(). This work provides a minimal implementation for a possible mapping. In order to not add/expose an extra ct->status bit, the zone structure has been extended to carry a flag for deriving the mark. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-18netfilter: nf_conntrack: add direction support for zonesDaniel Borkmann
This work adds a direction parameter to netfilter zones, so identity separation can be performed only in original/reply or both directions (default). This basically opens up the possibility of doing NAT with conflicting IP address/port tuples from multiple, isolated tenants on a host (e.g. from a netns) without requiring each tenant to NAT twice resp. to use its own dedicated IP address to SNAT to, meaning overlapping tuples can be made unique with the zone identifier in original direction, where the NAT engine will then allocate a unique tuple in the commonly shared default zone for the reply direction. In some restricted, local DNAT cases, also port redirection could be used for making the reply traffic unique w/o requiring SNAT. The consensus we've reached and discussed at NFWS and since the initial implementation [1] was to directly integrate the direction meta data into the existing zones infrastructure, as opposed to the ct->mark approach we proposed initially. As we pass the nf_conntrack_zone object directly around, we don't have to touch all call-sites, but only those, that contain equality checks of zones. Thus, based on the current direction (original or reply), we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID. CT expectations are direction-agnostic entities when expectations are being compared among themselves, so we can only use the identifier in this case. Note that zone identifiers can not be included into the hash mix anymore as they don't contain a "stable" value that would be equal for both directions at all times, f.e. if only zone->id would unconditionally be xor'ed into the table slot hash, then replies won't find the corresponding conntracking entry anymore. If no particular direction is specified when configuring zones, the behaviour is exactly as we expect currently (both directions). Support has been added for the CT netlink interface as well as the x_tables raw CT target, which both already offer existing interfaces to user space for the configuration of zones. Below a minimal, simplified collision example (script in [2]) with netperf sessions: +--- tenant-1 ---+ mark := 1 | netperf |--+ +----------------+ | CT zone := mark [ORIGINAL] [ip,sport] := X +--------------+ +--- gateway ---+ | mark routing |--| SNAT |-- ... + +--------------+ +---------------+ | +--- tenant-2 ---+ | ~~~|~~~ | netperf |--+ +-----------+ | +----------------+ mark := 2 | netserver |------ ... + [ip,sport] := X +-----------+ [ip,port] := Y On the gateway netns, example: iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark conntrack dump from gateway netns: netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024 [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1 src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438 [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2 src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2 Taking this further, test script in [2] creates 200 tenants and runs original-tuple colliding netperf sessions each. A conntrack -L dump in the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED state as expected. I also did run various other tests with some permutations of the script, to mention some: SNAT in random/random-fully/persistent mode, no zones (no overlaps), static zones (original, reply, both directions), etc. [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/ [2] https://paste.fedoraproject.org/242835/65657871/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-17ipv6: trivial whitespace fixIan Morris
Change brace placement to be in line with coding standards Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17ipv6: Fix a potential deadlock when creating pcpu rtMartin KaFai Lau
rt6_make_pcpu_route() is called under read_lock(&table->tb6_lock). rt6_make_pcpu_route() calls ip6_rt_pcpu_alloc(rt) which then calls dst_alloc(). dst_alloc() _may_ call ip6_dst_gc() which takes the write_lock(&tabl->tb6_lock). A visualized version: read_lock(&table->tb6_lock); rt6_make_pcpu_route(); => ip6_rt_pcpu_alloc(); => dst_alloc(); => ip6_dst_gc(); => write_lock(&table->tb6_lock); /* oops */ The fix is to do a read_unlock first before calling ip6_rt_pcpu_alloc(). A reported stack: [141625.537638] INFO: rcu_sched self-detected stall on CPU { 27} (t=60000 jiffies g=4159086 c=4159085 q=2139) [141625.547469] Task dump for CPU 27: [141625.550881] mtr R running task 0 22121 22081 0x00000008 [141625.558069] 0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b [141625.565641] ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000 [141625.573220] ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00 [141625.580803] Call Trace: [141625.583345] <IRQ> [<ffffffff8106e488>] sched_show_task+0xc1/0xc6 [141625.589650] [<ffffffff810702b0>] dump_cpu_task+0x35/0x39 [141625.595144] [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c [141625.601320] [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4 [141625.607669] [<ffffffff810940c8>] update_process_times+0x2a/0x4f [141625.613925] [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e [141625.619923] [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c [141625.625830] [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d [141625.632171] [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166 [141625.638258] [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52 [141625.645036] [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a [141625.651643] [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70 [141625.657895] <EOI> [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5 [141625.664188] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 [141625.670272] [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f [141625.677140] [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25 [141625.683218] [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82 [141625.689124] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 [141625.695207] [<ffffffff813d6058>] fib6_clean_all+0xe/0x10 [141625.700854] [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8 [141625.706329] [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9 [141625.711718] [<ffffffff81346d68>] dst_alloc+0x55/0x159 [141625.717105] [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63 [141625.723620] [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8 [141625.729441] [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13 [141625.735700] [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf [141625.741698] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 [141625.748043] [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a [141625.754050] [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9 [141625.761002] [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c [141625.766914] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 [141625.773260] [<ffffffff813d008c>] ip6_route_output+0x7a/0x82 [141625.779177] [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112 [141625.785437] [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b [141625.791604] [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6 [141625.797423] [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2 [141625.804464] [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e [141625.810028] [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c [141625.815588] [<ffffffff8132be57>] SyS_sendto+0xfe/0x143 [141625.821063] [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67 [141625.827146] [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11 [141625.833660] [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2 [141625.839565] [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a Fixes: d52d3997f843 ("pv6: Create percpu rt6_info") Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17ipv6: Add rt6_make_pcpu_route()Martin KaFai Lau
It is a prep work for fixing a potential deadlock when creating a pcpu rt. The current rt6_get_pcpu_route() will also create a pcpu rt if one does not exist. This patch moves the pcpu rt creation logic into another function, rt6_make_pcpu_route(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17ipv6: Remove un-used argument from ip6_dst_alloc()Martin KaFai Lau
After 4b32b5ad31a6 ("ipv6: Stop rt6_info from using inet_peer's metrics"), ip6_dst_alloc() does not need the 'table' argument. This patch cleans it up. Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2015-08-17 1) Fix IPv6 ECN decapsulation for IPsec interfamily tunnels. From Thomas Egerer. 2) Use kmemdup instead of duplicating it in xfrm_dump_sa(). From Andrzej Hajda. 3) Pass oif to the xfrm lookups so that it gets set on the flow and the resolver routines can match based on oif. From David Ahern. 4) Add documentation for the new xfrm garbage collector threshold. From Alexander Duyck. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: addr IFLA_OPERSTATE to netlink message for ipv6 ifinfoAndy Gospodarek
This is useful information to include in ipv6 netlink messages that report interface information. IFLA_OPERSTATE is already included in ipv4 messages, but missing for ipv6. This closes that gap. Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: ipv6 sysctl option to ignore routes when nexthop link is downAndy Gospodarek
Like the ipv4 patch with a similar title, this adds a sysctl to allow the user to change routing behavior based on whether or not the interface associated with the nexthop was an up or down link. The default setting preserves the current behavior, but anyone that enables it will notice that nexthops on down interfaces will no longer be selected: net.ipv6.conf.all.ignore_routes_with_linkdown = 0 net.ipv6.conf.default.ignore_routes_with_linkdown = 0 net.ipv6.conf.lo.ignore_routes_with_linkdown = 0 ... When the above sysctls are set, not only will link status be reported to userspace, but an indication that a nexthop is dead and will not be used is also reported. 1000::/8 via 7000::2 dev p7p1 metric 1024 dead linkdown pref medium 1000::/8 via 8000::2 dev p8p1 metric 1024 pref medium 7000::/8 dev p7p1 proto kernel metric 256 dead linkdown pref medium 8000::/8 dev p8p1 proto kernel metric 256 pref medium 9000::/8 via 8000::2 dev p8p1 metric 2048 pref medium 9000::/8 via 7000::2 dev p7p1 metric 1024 dead linkdown pref medium fe80::/64 dev p7p1 proto kernel metric 256 dead linkdown pref medium fe80::/64 dev p8p1 proto kernel metric 256 pref medium This also adds devconf support and notification when sysctl values change. v2: drop use of rt6i_nhflags since it is not needed right now Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: track link status of ipv6 nexthopsAndy Gospodarek
Add support to track current link status of ipv6 nexthops to match recent changes that added support for ipv4 nexthops. This takes a simple approach to track linkdown status for next-hops and simply checks the dev for the dst entry and sets proper flags that to be used in the netlink message. v2: drop use of rt6i_nhflags since it is not needed right now Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: fix wrong skb_get() usage / crash in IGMP/MLD parsing codeLinus Lüssing
The recent refactoring of the IGMP and MLD parsing code into ipv6_mc_check_mld() / ip_mc_check_igmp() introduced a potential crash / BUG() invocation for bridges: I wrongly assumed that skb_get() could be used as a simple reference counter for an skb which is not the case. skb_get() bears additional semantics, a user count. This leads to a BUG() invocation in pskb_expand_head() / kernel panic if pskb_may_pull() is called on an skb with a user count greater than one - unfortunately the refactoring did just that. Fixing this by removing the skb_get() call and changing the API: The caller of ipv6_mc_check_mld() / ip_mc_check_igmp() now needs to additionally check whether the returned skb_trimmed is a clone. Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code") Reported-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/cavium/Kconfig The cavium conflict was overlapping dependency changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-11xfrm: Add oif to dst lookupsDavid Ahern
Rules can be installed that direct route lookups to specific tables based on oif. Plumb the oif through the xfrm lookups so it gets set in the flow struct and passed to the resolver routines. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-08-11xfrm6: Fix IPv6 ECN decapsulationThomas Egerer
Using ipv6_get_dsfield on the outer IP header implies that inner and outer header are of the the same address family. For interfamily tunnels, particularly 646, the code reading the DSCP field obtains the wrong values (IHL and the upper four bits of the DSCP field). This can cause the code to detect a congestion encoutered state in the outer header and enable the corresponding bits in the inner header, too. Since the DSCP field is stored in the xfrm mode common buffer independently from the IP version of the outer header, it's safe (and correct) to take this value from there. Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-08-11netfilter: nf_conntrack: push zone object into functionsDaniel Borkmann
This patch replaces the zone id which is pushed down into functions with the actual zone object. It's a bigger one-time change, but needed for later on extending zones with a direction parameter, and thus decoupling this additional information from all call-sites. No functional changes in this patch. The default zone becomes a global const object, namely nf_ct_zone_dflt and will be returned directly in various cases, one being, when there's f.e. no zoning support. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-10inet: fix possible request socket leakEric Dumazet
In commit b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()"), I missed fact that tcp_check_req() can return the listener socket in one case, and that we must release the request socket refcount or we leak it. Tested: Following packetdrill test template shows the issue 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 2920 <mss 1460,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.002 < . 1:1(0) ack 21 win 2920 +0 > R 21:21(0) Fixes: b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains five Netfilter fixes for your net tree, they are: 1) Silence a warning on falling back to vmalloc(). Since 88eab472ec21, we can easily hit this warning message, that gets users confused. So let's get rid of it. 2) Recently when porting the template object allocation on top of kmalloc to fix the netns dependencies between x_tables and conntrack, the error checks where left unchanged. Remove IS_ERR() and check for NULL instead. Patch from Dan Carpenter. 3) Don't ignore gfp_flags in the new nf_ct_tmpl_alloc() function, from Joe Stringer. 4) Fix a crash due to NULL pointer dereference in ip6t_SYNPROXY, patch from Phil Sutter. 5) The sequence number of the Syn+ack that is sent from SYNPROXY to clients is not adjusted through our NAT infrastructure, as a result the client may ignore this TCP packet and TCP flow hangs until the client probes us. Also from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10ip_gre: Add support to collect tunnel metadata.Pravin B Shelar
Following patch create new tunnel flag which enable tunnel metadata collection on given device. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10ipv6: don't reject link-local nexthop on other interfaceFlorian Westphal
48ed7b26faa7 ("ipv6: reject locally assigned nexthop addresses") is too strict; it rejects following corner-case: ip -6 route add default via fe80::1:2:3 dev eth1 [ where fe80::1:2:3 is assigned to a local interface, but not eth1 ] Fix this by restricting search to given device if nh is linklocal. Joint work with Hannes Frederic Sowa. Fixes: 48ed7b26faa7 ("ipv6: reject locally assigned nexthop addresses") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10netfilter: SYNPROXY: fix sending window update to clientPhil Sutter
Upon receipt of SYNACK from the server, ipt_SYNPROXY first sends back an ACK to finish the server handshake, then calls nf_ct_seqadj_init() to initiate sequence number adjustment of forwarded packets to the client and finally sends a window update to the client to unblock it's TX queue. Since synproxy_send_client_ack() does not set synproxy_send_tcp()'s nfct parameter, no sequence number adjustment happens and the client receives the window update with incorrect sequence number. Depending on client TCP implementation, this leads to a significant delay (until a window probe is being sent). Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-10netfilter: ip6t_SYNPROXY: fix NULL pointer dereferencePhil Sutter
This happens when networking namespaces are enabled. Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07netfilter: nf_tables: add nft_dup expressionPablo Neira Ayuso
This new expression uses the nf_dup engine to clone packets to a given gateway. Unlike xt_TEE, we use an index to indicate output interface which should be fine at this stage. Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from nf_dup_ipv{4,6} to silence a lockdep splat. Based on the original tee expression from Arturo Borrero Gonzalez, although this patch has diverted quite a bit from this initial effort due to the change to support maps. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07netfilter: factor out packet duplication for IPv4/IPv6Pablo Neira Ayuso
Extracted from the xtables TEE target. This creates two new modules for IPv4 and IPv6 that are shared between the TEE target and the new nf_tables dup expressions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) A couple of cleanups for the netfilter core hook from Eric Biederman. 2) Net namespace hook registration, also from Eric. This adds a dependency with the rtnl_lock. This should be fine by now but we have to keep an eye on this because if we ever get the per-subsys nfnl_lock before rtnl we have may problems in the future. But we have room to remove this in the future by propagating the complexity to the clients, by registering hooks for the init netns functions. 3) Update nf_tables to use the new net namespace hook infrastructure, also from Eric. 4) Three patches to refine and to address problems from the new net namespace hook infrastructure. 5) Switch to alternate jumpstack in xtables iff the packet is reentering. This only applies to a very special case, the TEE target, but Eric Dumazet reports that this is slowing down things for everyone else. So let's only switch to the alternate jumpstack if the tee target is in used through a static key. This batch also comes with offline precalculation of the jumpstack based on the callchain depth. From Florian Westphal. 6) Minimal SCTP multihoming support for our conntrack helper, from Michal Kubecek. 7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian Westphal. 8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler. 9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-04netfilter: ip6t_REJECT: Remove debug messages from reject_tg6()Subash Abhinov Kasiviswanathan
Make it similar to reject_tg() in ipt_REJECT. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: arch/s390/net/bpf_jit_comp.c drivers/net/ethernet/ti/netcp_ethss.c net/bridge/br_multicast.c net/ipv4/ip_fragment.c All four conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>