summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2016-07-16net: ipmr/ip6mr: add support for keeping an entry ageNikolay Aleksandrov
In preparation for hardware offloading of ipmr/ip6mr we need an interface that allows to check (and later update) the age of entries. Relying on stats alone can show activity but not actual age of the entry, furthermore when there're tens of thousands of entries a lot of the hardware implementations only support "hit" bits which are cleared on read to denote that the entry was active and shouldn't be aged out, these can then be naturally translated into age timestamp and will be compatible with the software forwarding age. Using a lastuse entry doesn't affect performance because the members in that cache line are written to along with the age. Since all new users are encouraged to use ipmr via netlink, this is exported via the RTA_EXPIRES attribute. Also do a minor local variable declaration style adjustment - arrange them longest to shortest. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Shrijeet Mukherjee <shm@cumulusnetworks.com> CC: Satish Ashok <sashok@cumulusnetworks.com> CC: Donald Sharp <sharpd@cumulusnetworks.com> CC: David S. Miller <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09ipv6: do not abuse GFP_ATOMIC in inet6_netconf_notify_devconf()Eric Dumazet
All inet6_netconf_notify_devconf() callers are in process context, so we can use GFP_KERNEL allocations if we take care of not holding a rwlock while not needed in ip6mr (we hold RTNL there) Fixes: d67b8c616b48 ("netconf: advertise mc_forwarding status") Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding status") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09sit: support MPLS over IPv4Simon Horman
Extend the SIT driver to support MPLS over IPv4. This implementation extends existing support for IPv6 over IPv4 and IPv4 over IPv4. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/mellanox/mlx5/core/en.h drivers/net/ethernet/mellanox/mlx5/core/en_main.c drivers/net/usb/r8152.c All three conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) Don't use userspace datatypes in bridge netfilter code, from Tobin Harding. 2) Iterate only once over the expectation table when removing the helper module, instead of once per-netns, from Florian Westphal. 3) Extra sanitization in xt_hook_ops_alloc() to return error in case we ever pass zero hooks, xt_hook_ops_alloc(): 4) Handle NFPROTO_INET from the logging core infrastructure, from Liping Zhang. 5) Autoload loggers when TRACE target is used from rules, this doesn't change the behaviour in case the user already selected nfnetlink_log as preferred way to print tracing logs, also from Liping Zhang. 6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields by cache lines, increases the size of entries in 11% per entry. From Florian Westphal. 7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian. 8) Remove useless defensive check in nf_logger_find_get() from Shivani Bhardwaj. 9) Remove zone extension as place it in the conntrack object, this is always include in the hashing and we expect more intensive use of zones since containers are in place. Also from Florian Westphal. 10) Owner match now works from any namespace, from Eric Bierdeman. 11) Make sure we only reply with TCP reset to TCP traffic from nf_reject_ipv4, patch from Liping Zhang. 12) Introduce --nflog-size to indicate amount of network packet bytes that are copied to userspace via log message, from Vishwanath Pai. This obsoletes --nflog-range that has never worked, it was designed to achieve this but it has never worked. 13) Introduce generic macros for nf_tables object generation masks. 14) Use generation mask in table, chain and set objects in nf_tables. This allows fixes interferences with ongoing preparation phase of the commit protocol and object listings going on at the same time. This update is introduced in three patches, one per object. 15) Check if the object is active in the next generation for element deactivation in the rbtree implementation, given that deactivation happens from the commit phase path we have to observe the future status of the object. 16) Support for deletion of just added elements in the hash set type. 17) Allow to resize hashtable from /proc entry, not only from the obscure /sys entry that maps to the module parameter, from Florian Westphal. 18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised anymore since we tear down the ruleset whenever the netdevice goes away. 19) Support for matching inverted set lookups, from Arturo Borrero. 20) Simplify the iptables_mangle_hook() by removing a superfluous extra branch. 21) Introduce ether_addr_equal_masked() and use it from the netfilter codebase, from Joe Perches. 22) Remove references to "Use netfilter MARK value as routing key" from the Netfilter Kconfig description given that this toggle doesn't exists already for 10 years, from Moritz Sichert. 23) Introduce generic NF_INVF() and use it from the xtables codebase, from Joe Perches. 24) Setting logger to NONE via /proc was not working unless explicit nul-termination was included in the string. This fixes seems to leave the former behaviour there, so we don't break backward. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05ipv6: Fix mem leak in rt6i_pcpuMartin KaFai Lau
It was first reported and reproduced by Petr (thanks!) in https://bugzilla.kernel.org/show_bug.cgi?id=119581 free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy(). However, after fixing a deadlock bug in commit 9c7370a166b4 ("ipv6: Fix a potential deadlock when creating pcpu rt"), free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL. It is worth to note that rt6i_pcpu is protected by table->tb6_lock. kmemleak somehow did not report it. We nailed it down by observing the pcpu entries in /proc/vmallocinfo (first suggested by Hannes, thanks!). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Fixes: 9c7370a166b4 ("ipv6: Fix a potential deadlock when creating pcpu rt") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Tested-by: Petr Novopashenniy <pety@rusnet.ru> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-03netfilter: Convert FWINV<[foo]> macros and uses to NF_INVFJoe Perches
netfilter uses multiple FWINV #defines with identical form that hide a specific structure variable and dereference it with a invflags member. $ git grep "#define FWINV" include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg)) net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg)) net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg))) net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg))) net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg))) net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg))) Consolidate these macros into a single NF_INVF macro. Miscellanea: o Neaten the alignment around these uses o A few lines are > 80 columns for intelligibility Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01netfilter: x_tables: simplify ip{6}table_mangle_hook()Pablo Neira Ayuso
No need for a special case to handle NF_INET_POST_ROUTING, this is basically the same handling as for prerouting, input, forward. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01tcp: md5: use kmalloc() backed scratch areasEric Dumazet
Some arches have virtually mapped kernel stacks, or will soon have. tcp_md5_hash_header() uses an automatic variable to copy tcp header before mangling th->check and calling crypto function, which might be problematic on such arches. David says that using percpu storage is also problematic on non SMP builds. Just use kmalloc() to allocate scratch areas. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of overlapping changes, except the packet scheduler conflicts which deal with the addition of the free list parameter to qdisc_enqueue(). Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-28ipmr/ip6mr: Initialize the last assert time of mfc entries.Tom Goff
This fixes wrong-interface signaling on 32-bit platforms for entries created when jiffies > 2^31 + MFC_ASSERT_THRESH. Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27ipv6: enforce egress device match in per table nexthop lookupsPaolo Abeni
with the commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups"), net hop lookup is first performed on route creation in the passed-in table. However device match is not enforced in table lookup, so the found route can be later discarded due to egress device mismatch and no global lookup will be performed. This cause the following to fail: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link set dummy1 up ip link set dummy2 up ip route add 2001:db8:8086::/48 dev dummy1 metric 20 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20 ip route add 2001:db8:8086::/48 dev dummy2 metric 21 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21 RTNETLINK answers: No route to host This change fixes the issue enforcing device lookup in ip6_nh_lookup_table() v1->v2: updated commit message title Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups") Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18ipv6: RFC 4884 partial support for SIT/GRE tunnelsEric Dumazet
When receiving an ICMPv4 message containing extensions as defined in RFC 4884, and translating it to ICMPv6 at SIT or GRE tunnel, we need some extra manipulation in order to properly forward the extensions. This patch only takes care of Time Exceeded messages as they are the ones that typically carry information from various routers in a fabric during a traceroute session. It also avoids complex skb logic if the data_len is not a multiple of 8. RFC states : The "original datagram" field MUST contain at least 128 octets. If the original datagram did not contain 128 octets, the "original datagram" field MUST be zero padded to 128 octets. In practice routers use 128 bytes of original datagram, not more. Initial translation was added in commit ca15a078bd90 ("sit: generate icmpv6 error when receiving icmpv4 error") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Oussama Ghorbel <ghorbel@pivasoftware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18ipv6: translate ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEEDEric Dumazet
For better traceroute/mtr support for SIT and GRE tunnels, we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED We also have to translate the IPv4 source IP address of ICMP message to IPv6 v4mapped. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18ip6: move ipip6_err_gen_icmpv6_unreach()Eric Dumazet
We want to use this helper from GRE as well, so this is the time to move it in net/ipv6/icmp.c Also add a @nhs parameter, since SIT and GRE have different values for the header(s) to skip. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18ipv6: icmp: add a force_saddr param to icmp6_send()Eric Dumazet
SIT or GRE tunnels might want to translate an IPV4 address into a v4mapped one when translating ICMP to ICMPv6. This patch adds the parameter to icmp6_send() but does not change icmpv6_send() signature. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17net: ipv6: Address selection needs to consider L3 domainsDavid Ahern
IPv6 version of 3f2fb9a834cb ("net: l3mdev: address selection should only consider devices in L3 domain") and the follow up commit, a17b693cdd876 ("net: l3mdev: prefer VRF master for source address selection"). That is, if outbound device is given then the address preference order is an address from that device, an address from the master device if it is enslaved, and then an address from a device in the same L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17net: vrf: Implement get_saddr for IPv6David Ahern
IPv6 source address selection needs to consider the real egress route. Similar to IPv4 implement a get_saddr6 method which is called if source address has not been set. The get_saddr6 method does a full lookup which means pulling a route from the VRF FIB table and properly considering linklocal/multicast destination addresses. Lookup failures (eg., unreachable) then cause the source address selection to fail which gets propagated back to the caller. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17net: ipv6: Move ip6_route_get_saddr to inlineDavid Ahern
VRF driver needs access to ip6_route_get_saddr code. Since it does little beyond ipv6_dev_get_saddr and ipv6_dev_get_saddr is already exported for modules move ip6_route_get_saddr to the header as an inline. Code move only; no functional change. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16net: xfrm: fix old-style declarationArnd Bergmann
Modern C standards expect the '__inline__' keyword to come before the return type in a declaration, and we get a couple of warnings for this with "make W=1" in the xfrm{4,6}_policy.c files: net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static int inline xfrm6_net_sysctl_init(struct net *net) net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static void inline xfrm6_net_sysctl_exit(struct net *net) net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static int inline xfrm4_net_sysctl_init(struct net *net) net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static void inline xfrm4_net_sysctl_exit(struct net *net) Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16sit: correct IP protocol used in ipip6_errSimon Horman
Since 32b8a8e59c9c ("sit: add IPv4 over IPv4 support") ipip6_err() may be called for packets whose IP protocol is IPPROTO_IPIP as well as those whose IP protocol is IPPROTO_IPV6. In the case of IPPROTO_IPIP packets the correct protocol value is not passed to ipv4_update_pmtu() or ipv4_redirect(). This patch resolves this problem by using the IP protocol of the packet rather than a hard-coded value. This appears to be consistent with the usage of the protocol of a packet by icmp_socket_deliver() the caller of ipip6_err(). I was able to exercise the redirect case by using a setup where an ICMP redirect was received for the destination of the encapsulated packet. However, it appears that although incorrect the protocol field is not used in this case and thus no problem manifests. On inspection it does not appear that a problem will manifest in the fragmentation needed/update pmtu case either. In short I believe this is a cosmetic fix. None the less, the use of IPPROTO_IPV6 seems wrong and confusing. Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15gre: fix error handlerEric Dumazet
1) gre_parse_header() can be called from gre_err() At this point transport header points to ICMP header, not the inner header. 2) We can not really change transport header as ipgre_err() will later assume transport header still points to ICMP header (using icmp_hdr()) 3) pskb_may_pull() logic in gre_parse_header() really works if we are interested at zone pointed by skb->data 4) As Jiri explained in commit b7f8fe251e46 ("gre: do not pull header in ICMP error processing") we should not pull headers in error handler. So this fix : A) changes gre_parse_header() to use skb->data instead of skb_transport_header() B) Adds a nhs parameter to gre_parse_header() so that we can skip the not pulled IP header from error path. This offset is 0 for normal receive path. C) remove obsolete IPV6 includes Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15ila: Fix checksum neutral mappingTom Herbert
The algorithm for checksum neutral mapping is incorrect. This problem was being hidden since we were previously always performing checksum offload on the translated addresses and only with IPv6 HW csum. Enabling an ILA router shows the issue. Corrected algorithm: old_loc is the original locator in the packet, new_loc is the value to overwrite with and is found in the lookup table. old_flag is the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG. Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for checksum neutral translation. Solving for diff gives: diff = (old_id - new_id) + (old_flag - new_flag) compute_csum_diff8(new_id, old_id) gives old_id - new_id If old_flag is set old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG Else old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG Tested: - Implemented a user space program that creates random addresses and random locators to overwrite. Compares the checksum over the address before and after translation (must always be equal) - Enabled ILA router and showed proper operation. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15ipv6: export several functionsAlexander Aring
This patch exports some neighbour discovery functions which can be used by 6lowpan neighbour discovery ops functionality then. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15ipv6: introduce neighbour discovery opsAlexander Aring
This patch introduces neighbour discovery ops callback structure. The idea is to separate the handling for 6LoWPAN into the 6lowpan module. These callback offers 6lowpan different handling, such as 802.15.4 short address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 over 6LoWPANs). Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15addrconf: put prefix address add in an own functionAlexander Aring
This patch moves the functionality to add a RA PIO prefix generated address in an own function. This move prepares to add a hook for adding a second address for a second link-layer address. E.g. short address for 802.15.4 6LoWPAN. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15ndisc: add __ndisc_fill_addr_option functionAlexander Aring
This patch adds __ndisc_fill_addr_option as low-level function for ndisc_fill_addr_option which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-156lowpan: add 802.15.4 short addr slaacAlexander Aring
This patch adds the autoconfiguration if a valid 802.15.4 short address is available for 802.15.4 6LoWPAN interfaces. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: vrf: Handle ipv6 multicast and link-local addressesDavid Ahern
IPv6 multicast and link-local addresses require special handling by the VRF driver: 1. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. 2. fail sends/receives on a VRF device to/from a multicast address (e.g, make ping6 ff02::1%<vrf> fail) 3. move the setting of the flow oif to the first dst lookup and revert the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF support to IPv6 stack"). Linklocal/mcast addresses require use of the skb->dev. With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses work (icmp, tcp, and udp) e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: ipv6: Do not add multicast route for l3 master devicesDavid Ahern
L3 master devices are virtual devices similar to the loopback device. Link local and multicast routes for these devices do not make sense. The ipv6 addrconf code already skips adding a linklocal address; do the same for the mcast route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14udp reuseport: fix packet of same flow hashed to different socketSu, Xuemin
There is a corner case in which udp packets belonging to a same flow are hashed to different socket when hslot->count changes from 10 to 11: 1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash, and always passes 'daddr' to udp_ehashfn(). 2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2, but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to INADDR_ANY instead of some specific addr. That means when hslot->count changes from 10 to 11, the hash calculated by udp_ehashfn() is also changed, and the udp packets belonging to a same flow will be hashed to different socket. This is easily reproduced: 1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000. 2) From the same host send udp packets to 127.0.0.1:40000, record the socket index which receives the packets. 3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096 is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the same hslot as the aformentioned 10 sockets, and makes the hslot->count change from 10 to 11. 4) From the same host send udp packets to 127.0.0.1:40000, and the socket index which receives the packets will be different from the one received in step 2. This should not happen as the socket bound to 0.0.0.0:44096 should not change the behavior of the sockets bound to 0.0.0.0:40000. It's the same case for IPv6, and this patch also fixes that. Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14ipv6: fix checksum annotation in udp6_csum_initHannes Frederic Sowa
Cc: Tom Herbert <tom@herbertland.com> Fixes: 4068579e1e098fa ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14ipv6: tcp: fix endianness annotation in tcp_v6_send_responseHannes Frederic Sowa
Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14ipv6: fix endianness error in icmpv6_errHannes Frederic Sowa
IPv6 ping socket error handler doesn't correctly convert the new 32 bit mtu to host endianness before using. Cc: Lorenzo Colitti <lorenzo@google.com> Fixes: 6d0bfe22611602f ("net: ipv6: Add IPv6 support to the ping socket.") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11ipv6: use TOS marks from sockets for routing decisionHannes Frederic Sowa
In IPv6 the ToS values are part of the flowlabel in flowi6 and get extracted during fib rule lookup, but we forgot to correctly initialize the flowlabel before the routing lookup. Reported-by: <liam.mcbirnie@boeing.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/sched/act_police.c net/sched/sch_drr.c net/sched/sch_hfsc.c net/sched/sch_prio.c net/sched/sch_red.c net/sched/sch_tbf.c In net-next the drop methods of the packet schedulers got removed, so the bug fixes to them in 'net' are irrelevant. A packet action unload crash fix conflicts with the addition of the new firstuse timestamp. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09net: vrf: Fix crash when IPv6 is disabled at boot timeDavid Ahern
Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is disabled at boot using the kernel option ipv6.disable=1. Using current net-next with the boot option: $ ip link add red type vrf table 1001 Generates: [12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748 [12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a [12210.922537] PGD b79e3067 PUD bb32b067 PMD 0 [12210.923479] Oops: 0000 [#1] SMP [12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc [12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235 [12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000 [12210.929328] RIP: 0010:[<ffffffff814b30e3>] [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a [12210.930697] RSP: 0018:ffff8800bacaf888 EFLAGS: 00010202 [12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28 [12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286 [12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b [12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9 [12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000 [12210.937103] FS: 00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000 [12210.938321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0 [12210.940278] Stack: [12210.940603] ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135 [12210.941818] ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800 [12210.943040] ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280 [12210.944288] Call Trace: [12210.944688] [<ffffffff814b3135>] fib6_new_table+0x24/0x8a [12210.945516] [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162 [12210.946328] [<ffffffff814091e1>] register_netdevice+0x100/0x396 [12210.947209] [<ffffffff8139823d>] vrf_newlink+0x40/0xb3 [12210.948001] [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5 ... The problem above is due to the fact that the fib hash table is not allocated when IPv6 is disabled at boot. As for the VRF driver it should not do any IPv6 initializations if IPv6 is disabled, so it needs to know if IPv6 is disabled at boot. The disable parameter is private to the IPv6 module, so provide an accessor for modules to determine if IPv6 was disabled at boot time. Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09sit: remove unnecessary protocol check in ipip6_tunnel_xmit()Simon Horman
ipip6_tunnel_xmit() is called immediately after checking that skb->protocol is htons(ETH_P_IPV6) so there is no need to check it a second time. Found by inspection. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08ip6gre: Allow live link address changeShweta Choudaha
The ip6 GRE tap device should not be forced to down state to change the mac address and should allow live address change for tap device similar to ipv4 gre. Signed-off-by: Shweta Choudaha <schoudah@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08ip6gre: Allow live link address changeShweta Choudaha
The ip6 GRE tap device should not be forced to down state to change the mac address and should allow live address change for tap device similar to ipv4 gre. Signed-off-by: Shweta Choudaha <schoudah@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: Add l3mdev ruleDavid Ahern
Currently, VRFs require 1 oif and 1 iif rule per address family per VRF. As the number of VRF devices increases it brings scalability issues with the increasing rule list. All of the VRF rules have the same format with the exception of the specific table id to direct the lookup. Since the table id is available from the oif or iif in the loopup, the VRF rules can be consolidated to a single rule that pulls the table from the VRF device. This patch introduces a new rule attribute l3mdev. The l3mdev rule means the table id used for the lookup is pulled from the L3 master device (e.g., VRF) rather than being statically defined. With the l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev rule per address family (IPv4 and IPv6). If an admin wishes to insert higher priority rules for specific VRFs those rules will co-exist with the l3mdev rule. This capability means current VRF scripts will co-exist with this new simpler implementation. Currently, the rules list for both ipv4 and ipv6 look like this: $ ip ru ls 1000: from all oif vrf1 lookup 1001 1000: from all iif vrf1 lookup 1001 1000: from all oif vrf2 lookup 1002 1000: from all iif vrf2 lookup 1002 1000: from all oif vrf3 lookup 1003 1000: from all iif vrf3 lookup 1003 1000: from all oif vrf4 lookup 1004 1000: from all iif vrf4 lookup 1004 1000: from all oif vrf5 lookup 1005 1000: from all iif vrf5 lookup 1005 1000: from all oif vrf6 lookup 1006 1000: from all iif vrf6 lookup 1006 1000: from all oif vrf7 lookup 1007 1000: from all iif vrf7 lookup 1007 1000: from all oif vrf8 lookup 1008 1000: from all iif vrf8 lookup 1008 ... 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default With the l3mdev rule the list is just the following regardless of the number of VRFs: $ ip ru ls 1000: from all lookup [l3mdev table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default (Note: the above pretty print of the rule is based on an iproute2 prototype. Actual verbage may change) Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08ipv6: Skip XFRM lookup if dst_entry in socket cache is validJakub Sitnicki
At present we perform an xfrm_lookup() for each UDPv6 message we send. The lookup involves querying the flow cache (flow_cache_lookup) and, in case of a cache miss, creating an XFRM bundle. If we miss the flow cache, we can end up creating a new bundle and deriving the path MTU (xfrm_init_pmtu) from on an already transformed dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down to xfrm_lookup(). This can happen only if we're caching the dst_entry in the socket, that is when we're using a connected UDP socket. To put it another way, the path MTU shrinks each time we miss the flow cache, which later on leads to incorrectly fragmented payload. It can be observed with ESPv6 in transport mode: 1) Set up a transformation and lower the MTU to trigger fragmentation # ip xfrm policy add dir out src ::1 dst ::1 \ tmpl src ::1 dst ::1 proto esp spi 1 # ip xfrm state add src ::1 dst ::1 \ proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b # ip link set dev lo mtu 1500 2) Monitor the packet flow and set up an UDP sink # tcpdump -ni lo -ttt & # socat udp6-listen:12345,fork /dev/null & 3) Send a datagram that needs fragmentation with a connected socket # perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345 2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error 00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448 00:00:00.000014 IP6 ::1 > ::1: frag (1448|32) 00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272 (^ ICMPv6 Parameter Problem) 00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136 4) Compare it to a non-connected socket # perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345 00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448 00:00:00.000010 IP6 ::1 > ::1: frag (1448|64) What happens in step (3) is: 1) when connecting the socket in __ip6_datagram_connect(), we perform an XFRM lookup, miss the flow cache, create an XFRM bundle, and cache the destination, 2) afterwards, when sending the datagram, we perform an XFRM lookup, again, miss the flow cache (due to mismatch of flowi6_iif and flowi6_oif, which is an issue of its own), and recreate an XFRM bundle based on the cached (and already transformed) destination. To prevent the recreation of an XFRM bundle, avoid an XFRM lookup altogether whenever we already have a destination entry cached in the socket. This prevents the path MTU shrinkage and brings us on par with UDPv4. The fix also benefits connected PINGv6 sockets, another user of ip6_sk_dst_lookup_flow(), who also suffer messages being transformed twice. Joint work with Hannes Frederic Sowa. Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08ila: Perform only one translation in forwarding pathTom Herbert
When setting up ILA in a router we noticed that the the encapsulation is invoked twice: once in the route input path and again upon route output. To resolve this we add a flag set_csum_neutral for the ila_update_ipv6_locator. If this flag is set and the checksum neutral bit is also set we assume that checksum-neutral translation has already been performed and take no further action. The flag is set only in ila_output path. The flag is not set for ila_input and ila_xlat. Tested: Used 3 netns to set to emulate a router and two hosts. The router translates SIR addresses between the two destinations in other two netns. Verified ping and netperf are functional. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: vrf: ipv6 support for local traffic to local addressesDavid Ahern
Add support for locally originated traffic to VRF-local IPv6 addresses. Similar to IPv4 a local dst is set on the skb and the packet is reinserted with a call to netif_rx. With this patch, ping, tcp and udp packets to a local IPv6 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms ip6_input is exported so the VRF driver can use it for the dst input function. The dst_alloc function for IPv4 defaults to setting the input and output functions; IPv6's does not. VRF does not need to duplicate the Rx path so just export the ipv6 input function. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07tcp: record TLP and ER timer stats in v6 statsYuchung Cheng
The v6 tcp stats scan do not provide TLP and ER timer information correctly like the v4 version . This patch fixes that. Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Fixes: eed530b6c676 ("tcp: early retransmit") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03skbuff: introduce skb_gso_validate_mtuMarcelo Ricardo Leitner
skb_gso_network_seglen is not enough for checking fragment sizes if skb is using GSO_BY_FRAGS as we have to check frag per frag. This patch introduces skb_gso_validate_mtu, based on the former, which will wrap the use case inside it as all calls to skb_gso_network_seglen were to validate if it fits on a given TMU, and improve the check. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-02Possible problem with e6afc8ac ("udp: remove headers from UDP packets before ↵Eric Dumazet
queueing") Paul Moore tracked a regression caused by a recent commit, which mistakenly assumed that sk_filter() could be avoided if socket had no current BPF filter. The intent was to avoid udp_lib_checksum_complete() overhead. But sk_filter() also checks skb_pfmemalloc() and security_sock_rcv_skb(), so better call it. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Tested-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: samanthakumar <samanthakumar@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix incorrect timestamp in nfnetlink_queue introduced when addressing y2038 safe timestamp, from Florian Westphal. 2) Get rid of leftover conntrack definition from the previous merge window, oneliner from Florian. 3) Make nf_queue handler pernet to resolve race on dereferencing the hook state structure with netns removal, from Eric Biederman. 4) Ensure clean exit on unregistered helper ports, from Taehee Yoo. 5) Restore FLOWI_FLAG_KNOWN_NH in nf_dup_ipv6. This got lost while generalizing xt_TEE to add packet duplication support in nf_tables, from Paolo Abeni. 6) Insufficient netlink NFTA_SET_TABLE attribute check in nf_tables_getset(), from Phil Turnbull. 7) Reject helper registration on duplicated ports via modparams. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-31fou: fix IPv6 Kconfig optionsArnd Bergmann
The Kconfig options I added to work around broken compilation ended up screwing up things more, as I used the wrong symbol to control compilation of the file, resulting in IPv6 fou support to never be built into the kernel. Changing CONFIG_NET_FOU_IPV6_TUNNELS to CONFIG_IPV6_FOU fixes that problem, I had renamed the symbol in one location but not the other, and as the file is never being used by other kernel code, this did not lead to a build failure that I would have caught. After that fix, another issue with the same patch becomes obvious, as we 'select INET6_TUNNEL', which is related to IPV6_TUNNEL, but not the same, and this can still cause the original build failure when IPV6_TUNNEL is not built-in but IPV6_FOU is. The fix is equally trivial, we just need to select the right symbol. I have successfully build 350 randconfig kernels with this patch and verified that the driver is now being built. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Valentin Rothberg <valentinrothberg@gmail.com> Fixes: fabb13db448e ("fou: add Kconfig options for IPv6 support") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-30netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flagsPaolo Abeni
With the commit 48e8aa6e3137 ("ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags") ip6_pol_route() callers were asked to to set the FLOWI_FLAG_KNOWN_NH properly and xt_TEE was updated accordingly, but with the later refactor in commit bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") the flowi6_flags update was lost. This commit re-add it just before the routing decision. Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>