Age | Commit message (Collapse) | Author |
|
Simplify vxlan implementation using common UDP tunnel APIs.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch:
- Clarifies the specific requirements of devices returning
CHECKSUM_UNNECESSARY (comments in skbuff.h).
- Adds csum_level field to skbuff. This is used to express how
many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
This replaces the overloading of skb->encapsulation, that field is
is now only used to indicate inner headers are valid.
- Change __skb_checksum_validate_needed to "consume" each checksum
as indicated by csum_level as layers of the the packet are parsed.
- Remove skb_pop_rcv_encapsulation, no longer needed in the new
csum_level model.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The first initializer in the following
union vxlan_addr ipa = {
.sin.sin_addr.s_addr = tip,
.sa.sa_family = AF_INET,
};
is optimised away by the compiler, due to the second initializer,
therefore initialising .sin.sin_addr.s_addr always to 0.
This results in netlink messages indicating a L3 miss never contain the
missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
The problem affects user space programs relying on an IP address being
sent as part of a netlink message indicating a L3 miss.
Changing
.sa.sa_family = AF_INET,
to
.sin.sin_family = AF_INET,
fixes the problem.
Signed-off-by: Gerhard Stenzel <gerhard.stenzel@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In vxlan driver call common function udp_sock_create to create the
listener UDP port.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dumping a bridge fdb dumps every fdb entry
held. With this change we are going to filter
on selected bridge port.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In vxlan and OVS vport-vxlan call common function to get source port
for a UDP tunnel. Removed vxlan_src_port since the functionality is
now in udp_flow_src_port.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Call skb_pop_rcv_encapsulation and postpull_rcsum for the Ethernet
header to work properly with checksum complete.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we mirror packets from a vxlan tunnel to other device,
the mirror device should see the same packets (that is, without
outer header). Because vxlan tunnel sets dev->hard_header_len,
tcf_mirred() resets mac header back to outer mac, the mirror device
actually sees packets with outer headers
Vxlan tunnel should set dev->needed_headroom instead of
dev->hard_header_len, like what other ip tunnels do. This fixes
the above problem.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: stephen hemminger <stephen@networkplumber.org>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Need to gro_postpull_rcsum for GRO to work with checksum complete.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added VXLAN link configuration for sending UDP checksums, and allowing
TX and RX of UDP6 checksums.
Also, call common iptunnel_handle_offloads and added GSO support for
checksums.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
net: get rid of SET_ETHTOOL_OPS
Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.
Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;
Compile tested only, but I'd seriously wonder if this broke anything.
Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch allows to switch the netns when packet is encapsulated or
decapsulated.
The vxlan socket is openned into the i/o netns, ie into the netns where
encapsulated packets are received. The socket lookup is done into this netns to
find the corresponding vxlan tunnel. After decapsulation, the packet is
injecting into the corresponding interface which may stand to another netns.
When one of the two netns is removed, the tunnel is destroyed.
Configuration example:
ip netns add netns1
ip netns exec netns1 ip link set lo up
ip link add vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
ip link set vxlan10 netns netns1
ip netns exec netns1 ip addr add 192.168.0.249/24 broadcast 192.168.0.255 dev vxlan10
ip netns exec netns1 ip link set vxlan10 up
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The goal of this patch is to fix rtnelink notification. The main problem was
about notification for fdb entry with more than one remote. Before the patch,
when a remote was added to an existing fdb entry, the kernel advertised the
first remote instead of the added one. Also when a remote was removed from a fdb
entry with several remotes, the deleted remote was not advertised.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.
The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.
Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the vxlan interface is created without explicit group definition,
there are corner cases which may cause kernel panic.
For instance, in the following scenario:
node A:
$ ip link add dev vxlan42 address 2c:c2:60:00:10:20 type vxlan id 42
$ ip addr add dev vxlan42 10.0.0.1/24
$ ip link set up dev vxlan42
$ arp -i vxlan42 -s 10.0.0.2 2c:c2:60:00:01:02
$ bridge fdb add dev vxlan42 to 2c:c2:60:00:01:02 dst <IPv4 address>
$ ping 10.0.0.2
node B:
$ ip link add dev vxlan42 address 2c:c2:60:00:01:02 type vxlan id 42
$ ip addr add dev vxlan42 10.0.0.2/24
$ ip link set up dev vxlan42
$ arp -i vxlan42 -s 10.0.0.1 2c:c2:60:00:10:20
node B crashes:
vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000046
IP: [<ffffffff8143c459>] ip6_route_output+0x58/0x82
PGD 7bd89067 PUD 7bd4e067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc8-hvx-xen-00019-g97a5221-dirty #154
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88007c774f50 ti: ffff88007c79c000 task.ti: ffff88007c79c000
RIP: 0010:[<ffffffff8143c459>] [<ffffffff8143c459>] ip6_route_output+0x58/0x82
RSP: 0018:ffff88007fd03668 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffffff8186a000 RCX: 0000000000000040
RDX: 0000000000000000 RSI: ffff88007b0e4a80 RDI: ffff88007fd03754
RBP: ffff88007fd03688 R08: ffff88007b0e4a80 R09: 0000000000000000
R10: 0200000a0100000a R11: 0001002200000000 R12: ffff88007fd03740
R13: ffff88007b0e4a80 R14: ffff88007b0e4a80 R15: ffff88007bba0c50
FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000046 CR3: 000000007bb60000 CR4: 00000000000006e0
Stack:
0000000000000000 ffff88007fd037a0 ffffffff8186a000 ffff88007fd03740
ffff88007fd036c8 ffffffff814320bb 0000000000006e49 ffff88007b8b7360
ffff88007bdbf200 ffff88007bcbc000 ffff88007b8b7000 ffff88007b8b7360
Call Trace:
<IRQ>
[<ffffffff814320bb>] ip6_dst_lookup_tail+0x2d/0xa4
[<ffffffff814322a5>] ip6_dst_lookup+0x10/0x12
[<ffffffff81323b4e>] vxlan_xmit_one+0x32a/0x68c
[<ffffffff814a325a>] ? _raw_spin_unlock_irqrestore+0x12/0x14
[<ffffffff8104c551>] ? lock_timer_base.isra.23+0x26/0x4b
[<ffffffff8132451a>] vxlan_xmit+0x66a/0x6a8
[<ffffffff8141a365>] ? ipt_do_table+0x35f/0x37e
[<ffffffff81204ba2>] ? selinux_ip_postroute+0x41/0x26e
[<ffffffff8139d0c1>] dev_hard_start_xmit+0x2ce/0x3ce
[<ffffffff8139d491>] __dev_queue_xmit+0x2d0/0x392
[<ffffffff813b380f>] ? eth_header+0x28/0xb5
[<ffffffff8139d569>] dev_queue_xmit+0xb/0xd
[<ffffffff813a5aa6>] neigh_resolve_output+0x134/0x152
[<ffffffff813db741>] ip_finish_output2+0x236/0x299
[<ffffffff813dc074>] ip_finish_output+0x98/0x9d
[<ffffffff813dc749>] ip_output+0x62/0x67
[<ffffffff813da9f2>] dst_output+0xf/0x11
[<ffffffff813dc11c>] ip_local_out+0x1b/0x1f
[<ffffffff813dcf1b>] ip_send_skb+0x11/0x37
[<ffffffff813dcf70>] ip_push_pending_frames+0x2f/0x33
[<ffffffff813ff732>] icmp_push_reply+0x106/0x115
[<ffffffff813ff9e4>] icmp_reply+0x142/0x164
[<ffffffff813ffb3b>] icmp_echo.part.16+0x46/0x48
[<ffffffff813c1d30>] ? nf_iterate+0x43/0x80
[<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
[<ffffffff813ffb62>] icmp_echo+0x25/0x27
[<ffffffff814005f7>] icmp_rcv+0x1d2/0x20a
[<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
[<ffffffff813d810d>] ip_local_deliver_finish+0xd6/0x14f
[<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
[<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
[<ffffffff813d82bf>] ip_local_deliver+0x4a/0x4f
[<ffffffff813d7f7b>] ip_rcv_finish+0x253/0x26a
[<ffffffff813d7d28>] ? inet_add_protocol+0x3e/0x3e
[<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
[<ffffffff813d856a>] ip_rcv+0x2a6/0x2ec
[<ffffffff8139a9a0>] __netif_receive_skb_core+0x43e/0x478
[<ffffffff812a346f>] ? virtqueue_poll+0x16/0x27
[<ffffffff8139aa2f>] __netif_receive_skb+0x55/0x5a
[<ffffffff8139aaaa>] process_backlog+0x76/0x12f
[<ffffffff8139add8>] net_rx_action+0xa2/0x1ab
[<ffffffff81047847>] __do_softirq+0xca/0x1d1
[<ffffffff81047ace>] irq_exit+0x3e/0x85
[<ffffffff8100b98b>] do_IRQ+0xa9/0xc4
[<ffffffff814a37ad>] common_interrupt+0x6d/0x6d
<EOI>
[<ffffffff810378db>] ? native_safe_halt+0x6/0x8
[<ffffffff810110c7>] default_idle+0x9/0xd
[<ffffffff81011694>] arch_cpu_idle+0x13/0x1c
[<ffffffff8107480d>] cpu_startup_entry+0xbc/0x137
[<ffffffff8102e741>] start_secondary+0x1a0/0x1a5
Code: 24 14 e8 f1 e5 01 00 31 d2 a8 32 0f 95 c2 49 8b 44 24 2c 49 0b 44 24 24 74 05 83 ca 04 eb 1c 4d 85 ed 74 17 49 8b 85 a8 02 00 00 <66> 8b 40 46 66 c1 e8 07 83 e0 07 c1 e0 03 09 c2 4c 89 e6 48 89
RIP [<ffffffff8143c459>] ip6_route_output+0x58/0x82
RSP <ffff88007fd03668>
CR2: 0000000000000046
---[ end trace 4612329caab37efd ]---
When vxlan interface is created without explicit group definition, the
default_dst protocol family is initialiazed to AF_UNSPEC and the driver
assumes IPv4 configuration. On the other side, the default_dst protocol
family is used to differentiate between IPv4 and IPv6 cases and, since,
AF_UNSPEC != AF_INET, the processing takes the IPv6 path.
Making the IPv4 assumption explicit by settting default_dst protocol
family to AF_INET4 and preventing mixing of IPv4 and IPv6 addresses in
snooped fdb entries fixes the corner case crashes.
Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
Documentation/devicetree/bindings/net/micrel-ks8851.txt
net/core/netpoll.c
The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.
In micrel-ks8851.txt we simply have overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The VXLAN neigh_reduce() code is completely non-functional since
check-in. Specific errors:
1) The original code drops all packets with a multicast destination address,
even though neighbor solicitations are sent to the solicited-node
address, a multicast address. The code after this check was never run.
2) The neighbor table lookup used the IPv6 header destination, which is the
solicited node address, rather than the target address from the
neighbor solicitation. So neighbor lookups would always fail if it
got this far. Also for L3MISSes.
3) The code calls ndisc_send_na(), which does a send on the tunnel device.
The context for neigh_reduce() is the transmit path, vxlan_xmit(),
where the host or a bridge-attached neighbor is trying to transmit
a neighbor solicitation. To respond to it, the tunnel endpoint needs
to do a *receive* of the appropriate neighbor advertisement. Doing a
send, would only try to send the advertisement, encapsulated, to the
remote destinations in the fdb -- hosts that definitely did not do the
corresponding solicitation.
4) The code uses the tunnel endpoint IPv6 forwarding flag to determine the
isrouter flag in the advertisement. This has nothing to do with whether
or not the target is a router, and generally won't be set since the
tunnel endpoint is bridging, not routing, traffic.
The patch below creates a proxy neighbor advertisement to respond to
neighbor solicitions as intended, providing proper IPv6 support for neighbor
reduction.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch fixes a NULL pointer dereference in the event of an
skb allocation failure in arp_reduce().
Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Pablo Neira Ayuso <pablo@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The loop in vxlan_gro_receive() over the current set of candidates for
coalescing was wrongly aborted once a match was found. In rare cases,
this can cause a false-positives matching in the next layer GRO checks.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure the practice set by commit 0afb166 "vxlan: Add capability
of Rx checksum offload for inner packet" is applied when the skb
goes through the portion of the RX code which is shared between
vxlan netdevices and ovs vxlan port instances.
Cc: Joseph Gasparakis <joseph.gasparakis@intel.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As per suggestion from Eric W. Biederman, vxlan should be using
{un,}register_pernet_subsys() instead of {un,}register_pernet_device()
to ensure the vxlan_net structure is initialized before and cleaned
up after all network devices in a given network namespace i.e. when
dealing with network notifiers. This is similarly handeled already in
commit 91e2ff3528ac ("net: Teach vlans to cleanup as a pernet subsystem")
and, thus, improves upon fd27e0d44a89 ("net: vxlan: do not use vxlan_net
before checking event type"). Just as in 91e2ff3528ac, we do not need
to explicitly handle deletion of vxlan devices as network namespace
exit calls dellink on all remaining virtual devices, and
rtnl_link_unregister() calls dellink on all outstanding devices in that
network namespace, so we can entirely drop the pernet exit operation
as well. Moreover, on vxlan module exit, rcu_barrier() is called by
netns since commit 3a765edadb28 ("netns: Add an explicit rcu_barrier
to unregister_pernet_{device|subsys}"), so this may be omitted. Tested
with various scenarios and works well on my side.
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add GRO handlers for vxlann, by using the UDP GRO infrastructure.
For single TCP session that goes through vxlan tunneling I got nice
improvement from 6.8Gbs to 11.5Gbs
--> UDP/VXLAN GRO disabled
$ netperf -H 192.168.52.147 -c -C
$ netperf -t TCP_STREAM -H 192.168.52.147 -c -C
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.52.147 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 65536 65536 10.00 6799.75 12.54 24.79 0.604 1.195
--> UDP/VXLAN GRO enabled
$ netperf -t TCP_STREAM -H 192.168.52.147 -c -C
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.52.147 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 65536 65536 10.00 11562.72 24.90 20.34 0.706 0.577
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a description to the vxlan module, helping save the world
from the minions of destruction and confusion.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jesse Brandeburg reported that commit acaf4e70997f caused a panic
when adding a network namespace while vxlan module was present in
the system:
[<ffffffff814d0865>] vxlan_lowerdev_event+0xf5/0x100
[<ffffffff816e9e5d>] notifier_call_chain+0x4d/0x70
[<ffffffff810912be>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff810912d6>] raw_notifier_call_chain+0x16/0x20
[<ffffffff815d9610>] call_netdevice_notifiers_info+0x40/0x70
[<ffffffff815d9656>] call_netdevice_notifiers+0x16/0x20
[<ffffffff815e1bce>] register_netdevice+0x1be/0x3a0
[<ffffffff815e1dce>] register_netdev+0x1e/0x30
[<ffffffff814cb94a>] loopback_net_init+0x4a/0xb0
[<ffffffffa016ed6e>] ? lockd_init_net+0x6e/0xb0 [lockd]
[<ffffffff815d6bac>] ops_init+0x4c/0x150
[<ffffffff815d6d23>] setup_net+0x73/0x110
[<ffffffff815d725b>] copy_net_ns+0x7b/0x100
[<ffffffff81090e11>] create_new_namespaces+0x101/0x1b0
[<ffffffff81090f45>] copy_namespaces+0x85/0xb0
[<ffffffff810693d5>] copy_process.part.26+0x935/0x1500
[<ffffffff811d5186>] ? mntput+0x26/0x40
[<ffffffff8106a15c>] do_fork+0xbc/0x2e0
[<ffffffff811b7f2e>] ? ____fput+0xe/0x10
[<ffffffff81089c5c>] ? task_work_run+0xac/0xe0
[<ffffffff8106a406>] SyS_clone+0x16/0x20
[<ffffffff816ee689>] stub_clone+0x69/0x90
[<ffffffff816ee329>] ? system_call_fastpath+0x16/0x1b
Apparently loopback device is being registered first and thus we
receive an event notification when vxlan_net is not ready. Hence,
when we call net_generic() and request vxlan_net_id, we seem to
access garbage at that point in time. In setup_net() where we set
up a newly allocated network namespace, we traverse the list of
pernet ops ...
list_for_each_entry(ops, &pernet_list, list) {
error = ops_init(ops, net);
if (error < 0)
goto out_undo;
}
... and loopback_net_init() is invoked first here, so in the middle
of setup_net() we get this notification in vxlan. As currently we
only care about devices that unregister, move access through
net_generic() there. Fix is based on Cong Wang's proposal, but
only changes what is needed here. It sucks a bit as we only work
around the actual cure: right now it seems the only way to check if
a netns actually finished traversing all init ops would be to check
if it's part of net_namespace_list. But that I find quite expensive
each time we go through a notifier callback. Anyway, did a couple
of tests and it seems good for now.
Fixes: acaf4e70997f ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Reported-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We should use vxlan_dellink() handler in vxlan_exit_net(), since
i) we're not in fast-path and we should be consistent in dismantle
just as we would remove a device through rtnl ops, and more
importantly, ii) in case future code will kfree() memory in
vxlan_dellink(), we would leak it right here unnoticed. Therefore,
do not only half of the cleanup work, but make it properly.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We can create a vxlan device with an explicit underlying carrier.
In that case, when the carrier link is being deleted from the
system (e.g. due to module unload) we should also clean up all
created vxlan devices on top of it since otherwise we're in an
inconsistent state in vxlan device. In that case, the user needs
to remove all such devices, while in case of other virtual devs
that sit on top of physical ones, it is usually the case that
these devices do unregister automatically as well and do not
leave the burden on the user.
This work is not necessary when vxlan device was not created with
a real underlying device, as connections can resume in that case
when driver is plugged again. But at least for the other cases,
we should go ahead and do the cleanup on removal.
We don't register the notifier during vxlan_newlink() here since
I consider this event rather rare, and therefore we should not
bloat vxlan's core structure unecessary. Also, we can simply make
use of unregister_netdevice_many() to batch that. fdb is flushed
upon ndo_stop().
E.g. `ip -d link show vxlan13` after carrier removal before
this patch:
5: vxlan13: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default
link/ether 1e:47:da:6d:4d:99 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 13 group 239.0.0.10 dev 2 port 32768 61000 ageing 300
^^^^^
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The following call chains indicate that vxlan_fdb_parse() is
under rtnl_lock protection. So if we use __dev_get_by_index()
instead of dev_get_by_index() to find interface handler in it,
this would help us avoid to change interface reference counter.
rtnetlink_rcv()
rtnl_lock()
netlink_rcv_skb()
rtnl_fdb_add()
vxlan_fdb_add()
vxlan_fdb_parse()
rtnl_unlock()
rtnetlink_rcv()
rtnl_lock()
netlink_rcv_skb()
rtnl_fdb_del()
vxlan_fdb_del()
vxlan_fdb_parse()
rtnl_unlock()
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sathya Perla posted a patch trying to address following problem :
<quote>
The vxlan driver sets itself as the socket owner for all the TX flows
it encapsulates (using vxlan_set_owner()) and assigns it's own skb
destructor. This causes all tunneled traffic to land up on only one TXQ
as all encapsulated skbs refer to the vxlan socket and not the original
socket. Also, the vxlan skb destructor breaks some functionality for
tunneled traffic like wmem accounting and as TCP small queues and
FQ/pacing packet scheduler.
</quote>
I reworked Sathya patch and added some explanations.
vxlan_xmit() can avoid one skb_clone()/dev_kfree_skb() pair
and gain better drop monitor accuracy, by calling kfree_skb() when
appropriate.
The UDP socket used by vxlan to perform encapsulation of xmit packets
do not need to be alive while packets leave vxlan code. Its better
to keep original socket ownership to get proper feedback from qdisc and
NIC layers.
We use skb->sk to
A) control amount of bytes/packets queued on behalf of a socket, but
prior vxlan code did the skb->sk transfert without any limit/control
on vxlan socket sk_sndbuf.
B) security purposes (as selinux) or netfilter uses, and I do not think
anything is prepared to handle vxlan stacked case in this area.
By not changing ownership, vxlan tunnels behave like other tunnels.
As Stephen mentioned, we might do the same change in L2TP.
Reported-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
They are same, so unify them as one, pcpu_sw_netstats.
Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Even if user doesn't supply the physical netdev to attach vxlan dev
to, and at the same time user want to vxlan sit top of IPv6, mark
vxlan_dev flags with VXLAN_F_IPV6 to create IPv6 based socket.
Otherwise kernel crashes safely every time spitting below messages,
Steps to reproduce:
ip link add vxlan0 type vxlan id 42 group ff0e::110
ip link set vxlan0 up
[ 62.656266] BUG: unable to handle kernel NULL pointer dereference[ 62.656320] ip (3008) used greatest stack depth: 3912 bytes left
at 0000000000000046
[ 62.656423] IP: [<ffffffff816d822d>] ip6_route_output+0xbd/0xe0
[ 62.656525] PGD 2c966067 PUD 2c9a2067 PMD 0
[ 62.656674] Oops: 0000 [#1] SMP
[ 62.656781] Modules linked in: vxlan netconsole deflate zlib_deflate af_key
[ 62.657083] CPU: 1 PID: 2128 Comm: whoopsie Not tainted 3.12.0+ #182
[ 62.657083] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006
[ 62.657083] task: ffff88002e2335d0 ti: ffff88002c94c000 task.ti: ffff88002c94c000
[ 62.657083] RIP: 0010:[<ffffffff816d822d>] [<ffffffff816d822d>] ip6_route_output+0xbd/0xe0
[ 62.657083] RSP: 0000:ffff88002fd038f8 EFLAGS: 00210296
[ 62.657083] RAX: 0000000000000000 RBX: ffff88002fd039e0 RCX: 0000000000000000
[ 62.657083] RDX: ffff88002fd0eb68 RSI: ffff88002fd0d278 RDI: ffff88002fd0d278
[ 62.657083] RBP: ffff88002fd03918 R08: 0000000002000000 R09: 0000000000000000
[ 62.657083] R10: 00000000000001ff R11: 0000000000000000 R12: 0000000000000001
[ 62.657083] R13: ffff88002d96b480 R14: ffffffff81c8e2c0 R15: 0000000000000001
[ 62.657083] FS: 0000000000000000(0000) GS:ffff88002fd00000(0063) knlGS:00000000f693b740
[ 62.657083] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 62.657083] CR2: 0000000000000046 CR3: 000000002c9d2000 CR4: 00000000000006e0
[ 62.657083] Stack:
[ 62.657083] ffff88002fd03a40 ffffffff81c8e2c0 ffff88002fd039e0 ffff88002d96b480
[ 62.657083] ffff88002fd03958 ffffffff816cac8b ffff880019277cc0 ffff8800192b5d00
[ 62.657083] ffff88002d5bc000 ffff880019277cc0 0000000000001821 0000000000000001
[ 62.657083] Call Trace:
[ 62.657083] <IRQ>
[ 62.657083] [<ffffffff816cac8b>] ip6_dst_lookup_tail+0xdb/0xf0
[ 62.657083] [<ffffffff816caea0>] ip6_dst_lookup+0x10/0x20
[ 62.657083] [<ffffffffa0020c13>] vxlan_xmit_one+0x193/0x9c0 [vxlan]
[ 62.657083] [<ffffffff8137b3b7>] ? account+0xc7/0x1f0
[ 62.657083] [<ffffffffa0021513>] vxlan_xmit+0xd3/0x400 [vxlan]
[ 62.657083] [<ffffffff8161390d>] dev_hard_start_xmit+0x49d/0x5e0
[ 62.657083] [<ffffffff81613d29>] dev_queue_xmit+0x2d9/0x480
[ 62.657083] [<ffffffff817cb854>] ? _raw_write_unlock_bh+0x14/0x20
[ 62.657083] [<ffffffff81630565>] ? eth_header+0x35/0xe0
[ 62.657083] [<ffffffff8161bc5e>] neigh_resolve_output+0x11e/0x1e0
[ 62.657083] [<ffffffff816ce0e0>] ? ip6_fragment+0xad0/0xad0
[ 62.657083] [<ffffffff816cb465>] ip6_finish_output2+0x2f5/0x470
[ 62.657083] [<ffffffff816ce166>] ip6_finish_output+0x86/0xc0
[ 62.657083] [<ffffffff816ce218>] ip6_output+0x78/0xb0
[ 62.657083] [<ffffffff816eadd6>] mld_sendpack+0x256/0x2a0
[ 62.657083] [<ffffffff816ebd8c>] mld_ifc_timer_expire+0x17c/0x290
[ 62.657083] [<ffffffff816ebc10>] ? igmp6_timer_handler+0x80/0x80
[ 62.657083] [<ffffffff816ebc10>] ? igmp6_timer_handler+0x80/0x80
[ 62.657083] [<ffffffff81051065>] call_timer_fn+0x45/0x150
[ 62.657083] [<ffffffff816ebc10>] ? igmp6_timer_handler+0x80/0x80
[ 62.657083] [<ffffffff81052353>] run_timer_softirq+0x1f3/0x2a0
[ 62.657083] [<ffffffff8102dfd8>] ? lapic_next_event+0x18/0x20
[ 62.657083] [<ffffffff8109e36f>] ? clockevents_program_event+0x6f/0x110
[ 62.657083] [<ffffffff8104a2f6>] __do_softirq+0xd6/0x2b0
[ 62.657083] [<ffffffff8104a75e>] irq_exit+0x7e/0xa0
[ 62.657083] [<ffffffff8102ea15>] smp_apic_timer_interrupt+0x45/0x60
[ 62.657083] [<ffffffff817d3eca>] apic_timer_interrupt+0x6a/0x70
[ 62.657083] <EOI>
[ 62.657083] [<ffffffff817d4a35>] ? sysenter_dispatch+0x7/0x1a
[ 62.657083] Code: 4d 8b 85 a8 02 00 00 4c 89 e9 ba 03 04 00 00 48 c7 c6 c0 be 8d 81 48 c7 c7 48 35 a3 81 31 c0 e8 db 68 0e 00 49 8b 85 a8 02 00 00 <0f> b6 40 46 c0 e8 05 0f b6 c0 c1 e0 03 41 09 c4 e9 77 ff ff ff
[ 62.657083] RIP [<ffffffff816d822d>] ip6_route_output+0xbd/0xe0
[ 62.657083] RSP <ffff88002fd038f8>
[ 62.657083] CR2: 0000000000000046
[ 62.657083] ---[ end trace ba8a9583d7cd1934 ]---
[ 62.657083] Kernel panic - not syncing: Fatal exception in interrupt
Signed-off-by: Fan Du <fan.du@windriver.com>
Reported-by: Ryan Whelan <rcwhelan@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When adding a new vxlan device to an "underlying carrier" (here:
dst->remote_ifindex), the MTU size assigned to the vxlan device
is the MTU at setup time of the carrier - needed headroom, when
adding a vxlan device w/o explicit carrier, then it defaults
to 1500.
In case of an explicit carrier that supports jumbo frames, we
currently cannot change vxlan MTU via ip(8) to > 1500 in
post-setup time, as vxlan driver uses eth_change_mtu() as default
method for manually setting MTU.
Hence, use a custom implementation that only falls back to
eth_change_mtu() in case we didn't use a dev parameter on device
setup time, and otherwise allow a max MTU setting of the carrier
incl. adjustment for headroom.
Reported-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/ethernet/intel/i40e/i40e_main.c
drivers/net/macvtap.c
Both minor merge hassles, simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
vxlan_group_used only allows device to leave multicast group
when the remote_ip of this vxlan device is difference from
other vxlan devices' remote_ip. this will cause device not
leave multicast group untile the vn_sock of this vxlan deivce
being released.
The check in vxlan_group_used is not quite precise. since even
the remote_ip is same, but these vxlan devices may use different
lower devices, and they may use different vn_socks.
Only when some vxlan devices use the same vn_sock,same lower
device and same remote_ip, the mc_list of the vn_sock should
not be changed.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In vxlan_open, vxlan_group_used always returns true,
because the state of the vxlan deivces which we want
to open has alreay been running. and it has already
in vxlan_list.
Since ip_mc_join_group takes care of the reference
of struct ip_mc_list. removing vxlan_group_used here
is safe.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Otherwise causing dst memory leakage.
Have Checked all other type tunnel device transmit implementation,
no such things happens anymore.
Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
"The biggest changes:
- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.
- move the various kernel locking primitives to the new
kernel/locking/ directory"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...
|
|
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.
The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.
This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.
Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.
Feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
trivial patch converting ERR_PTR(PTR_ERR()) into ERR_CAST().
No functional changes.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch removes the burden from the NIC drivers to check if the
vxlan driver is enabled in the kernel and also makes available
the vxlan headrooms to them.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
drivers/net/vxlan.c: In function ‘vxlan_sock_add’:
drivers/net/vxlan.c:2298:11: warning: ‘sock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/net/vxlan.c:2275:17: note: ‘sock’ was declared here
LD drivers/net/built-in.o
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/usb/qmi_wwan.c
drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
include/net/netfilter/nf_conntrack_synproxy.h
include/net/secure_seq.h
The conflicts are of two varieties:
1) Conflicts with Joe Perches's 'extern' removal from header file
function declarations. Usually it's an argument signature change
or a function being added/removed. The resolutions are trivial.
2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
a new value, another changes an existing value. That sort of
thing.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
of the callers.
- Move the initialization of sysctl_local_ports into
sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c
v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next
v3:
- Compile fixes of strange callers of inet_get_local_port_range.
This patch now successfully passes an allmodconfig build.
Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range
Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|