summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2014-07-29SUNRPC: Allow svc_reserve() to notify TCP socket that space has been freedTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurelyTrond Myklebust
If requests are queued in the socket inbuffer waiting for an svc_tcp_has_wspace() requirement to be satisfied, then we do not want to clear the SOCK_NOSPACE flag until we've satisfied that requirement. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29SUNRPC: Reduce contention in svc_xprt_enqueue()Trond Myklebust
Ensure that all calls to svc_xprt_enqueue() except svc_xprt_received() check the value of XPT_BUSY, before attempting to grab spinlocks etc. This is to avoid situations such as the following "perf" trace, which shows heavy contention on the pool spinlock: 54.15% nfsd [kernel.kallsyms] [k] _raw_spin_lock_bh | --- _raw_spin_lock_bh | |--71.43%-- svc_xprt_enqueue | | | |--50.31%-- svc_reserve | | | |--31.35%-- svc_xprt_received | | | |--18.34%-- svc_tcp_data_ready ... Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-22svcrdma: Add zero padding if the client doesn't send itChuck Lever
See RFC 5666 section 3.7: clients don't have to send zero XDR padding. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=246 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-18SUNRPC: xdr_get_next_encode_buffer should be declared staticTrond Myklebust
Quell another sparse warning. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-18svcrdma: Select NFSv4.1 backchannel transport based on forward channelChuck Lever
The current code always selects XPRT_TRANSPORT_BC_TCP for the back channel, even when the forward channel was not TCP (eg, RDMA). When a 4.1 mount is attempted with RDMA, the server panics in the TCP BC code when trying to send CB_NULL. Instead, construct the transport protocol number from the forward channel transport or'd with XPRT_TRANSPORT_BC. Transports that do not support bi-directional RPC will not have registered a "BC" transport, causing create_backchannel_client() to fail immediately. Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11svcrdma: send_write() must not overflow the device's max sgeSteve Wise
Function send_write() must stop creating sges when it reaches the device max and return the amount sent in the RDMA Write to the caller. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23SUNRPC/NFSD: Change to type of bool for rq_usedeferral and rq_splice_okKinglong Mee
rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix checksumming regressions, from Tom Herbert. 2) Undo unintentional permissions changes for SCTP rto_alpha and rto_beta sysfs knobs, from Denial Borkmann. 3) VXLAN, like other IP tunnels, should advertize it's encapsulation size using dev->needed_headroom instead of dev->hard_header_len. From Cong Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: sctp: fix permissions for rto_alpha and rto_beta knobs vxlan: Checksum fixes net: add skb_pop_rcv_encapsulation udp: call __skb_checksum_complete when doing full checksum net: Fix save software checksum complete net: Fix GSO constants to match NETIF flags udp: ipv4: do not waste time in __udp4_lib_mcast_demux_lookup vxlan: use dev->needed_headroom instead of dev->hard_header_len MAINTAINERS: update cxgb4 maintainer
2014-06-15net: sctp: fix permissions for rto_alpha and rto_beta knobsDaniel Borkmann
Commit 3fd091e73b81 ("[SCTP]: Remove multiple levels of msecs to jiffies conversions.") has silently changed permissions for rto_alpha and rto_beta knobs from 0644 to 0444. The purpose of this was to discourage users from tweaking rto_alpha and rto_beta knobs in production environments since they are key to correctly compute rtt/srtt. RFC4960 under section 6.3.1. RTO Calculation says regarding rto_alpha and rto_beta under rule C3 and C4: [...] C3) When a new RTT measurement R' is made, set RTTVAR <- (1 - RTO.Beta) * RTTVAR + RTO.Beta * |SRTT - R'| and SRTT <- (1 - RTO.Alpha) * SRTT + RTO.Alpha * R' Note: The value of SRTT used in the update to RTTVAR is its value before updating SRTT itself using the second assignment. After the computation, update RTO <- SRTT + 4 * RTTVAR. C4) When data is in flight and when allowed by rule C5 below, a new RTT measurement MUST be made each round trip. Furthermore, new RTT measurements SHOULD be made no more than once per round trip for a given destination transport address. There are two reasons for this recommendation: First, it appears that measuring more frequently often does not in practice yield any significant benefit [ALLMAN99]; second, if measurements are made more often, then the values of RTO.Alpha and RTO.Beta in rule C3 above should be adjusted so that SRTT and RTTVAR still adjust to changes at roughly the same rate (in terms of how many round trips it takes them to reflect new values) as they would if making only one measurement per round-trip and using RTO.Alpha and RTO.Beta as given in rule C3. However, the exact nature of these adjustments remains a research issue. [...] While it is discouraged to adjust rto_alpha and rto_beta and not further specified how to adjust them, the RFC also doesn't explicitly forbid it, but rather gives a RECOMMENDED default value (rto_alpha=3, rto_beta=2). We have a couple of users relying on the old permissions before they got changed. That said, if someone really has the urge to adjust them, we could allow it with a warning in the log. Fixes: 3fd091e73b81 ("[SCTP]: Remove multiple levels of msecs to jiffies conversions.") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-15net: Fix save software checksum completeTom Herbert
Geert reported issues regarding checksum complete and UDP. The logic introduced in commit 7e3cead5172927732f51fde ("net: Save software checksum complete") is not correct. This patch: 1) Restores code in __skb_checksum_complete_header except for setting CHECKSUM_UNNECESSARY. This function may be calculating checksum on something less than skb->len. 2) Adds saving checksum to __skb_checksum_complete. The full packet checksum 0..skb->len is calculated without adding in pseudo header. This value is saved in skb->csum and then the pseudo header is added to that to derive the checksum for validation. 3) In both __skb_checksum_complete_header and __skb_checksum_complete, set skb->csum_valid to whether checksum of zero was computed. This allows skb_csum_unnecessary to return true without changing to CHECKSUM_UNNECESSARY which was done previously. 4) Copy new csum related bits in __copy_skb_header. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-13udp: ipv4: do not waste time in __udp4_lib_mcast_demux_lookupEric Dumazet
Its too easy to add thousand of UDP sockets on a particular bucket, and slow down an innocent multicast receiver. Early demux is supposed to be an optimization, we should avoid spending too much time in it. It is interesting to note __udp4_lib_demux_lookup() only tries to match first socket in the chain. 10 is the threshold we already have in __udp4_lib_lookup() to switch to secondary hash. Fixes: 421b3885bf6d5 ("udp: ipv4: Add udp early demux") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: David Held <drheld@google.com> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...
2014-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...
2014-06-12rtnetlink: fix userspace API breakage for iproute2 < v3.9.0Michal Schmidt
When running RHEL6 userspace on a current upstream kernel, "ip link" fails to show VF information. The reason is a kernel<->userspace API change introduced by commit 88c5b5ce5cb57 ("rtnetlink: Call nlmsg_parse() with correct header length"), after which the kernel does not see iproute2's IFLA_EXT_MASK attribute in the netlink request. iproute2 adjusted for the API change in its commit 63338dca4513 ("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter"). The problem has been noticed before: http://marc.info/?l=linux-netdev&m=136692296022182&w=2 (Subject: Re: getting VF link info seems to be broken in 3.9-rc8) We can do better than tell those with old userspace to upgrade. We can recognize the old iproute2 in the kernel by checking the netlink message length. Even when including the IFLA_EXT_MASK attribute, its netlink message is shorter than struct ifinfomsg. With this patch "ip link" shows VF information in both old and new iproute2 versions. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12tcp: fixing TLP's FIN recoveryPer Hurtig
Fix to a problem observed when losing a FIN segment that does not contain data. In such situations, TLP is unable to recover from *any* tail loss and instead adds at least PTO ms to the retransmission process, i.e., RTO = RTO + PTO. Signed-off-by: Per Hurtig <per.hurtig@kau.se> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12bridge: fix compile error when compiling without IPv6 supportLinus Lüssing
Some fields in "struct net_bridge" aren't available when compiling the kernel without IPv6 support. Therefore adding a check/macro to skip the complaining code sections in that case. Introduced by 2cd4143192e8c60f66cb32c3a30c76d0470a372d ("bridge: memorize and export selected IGMP/MLD querier port") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12bridge: fix smatch warning / potential null pointer dereferenceLinus Lüssing
"New smatch warnings: net/bridge/br_multicast.c:1368 br_ip6_multicast_query() error: we previously assumed 'group' could be null (see line 1349)" In the rare (sort of broken) case of a query having a Maximum Response Delay of zero, we could create a potential null pointer dereference. Fixing this by skipping the multicast specific MLD Query parsing again if no multicast group address is available. Introduced by dc4eb53a996a78bfb8ea07b47423ff5a3aadc362 ("bridge: adhere to querier election mechanism specified by RFCs") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-12sctp: Fix sk_ack_backlog wrap-around problemXufeng Zhang
Consider the scenario: For a TCP-style socket, while processing the COOKIE_ECHO chunk in sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, a new association would be created in sctp_unpack_cookie(), but afterwards, some processing maybe failed, and sctp_association_free() will be called to free the previously allocated association, in sctp_association_free(), sk_ack_backlog value is decremented for this socket, since the initial value for sk_ack_backlog is 0, after the decrement, it will be 65535, a wrap-around problem happens, and if we want to establish new associations afterward in the same socket, ABORT would be triggered since sctp deem the accept queue as full. Fix this issue by only decrementing sk_ack_backlog for associations in the endpoint's list. Fix-suggested-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linusAl Viro
Backmerge of dcache.c changes from mainline. It's that, or complete rebase... Conflicts: fs/splice.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/core/rtnetlink.c net/core/skbuff.c Both conflicts were very simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net/core: Add VF link state control policyDoug Ledford
Commit 1d8faf48c7 (net/core: Add VF link state control) added VF link state control to the netlink VF nested structure, but failed to add a proper entry for the new structure into the VF policy table. Add the missing entry so the table and the actual data copied into the netlink nested struct are in sync. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net_sched: drr: warn when qdisc is not work conservingFlorian Westphal
The DRR scheduler requires that items on the active list are work conserving, i.e. do not hold on to skbs for throttling purposes, etc. Attaching e.g. tbf renders DRR useless because all other classes on the active list are delayed as well. So, warn users that this configuration won't work as expected; we already do this in couple of other qdiscs, see e.g. commit b00355db3f88d96810a60011a30cfb2c3469409d ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler') The 'const' change is needed to avoid compiler warning ("discards 'const' qualifier from pointer target type"). tested with: drr_hier() { parent=$1 classes=$2 for i in $(seq 1 $classes); do classid=$parent$(printf %x $i) tc class add dev eth0 parent $parent classid $classid drr tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit done } tc qdisc add dev eth0 root handle 1: drr drr_hier 1: 32 tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: Add skb_gro_postpull_rcsum to udp and vxlanTom Herbert
Need to gro_postpull_rcsum for GRO to work with checksum complete. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: Save software checksum completeTom Herbert
In skb_checksum complete, if we need to compute the checksum for the packet (via skb_checksum) save the result as CHECKSUM_COMPLETE. Subsequent checksum verification can use this. Also, added csum_complete_sw flag to distinguish between software and hardware generated checksum complete, we should always be able to trust the software computation. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11ceph: remove bogus externstephen hemminger
Sparse complained about this bogus extern on definition of a function. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11ipv4: fix a race in ip4_datagram_release_cb()Eric Dumazet
Alexey gave a AddressSanitizer[1] report that finally gave a good hint at where was the origin of various problems already reported by Dormando in the past [2] Problem comes from the fact that UDP can have a lockless TX path, and concurrent threads can manipulate sk_dst_cache, while another thread, is holding socket lock and calls __sk_dst_set() in ip4_datagram_release_cb() (this was added in linux-3.8) It seems that all we need to do is to use sk_dst_check() and sk_dst_set() so that all the writers hold same spinlock (sk->sk_dst_lock) to prevent corruptions. TCP stack do not need this protection, as all sk_dst_cache writers hold the socket lock. [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel AddressSanitizer: heap-use-after-free in ipv4_dst_check Read of size 2 by thread T15453: [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116 [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531 [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0 [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413 [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b ./arch/x86/kernel/entry_64.S:629 Freed by thread T15455: [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251 [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280 [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b ./arch/x86/kernel/entry_64.S:629 Allocated by thread T15453: [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171 [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406 [< inlined >] __ip_route_output_key+0x3e8/0xf70 __mkroute_output ./net/ipv4/route.c:1939 [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161 [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249 [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b ./arch/x86/kernel/entry_64.S:629 [2] <4>[196727.311203] general protection fault: 0000 [#1] SMP <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1 <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013 <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000 <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>] [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80 <4>[196727.311399] RSP: 0018:ffff885effd23a70 EFLAGS: 00010282 <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040 <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200 <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800 <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce <4>[196727.311510] FS: 0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000 <4>[196727.311554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0 <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>[196727.311713] Stack: <4>[196727.311733] ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42 <4>[196727.311784] ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0 <4>[196727.311834] ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0 <4>[196727.311885] Call Trace: <4>[196727.311907] <IRQ> <4>[196727.311912] [<ffffffff815b7f42>] dst_destroy+0x32/0xe0 <4>[196727.311959] [<ffffffff815b86c6>] dst_release+0x56/0x80 <4>[196727.311986] [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0 <4>[196727.312013] [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820 <4>[196727.312041] [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360 <4>[196727.312070] [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150 <4>[196727.312097] [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360 <4>[196727.312125] [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230 <4>[196727.312154] [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90 <4>[196727.312183] [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360 <4>[196727.312212] [<ffffffff815fe00b>] ip_rcv+0x22b/0x340 <4>[196727.312242] [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan] <4>[196727.312275] [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640 <4>[196727.312308] [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150 <4>[196727.312338] [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70 <4>[196727.312368] [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0 <4>[196727.312397] [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140 <4>[196727.312433] [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe] <4>[196727.312463] [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340 <4>[196727.312491] [<ffffffff815b1691>] net_rx_action+0x111/0x210 <4>[196727.312521] [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70 <4>[196727.312552] [<ffffffff810519d0>] __do_softirq+0xd0/0x270 <4>[196727.312583] [<ffffffff816cef3c>] call_softirq+0x1c/0x30 <4>[196727.312613] [<ffffffff81004205>] do_softirq+0x55/0x90 <4>[196727.312640] [<ffffffff81051c85>] irq_exit+0x55/0x60 <4>[196727.312668] [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0 <4>[196727.312696] [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a <4>[196727.312722] <EOI> <1>[196727.313071] RIP [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80 <4>[196727.313100] RSP <ffff885effd23a70> <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]--- <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt Reported-by: Alexey Preobrazhensky <preobr@google.com> Reported-by: dormando <dormando@rydia.ne> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 8141ed9fcedb2 ("ipv4: Add a socket release callback for datagram sockets") Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: add __pskb_copy_fclone and pskb_copy_for_cloneOctavian Purdila
There are several instances where a pskb_copy or __pskb_copy is immediately followed by an skb_clone. Add a couple of new functions to allow the copy skb to be allocated from the fclone cache and thus speed up subsequent skb_clone calls. Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: Marek Lindner <mareklindner@neomailbox.ch> Cc: Simon Wunderlich <sw@simonwunderlich.de> Cc: Antonio Quartulli <antonio@meshcoding.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Arvid Brodin <arvid.brodin@alten.se> Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org> Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Allan Stephens <allan.stephens@windriver.com> Cc: Andrew Hendry <andrew.hendry@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11bridge: Support 802.1ad vlan filteringToshiaki Makita
This enables us to change the vlan protocol for vlan filtering. We come to be able to filter frames on the basis of 802.1ad vlan tags through a bridge. This also changes br->group_addr if it has not been set by user. This is needed for an 802.1ad bridge. (See IEEE 802.1Q-2011 8.13.5.) Furthermore, this sets br->group_fwd_mask_required so that an 802.1ad bridge can forward the Nearest Customer Bridge group addresses except for br->group_addr, which should be passed to higher layer. To change the vlan protocol, write a protocol in sysfs: # echo 0x88a8 > /sys/class/net/br0/bridge/vlan_protocol Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11bridge: Prepare for forwarding another bridge group addressesToshiaki Makita
If a bridge is an 802.1ad bridge, it must forward another bridge group addresses (the Nearest Customer Bridge group addresses). (For details, see IEEE 802.1Q-2011 8.6.3.) As user might not want group_fwd_mask to be modified by enabling 802.1ad, introduce a new mask, group_fwd_mask_required, which indicates addresses the bridge wants to forward. This will be set by enabling 802.1ad. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11bridge: Prepare for 802.1ad vlan filtering supportToshiaki Makita
This enables a bridge to have vlan protocol informantion and allows vlan tag manipulation (retrieve, insert and remove tags) according to the vlan protocol. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11bridge: Add 802.1ad tx vlan accelerationToshiaki Makita
Bridge device doesn't need to embed S-tag into skb->data. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: filter: fix warning on 32-bit archAlexei Starovoitov
fix compiler warning on 32-bit architectures: net/core/filter.c: In function '__sk_run_filter': net/core/filter.c:540:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] net/core/filter.c:550:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] net/core/filter.c:560:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11tipc: fix potential bug in function tipc_backlog_rcvJon Paul Maloy
In commit 4f4482dcd9a0606a30541ff165ddaca64748299b ("tipc: compensate for double accounting in socket rcv buffer") we access 'truesize' of a received buffer after it might have been released by the function filter_rcv(). In this commit we correct this by reading the value of 'truesize' to the stack before delivering the buffer to filter_rcv(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: sctp: fix incorrect type in gfp initializerDaniel Borkmann
This fixes the following sparse warning: net/sctp/associola.c:1556:29: warning: incorrect type in initializer (different base types) net/sctp/associola.c:1556:29: expected bool [unsigned] [usertype] preload net/sctp/associola.c:1556:29: got restricted gfp_t Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: sctp: improve sctp_select_active_and_retran_path selectionDaniel Borkmann
In function sctp_select_active_and_retran_path(), we walk the transport list in order to look for the two most recently used ACTIVE transports (trans_pri, trans_sec). In case we didn't find anything ACTIVE, we currently just camp on a possibly PF or INACTIVE transport that is primary path; this behavior actually dates back to linux-history tree of the very early days of lksctp, and can yield a behavior that chooses suboptimal transport paths. Instead, be a bit more clever by reusing and extending the recently introduced sctp_trans_elect_best() handler. In case both transports are evaluated to have the same score resulting from their states, break the tie by looking at: 1) transport patch error count 2) last_time_heard value from each transport. This is analogous to Nishida's Quick Failover draft [1], section 5.1, 3: The sender SHOULD avoid data transmission to PF destinations. When all destinations are in either PF or Inactive state, the sender MAY either move the destination from PF to active state (and transmit data to the active destination) or the sender MAY transmit data to a PF destination. In the former scenario, (i) the sender MUST NOT notify the ULP about the state transition, and (ii) MUST NOT clear the destination's error counter. It is recommended that the sender picks the PF destination with least error count (fewest consecutive timeouts) for data transmission. In case of a tie (multiple PF destinations with same error count), the sender MAY choose the last active destination. Thus for sctp_select_active_and_retran_path(), we keep track of the best, if any, transport that is in PF state and in case no ACTIVE transport has been found (hence trans_{pri,sec} is NULL), we select the best out of the three: current primary_path and retran_path as well as a possible PF transport. The secondary may still camp on the original primary_path as before. The change in sctp_trans_elect_best() with a more fine grained tie selection also improves at the same time path selection for sctp_assoc_update_retran_path() in case of non-ACTIVE states. [1] http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05 Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: sctp: migrate most recently used transport to ktimeDaniel Borkmann
Be more precise in transport path selection and use ktime helpers instead of jiffies to compare and pick the better primary and secondary recently used transports. This also avoids any side-effects during a possible roll-over, and could lead to better path decision-making. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: sctp: refactor active path selectionDaniel Borkmann
This patch just refactors and moves the code for the active path selection into its own helper function outside of sctp_assoc_control_transport() which is already big enough. No functional changes here. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11ktime: add ktime_after and ktime_before helperDaniel Borkmann
Add two minimal helper functions analogous to time_before() and time_after() that will later on both be needed by SCTP code. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11mac802154: don't deliver packets to devices that are downPhoebe Buckheister
Only one WPAN devices can be active at any given time, so only deliver packets to that one interface that is actually up. Multiple monitors may be up at any given time, but we don't have to deliver to monitors that are down either. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11mac802154: properly free incoming skbs on decryption failurePhoebe Buckheister
mac802154 RX did not free skbs on decryption failure, assuming that the caller would when the local rx handler returned _DROP. This was false. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: fix UDP tunnel GSO of frag_list GRO packetsWei-Chun Chao
This patch fixes a kernel BUG_ON in skb_segment. It is hit when testing two VMs on openvswitch with one VM acting as VXLAN gateway. During VXLAN packet GSO, skb_segment is called with skb->data pointing to inner TCP payload. skb_segment calls skb_network_protocol to retrieve the inner protocol. skb_network_protocol actually expects skb->data to point to MAC and it calls pskb_may_pull with ETH_HLEN. This ends up pulling in ETH_HLEN data from header tail. As a result, pskb_trim logic is skipped and BUG_ON is hit later. Move skb_push in front of skb_network_protocol so that skb->data lines up properly. kernel BUG at net/core/skbuff.c:2999! Call Trace: [<ffffffff816ac412>] tcp_gso_segment+0x122/0x410 [<ffffffff816bc74c>] inet_gso_segment+0x13c/0x390 [<ffffffff8164b39b>] skb_mac_gso_segment+0x9b/0x170 [<ffffffff816b3658>] skb_udp_tunnel_segment+0xd8/0x390 [<ffffffff816b3c00>] udp4_ufo_fragment+0x120/0x140 [<ffffffff816bc74c>] inet_gso_segment+0x13c/0x390 [<ffffffff8109d742>] ? default_wake_function+0x12/0x20 [<ffffffff8164b39b>] skb_mac_gso_segment+0x9b/0x170 [<ffffffff8164b4d0>] __skb_gso_segment+0x60/0xc0 [<ffffffff8164b6b3>] dev_hard_start_xmit+0x183/0x550 [<ffffffff8166c91e>] sch_direct_xmit+0xfe/0x1d0 [<ffffffff8164bc94>] __dev_queue_xmit+0x214/0x4f0 [<ffffffff8164bf90>] dev_queue_xmit+0x10/0x20 [<ffffffff81687edb>] ip_finish_output+0x66b/0x890 [<ffffffff81688a58>] ip_output+0x58/0x90 [<ffffffff816c628f>] ? fib_table_lookup+0x29f/0x350 [<ffffffff816881c9>] ip_local_out_sk+0x39/0x50 [<ffffffff816cbfad>] iptunnel_xmit+0x10d/0x130 [<ffffffffa0212200>] vxlan_xmit_skb+0x1d0/0x330 [vxlan] [<ffffffffa02a3919>] vxlan_tnl_send+0x129/0x1a0 [openvswitch] [<ffffffffa02a2cd6>] ovs_vport_send+0x26/0xa0 [openvswitch] [<ffffffffa029931e>] do_output+0x2e/0x50 [openvswitch] Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: ipv6: Fixed up ipsec packet be re-routing issuehuizhang
Bug report on https://bugzilla.kernel.org/show_bug.cgi?id=75781 When a local output ipsec packet match the mangle table rule, and be set mark value, the packet will be route again in route_me_harder -> _session_decoder6 In this case, the nhoff in CB of skb was still the default value 0. So the protocal match can't success and the packet can't match correct SA rule,and then the packet be send out in plaintext. To fixed up the issue. The CB->nhoff must be set. Signed-off-by: Hui Zhang <huizhang@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11ip_tunnel: fix i_key matching in ip_tunnel_findDmitry Popov
Some tunnels (though only vti as for now) can use i_key just for internal use: for example vti uses it for fwmark'ing incoming packets. So raw i_key value shouldn't be treated as a distinguisher for them. ip_tunnel_key_match exists for cases when we want to compare two ip_tunnel_parms' i_keys. Example bug: ip link add type vti ikey 1 local 1.0.0.1 remote 2.0.0.2 ip link add type vti ikey 2 local 1.0.0.1 remote 2.0.0.2 spawned two tunnels, although it doesn't make sense. Signed-off-by: Dmitry Popov <ixaphire@qrator.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11ip_vti: Fix 'ip tunnel add' with 'key' parametersDmitry Popov
ip tunnel add remote 10.2.2.1 local 10.2.2.2 mode vti ikey 1 okey 2 translates to p->iflags = VTI_ISVTI|GRE_KEY and p->i_key = 1, but GRE_KEY != TUNNEL_KEY, so ip_tunnel_ioctl would set i_key to 0 (same story with o_key) making us unable to create vti tunnels with [io]key via ip tunnel. We cannot simply translate GRE_KEY to TUNNEL_KEY (as GRE module does) because vti_tunnels with same local/remote addresses but different ikeys will be treated as different then. So, imo the best option here is to move p->i_flags & *_KEY check for vti tunnels from ip_tunnel.c to ip_vti.c and to think about [io]_mark field for ip_tunnel_parm in the future. Signed-off-by: Dmitry Popov <ixaphire@qrator.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net: filter: cleanup A/X name usageAlexei Starovoitov
The macro 'A' used in internal BPF interpreter: #define A regs[insn->a_reg] was easily confused with the name of classic BPF register 'A', since 'A' would mean two different things depending on context. This patch is trying to clean up the naming and clarify its usage in the following way: - A and X are names of two classic BPF registers - BPF_REG_A denotes internal BPF register R0 used to map classic register A in internal BPF programs generated from classic - BPF_REG_X denotes internal BPF register R7 used to map classic register X in internal BPF programs generated from classic - internal BPF instruction format: struct sock_filter_int { __u8 code; /* opcode */ __u8 dst_reg:4; /* dest register */ __u8 src_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate constant */ }; - BPF_X/BPF_K is 1 bit used to encode source operand of instruction In classic: BPF_X - means use register X as source operand BPF_K - means use 32-bit immediate as source operand In internal: BPF_X - means use 'src_reg' register as source operand BPF_K - means use 32-bit immediate as source operand Suggested-by: Chema Gonzalez <chema@google.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Chema Gonzalez <chema@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11dns_resolver: assure that dns_query() result is null-terminatedManuel Schölling
dns_query() credulously assumes that keys are null-terminated and returns a copy of a memory block that is off by one. Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-10bridge: memorize and export selected IGMP/MLD querier portLinus Lüssing
Adding bridge support to the batman-adv multicast optimization requires batman-adv knowing about the existence of bridged-in IGMP/MLD queriers to be able to reliably serve any multicast listener behind this same bridge. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-10bridge: add export of multicast database adjacent to net_devLinus Lüssing
With this new, exported function br_multicast_list_adjacent(net_dev) a list of IPv4/6 addresses is returned. This list contains all multicast addresses sensed by the bridge multicast snooping feature on all bridge ports of the bridge interface of net_dev, excluding addresses from the specified net_device itself. Adding bridge support to the batman-adv multicast optimization requires batman-adv knowing about the existence of bridged-in multicast listeners to be able to reliably serve them with multicast packets. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>