summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-07-26inet: frag: don't re-use chainlist for evictorFlorian Westphal
commit 65ba1f1ec0eff ("inet: frags: fix a race between inet_evict_bucket and inet_frag_kill") describes the bug, but the fix doesn't work reliably. Problem is that ->flags member can be set on other cpu without chainlock being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED bit after we put the element on the evictor private list. We can crash when walking the 'private' evictor list since an element can be deleted from list underneath the evictor. Join work with Nikolay Alexandrov. Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue") Reported-by: Johan Schuijt <johan@transip.nl> Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Nikolay Alexandrov <nikolay@cumulusnetworks.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26net: sctp: stop spamming klog with rfc6458, 5.3.2. deprecation warningsDaniel Borkmann
Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the (as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d5e53 ("net: sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1]. Imho, it was not a good idea, and we should just revert that message for a couple of reasons: 1) It's uapi and therefore set in stone forever. 2) To be able to run on older and newer kernels, an SCTP application would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/ SCTP_RCVINFO support, so that on older kernels, it can make use of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO. In my (limited) experience, a lot of SCTP appliances are migrating to newer kernels only ve(ee)ry slowly. 3) Some people don't have the chance to change their applications, f.e. due to proprietary legacy stuff. So, they'll hit this warning in fast path and are stuck with older kernels. But i.e. due to point 1) I really fail to see the benefit of a warning. So just revert that for now, the issue was reported up Jamal. [1] http://thread.gmane.org/gmane.linux.network/321960/ Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Michael Tuexen <tuexen@fh-muenster.de> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26bridge: netlink: fix slave_changelink/br_setport race conditionsNikolay Aleksandrov
Since slave_changelink support was added there have been a few race conditions when using br_setport() since some of the port functions it uses require the bridge lock. It is very easy to trigger a lockup due to some internal spin_lock() usage without bh disabled, also it's possible to get the bridge into an inconsistent state. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 3ac636b8591c ("bridge: implement rtnl_link_ops->slave_changelink") Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains ten Netfilter/IPVS fixes, they are: 1) Address refcount leak when creating an expectation from the ctnetlink interface. 2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry Torokhov. 3) Resolve panic for unreachable route in IPVS with locally generated traffic in the output path, from Alex Gartrell. 4) Fix wrong source address in rare cases for tunneled traffic in IPVS, from Julian Anastasov. 5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian. 6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from Alex Gartrell. 7) Fix crash with IPVS sync protocol v0 and FTP, from Julian. 8) Reset sender cpu for forwarded traffic in IPVS, also from Julian. 9) Allocate template conntracks through kmalloc() to resolve netns dependency problems with the conntrack kmem_cache. 10) Fix zones with expectations that clash using the same tuple, from Joe Stringer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-25cgroup: net_cls: fix false-positive "suspicious RCU usage"Konstantin Khlebnikov
In dev_queue_xmit() net_cls protected with rcu-bh. [ 270.730026] =============================== [ 270.730029] [ INFO: suspicious RCU usage. ] [ 270.730033] 4.2.0-rc3+ #2 Not tainted [ 270.730036] ------------------------------- [ 270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage! [ 270.730041] other info that might help us debug this: [ 270.730043] rcu_scheduler_active = 1, debug_locks = 1 [ 270.730045] 2 locks held by dhclient/748: [ 270.730046] #0: (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960 [ 270.730085] #1: (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960 [ 270.730090] stack backtrace: [ 270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2 [ 270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 [ 270.730100] 0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007 [ 270.730103] ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00 [ 270.730105] ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8 [ 270.730108] Call Trace: [ 270.730121] [<ffffffff817ad487>] dump_stack+0x4c/0x65 [ 270.730148] [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120 [ 270.730153] [<ffffffff816a62d2>] task_cls_state+0x92/0xa0 [ 270.730158] [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup] [ 270.730164] [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0 [ 270.730166] [<ffffffff816ab573>] tc_classify+0x33/0x90 [ 270.730170] [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb] [ 270.730172] [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960 [ 270.730174] [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960 [ 270.730176] [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20 [ 270.730185] [<ffffffff81787770>] dev_queue_xmit+0x10/0x20 [ 270.730187] [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760 [ 270.730190] [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0 [ 270.730203] [<ffffffff81665245>] ? sock_def_readable+0x5/0x190 [ 270.730210] [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40 [ 270.730216] [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640 [ 270.730219] [<ffffffff8165f367>] sock_sendmsg+0x47/0x50 [ 270.730221] [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0 [ 270.730232] [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0 [ 270.730234] [<ffffffff811fe5b8>] vfs_write+0xb8/0x190 [ 270.730236] [<ffffffff811fe8c2>] SyS_write+0x52/0xb0 [ 270.730239] [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76 Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24sch_choke: drop all packets in queue during resetWANG Cong
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24sch_plug: purge buffered packets during resetWANG Cong
Otherwise the skbuff related structures are not correctly refcount'ed. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24ipv4: consider TOS in fib_select_defaultJulian Anastasov
fib_select_default considers alternative routes only when res->fi is for the first alias in res->fa_head. In the common case this can happen only when the initial lookup matches the first alias with highest TOS value. This prevents the alternative routes to require specific TOS. This patch solves the problem as follows: - routes that require specific TOS should be returned by fib_select_default only when TOS matches, as already done in fib_table_lookup. This rule implies that depending on the TOS we can have many different lists of alternative gateways and we have to keep the last used gateway (fa_default) in first alias for the TOS instead of using single tb_default value. - as the aliases are ordered by many keys (TOS desc, fib_priority asc), we restrict the possible results to routes with matching TOS and lowest metric (fib_priority) and routes that match any TOS, again with lowest metric. For example, packet with TOS 8 can not use gw3 (not lowest metric), gw4 (different TOS) and gw6 (not lowest metric), all other gateways can be used: tos 8 via gw1 metric 2 <--- res->fa_head and res->fi tos 8 via gw2 metric 2 tos 8 via gw3 metric 3 tos 4 via gw4 tos 0 via gw5 tos 0 via gw6 metric 1 Reported-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24ipv4: fib_select_default should match the prefixJulian Anastasov
fib_trie starting from 4.1 can link fib aliases from different prefixes in same list. Make sure the alternative gateways are in same table and for same prefix (0) by checking tb_id and fa_slen. Fixes: 79e5ad2ceb00 ("fib_trie: Remove leaf_info") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio/vhost fixes from Michael Tsirkin: "Bugfixes and documentation fixes. Igor's patch that allows users to tweak memory table size is borderline, but it does fix known crashes, so I merged it" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost: add max_mem_regions module parameter vhost: extend memory regions allocation to vmalloc 9p/trans_virtio: reset virtio device on remove virtio/s390: rename drivers/s390/kvm -> drivers/s390/virtio MAINTAINERS: separate section for s390 virtio drivers virtio: define virtio_pci_cfg_cap in header. virtio: Fix typecast of pointer in vring_init() virtio scsi: fix unused variable warning vhost: use binary search instead of linear in find_region() virtio_net: document VIRTIO_NET_CTRL_GUEST_OFFLOADS
2015-07-23Bluetooth: Fix NULL pointer dereference in smp_conn_securityJohan Hedberg
The l2cap_conn->smp pointer may be NULL for various valid reasons where SMP has failed to initialize properly. One such scenario is when crypto support is missing, another when the adapter has been powered on through a legacy method. The smp_conn_security() function should have the appropriate check for this situation to avoid NULL pointer dereferences. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org # 4.0+
2015-07-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Don't use shared bluetooth antenna in iwlwifi driver for management frames, from Emmanuel Grumbach. 2) Fix device ID check in ath9k driver, from Felix Fietkau. 3) Off by one in xen-netback BUG checks, from Dan Carpenter. 4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann. 5) Fix races in setting peeked bit flag in SKBs during datagram receive. If it's shared we have to clone it otherwise the value can easily be corrupted. Fix from Herbert Xu. 6) Revert fec clock handling change, causes regressions. From Fabio Estevam. 7) Fix use after free in fq_codel and sfq packet schedulers, from WANG Cong. 8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.) from WANG Cong and Konstantin Khlebnikov. 9) Memory leak in act_bpf packet action, from Alexei Starovoitov. 10) ARM bpf JIT bug fixes from Nicolas Schichan. 11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from Michael S Tsirkin. 12) Destruction of bond with different ARP header types not handled correctly, fix from Nikolay Aleksandrov. 13) Revert GRO receive support in ipv6 SIT tunnel driver, causes regressions because the GRO packets created cannot be processed properly on the GSO side if we forward the frame. From Herbert Xu. 14) TCCR update race and other fixes to ravb driver from Sergei Shtylyov. 15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet. 16) Fix panics on packet scheduler filter replace, from Daniel Borkmann. 17) Make sure AF_PACKET sees properly IP headers in defragmented frames (via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee. 18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits) ravb: fix ring memory allocation net: phy: dp83867: Fix warning check for setting the internal delay openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes netlink: don't hold mutex in rcu callback when releasing mmapd ring ARM: net: fix vlan access instructions in ARM JIT. ARM: net: handle negative offsets in BPF JIT. ARM: net: fix condition for load_order > 0 when translating load instructions. tcp: suppress a division by zero warning drivers: net: cpsw: remove tx event processing in rx napi poll inet: frags: fix defragmented packet's IP header for af_packet net: mvneta: fix refilling for Rx DMA buffers stmmac: fix setting of driver data in stmmac_dvr_probe sched: cls_flow: fix panic on filter replace sched: cls_flower: fix panic on filter replace sched: cls_bpf: fix panic on filter replace net/mdio: fix mdio_bus_match for c45 PHY net: ratelimit warnings about dst entry refcount underflow or overflow caif: fix leaks and race in caif_queue_rcv_skb() qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355 ravb: fix race updating TCCR ...
2015-07-22SUNRPC: xprt_complete_bc_request must also decrement the free slot countTrond Myklebust
Calling xprt_complete_bc_request() effectively causes the slot to be allocated, so it needs to decrement the backchannel free slot count as well. Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22SUNRPC: Fix a backchannel deadlockTrond Myklebust
xprt_alloc_bc_request() cannot call xprt_free_bc_request() without deadlocking, since it already holds the xprt->bc_pa_lock. Reported-by: Chuck Lever <chuck.lever@oracle.com> Fixes: 0d2a970d0ae55 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22netfilter: nf_conntrack: Support expectations in different zonesJoe Stringer
When zones were originally introduced, the expectation functions were all extended to perform lookup using the zone. However, insertion was not modified to check the zone. This means that two expectations which are intended to apply for different connections that have the same tuple but exist in different zones cannot both be tracked. Fixes: 5d0aa2ccd4 (netfilter: nf_conntrack: add support for "conntrack zones") Signed-off-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-21openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodesChris J Arges
Some architectures like POWER can have a NUMA node_possible_map that contains sparse entries. This causes memory corruption with openvswitch since it allocates flow_cache with a multiple of num_possible_nodes() and assumes the node variable returned by for_each_node will index into flow->stats[node]. Use nr_node_ids to allocate a maximal sparse array instead of num_possible_nodes(). The crash was noticed after 3af229f2 was applied as it changed the node_possible_map to match node_online_map on boot. Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861 Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21netlink: don't hold mutex in rcu callback when releasing mmapd ringFlorian Westphal
Kirill A. Shutemov says: This simple test-case trigers few locking asserts in kernel: int main(int argc, char **argv) { unsigned int block_size = 16 * 4096; struct nl_mmap_req req = { .nm_block_size = block_size, .nm_block_nr = 64, .nm_frame_size = 16384, .nm_frame_nr = 64 * block_size / 16384, }; unsigned int ring_size; int fd; fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) exit(1); if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) exit(1); ring_size = req.nm_block_nr * req.nm_block_size; mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); return 0; } +++ exited with 0 +++ BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616 in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init 3 locks held by init/1: #0: (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220 #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70 #2: (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0 Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20 CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98 Call Trace: <IRQ> [<ffffffff81929ceb>] dump_stack+0x4f/0x7b [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270 [<ffffffff81085bed>] __might_sleep+0x4d/0x90 [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430 [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350 [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70 [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150 [<ffffffff817e484d>] __sk_free+0x1d/0x160 [<ffffffff817e49a9>] sk_free+0x19/0x20 [..] Cong Wang says: We can't hold mutex lock in a rcu callback, [..] Thomas Graf says: The socket should be dead at this point. It might be simpler to add a netlink_release_ring() function which doesn't require locking at all. Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name> Diagnosed-by: Cong Wang <cwang@twopensource.com> Suggested-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21tcp: suppress a division by zero warningEric Dumazet
Andrew Morton reported following warning on one ARM build with gcc-4.4 : net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc': net/ipv4/inet_hashtables.c:617: warning: division by zero Even guarded with a test on sizeof(spinlock_t), compiler does not like current construct on a !CONFIG_SMP build. Remove the warning by using a temporary variable. Fixes: 095dc8e0c368 ("tcp: fix/cleanup inet_ehash_locks_alloc()") Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21inet: frags: fix defragmented packet's IP header for af_packetEdward Hyunkoo Jee
When ip_frag_queue() computes positions, it assumes that the passed sk_buff does not contain L2 headers. However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly functions can be called on outgoing packets that contain L2 headers. Also, IPv4 checksum is not corrected after reassembly. Fixes: 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 fanouts.") Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21sched: cls_flow: fix panic on filter replaceDaniel Borkmann
The following test case causes a NULL pointer dereference in cls_flow: tc filter add dev foo parent 1: handle 0x1 flow hash keys dst action ok tc filter replace dev foo parent 1: pref 49152 handle 0x1 \ flow hash keys mark action drop To be more precise, actually two different panics are fixed, the first occurs because tcf_exts_init() is not called on the newly allocated filter when we do a replace. And the second panic uncovered after that happens since the arguments of list_replace_rcu() are swapped, the old element needs to be the first argument and the new element the second. Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21sched: cls_flower: fix panic on filter replaceDaniel Borkmann
The following test case causes a NULL pointer dereference in cls_flower: tc filter add dev foo parent 1: flower eth_type ipv4 action ok flowid 1:1 tc filter replace dev foo parent 1: pref 49152 handle 0x1 \ flower eth_type ipv6 action ok flowid 1:1 The problem is that commit 77b9900ef53a ("tc: introduce Flower classifier") accidentally swapped the arguments of list_replace_rcu(), the old element needs to be the first argument and the new element the second. Fixes: 77b9900ef53a ("tc: introduce Flower classifier") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21sched: cls_bpf: fix panic on filter replaceDaniel Borkmann
The following test case causes a NULL pointer dereference in cls_bpf: FOO="1,6 0 0 4294967295," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok tc filter replace dev foo parent 1: pref 49152 handle 0x1 \ bpf bytecode "$FOO" flowid 1:1 action drop The problem is that commit 1f947bf151e9 ("net: sched: rcu'ify cls_bpf") accidentally swapped the arguments of list_replace_rcu(), the old element needs to be the first argument and the new element the second. Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21Merge tag 'mac80211-for-davem-2015-07-17' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Some fixes for the current cycle: 1. Arik introduced an rtnl-locked regulatory API to be able to differentiate between place do/don't have the RTNL; this fixes missing locking in some of the code paths 2. Two small mesh bugfixes from Bob, one to avoid treating a certain malformed over-the-air frame and one to avoid sending a garbage field over the air. 3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya. 4. A fix for a powersave vs. aggregation teardown race, from Michal. 5. Thomas reduced the loglevel of CRDA messages to avoid spamming the kernel log with mostly irrelevant information. 6. Tom fixed a dangling debugfs directory pointer that could cause crashes if subsequent addition of the same interface to debugfs failed for some reason. 7. A fix from myself for a list corruption issue in mac80211 during combined interface shutdown/removal - shut down interfaces first and only then remove them to avoid that. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21net: ratelimit warnings about dst entry refcount underflow or overflowKonstantin Khlebnikov
Kernel generates a lot of warnings when dst entry reference counter overflows and becomes negative. That bug was seen several times at machines with outdated 3.10.y kernels. Most like it's already fixed in upstream. Anyway that flood completely kills machine and makes further debugging impossible. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21caif: fix leaks and race in caif_queue_rcv_skb()Eric Dumazet
1) If sk_filter() is applied, skb was leaked (not freed) 2) Testing SOCK_DEAD twice is racy : packet could be freed while already queued. 3) Remove obsolete comment about caching skb->len Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20Revert "sit: Add gro callbacks to sit_offload"Herbert Xu
This patch reverts 19424e052fb44da2f00d1a868cbb51f3e9f4bbb5 ("sit: Add gro callbacks to sit_offload") because it generates packets that cannot be handled even by our own GSO. Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20Merge tag 'ipvs-fixes-for-v4.2' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS Fixes for v4.2 please consider this fix for v4.2. For reasons that are not clear to me it is a bumper crop. It seems to me that they are all relevant to stable. Please let me know if you need my help to get the fixes into stable. * ipvs: fix ipv6 route unreach panic This problem appears to be present since IPv6 support was added to IPVS in v2.6.28. * ipvs: skb_orphan in case of forwarding This appears to resolve a problem resulting from a side effect of 41063e9dd119 ("ipv4: Early TCP socket demux.") which was included in v3.6. * ipvs: do not use random local source address for tunnels This appears to resolve a problem introduced by 026ace060dfe ("ipvs: optimize dst usage for real server") in v3.10. * ipvs: fix crash if scheduler is changed This appears to resolve a problem introduced by ceec4c381681 ("ipvs: convert services to rcu") in v3.10. Julian has provided backports of the fix: * [PATCHv2 3.10.81] ipvs: fix crash if scheduler is changed http://www.spinics.net/lists/lvs-devel/msg04008.html * [PATCHv2 3.12.44,3.14.45,3.18.16,4.0.6] ipvs: fix crash if scheduler is changed http://www.spinics.net/lists/lvs-devel/msg04007.html Please let me know how you would like to handle guiding these backports into stable. * ipvs: fix crash with sync protocol v0 and FTP This appears to resolve a problem introduced by 749c42b620a9 ("ipvs: reduce sync rate with time thresholds") in v3.5 ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-20netfilter: fix netns dependencies with conntrack templatesPablo Neira Ayuso
Quoting Daniel Borkmann: "When adding connection tracking template rules to a netns, f.e. to configure netfilter zones, the kernel will endlessly busy-loop as soon as we try to delete the given netns in case there's at least one template present, which is problematic i.e. if there is such bravery that the priviledged user inside the netns is assumed untrusted. Minimal example: ip netns add foo ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1 ip netns del foo What happens is that when nf_ct_iterate_cleanup() is being called from nf_conntrack_cleanup_net_list() for a provided netns, we always end up with a net->ct.count > 0 and thus jump back to i_see_dead_people. We don't get a soft-lockup as we still have a schedule() point, but the serving CPU spins on 100% from that point onwards. Since templates are normally allocated with nf_conntrack_alloc(), we also bump net->ct.count. The issue why they are not yet nf_ct_put() is because the per netns .exit() handler from x_tables (which would eventually invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is called in the dependency chain at a *later* point in time than the per netns .exit() handler for the connection tracker. This is clearly a chicken'n'egg problem: after the connection tracker .exit() handler, we've teared down all the connection tracking infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be invoked at a later point in time during the netns cleanup, as that would lead to a use-after-free. At the same time, we cannot make x_tables depend on the connection tracker module, so that the xt_ct_tg_destroy() would be invoked earlier in the cleanup chain." Daniel confirms this has to do with the order in which modules are loaded or having compiled nf_conntrack as modules while x_tables built-in. So we have no guarantees regarding the order in which netns callbacks are executed. Fix this by allocating the templates through kmalloc() from the respective SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache. Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch is marked as unlikely since conntrack templates are rarely allocated and only from the configuration plane path. Note that templates are not kept in any list to avoid further dependencies with nf_conntrack anymore, thus, the tmpl larval list is removed. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Daniel Borkmann <daniel@iogearbox.net>
2015-07-17cfg80211: use RTNL locked reg_can_beacon for IR-relaxationArik Nemtsov
The RTNL is required to check for IR-relaxation conditions that allow more channels to beacon. Export an RTNL locked version of reg_can_beacon and use it where possible in AP/STA interface type flows, where IR-relaxation may be applicable. Fixes: 06f207fc5418 ("cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA") Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17mac80211: add missing length check for confirm framesBob Copeland
Although mesh_rx_plink_frame() already checks that frames have enough bytes for the action code plus another two bytes for capability/reason code, it doesn't take into account that confirm frames also have an additional two-byte aid. As a result, a corrupt frame could cause a subsequent subtraction to wrap around to ill effect. Add another check for this case. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17mac80211: correct aid location in peering framesBob Copeland
According to 802.11-2012 8.5.16.3.2 AID comes directly after the capability bytes in mesh peering confirm frames. The existing code, however, was adding a 2 byte offset to this location, resulting in garbage data going out over the air. Remove the offset to fix it. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17wireless: regulatory: reduce log level of CRDA related messagesThomas Petazzoni
With a basic Linux userspace, the messages "Calling CRDA to update world regulatory domain" appears 10 times after boot every second or so, followed by a final "Exceeded CRDA call max attempts. Not calling CRDA". For those of us not having the corresponding userspace parts, having those messages repeatedly displayed at boot time is a bit annoying, so this commit reduces their log level to pr_debug(). Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17mac80211: shut down interfaces before destroying interface listJohannes Berg
If the hardware is unregistered while interfaces are up, mac80211 will unregister all interfaces, which in turns causes mac80211 to be called again to remove them all from the driver and eventually shut down the hardware. During this shutdown, however, it's currently already unsafe to iterate the list of interfaces atomically, as the list is manipulated in an unsafe manner. This puts an undue burden on the driver - it must stop all its activities before calling ieee80211_unregister_hw(), while in the normal stop path it can do all cleanup in the stop method. If, for example, it's using the iteration during RX for some reason, it would have to stop RX before unregistering to avoid crashes. Fix this problem by closing all interfaces before unregistering them. This will cause the driver stop to have completed before we manipulate the interface list, and after the driver is stopped *and* has called ieee80211_unregister_hw() it really musn't be iterating any more as the memory will be freed as well. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17mac80211: wowlan: enable powersave if suspend while ps-pollingChaitanya T K
If for any reason we're in the middle of PS-polling or awake after TX due to dynamic powersave while going to suspend, go back to save power. This might cause a response frame to get lost, but since we can't really wait for it while going to suspend that's still better than not enabling powersave which would cause higher power usage during (and possibly even after) suspend. Note that this really only affects the very few drivers that use the powersave implementation in mac80211. Signed-off-by: Chaitanya T K <chaitanya.mgit@gmail.com> [rewrite misleading commit log] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17mac80211: don't clear all tx flags when requeingMichal Kazior
When acting as AP and a PS-Poll frame is received associated station is marked as one in a Service Period. This state is kept until Tx status for released frame is reported. While a station is in Service Period PS-Poll frames are ignored. However if PS-Poll was received during A-MPDU teardown it was possible to have the to-be released frame re-queued back to pending queue. In such case the frame was stripped of 2 important flags: (a) IEEE80211_TX_CTL_NO_PS_BUFFER (b) IEEE80211_TX_STATUS_EOSP Stripping of (a) led to the frame that was to be released to be queued back to ps_tx_buf queue. If station remained to use only PS-Poll frames the re-queued frame (and new ones) was never actually transmitted because mac80211 would ignore subsequent PS-Poll frames due to station being in Service Period. There was nothing left to clear the Service Period bit (no xmit -> no tx status -> no SP end), i.e. the AP would have the station stuck in Service Period. Beacon TIM would repeatedly prompt station to poll for frames but it would get none. Once (a) is not stripped (b) becomes important because it's the main condition to clear the Service Period bit of the station when Tx status for the released frame is reported back. This problem was observed with ath9k acting as P2P GO in some testing scenarios but isn't limited to it. AP operation with mac80211 based Tx A-MPDU control combined with clients using PS-Poll frames is subject to this race. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17mac80211: clear subdir_stations when removing debugfsTom Hughes
If we don't do this, and we then fail to recreate the debugfs directory during a mode change, then we will fail later trying to add stations to this now bogus directory: BUG: unable to handle kernel NULL pointer dereference at 0000006c IP: [<c0a92202>] mutex_lock+0x12/0x30 Call Trace: [<c0678ab4>] start_creating+0x44/0xc0 [<c0679203>] debugfs_create_dir+0x13/0xf0 [<f8a938ae>] ieee80211_sta_debugfs_add+0x6e/0x490 [mac80211] Cc: stable@kernel.org Signed-off-by: Tom Hughes <tom@compton.nu> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-15tc: act_bpf: fix memory leakAlexei Starovoitov
prog->bpf_ops is populated when act_bpf is used with classic BPF and prog->bpf_name is optionally used with extended BPF. Fix memory leak when act_bpf is released. Fixes: d23b8ad8ab23 ("tc: add BPF based action") Fixes: a8cb5f556b56 ("act_bpf: add initial eBPF support for actions") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15fq_codel: fix return value of fq_codel_drop()WANG Cong
The ->drop() is supposed to return the number of bytes it dropped, however fq_codel_drop() returns the index of the flow where it drops a packet from. Fix this by introducing a helper to wrap fq_codel_drop(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15net_sched: fix a use-after-free in sfqWANG Cong
Fixes: 25331d6ce42b ("net: sched: implement qstat helper routines") Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15ipv6: lock socket in ip6_datagram_connect()Eric Dumazet
ip6_datagram_connect() is doing a lot of socket changes without socket being locked. This looks wrong, at least for udp_lib_rehash() which could corrupt lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15fq_codel: fix a use-after-freeWANG Cong
Fixes: 25331d6ce42b ("net: sched: implement qstat helper routines") Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15tcp: don't use F-RTO on non-recurring timeoutsYuchung Cheng
Currently F-RTO may repeatedly send new data packets on non-recurring timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682) should only be used on either new recovery or recurring timeouts. This exacerbates the recovery progress during frequent timeout & repair, because we prioritize sending new data packets instead of repairing the holes when the bandwidth is already scarce. Fix it by correcting the test of a new recovery episode. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15bridge: mdb: fix double add notificationNikolay Aleksandrov
Since the mdb add/del code was introduced there have been 2 br_mdb_notify calls when doing br_mdb_add() resulting in 2 notifications on each add. Example: Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent Before patch: root@debian:~# bridge monitor all [MDB]dev br0 port eth1 grp 239.0.0.1 permanent [MDB]dev br0 port eth1 grp 239.0.0.1 permanent After patch: root@debian:~# bridge monitor all [MDB]dev br0 port eth1 grp 239.0.0.1 permanent Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: cfd567543590 ("bridge: add support of adding and deleting mdb entries") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leaveSatish Ashok
A report with INCLUDE/Change_to_include and empty source list should be treated as a leave, specified by RFC 3376, section 3.1: "If the requested filter mode is INCLUDE *and* the requested source list is empty, then the entry corresponding to the requested interface and multicast address is deleted if present. If no such entry is present, the request is ignored." Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma fixes from Doug Ledford: "Mainly fix-ups for the various 4.2 items" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (24 commits) IB/core: Destroy ocrdma_dev_id IDR on module exit IB/core: Destroy multcast_idr on module exit IB/mlx4: Optimize do_slave_init IB/mlx4: Fix memory leak in do_slave_init IB/mlx4: Optimize freeing of items on error unwind IB/mlx4: Fix use of flow-counters for process_mad IB/ipath: Convert use of __constant_<foo> to <foo> IB/ipoib: Set MTU to max allowed by mode when mode changes IB/ipoib: Scatter-Gather support in connected mode IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICES IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush IB/ucma: Fix lockdep warning in ucma_lock_files rds: rds_ib_device.refcount overflow RDMA/nes: Fix for incorrect recording of the MAC address RDMA/nes: Fix for resolving the neigh RDMA/core: Fixes for port mapper client registration IB/IPoIB: Fix bad error flow in ipoib_add_port() IB/mlx4: Do not attemp to report HCA clock offset on VFs IB/cm: Do not queue work to a device that's going away IB/srp: Avoid using uninitialized variable ...
2015-07-15net: Fix skb csum races when peekingHerbert Xu
When we calculate the checksum on the recv path, we store the result in the skb as an optimisation in case we need the checksum again down the line. This is in fact bogus for the MSG_PEEK case as this is done without any locking. So multiple threads can peek and then store the result to the same skb, potentially resulting in bogus skb states. This patch fixes this by only storing the result if the skb is not shared. This preserves the optimisations for the few cases where it can be done safely due to locking or other reasons, e.g., SIOCINQ. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15NET: AX.25: Stop heartbeat timer on disconnect.Richard Stearn
This may result in a kernel panic. The bug has always existed but somehow we've run out of luck now and it bites. Signed-off-by: Richard Stearn <richard@rns-stearn.demon.co.uk> Cc: stable@vger.kernel.org # all branches Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15net: Clone skb before setting peeked flagHerbert Xu
Shared skbs must not be modified and this is crucial for broadcast and/or multicast paths where we use it as an optimisation to avoid unnecessary cloning. The function skb_recv_datagram breaks this rule by setting peeked without cloning the skb first. This causes funky races which leads to double-free. This patch fixes this by cloning the skb and replacing the skb in the list when setting skb->peeked. Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15rtnetlink: reject non-IFLA_VF_PORT attributes inside IFLA_VF_PORTSDaniel Borkmann
Similarly as in commit 4f7d2cdfdde7 ("rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver"), we have a double nesting of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT that is nested itself. While IFLA_VF_PORTS is a verified attribute from ifla_policy[], we only check if the IFLA_VF_PORTS container has IFLA_VF_PORT attributes and then pass the attribute's content itself via nla_parse_nested(). It would be more correct to reject inner types other than IFLA_VF_PORT instead of continuing parsing and also similarly as in commit 4f7d2cdfdde7, to check for a minimum of NLA_HDRLEN. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Cc: Scott Feldman <sfeldma@gmail.com> Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-14rds: rds_ib_device.refcount overflowWengang Wang
Fixes: 3e0249f9c05c ("RDS/IB: add refcount tracking to struct rds_ib_device") There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr failed(mr pool running out). this lead to the refcount overflow. A complain in line 117(see following) is seen. From vmcore: s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448. That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely to return ERR_PTR(-EAGAIN). 115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev) 116 { 117 BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0); 118 if (atomic_dec_and_test(&rds_ibdev->refcount)) 119 queue_work(rds_wq, &rds_ibdev->free_work); 120 } fix is to drop refcount when rds_ib_alloc_fmr failed. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>