summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2017-03-01net: solve a NAPI raceEric Dumazet
While playing with mlx4 hardware timestamping of RX packets, I found that some packets were received by TCP stack with a ~200 ms delay... Since the timestamp was provided by the NIC, and my probe was added in tcp_v4_rcv() while in BH handler, I was confident it was not a sender issue, or a drop in the network. This would happen with a very low probability, but hurting RPC workloads. A NAPI driver normally arms the IRQ after the napi_complete_done(), after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab it. Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit while IRQ are not disabled, we might have later an IRQ firing and finding this bit set, right before napi_complete_done() clears it. This can happen with busy polling users, or if gro_flush_timeout is used. But some other uses of napi_schedule() in drivers can cause this as well. thread 1 thread 2 (could be on same cpu, or not) // busy polling or napi_watchdog() napi_schedule(); ... napi->poll() device polling: read 2 packets from ring buffer Additional 3rd packet is available. device hard irq // does nothing because NAPI_STATE_SCHED bit is owned by thread 1 napi_schedule(); napi_complete_done(napi, 2); rearm_irq(); Note that rearm_irq() will not force the device to send an additional IRQ for the packet it already signaled (3rd packet in my example) This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep() can set if it could not grab NAPI_STATE_SCHED Then napi_complete_done() properly reschedules the napi to make sure we do not miss something. Since we manipulate multiple bits at once, use cmpxchg() like in sk_busy_loop() to provide proper transactions. In v2, I changed napi_watchdog() to use a relaxed variant of napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point. In v3, I added more details in the changelog and clears NAPI_STATE_MISSED in busy_poll_stop() In v4, I added the ideas given by Alexander Duyck in v3 review Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01rds: ib: add the static type to the variablesZhu Yanjun
The variables rds_ib_mr_1m_pool_size and rds_ib_mr_8k_pool_size are used only in the ib.c file. As such, the static type is added to limit them in this file. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01sctp: call rcu_read_lock before checking for duplicate transport nodesXin Long
Commit cd2b70875058 ("sctp: check duplicate node before inserting a new transport") called rhltable_lookup() to check for the duplicate transport node in transport rhashtable. But rhltable_lookup() doesn't call rcu_read_lock inside, it could cause a use-after-free issue if it tries to dereference the node that another cpu has freed it. Note that sock lock can not avoid this as it is per sock. This patch is to fix it by calling rcu_read_lock before checking for duplicate transport nodes. Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new transport") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01rxrpc: Fix deadlock between call creation and sendmsg/recvmsgDavid Howells
All the routines by which rxrpc is accessed from the outside are serialised by means of the socket lock (sendmsg, recvmsg, bind, rxrpc_kernel_begin_call(), ...) and this presents a problem: (1) If a number of calls on the same socket are in the process of connection to the same peer, a maximum of four concurrent live calls are permitted before further calls need to wait for a slot. (2) If a call is waiting for a slot, it is deep inside sendmsg() or rxrpc_kernel_begin_call() and the entry function is holding the socket lock. (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented from servicing the other calls as they need to take the socket lock to do so. (4) The socket is stuck until a call is aborted and makes its slot available to the waiter. Fix this by: (1) Provide each call with a mutex ('user_mutex') that arbitrates access by the users of rxrpc separately for each specific call. (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as they've got a call and taken its mutex. Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is set but someone else has the lock. Should I instead only return EWOULDBLOCK if there's nothing currently to be done on a socket, and sleep in this particular instance because there is something to be done, but we appear to be blocked by the interrupt handler doing its ping? (3) Make rxrpc_new_client_call() unlock the socket after allocating a new call, locking its user mutex and adding it to the socket's call tree. The call is returned locked so that sendmsg() can add data to it immediately. From the moment the call is in the socket tree, it is subject to access by sendmsg() and recvmsg() - even if it isn't connected yet. (4) Lock new service calls in the UDP data_ready handler (in rxrpc_new_incoming_call()) because they may already be in the socket's tree and the data_ready handler makes them live immediately if a user ID has already been preassigned. Note that the new call is locked before any notifications are sent that it is live, so doing mutex_trylock() *ought* to always succeed. Userspace is prevented from doing sendmsg() on calls that are in a too-early state in rxrpc_do_sendmsg(). (5) Make rxrpc_new_incoming_call() return the call with the user mutex held so that a ping can be scheduled immediately under it. Note that it might be worth moving the ping call into rxrpc_new_incoming_call() and then we can drop the mutex there. (6) Make rxrpc_accept_call() take the lock on the call it is accepting and release the socket after adding the call to the socket's tree. This is slightly tricky as we've dequeued the call by that point and have to requeue it. Note that requeuing emits a trace event. (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the new mutex immediately and don't bother with the socket mutex at all. This patch has the nice bonus that calls on the same socket are now to some extent parallelisable. Note that we might want to move rxrpc_service_prealloc() calls out from the socket lock and give it its own lock, so that we don't hang progress in other calls because we're waiting for the allocator. We probably also want to avoid calling rxrpc_notify_socket() from within the socket lock (rxrpc_accept_call()). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.c.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull IDR rewrite from Matthew Wilcox: "The most significant part of the following is the patch to rewrite the IDR & IDA to be clients of the radix tree. But there's much more, including an enhancement of the IDA to be significantly more space efficient, an IDR & IDA test suite, some improvements to the IDR API (and driver changes to take advantage of those improvements), several improvements to the radix tree test suite and RCU annotations. The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree for most of the last cycle. Coupled with the IDR test suite, I feel pretty confident that any remaining bugs are quite hard to hit. 0-day did a great job of watching my git tree and pointing out problems; as it hit them, I added new test-cases to be sure not to be caught the same way twice" Willy goes on to expand a bit on the IDR rewrite rationale: "The radix tree and the IDR use very similar data structures. Merging the two codebases lets us share the memory allocation pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the features of the radix tree to users of the IDR (and I have some interesting patches along those lines waiting for 4.12) It also shrinks the size of the 'struct idr' from 40 bytes to 24 which will shrink a fair few data structures that embed an IDR" * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits) radix tree test suite: Add config option for map shift idr: Add missing __rcu annotations radix-tree: Fix __rcu annotations radix-tree: Add rcu_dereference and rcu_assign_pointer calls radix tree test suite: Run iteration tests for longer radix tree test suite: Fix split/join memory leaks radix tree test suite: Fix leaks in regression2.c radix tree test suite: Fix leaky tests radix tree test suite: Enable address sanitizer radix_tree_iter_resume: Fix out of bounds error radix-tree: Store a pointer to the root in each node radix-tree: Chain preallocated nodes through ->parent radix tree test suite: Dial down verbosity with -v radix tree test suite: Introduce kmalloc_verbose idr: Return the deleted entry from idr_remove radix tree test suite: Build separate binaries for some tests ida: Use exceptional entries for small IDAs ida: Move ida_bitmap to a percpu variable Reimplement IDR and IDA using the radix tree radix-tree: Add radix_tree_iter_delete ...
2017-02-28Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "The nfsd update this round is mainly a lot of miscellaneous cleanups and bugfixes. A couple changes could theoretically break working setups on upgrade. I don't expect complaints in practice, but they seem worth calling out just in case: - NFS security labels are now off by default; a new security_label export flag reenables it per export. But, having them on by default is a disaster, as it generally only makes sense if all your clients and servers have similar enough selinux policies. Thanks to Jason Tibbitts for pointing this out. - NFSv4/UDP support is off. It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that" * tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits) nfsd: Fix display of the version string nfsd: fix configuration of supported minor versions sunrpc: don't register UDP port with rpcbind when version needs congestion control nfs/nfsd/sunrpc: enforce transport requirements for NFSv4 sunrpc: flag transports as having congestion control sunrpc: turn bitfield flags in svc_version into bools nfsd: remove superfluous KERN_INFO nfsd: special case truncates some more nfsd: minor nfsd_setattr cleanup NFSD: Reserve adequate space for LOCKT operation NFSD: Get response size before operation for all RPCs nfsd/callback: Drop a useless data copy when comparing sessionid nfsd/callback: skip the callback tag nfsd/callback: Cleanup callback cred on shutdown nfsd/idmap: return nfserr_inval for 0-length names SUNRPC/Cache: Always treat the invalid cache as unexpired SUNRPC: Drop all entries from cache_detail when cache_purge() svcrdma: Poll CQs in "workqueue" mode svcrdma: Combine list fields in struct svc_rdma_op_ctxt svcrdma: Remove unused sc_dto_q field ...
2017-02-28Merge tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "This time around we have: - support for rbd data-pool feature, which enables rbd images on erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to allow erasure-coded profiles with k+m up to 32. - a patch for ceph_d_revalidate() performance regression introduced in 4.9, along with some cleanups in the area (Jeff Layton) - a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff Layton) - buffered reads are now processed in rsize windows instead of rasize windows (Andreas Gerstmayr). The new default for rsize mount option is 64M. - ack vs commit distinction is gone, greatly simplifying ->fsync() and MOSDOpReply handling code (myself) ... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH computations are still serialized though) and several minor fixes and cleanups all over" * tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits) libceph, rbd, ceph: WRITE | ONDISK -> WRITE libceph: get rid of ack vs commit ceph: remove special ack vs commit behavior ceph: tidy some white space in get_nonsnap_parent() crush: fix dprintk compilation crush: do is_out test only if we do not collide ceph: remove req from unsafe list when unregistering it rbd: constify device_type structure rbd: kill obj_request->object_name and rbd_segment_name_cache rbd: store and use obj_request->object_no rbd: RBD_V{1,2}_DATA_FORMAT macros rbd: factor out __rbd_osd_req_create() rbd: set offset and length outside of rbd_obj_request_create() rbd: support for data-pool feature rbd: introduce rbd_init_layout() rbd: use rbd_obj_bytes() more rbd: remove now unused rbd_obj_request_wait() and helpers rbd: switch rbd_obj_method_sync() to ceph_osdc_call() libceph: pass reply buffer length through ceph_osdc_call() rbd: do away with obj_request in rbd_obj_read_sync() ...
2017-02-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Don't save TIPC header values before the header has been validated, from Jon Paul Maloy. 2) Fix memory leak in RDS, from Zhu Yanjun. 3) We miss to initialize the UID in the flow key in some paths, from Julian Anastasov. 4) Fix latent TOS masking bug in the routing cache removal from years ago, also from Julian. 5) We forget to set the sockaddr port in sctp_copy_local_addr_list(), fix from Xin Long. 6) Missing module ref count drop in packet scheduler actions, from Roman Mashak. 7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu. 8) Fix use after free which happens because L2TP's ipv4 support returns non-zero values from it's backlog_rcv function which ipv4 interprets as protocol values. Fix from Paul Hüber. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) qed: Don't use attention PTT for configuring BW qed: Fix race with multiple VFs l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv xfrm: provide correct dst in xfrm_neigh_lookup rhashtable: Fix RCU dereference annotation in rht_bucket_nested rhashtable: Fix use before NULL check in bucket_table_free net sched actions: do not overwrite status of action creation. rxrpc: Kernel calls get stuck in recvmsg net sched actions: decrement module reference count after table flush. lib: Allow compile-testing of parman ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt sctp: set sin_port for addr param when checking duplicate address net/mlx4_en: fix overflow in mlx4_en_init_timestamp() netfilter: nft_set_bitmap: incorrect bitmap size net: s2io: fix typo argumnet argument net: vxge: fix typo argumnet argument netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value. ipv4: mask tos for input route ipv4: add missing initialization for flowi4_uid lib: fix spelling mistake: "actualy" -> "actually" ...
2017-02-28netfilter: use skb_to_full_sk in ip_route_me_harderFlorian Westphal
inet_sk(skb->sk) is illegal in case skb is attached to request socket. Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Reported by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-28mac80211: use driver-indicated transmitter STA only for data framesJohannes Berg
When I originally introduced using the driver-indicated station as an optimisation to avoid the hashtable lookup/iteration, of course it wasn't intended to really functionally change anything. I neglected, however, to take into account VLAN interfaces, which have the property that management and data frames are handled differently: data frames go directly to the station and the VLAN while management frames continue to be processed over the underlying/associated AP-type interface. As a consequence, when a driver used this optimisation for management frames and the user enabled VLANs, my change broke things since any management frames, particularly disassoc/deauth, were missed by hostapd. Fix this by restoring the original code path for non-data frames, they aren't critical for performance to begin with. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=194713. Big thanks goes to Jarek who bisected the issue and provided a very detailed bug report, including the crucial information that he was using VLANs in his configuration. Cc: stable@vger.kernel.org Fixes: 771e846bea9e ("mac80211: allow passing transmitter station on RX") Reported-and-tested-by: Jarek Kamiński <jarek@freeside.be> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27lib/vsprintf.c: remove %Z supportAlexey Dobriyan
Now that %z is standartised in C99 there is no reason to support %Z. Unlike %L it doesn't even make format strings smaller. Use BUILD_BUG_ON in a couple ATM drivers. In case anyone didn't notice lib/vsprintf.o is about half of SLUB which is in my opinion is quite an achievement. Hopefully this patch inspires someone else to trim vsprintf.c more. Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27scripts/spelling.txt: add "varible" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: varible||variable While we are here, tidy up the comment blocks that fit in a single line for drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c and net/sctp/transport.c. Link: http://lkml.kernel.org/r/1481573103-11329-11-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27scripts/spelling.txt: add "aligment" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: aligment||alignment I did not touch the "N_BYTE_ALIGMENT" macro in drivers/net/wireless/realtek/rtlwifi/wifi.h to avoid unpredictable impact. I fixed "_aligment_handler" in arch/openrisc/kernel/entry.S because it is surrounded by #if 0 ... #endif. It is surely safe and I confirmed "_alignment_handler" is correct. I also fixed the "controler" I found in the same hunk in arch/openrisc/kernel/head.S. Link: http://lkml.kernel.org/r/1481573103-11329-8-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27scripts/spelling.txt: add "an user" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: an user||a user an userspace||a userspace I also added "userspace" to the list since it is a common word in Linux. I found some instances for "an userfaultfd", but I did not add it to the list. I felt it is endless to find words that start with "user" such as "userland" etc., so must draw a line somewhere. Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27scripts/spelling.txt: add "swith" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: swith||switch swithable||switchable swithed||switched swithing||switching While we are here, fix the "update" to "updates" in the touched hunk in drivers/net/wireless/marvell/mwifiex/wmm.c. Link: http://lkml.kernel.org/r/1481573103-11329-2-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27virtio: allow drivers to request IRQ affinity when creating VQsChristoph Hellwig
Add a struct irq_affinity pointer to the find_vqs methods, which if set is used to tell the PCI layer to create the MSI-X vectors for our I/O virtqueues with the proper affinity from the start. Compared to after the fact affinity hints this gives us an instantly working setup and allows to allocate the irq descritors node-local and avoid interconnect traffic. Last but not least this will allow blk-mq queues are created based on the interrupt affinity for storage drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains netfilter fixes for you net tree, they are: 1) Missing ct zone size in the nft_ct initialization path, patch from Florian Westphal. 2) Two patches for netfilter uapi headers, one to remove unnecessary sysctl.h inclusion and another to fix compilation of xt_hashlimit.h in userspace, from Dmitry V. Levin. 3) Patch to fix a sloppy change in nf_ct_expect that incorrectly simplified nf_ct_expect_related_report() in the previous nf-next batch. This also includes another patch for __nf_ct_expect_check() to report success by returning 0 to keep it consistent with other existing functions. From Jarno Rajahalme. 4) The ->walk() iterator of the new bitmap set type goes over the real bitmap size, this results in incorrect dumps when NFTA_SET_USERDATA is used. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-27mac80211: fix typo in debug printSara Sharon
Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27mac80211: shorten debug messageSara Sharon
Tracing is limited to 100 characters and this message passes the limit when there are a few buffered frames. Shorten it. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27mac80211: fix power saving clients handling in iwlwifiEmmanuel Grumbach
iwlwifi now supports RSS and can't let mac80211 track the PS state based on the Rx frames since they can come out of order. iwlwifi is now advertising AP_LINK_PS, and uses explicit notifications to teach mac80211 about the PS state of the stations and the PS poll / uAPSD trigger frames coming our way from the peers. Because of that, the TIM stopped being maintained in mac80211. I tried to fix this in commit c68df2e7be0c ("mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE") but that was later reverted by Felix in commit 6c18a6b4e799 ("Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE") since it broke drivers that do not implement set_tim. Since none of the drivers that set AP_LINK_PS have the set_tim() handler set besides iwlwifi, I can bail out in __sta_info_recalc_tim if AP_LINK_PS AND .set_tim is not implemented. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27mac80211: don't handle filtered frames within a BA sessionFelix Fietkau
When running a BA session, the driver (or the hardware) already takes care of retransmitting failed frames, since it has to keep the receiver reorder window in sync. Adding another layer of retransmit around that does not improve anything. In fact, it can only lead to some strong reordering with huge latency. Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27mac80211: don't reorder frames with SN smaller than SSNSara Sharon
When RX aggregation starts, transmitter may continue send frames with SN smaller than SSN until the AddBA response is received. However, the reorder buffer is already initialized at this point, which will cause the drop of such frames as duplicates since the head SN of the reorder buffer is set to the SSN, which is bigger. Cc: stable@vger.kernel.org Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27mac80211: flush delayed work when entering suspendMatt Chen
The issue was found when entering suspend and resume. It triggers a warning in: mac80211/key.c: ieee80211_enable_keys() ... WARN_ON_ONCE(sdata->crypto_tx_tailroom_needed_cnt || sdata->crypto_tx_tailroom_pending_dec); ... It points out sdata->crypto_tx_tailroom_pending_dec isn't cleaned up successfully in a delayed_work during suspend. Add a flush_delayed_work to fix it. Cc: stable@vger.kernel.org Signed-off-by: Matt Chen <matt.chen@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-27mac80211: fix packet statistics for fast-RXJohannes Berg
When adding per-CPU statistics, which added statistics back to mac80211 for the fast-RX path, I evidently forgot to add the "stats->packets++" line. The reason for that is likely that I didn't see it since it's done in defragmentation for the regular RX path. Add the missing line to properly count received packets in the fast-RX case. Fixes: c9c5962b56c1 ("mac80211: enable collecting station statistics per-CPU") Reported-by: Oren Givon <oren.givon@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-26l2tp: avoid use-after-free caused by l2tp_ip_backlog_recvPaul Hüber
l2tp_ip_backlog_recv may not return -1 if the packet gets dropped. The return value is passed up to ip_local_deliver_finish, which treats negative values as an IP protocol number for resubmission. Signed-off-by: Paul Hüber <phueber@kernsp.in> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26xfrm: provide correct dst in xfrm_neigh_lookupJulian Anastasov
Fix xfrm_neigh_lookup to provide dst->path to the neigh_lookup dst_ops method. When skb is provided, the IP address in packet should already match the dst->path address family. But for the non-skb case, we should consider the last tunnel address as nexthop address. Fixes: f894cbf847c9 ("net: Add optional SKB arg to dst_ops->neigh_lookup().") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26net sched actions: do not overwrite status of action creation.Roman Mashak
nla_memdup_cookie was overwriting err value, declared at function scope and earlier initialized with result of ->init(). At success nla_memdup_cookie() returns 0, and thus module refcnt decremented, although the action was installed. $ sudo tc actions add action pass index 1 cookie 1234 $ sudo tc actions ls action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 0 $ $ lsmod Module Size Used by act_gact 16384 0 ... $ $ sudo rmmod act_gact [ 52.310283] ------------[ cut here ]------------ [ 52.312551] WARNING: CPU: 1 PID: 455 at kernel/module.c:1113 module_put+0x99/0xa0 [ 52.316278] Modules linked in: act_gact(-) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel psmouse pcbc evbug aesni_intel aes_x86_64 crypto_simd serio_raw glue_helper pcspkr cryptd [ 52.322285] CPU: 1 PID: 455 Comm: rmmod Not tainted 4.10.0+ #11 [ 52.324261] Call Trace: [ 52.325132] dump_stack+0x63/0x87 [ 52.326236] __warn+0xd1/0xf0 [ 52.326260] warn_slowpath_null+0x1d/0x20 [ 52.326260] module_put+0x99/0xa0 [ 52.326260] tcf_hashinfo_destroy+0x7f/0x90 [ 52.326260] gact_exit_net+0x27/0x40 [act_gact] [ 52.326260] ops_exit_list.isra.6+0x38/0x60 [ 52.326260] unregister_pernet_operations+0x90/0xe0 [ 52.326260] unregister_pernet_subsys+0x21/0x30 [ 52.326260] tcf_unregister_action+0x68/0xa0 [ 52.326260] gact_cleanup_module+0x17/0xa0f [act_gact] [ 52.326260] SyS_delete_module+0x1ba/0x220 [ 52.326260] entry_SYSCALL_64_fastpath+0x1e/0xad [ 52.326260] RIP: 0033:0x7f527ffae367 [ 52.326260] RSP: 002b:00007ffeb402a598 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0 [ 52.326260] RAX: ffffffffffffffda RBX: 0000559b069912a0 RCX: 00007f527ffae367 [ 52.326260] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559b06991308 [ 52.326260] RBP: 0000000000000003 R08: 00007f5280264420 R09: 00007ffeb4029511 [ 52.326260] R10: 000000000000087b R11: 0000000000000202 R12: 00007ffeb4029580 [ 52.326260] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b069912a0 [ 52.354856] ---[ end trace 90d89401542b0db6 ]--- $ With the fix: $ sudo modprobe act_gact $ lsmod Module Size Used by act_gact 16384 0 ... $ sudo tc actions add action pass index 1 cookie 1234 $ sudo tc actions ls action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 0 $ $ lsmod Module Size Used by act_gact 16384 1 ... $ sudo rmmod act_gact rmmod: ERROR: Module act_gact is in use $ $ sudo /home/mrv/bin/tc actions del action gact index 1 $ sudo rmmod act_gact $ lsmod Module Size Used by $ Fixes: 1045ba77a ("net sched actions: Add support for user cookies") Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26rxrpc: Kernel calls get stuck in recvmsgDavid Howells
Calls made through the in-kernel interface can end up getting stuck because of a missed variable update in a loop in rxrpc_recvmsg_data(). The problem is like this: (1) A new packet comes in and doesn't cause a notification to be given to the client as there's still another packet in the ring - the assumption being that if the client will keep drawing off data until the ring is empty. (2) The client is in rxrpc_recvmsg_data(), inside the big while loop that iterates through the packets. This copies the window pointers into variables rather than using the information in the call struct because: (a) MSG_PEEK might be in effect; (b) we need a barrier after reading call->rx_top to pair with the barrier in the softirq routine that loads the buffer. (3) The reading of call->rx_top is done outside of the loop, and top is never updated whilst we're in the loop. This means that even through there's a new packet available, we don't see it and may return -EFAULT to the caller - who will happily return to the scheduler and await the next notification. (4) No further notifications are forthcoming until there's an abort as the ring isn't empty. The fix is to move the read of call->rx_top inside the loop - but it needs to be done before the condition is checked. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26net sched actions: decrement module reference count after table flush.Roman Mashak
When tc actions are loaded as a module and no actions have been installed, flushing them would result in actions removed from the memory, but modules reference count not being decremented, so that the modules would not be unloaded. Following is example with GACT action: % sudo modprobe act_gact % lsmod Module Size Used by act_gact 16384 0 % % sudo tc actions ls action gact % % sudo tc actions flush action gact % lsmod Module Size Used by act_gact 16384 1 % sudo tc actions flush action gact % lsmod Module Size Used by act_gact 16384 2 % sudo rmmod act_gact rmmod: ERROR: Module act_gact is in use .... After the fix: % lsmod Module Size Used by act_gact 16384 0 % % sudo tc actions add action pass index 1 % sudo tc actions add action pass index 2 % sudo tc actions add action pass index 3 % lsmod Module Size Used by act_gact 16384 3 % % sudo tc actions flush action gact % lsmod Module Size Used by act_gact 16384 0 % % sudo tc actions flush action gact % lsmod Module Size Used by act_gact 16384 0 % sudo rmmod act_gact % lsmod Module Size Used by % Fixes: f97017cdefef ("net-sched: Fix actions flushing") Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockoptXin Long
Commit 5e1859fbcc3c ("ipv4: ipmr: various fixes and cleanups") fixed the issue for ipv4 ipmr: ip_mroute_setsockopt() & ip_mroute_getsockopt() should not access/set raw_sk(sk)->ipmr_table before making sure the socket is a raw socket, and protocol is IGMP The same fix should be done for ipv6 ipmr as well. This patch can fix the panic caused by overwriting the same offset as ipmr_table as in raw_sk(sk) when accessing other type's socket by ip_mroute_setsockopt(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26sctp: set sin_port for addr param when checking duplicate addressXin Long
Commit b8607805dd15 ("sctp: not copying duplicate addrs to the assoc's bind address list") tried to check for duplicate address before copying to asoc's bind_addr list from global addr list. But all the addrs' sin_ports in global addr list are 0 while the addrs' sin_ports are bp->port in asoc's bind_addr list. It means even if it's a duplicate address, af->cmp_addr will still return 0 as the their sin_ports are different. This patch is to fix it by setting the sin_port for addr param with bp->port before comparing the addrs. Fixes: b8607805dd15 ("sctp: not copying duplicate addrs to the assoc's bind address list") Reported-by: Wei Chen <weichen@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26netfilter: nft_set_bitmap: incorrect bitmap sizePablo Neira Ayuso
priv->bitmap_size stores the real bitmap size, instead of the full struct nft_bitmap object. Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-26netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.Jarno Rajahalme
Commit 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1 due to __nf_ct_expect_check() returning 1 on success. Prevent this regression in the future by changing the return value of __nf_ct_expect_check() to 0 on success. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-26ipv4: mask tos for input routeJulian Anastasov
Restore the lost masking of TOS in input route code to allow ip rules to match it properly. Problem [1] noticed by Shmulik Ladkani <shmulik.ladkani@gmail.com> [1] http://marc.info/?t=137331755300040&r=1&w=2 Fixes: 89aef8921bfb ("ipv4: Delete routing cache.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26ipv4: add missing initialization for flowi4_uidJulian Anastasov
Avoid matching of random stack value for uid when rules are looked up on input route or when RP filter is used. Problem should affect only setups that use ip rules with uid range. Fixes: 622ec2c9d524 ("net: core: add UID to flows, rules, and routes") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-25Merge tag 'for-next-dma_ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma DMA mapping updates from Doug Ledford: "Drop IB DMA mapping code and use core DMA code instead. Bart Van Assche noted that the ib DMA mapping code was significantly similar enough to the core DMA mapping code that with a few changes it was possible to remove the IB DMA mapping code entirely and switch the RDMA stack to use the core DMA mapping code. This resulted in a nice set of cleanups, but touched the entire tree and has been kept separate for that reason." * tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits) IB/rxe, IB/rdmavt: Use dma_virt_ops instead of duplicating it IB/core: Remove ib_device.dma_device nvme-rdma: Switch from dma_device to dev.parent RDS: net: Switch from dma_device to dev.parent IB/srpt: Modify a debug statement IB/srp: Switch from dma_device to dev.parent IB/iser: Switch from dma_device to dev.parent IB/IPoIB: Switch from dma_device to dev.parent IB/rxe: Switch from dma_device to dev.parent IB/vmw_pvrdma: Switch from dma_device to dev.parent IB/usnic: Switch from dma_device to dev.parent IB/qib: Switch from dma_device to dev.parent IB/qedr: Switch from dma_device to dev.parent IB/ocrdma: Switch from dma_device to dev.parent IB/nes: Remove a superfluous assignment statement IB/mthca: Switch from dma_device to dev.parent IB/mlx5: Switch from dma_device to dev.parent IB/mlx4: Switch from dma_device to dev.parent IB/i40iw: Remove a superfluous assignment statement IB/hns: Switch from dma_device to dev.parent ...
2017-02-25netfilter: nf_ct_expect: nf_ct_expect_related_report(): Return zero on success.Jarno Rajahalme
Commit 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1, which caused openvswitch conntrack integration fail in FTP test cases. Fix this by always returning zero on the success code path. Fixes: 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-24sunrpc: don't register UDP port with rpcbind when version needs congestion ↵Jeff Layton
control Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24nfs/nfsd/sunrpc: enforce transport requirements for NFSv4Jeff Layton
NFSv4 requires a transport "that is specified to avoid network congestion" (RFC 7530, section 3.1, paragraph 2). In practical terms, that means that you should not run NFSv4 over UDP. The server has never enforced that requirement, however. This patchset fixes this by adding a new flag to the svc_version that states that it has these transport requirements. With that, we can check that the transport has XPT_CONG_CTRL set before processing an RPC. If it doesn't we reject it with RPC_PROG_MISMATCH. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24sunrpc: flag transports as having congestion controlJeff Layton
NFSv4 requires a transport protocol with congestion control in most cases. On an IP network, that means that NFSv4 over UDP should be forbidden. The situation with RDMA is a bit more nuanced, but most RDMA transports are suitable for this. For now, we assume that all RDMA transports are suitable, but we may need to revise that at some point. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24sunrpc: turn bitfield flags in svc_version into boolsJeff Layton
It's just simpler to read this way, IMO. Also, no need to explicitly set vs_hidden to false in the nfsacl ones. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24rds: fix memory leak errorZhu Yanjun
When the function register_netdevice_notifier fails, the memory allocated by kmem_cache_create should be freed by the function kmem_cache_destroy. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24libceph, rbd, ceph: WRITE | ONDISK -> WRITEIlya Dryomov
CEPH_OSD_FLAG_ONDISK is set in account_request(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24libceph: get rid of ack vs commitIlya Dryomov
- CEPH_OSD_FLAG_ACK shouldn't be set anymore, so assert on it - remove support for handling ack replies (OSDs will send ack replies only if clients request them) - drop the "do lingering callbacks under osd->lock" logic from handle_reply() -- lreq->lock is sufficient in all three cases Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24vti6: return GRE_KEY for vti6David Forster
Align vti6 with vti by returning GRE_KEY flag. This enables iproute2 to display tunnel keys on "ip -6 tunnel show" Signed-off-by: David Forster <dforster@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24rxrpc: Fix an assertion in rxrpc_read()Marc Dionne
In the rxrpc_read() function, which allows a user to read the contents of a key, we miscalculate the expected length of an encoded rxkad token by not taking into account the key length. However, the data is stored later anyway with an ENCODE_DATA() call - and an assertion failure then ensues when the lengths are checked at the end. Fix this by including the key length in the token size estimation. The following assertion is produced: Assertion failed - 384(0x180) == 380(0x17c) is false ------------[ cut here ]------------ kernel BUG at ../net/rxrpc/key.c:1221! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 2 PID: 2957 Comm: keyctl Not tainted 4.10.0-fscache+ #483 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 task: ffff8804013a8500 task.stack: ffff8804013ac000 RIP: 0010:rxrpc_read+0x10de/0x11b6 RSP: 0018:ffff8804013afe48 EFLAGS: 00010296 RAX: 000000000000003b RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000040001 RSI: 00000000000000f6 RDI: 0000000000000300 RBP: ffff8804013afed8 R08: 0000000000000001 R09: 0000000000000001 R10: ffff8804013afd90 R11: 0000000000000002 R12: 00005575f7c911b4 R13: 00005575f7c911b3 R14: 0000000000000157 R15: ffff880408a5d640 FS: 00007f8dfbc73700(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005575f7c91008 CR3: 000000040120a000 CR4: 00000000001406e0 Call Trace: keyctl_read_key+0xb6/0xd7 SyS_keyctl+0x83/0xe7 do_syscall_64+0x80/0x191 entry_SYSCALL64_slow_path+0x25/0x25 Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24tipc: move premature initilalization of stack variablesJon Paul Maloy
In the function tipc_rcv() we initialize a couple of stack variables from the message header before that same header has been validated. In rare cases when the arriving header is non-linar, the validation function itself may linearize the buffer by calling skb_may_pull(), while the wrongly initialized stack fields are not updated accordingly. We fix this in this commit. Reported-by: Matthew Wong <mwong@sonusnet.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24RDS: IB: fix ifnullfree.cocci warningsWu Fengguang
net/rds/ib.c:115:2-7: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24sctp: deny peeloff operation on asocs with threads sleeping on itMarcelo Ricardo Leitner
commit 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf") attempted to avoid a BUG_ON call when the association being used for a sendmsg() is blocked waiting for more sndbuf and another thread did a peeloff operation on such asoc, moving it to another socket. As Ben Hutchings noticed, then in such case it would return without locking back the socket and would cause two unlocks in a row. Further analysis also revealed that it could allow a double free if the application managed to peeloff the asoc that is created during the sendmsg call, because then sctp_sendmsg() would try to free the asoc that was created only for that call. This patch takes another approach. It will deny the peeloff operation if there is a thread sleeping on the asoc, so this situation doesn't exist anymore. This avoids the issues described above and also honors the syscalls that are already being handled (it can be multiple sendmsg calls). Joint work with Xin Long. Fixes: 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf") Cc: Alexander Popov <alex.popov@linux.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23crush: fix dprintk compilationIlya Dryomov
The syntax error was not noticed because dprintk is a macro and the code is discarded by default. Reflects ceph.git commit f29b840c64a933b2cb13e3da6f3d785effd73a57. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>