summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-05-04net: core: Correct an over-stringent device loop detection.Vlad Yasevich
The code in __netdev_upper_dev_link() has an over-stringent loop detection logic that actually prevents valid configurations from working correctly. In particular, the logic returns an error if an upper device is already in the list of all upper devices for a given dev. This particular check seems to be a overzealous as it disallows perfectly valid configurations. For example: # ip l a link eth0 name eth0.10 type vlan id 10 # ip l a dev br0 typ bridge # ip l s eth0.10 master br0 # ip l s eth0 master br0 <--- Will fail If you switch the last two commands (add eth0 first), then both will succeed. If after that, you remove eth0 and try to re-add it, it will fail! It appears to be enough to simply check adj_list to keeps things safe. I've tried stacking multiple devices multiple times in all different combinations, and either rx_handler registration prevented the stacking of the device linking cought the error. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Veaceslav Falico <vfalico@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failuresScott Mayhew
In an environment where the KDC is running Active Directory, the exported composite name field returned in the context could be large enough to span a page boundary. Attaching a scratch buffer to the decoding xdr_stream helps deal with those cases. The case where we saw this was actually due to behavior that's been fixed in newer gss-proxy versions, but we're fixing it here too. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Simo Sorce <simo@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04Revert "net: kernel socket should be released in init_net namespace"Herbert Xu
This reverts commit c243d7e20996254f89c28d4838b5feca735c030d. That patch is solving a non-existant problem while creating a real problem. Just because a socket is allocated in the init name space doesn't mean that it gets hashed in the init name space. When we unhash it the name space must be the same as the one we had when we hashed it. So this patch is completely bogus and causes socket leaks. Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03net/rds: fix unaligned memory accessshamir rabinovitch
rdma_conn_param private data is copied using memcpy after headers such as cma_hdr (see cma_resolve_ib_udp as example). so the start of the private data is aligned to the end of the structure that come before. if this structure end with u32 the meaning is that the start of the private data will be 4 bytes aligned. structures that use u8/u16/u32/u64 are naturally aligned but in case the structure start is not 8 bytes aligned, all u64 members of this structure will not be aligned. to solve this issue we must use special macros that allow unaligned access to those unaligned members. Addresses the following kernel log seen when attempting to use RDMA: Kernel unaligned access at TPC[10507a88] rds_ib_cm_connect_complete+0x1bc/0x1e0 [rds_rdma] Acked-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com> [Minor tweaks for top of tree by:] Signed-off-by: David Ahern <david.ahern@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03netlink: Remove max_size settingHerbert Xu
We currently limit the hash table size to 64K which is very bad as even 10 years ago it was relatively easy to generate millions of sockets. Since the hash table is naturally limited by memory allocation failure, we don't really need an explicit limit so this patch removes it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03codel: fix maxpacket/mtu confusionEric Dumazet
Under presence of TSO/GSO/GRO packets, codel at low rates can be quite useless. In following example, not a single packet was ever dropped, while average delay in codel queue is ~100 ms ! qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0) backlog 13626b 3p requeues 0 count 0 lastcount 0 ldelay 96.9ms drop_next 0us maxpacket 9084 ecn_mark 0 drop_overlimit 0 This comes from a confusion of what should be the minimal backlog. It is pretty clear it is not 64KB or whatever max GSO packet ever reached the qdisc. codel intent was to use MTU of the device. After the fix, we finally drop some packets, and rtt/cwnd of my single TCP flow are meeting our expectations. qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0) backlog 6056b 3p requeues 0 count 1 lastcount 1 ldelay 36.3ms drop_next 0us maxpacket 10598 ecn_mark 0 drop_overlimit 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kathleen Nichols <nichols@pollere.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Van Jacobson <vanj@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-01ipv4: Missing sk_nulls_node_init() in ping_unhash().David S. Miller
If we don't do that, then the poison value is left in the ->pprev backlink. This can cause crashes if we do a disconnect, followed by a connect(). Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Wen Xu <hotdog3645@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-30ieee802154: trace: fix endian convertionAlexander Aring
This patch fix endian convertions for extended address and short address handling when TP_printk is called. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Cc: Guido Günther <agx@sigxcpu.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30cfg802154: pass name_assign_type to rdev_add_virtual_intf()Varka Bhadram
This code is based on commit 6bab2e19c5ffd ("cfg80211: pass name_assign_type to rdev_add_virtual_intf()") This will expose in sysfs whether the ifname of a IEEE-802.15.4 device is set by userspace or generated by the kernel. We are using two types of name_assign_types o NET_NAME_ENUM: Default interface name provided by kernel o NET_NAME_USER: Interface name provided by user. Signed-off-by: Varka Bhadram <varkab@cdac.in> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30ieee802154: Add trace events for rdev->opsGuido Günther
Enabling tracing via echo 1 > /sys/kernel/debug/tracing/events/cfg802154/enable enables event tracing like iwpan dev wpan0 set pan_id 0xbeef cat /sys/kernel/debug/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | iwpan-2663 [000] .... 170.369142: 802154_rdev_set_pan_id: phy0, wpan_dev(1), pan id: 0xbeef iwpan-2663 [000] .... 170.369177: 802154_rdev_return_int: phy0, returned: 0 Signed-off-by: Guido Günther <agx@sigxcpu.org> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30mac802154: llsec: fix return value check in llsec_key_alloc()Wei Yongjun
In case of error, the functions crypto_alloc_aead() and crypto_alloc_blkcipher() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30mac802154: fix ieee802154_register_hw error handlingAlexander Aring
Currently if ieee802154_if_add failed, we don't unregister the wpan phy which was registered before. This patch adds a correct error handling for unregister the wpan phy when ieee802154_if_add failed. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30Bluetooth: Skip the shutdown routine if the interface is not upGabriele Mazzotta
Most likely, the shutdown routine requires the interface to be up. This is the case for BTUSB_INTEL: the routine tries to send a command to the interface, but since this one is down, it fails and exits once HCI_INIT_TIMEOUT has expired. Signed-off-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org # 4.0.x
2015-04-29tcp: update reordering first before detecting lossYuchung Cheng
tcp_mark_lost_retrans is not used when FACK is disabled. Since tcp_update_reordering may disable FACK, it should be called first before tcp_mark_lost_retrans. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tcp: add TCP_CC_INFO socket optionEric Dumazet
Some Congestion Control modules can provide per flow information, but current way to get this information is to use netlink. Like TCP_INFO, let's add TCP_CC_INFO so that applications can issue a getsockopt() if they have a socket file descriptor, instead of playing complex netlink games. Sample usage would be : union tcp_cc_info info; socklen_t len = sizeof(info); if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tcp: prepare CC get_info() access from getsockopt()Eric Dumazet
We would like that optional info provided by Congestion Control modules using netlink can also be read using getsockopt() This patch changes get_info() to put this information in a buffer, instead of skb, like tcp_get_info(), so that following patch can reuse this common infrastructure. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tcp: add tcpi_bytes_received to tcp_infoEric Dumazet
This patch tracks total number of payload bytes received on a TCP socket. This is the sum of all changes done to tp->rcv_nxt RFC4898 named this : tcpEStatsAppHCThruOctetsReceived This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_received was placed near tp->rcv_nxt for best data locality and minimal performance impact. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tcp: add tcpi_bytes_acked to tcp_infoEric Dumazet
This patch tracks total number of bytes acked for a TCP socket. This is the sum of all changes done to tp->snd_una, and allows for precise tracking of delivered data. RFC4898 named this : tcpEStatsAppHCThruOctetsAcked This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_acked was placed near tp->snd_una for best data locality and minimal performance impact. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29net: dsa: Fix scope of eeprom-length propertyGuenter Roeck
eeprom-length is a switch property, not a dsa property, and thus needs to be attached to the switch node, not to the dsa node. Reported-by: Andrew Lunn <andrew@lunn.ch> Fixes: 6793abb4e849 ("net: dsa: Add support for switch EEPROM access") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tipc: fix problem with parallel link synchronization mechanismJon Paul Maloy
Currently, we try to accumulate arrived packets in the links's 'deferred' queue during the parallel link syncronization phase. This entails two problems: - With an unlucky combination of arriving packets the algorithm may go into a lockstep with the out-of-sequence handling function, where the synch mechanism is adding a packet to the deferred queue, while the out-of-sequence handling is retrieving it again, thus ending up in a loop inside the node_lock scope. - Even if this is avoided, the link will very often send out unnecessary protocol messages, in the worst case leading to redundant retransmissions. We fix this by just dropping arriving packets on the upcoming link during the synchronization phase, thus relying on the retransmission protocol to resolve the situation once the two links have arrived to a synchronized state. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29tipc: remove wrong use of NLM_F_MULTINicolas Dichtel
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact, it is sent only at the end of a dump. Libraries like libnl will wait forever for NLMSG_DONE. Fixes: 35b9dd7607f0 ("tipc: add bearer get/dump to new netlink api") Fixes: 7be57fc69184 ("tipc: add link get/dump to new netlink api") Fixes: 46f15c6794fb ("tipc: add media get/dump to new netlink api") CC: Richard Alpe <richard.alpe@ericsson.com> CC: Jon Maloy <jon.maloy@ericsson.com> CC: Ying Xue <ying.xue@windriver.com> CC: tipc-discussion@lists.sourceforge.net Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29bridge/nl: remove wrong use of NLM_F_MULTINicolas Dichtel
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact, it is sent only at the end of a dump. Libraries like libnl will wait forever for NLMSG_DONE. Fixes: e5a55a898720 ("net: create generic bridge ops") Fixes: 815cccbf10b2 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf") CC: John Fastabend <john.r.fastabend@intel.com> CC: Sathya Perla <sathya.perla@emulex.com> CC: Subbu Seetharaman <subbu.seetharaman@emulex.com> CC: Ajit Khaparde <ajit.khaparde@emulex.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: intel-wired-lan@lists.osuosl.org CC: Jiri Pirko <jiri@resnulli.us> CC: Scott Feldman <sfeldma@gmail.com> CC: Stephen Hemminger <stephen@networkplumber.org> CC: bridge@lists.linux-foundation.org Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29bridge/mdb: remove wrong use of NLM_F_MULTINicolas Dichtel
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact, it is sent only at the end of a dump. Libraries like libnl will wait forever for NLMSG_DONE. Fixes: 37a393bc4932 ("bridge: notify mdb changes via netlink") CC: Cong Wang <amwang@redhat.com> CC: Stephen Hemminger <stephen@networkplumber.org> CC: bridge@lists.linux-foundation.org Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29net: sched: act_connmark: don't zap skb->nfctFlorian Westphal
This action is meant to be passive, i.e. we should not alter skb->nfct: If nfct is present just leave it alone. Compile tested only. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29route: Use ipv4_mtu instead of raw rt_pmtuHerbert Xu
The commit 3cdaa5be9e81a914e633a6be7b7d2ef75b528562 ("ipv4: Don't increase PMTU with Datagram Too Big message") broke PMTU in cases where the rt_pmtu value has expired but is smaller than the new PMTU value. This obsolete rt_pmtu then prevents the new PMTU value from being installed. Fixes: 3cdaa5be9e81 ("ipv4: Don't increase PMTU with Datagram Too Big message") Reported-by: Gerd v. Egidy <gerd.von.egidy@intra2net.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix a crash in nf_tables when dictionaries are used from the ruleset, due to memory corruption, from Florian Westphal. 2) Fix another crash in nf_queue when used with br_netfilter. Also from Florian. Both fixes are related to new stuff that got in 4.0-rc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) mlx4 doesn't check fully for supported valid RSS hash function, fix from Amir Vadai 2) Off by one in ibmveth_change_mtu(), from David Gibson 3) Prevent altera chip from reporting false error interrupts in some circumstances, from Chee Nouk Phoon 4) Get rid of that stupid endless loop trying to allocate a FIN packet in TCP, and in the process kill deadlocks. From Eric Dumazet 5) Fix get_rps_cpus() crash due to wrong invalid-cpu value, also from Eric Dumazet 6) Fix two bugs in async rhashtable resizing, from Thomas Graf 7) Fix topology server listener socket namespace bug in TIPC, from Ying Xue 8) Add some missing HAS_DMA kconfig dependencies, from Geert Uytterhoeven 9) bgmac driver intends to force re-polling but does so by returning the wrong value from it's ->poll() handler. Fix from Rafał Miłecki 10) When the creater of an rhashtable configures a max size for it, don't bark in the logs and drop insertions when that is exceeded. Fix from Johannes Berg 11) Recover from out of order packets in ppp mppe properly, from Sylvain Rochet * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits) bnx2x: really disable TPA if 'disable_tpa' option is set net:treewide: Fix typo in drivers/net net/mlx4_en: Prevent setting invalid RSS hash function mdio-mux-gpio: use new gpiod_get_array and gpiod_put_array functions netfilter; Add some missing default cases to switch statements in nft_reject. ppp: mppe: discard late packet in stateless mode ppp: mppe: sanity error path rework net/bonding: Make DRV macros private net: rfs: fix crash in get_rps_cpus() altera tse: add support for fixed-links. pxa168: fix double deallocation of managed resources net: fix crash in build_skb() net: eth: altera: Resolve false errors from MSGDMA to TSE ehea: Fix memory hook reference counting crashes net/tg3: Release IRQs on permanent error net: mdio-gpio: support access that may sleep inet: fix possible panic in reqsk_queue_unlink() rhashtable: don't attempt to grow when at max_size bgmac: fix requests for extra polling calls from NAPI tcp: avoid looping in tcp_send_fin() ...
2015-04-27netfilter; Add some missing default cases to switch statements in nft_reject.David S. Miller
This fixes: ==================== net/netfilter/nft_reject.c: In function ‘nft_reject_dump’: net/netfilter/nft_reject.c:61:2: warning: enumeration value ‘NFT_REJECT_TCP_RST’ not handled in switch [-Wswitch] switch (priv->type) { ^ net/netfilter/nft_reject.c:61:2: warning: enumeration value ‘NFT_REJECT_ICMPX_UNREACH’ not handled in switch [-Wswi\ tch] net/netfilter/nft_reject_inet.c: In function ‘nft_reject_inet_dump’: net/netfilter/nft_reject_inet.c:105:2: warning: enumeration value ‘NFT_REJECT_TCP_RST’ not handled in switch [-Wswi\ tch] switch (priv->type) { ^ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-26Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Another set of mainly bugfixes and a couple of cleanups. No new functionality in this round. Highlights include: Stable patches: - Fix a regression in /proc/self/mountstats - Fix the pNFS flexfiles O_DIRECT support - Fix high load average due to callback thread sleeping Bugfixes: - Various patches to fix the pNFS layoutcommit support - Do not cache pNFS deviceids unless server notifications are enabled - Fix a SUNRPC transport reconnection regression - make debugfs file creation failure non-fatal in SUNRPC - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints - Fix locking around NFSv4.2 fallocate() support - Truncating NFSv4 file opens should also sync O_DIRECT writes - Prevent infinite loop in rpcrdma_ep_create() Features: - Various improvements to the RDMA transport code's handling of memory registration - Various code cleanups" * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits) fs/nfs: fix new compiler warning about boolean in switch nfs: Remove unneeded casts in nfs NFS: Don't attempt to decode missing directory entries Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one" NFS: Rename idmap.c to nfs4idmap.c NFS: Move nfs_idmap.h into fs/nfs/ NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h NFS: Add a stub for GETDEVICELIST nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes nfs: fix DIO good bytes calculation nfs: Fetch MOUNTED_ON_FILEID when updating an inode sunrpc: make debugfs file creation failure non-fatal nfs: fix high load average due to callback thread sleeping NFS: Reduce time spent holding the i_mutex during fallocate() NFS: Don't zap caches on fallocate() xprtrdma: Make rpcrdma_{un}map_one() into inline functions xprtrdma: Handle non-SEND completions via a callout xprtrdma: Add "open" memreg op xprtrdma: Add "destroy MRs" memreg op xprtrdma: Add "reset MRs" memreg op ...
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26net: rfs: fix crash in get_rps_cpus()Eric Dumazet
Commit 567e4b79731c ("net: rfs: add hash collision detection") had one mistake : RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu() and get_rps_cpu(), as @next_cpu was the result of an AND with rps_cpu_mask This bug showed up on a host with 72 cpus : next_cpu was 0x7f, and the code was trying to access percpu data of an non existent cpu. In a follow up patch, we might get rid of compares against nr_cpu_ids, if we init the tables with 0. This is silly to test for a very unlikely condition that exists only shortly after table initialization, as we got rid of rps_reset_sock_flow() and similar functions that were writing this RPS_NO_CPU magic value at flow dismantle : When table is old enough, it never contains this value anymore. Fixes: 567e4b79731c ("net: rfs: add hash collision detection") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-25net: fix crash in build_skb()Eric Dumazet
When I added pfmemalloc support in build_skb(), I forgot netlink was using build_skb() with a vmalloc() area. In this patch I introduce __build_skb() for netlink use, and build_skb() is a wrapper handling both skb->head_frag and skb->pfmemalloc This means netlink no longer has to hack skb->head_frag [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26! [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 1567.700067] Dumping ftrace buffer: [ 1567.700067] (ftrace buffer empty) [ 1567.700067] Modules linked in: [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167 [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000 [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3)) [ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202 [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049 [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000 [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000 [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000 [ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000 [ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0 [ 1567.700067] Stack: [ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000 [ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08 [ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821 [ 1567.700067] Call Trace: [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316) [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329) [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311) [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273) [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273) [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623) [ 1567.774369] sock_write_iter (net/socket.c:823) [ 1567.774369] ? sock_sendmsg (net/socket.c:806) [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491) [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249) [ 1567.774369] ? default_llseek (fs/read_write.c:487) [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701) [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4)) [ 1567.774369] vfs_write (fs/read_write.c:539) [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577) [ 1567.774369] ? SyS_read (fs/read_write.c:577) [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636) [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42) [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261) Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24netfilter: nf_tables: fix wrong length for jump/goto verdictsFlorian Westphal
NFT_JUMP/GOTO erronously sets length to sizeof(void *). We then allocate insufficient memory when such element is added to a vmap. Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-24inet: fix possible panic in reqsk_queue_unlink()Eric Dumazet
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243 There is a race when reqsk_timer_handler() and tcp_check_req() call inet_csk_reqsk_queue_unlink() on the same req at the same time. Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer"), listener spinlock was held and race could not happen. To solve this bug, we change reqsk_queue_unlink() to not assume req must be found, and we return a status, to conditionally release a refcount on the request sock. This also means tcp_check_req() in non fastopen case might or not consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have to properly handle this. (Same remark for dccp_check_req() and its callers) inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is called 4 times in tcp and 3 times in dccp. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24tcp: avoid looping in tcp_send_fin()Eric Dumazet
Presence of an unbound loop in tcp_send_fin() had always been hard to explain when analyzing crash dumps involving gigantic dying processes with millions of sockets. Lets try a different strategy : In case of memory pressure, try to add the FIN flag to last packet in write queue, even if packet was already sent. TCP stack will be able to deliver this FIN after a timeout event. Note that this FIN being delivered by a retransmit, it also carries a Push flag given our current implementation. By checking sk_under_memory_pressure(), we anticipate that cooking many FIN packets might deplete tcp memory. In the case we could not allocate a packet, even with __GFP_WAIT allocation, then not sending a FIN seems quite reasonable if it allows to get rid of this socket, free memory, and not block the process from eventually doing other useful work. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24mac80211: enable hash table shrinkingJohannes Berg
The hashtable behaviour change was merged into the tree at about the same time as the mac80211 use of rhashtable, but of course these don't really conflict in the normal sense. Enable hash table shrinking now. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-04-23Merge tag 'nfs-rdma-for-4.1-1' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust
NFS: NFSoRDMA Client Changes This patch series creates an operation vector for each of the different memory registration modes. This should make it easier to one day increase credit limit, rsize, and wsize. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-04-23Merge branch 'bugfixes'Trond Myklebust
* bugfixes: NFSv4: Return delegations synchronously in evict_inode SUNRPC: Fix a regression when reconnecting NFS: remount with security change should return EINVAL nfs: do not export discarded symbols NFSv4.1: don't export static symbol
2015-04-23sunrpc: make debugfs file creation failure non-fatalJeff Layton
v2: gracefully handle the case where some dentry pointers end up NULL and be more dilligent about zeroing out dentry pointers We currently have a problem that SELinux policy is being enforced when creating debugfs files. If a debugfs file is created as a side effect of doing some syscall, then that creation can fail if the SELinux policy for that process prevents it. This seems wrong. We don't do that for files under /proc, for instance, so Bruce has proposed a patch to fix that. While discussing that patch however, Greg K.H. stated: "No kernel code should care / fail if a debugfs function fails, so please fix up the sunrpc code first." This patch converts all of the sunrpc debugfs setup code to be void return functins, and the callers to not look for errors from those functions. This should allow rpc_clnt and rpc_xprt creation to work, even if the kernel fails to create debugfs files for some reason. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23net: unix: garbage: fixed several comment and whitespace style issuesJason Eastman
fixed several comment and whitespace style issues Signed-off-by: Jason Eastman <eastman.jason.linux@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23tipc: fix node refcount issueErik Hugne
When link statistics is dumped over netlink, we iterate over the list of peer nodes and append each links statistics to the netlink msg. In the case where the dump is resumed after filling up a nlmsg, the node refcnt is decremented without having been incremented previously which may cause the node reference to be freed. When this happens, the following info/stacktrace will be generated, followed by a crash or undefined behavior. We fix this by removing the erroneous call to tipc_node_put inside the loop that iterates over nodes. [ 384.312303] INFO: trying to register non-static key. [ 384.313110] the code is fine but needs lockdep annotation. [ 384.313290] turning off the locking correctness validator. [ 384.313290] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.0.0+ #13 [ 384.313290] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 384.313290] ffff88003c6d0290 ffff88003cc03ca8 ffffffff8170adf1 0000000000000007 [ 384.313290] ffffffff82728730 ffff88003cc03d38 ffffffff810a6a6d 00000000001d7200 [ 384.313290] ffff88003c6d0ab0 ffff88003cc03ce8 0000000000000285 0000000000000001 [ 384.313290] Call Trace: [ 384.313290] <IRQ> [<ffffffff8170adf1>] dump_stack+0x4c/0x65 [ 384.313290] [<ffffffff810a6a6d>] __lock_acquire+0xf3d/0xf50 [ 384.313290] [<ffffffff810a7375>] lock_acquire+0xd5/0x290 [ 384.313290] [<ffffffffa0043e8c>] ? link_timeout+0x1c/0x170 [tipc] [ 384.313290] [<ffffffffa0043e70>] ? link_state_event+0x4e0/0x4e0 [tipc] [ 384.313290] [<ffffffff81712890>] _raw_spin_lock_bh+0x40/0x80 [ 384.313290] [<ffffffffa0043e8c>] ? link_timeout+0x1c/0x170 [tipc] [ 384.313290] [<ffffffffa0043e8c>] link_timeout+0x1c/0x170 [tipc] [ 384.313290] [<ffffffff810c4698>] call_timer_fn+0xb8/0x490 [ 384.313290] [<ffffffff810c45e0>] ? process_timeout+0x10/0x10 [ 384.313290] [<ffffffff810c5a2c>] run_timer_softirq+0x21c/0x420 [ 384.313290] [<ffffffffa0043e70>] ? link_state_event+0x4e0/0x4e0 [tipc] [ 384.313290] [<ffffffff8105a954>] __do_softirq+0xf4/0x630 [ 384.313290] [<ffffffff8105afdd>] irq_exit+0x5d/0x60 [ 384.313290] [<ffffffff8103ade1>] smp_apic_timer_interrupt+0x41/0x50 [ 384.313290] [<ffffffff817144a0>] apic_timer_interrupt+0x70/0x80 [ 384.313290] <EOI> [<ffffffff8100db10>] ? default_idle+0x20/0x210 [ 384.313290] [<ffffffff8100db0e>] ? default_idle+0x1e/0x210 [ 384.313290] [<ffffffff8100e61a>] arch_cpu_idle+0xa/0x10 [ 384.313290] [<ffffffff81099803>] cpu_startup_entry+0x2c3/0x530 [ 384.313290] [<ffffffff810d2893>] ? clockevents_register_device+0x113/0x200 [ 384.313290] [<ffffffff81038b0f>] start_secondary+0x13f/0x170 Fixes: 8a0f6ebe8494 ("tipc: involve reference counter for node structure") Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23tipc: fix random link reset problemErik Hugne
In the function tipc_sk_rcv(), the stack variable 'err' is only initialized to TIPC_ERR_NO_PORT for the first iteration over the link input queue. If a chain of messages are received from a link, failure to lookup the socket for any but the first message will cause the message to bounce back out on a random link. We fix this by properly initializing err. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23tipc: fix topology server broken issueYing Xue
When a new topology server is launched in a new namespace, its listening socket is inserted into the "init ns" namespace's socket hash table rather than the one owned by the new namespace. Although the socket's namespace is forcedly changed to the new namespace later, the socket is still stored in the socket hash table of "init ns" namespace. When a client created in the new namespace connects its own topology server, the connection is failed as its server's socket could not be found from its own namespace's socket table. If __sock_create() instead of original sock_create_kern() is used to create the server's socket through specifying an expected namesapce, the socket will be inserted into the specified namespace's socket table, thereby avoiding to the topology server broken issue. Fixes: 76100a8a64bc ("tipc: fix netns refcnt leak") Reported-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23mac80211: fix rhashtable conversionJohannes Berg
My conversion of the mac80211 station hash table to rhashtable completely broke the lookup in sta_info_get() as it no longer took into account the virtual interface. Fix that. Fixes: 7bedd0cfad4e1 ("mac80211: use rhashtable for station table") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-04-22net: do not deplete pfmemalloc reserveEric Dumazet
build_skb() should look at the page pfmemalloc status. If set, this means page allocator allocated this page in the expectation it would help to free other pages. Networking stack can do that only if skb->pfmemalloc is also set. Also, we must refrain using high order pages from the pfmemalloc reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for them. Under memory pressure, using order-0 pages is probably the best strategy. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22ip6_gre: use netdev_alloc_pcpu_stats()Johannes Berg
The code there just open-codes the same, so use the provided macro instead. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This time around we have a collection of CephFS fixes from Zheng around MDS failure handling and snapshots, support for a new CRUSH straw2 algorithm (to sync up with userspace) and several RBD cleanups and fixes from Ilya, an error path leak fix from Taesoo, and then an assorted collection of cleanups from others" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits) rbd: rbd_wq comment is obsolete libceph: announce support for straw2 buckets crush: straw2 bucket type with an efficient 64-bit crush_ln() crush: ensuring at most num-rep osds are selected crush: drop unnecessary include from mapper.c ceph: fix uninline data function ceph: rename snapshot support ceph: fix null pointer dereference in send_mds_reconnect() ceph: hold on to exclusive caps on complete directories libceph: simplify our debugfs attr macro ceph: show non-default options only libceph: expose client options through debugfs libceph, ceph: split ceph_show_options() rbd: mark block queue as non-rotational libceph: don't overwrite specific con error msgs ceph: cleanup unsafe requests when reconnecting is denied ceph: don't zero i_wrbuffer_ref when reconnecting is denied ceph: don't mark dirty caps when there is no auth cap ceph: keep i_snap_realm while there are writers libceph: osdmap.h: Add missing format newlines ...
2015-04-22mpls: Prevent use of implicit NULL label as outgoing labelRobert Shearman
The reserved implicit-NULL label isn't allowed to appear in the label stack for packets, so make it an error for the control plane to specify it as an outgoing label. Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22mpls: Per-device enabling of packet inputRobert Shearman
An MPLS network is a single trust domain where the edges must be in control of what labels make their way into the core. The simplest way of ensuring this is for the edge device to always impose the labels, and not allow forward labeled traffic from untrusted neighbours. This is achieved by allowing a per-device configuration of whether MPLS traffic input from that interface should be processed or not. To be secure by default, the default state is changed to MPLS being disabled on all interfaces unless explicitly enabled and no global option is provided to change the default. Whilst this differs from other protocols (e.g. IPv6), network operators are used to explicitly enabling MPLS forwarding on interfaces, and with the number of links to the MPLS core typically fairly low this doesn't present too much of a burden on operators. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22mpls: Per-device MPLS stateRobert Shearman
Add per-device MPLS state to supported interfaces. Use the presence of this state in mpls_route_add to determine that this is a supported interface. Use the presence of mpls_dev to drop packets that arrived on an unsupported interface - previously they were allowed through. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>