summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2013-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) The addition of nftables. No longer will we need protocol aware firewall filtering modules, it can all live in userspace. At the core of nftables is a, for lack of a better term, virtual machine that executes byte codes to inspect packet or metadata (arriving interface index, etc.) and make verdict decisions. Besides support for loading packet contents and comparing them, the interpreter supports lookups in various datastructures as fundamental operations. For example sets are supports, and therefore one could create a set of whitelist IP address entries which have ACCEPT verdicts attached to them, and use the appropriate byte codes to do such lookups. Since the interpreted code is composed in userspace, userspace can do things like optimize things before giving it to the kernel. Another major improvement is the capability of atomically updating portions of the ruleset. In the existing netfilter implementation, one has to update the entire rule set in order to make a change and this is very expensive. Userspace tools exist to create nftables rules using existing netfilter rule sets, but both kernel implementations will need to co-exist for quite some time as we transition from the old to the new stuff. Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have worked so hard on this. 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements to our pseudo-random number generator, mostly used for things like UDP port randomization and netfitler, amongst other things. In particular the taus88 generater is updated to taus113, and test cases are added. 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet and Yang Yingliang. 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin Sujir. 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet, Neal Cardwell, and Yuchung Cheng. 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary control message data, much like other socket option attributes. From Francesco Fusco. 7) Allow applications to specify a cap on the rate computed automatically by the kernel for pacing flows, via a new SO_MAX_PACING_RATE socket option. From Eric Dumazet. 8) Make the initial autotuned send buffer sizing in TCP more closely reflect actual needs, from Eric Dumazet. 9) Currently early socket demux only happens for TCP sockets, but we can do it for connected UDP sockets too. Implementation from Shawn Bohrer. 10) Refactor inet socket demux with the goal of improving hash demux performance for listening sockets. With the main goals being able to use RCU lookups on even request sockets, and eliminating the listening lock contention. From Eric Dumazet. 11) The bonding layer has many demuxes in it's fast path, and an RCU conversion was started back in 3.11, several changes here extend the RCU usage to even more locations. From Ding Tianhong and Wang Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav Falico. 12) Allow stackability of segmentation offloads to, in particular, allow segmentation offloading over tunnels. From Eric Dumazet. 13) Significantly improve the handling of secret keys we input into the various hash functions in the inet hashtables, TCP fast open, as well as syncookies. From Hannes Frederic Sowa. The key fundamental operation is "net_get_random_once()" which uses static keys. Hannes even extended this to ipv4/ipv6 fragmentation handling and our generic flow dissector. 14) The generic driver layer takes care now to set the driver data to NULL on device removal, so it's no longer necessary for drivers to explicitly set it to NULL any more. Many drivers have been cleaned up in this way, from Jingoo Han. 15) Add a BPF based packet scheduler classifier, from Daniel Borkmann. 16) Improve CRC32 interfaces and generic SKB checksum iterators so that SCTP's checksumming can more cleanly be handled. Also from Daniel Borkmann. 17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces using the interface MTU value. This helps avoid PMTU attacks, particularly on DNS servers. From Hannes Frederic Sowa. 18) Use generic XPS for transmit queue steering rather than internal (re-)implementation in virtio-net. From Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) random32: add test cases for taus113 implementation random32: upgrade taus88 generator to taus113 from errata paper random32: move rnd_state to linux/random.h random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized random32: add periodic reseeding random32: fix off-by-one in seeding requirement PHY: Add RTL8201CP phy_driver to realtek xtsonic: add missing platform_set_drvdata() in xtsonic_probe() macmace: add missing platform_set_drvdata() in mace_probe() ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe() ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh vlan: Implement vlan_dev_get_egress_qos_mask as an inline. ixgbe: add warning when max_vfs is out of range. igb: Update link modes display in ethtool netfilter: push reasm skb through instead of original frag skbs ip6_output: fragment outgoing reassembled skb properly MAINTAINERS: mv643xx_eth: take over maintainership from Lennart net_sched: tbf: support of 64bit rates ixgbe: deleting dfwd stations out of order can cause null ptr deref ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS ...
2013-11-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "All kinds of stuff this time around; some more notable parts: - RCU'd vfsmounts handling - new primitives for coredump handling - files_lock is gone - Bruce's delegations handling series - exportfs fixes plus misc stuff all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits) ecryptfs: ->f_op is never NULL locks: break delegations on any attribute modification locks: break delegations on link locks: break delegations on rename locks: helper functions for delegation breaking locks: break delegations on unlink namei: minor vfs_unlink cleanup locks: implement delegations locks: introduce new FL_DELEG lock flag vfs: take i_mutex on renamed file vfs: rename I_MUTEX_QUOTA now that it's not used for quotas vfs: don't use PARENT/CHILD lock classes for non-directories vfs: pull ext4's double-i_mutex-locking into common code exportfs: fix quadratic behavior in filehandle lookup exportfs: better variable name exportfs: move most of reconnect_path to helper function exportfs: eliminate unused "noprogress" counter exportfs: stop retrying once we race with rename/remove exportfs: clear DISCONNECTED on all parents sooner exportfs: more detailed comment for path_reconnect ...
2013-11-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this cycle are: - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik van Riel, Peter Zijlstra et al. Yay! - optimize preemption counter handling: merge the NEED_RESCHED flag into the preempt_count variable, by Peter Zijlstra. - wait.h fixes and code reorganization from Peter Zijlstra - cfs_bandwidth fixes from Ben Segall - SMP load-balancer cleanups from Peter Zijstra - idle balancer improvements from Jason Low - other fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED stop_machine: Fix race between stop_two_cpus() and stop_cpus() sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus sched: Fix asymmetric scheduling for POWER7 sched: Move completion code from core.c to completion.c sched: Move wait code from core.c to wait.c sched: Move wait.c into kernel/sched/ sched/wait: Fix __wait_event_interruptible_lock_irq_timeout() sched: Avoid throttle_cfs_rq() racing with period_timer stopping sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used sched: Remove extra put_online_cpus() inside sched_setaffinity() sched/rt: Fix task_tick_rt() comment sched/wait: Fix build breakage sched/wait: Introduce prepare_to_wait_event() sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too sched: Remove get_online_cpus() usage sched: Fix race in migrate_swap_stop() ...
2013-11-11ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bhHannes Frederic Sowa
Fixes a suspicious rcu derference warning. Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11vlan: Implement vlan_dev_get_egress_qos_mask as an inline.David S. Miller
This is to avoid very silly Kconfig dependencies for modules using this routine. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11netfilter: push reasm skb through instead of original frag skbsJiri Pirko
Pushing original fragments through causes several problems. For example for matching, frags may not be matched correctly. Take following example: <example> On HOSTA do: ip6tables -I INPUT -p icmpv6 -j DROP ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT and on HOSTB you do: ping6 HOSTA -s2000 (MTU is 1500) Incoming echo requests will be filtered out on HOSTA. This issue does not occur with smaller packets than MTU (where fragmentation does not happen) </example> As was discussed previously, the only correct solution seems to be to use reassembled skb instead of separete frags. Doing this has positive side effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams dances in ipvs and conntrack can be removed. Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c entirely and use code in net/ipv6/reassembly.c instead. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11ip6_output: fragment outgoing reassembled skb properlyJiri Pirko
If reassembled packet would fit into outdev MTU, it is not fragmented according the original frag size and it is send as single big packet. The second case is if skb is gso. In that case fragmentation does not happen according to the original frag size. This patch fixes these. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-09net_sched: tbf: support of 64bit ratesYang Yingliang
With psched_ratecfg_precompute(), tbf can deal with 64bit rates. Add two new attributes so that tc can use them to break the 32bit limit. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Suggested-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08ipv6: use rt6_get_dflt_router to get default router in rt6_route_rcvDuan Jiong
As the rfc 4191 said, the Router Preference and Lifetime values in a ::/0 Route Information Option should override the preference and lifetime values in the Router Advertisement header. But when the kernel deals with a ::/0 Route Information Option, the rt6_get_route_info() always return NULL, that means that overriding will not happen, because those default routers were added without flag RTF_ROUTEINFO in rt6_add_dflt_router(). In order to deal with that condition, we should call rt6_get_dflt_router when the prefix length is 0. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08nfnetlink: do not ack malformed messagesJiri Benc
Commit 0628b123c96d ("netfilter: nfnetlink: add batch support and use it from nf_tables") introduced a bug leading to various crashes in netlink_ack when netlink message with invalid nlmsg_len was sent by an unprivileged user. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08net: Fix "ip rule delete table 256"Andreas Henriksson
When trying to delete a table >= 256 using iproute2 the local table will be deleted. The table id is specified as a netlink attribute when it needs more then 8 bits and iproute2 then sets the table field to RT_TABLE_UNSPEC (0). Preconditions to matching the table id in the rule delete code doesn't seem to take the "table id in netlink attribute" into condition so the frh_get_table helper function never gets to do its job when matching against current rule. Use the helper function twice instead of peaking at the table value directly. Originally reported at: http://bugs.debian.org/724783 Reported-by: Nicolas HICHER <nhicher@avencall.com> Signed-off-by: Andreas Henriksson <andreas@fatal.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08ipv6: protect flow label renew against GCFlorent Fourcot
Take ip6_fl_lock before to read and update a label. v2: protect only the relevant code Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08ipv6: increase maximum lifetime of flow labelsFlorent Fourcot
If the last RFC 6437 does not give any constraints for lifetime of flow labels, the previous RFC 3697 spoke of a minimum of 120 seconds between reattribution of a flow label. The maximum linger is currently set to 60 seconds and does not allow this configuration without CAP_NET_ADMIN right. This patch increase the maximum linger to 150 seconds, allowing more flexibility to standard users. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08ipv6: enable IPV6_FLOWLABEL_MGR for getsockoptFlorent Fourcot
It is already possible to set/put/renew a label with IPV6_FLOWLABEL_MGR and setsockopt. This patch add the possibility to get information about this label (current value, time before expiration, etc). It helps application to take decision for a renew or a release of the label. v2: * Add spin_lock to prevent race condition * return -ENOENT if no result found * check if flr_action is GET v3: * move the spin_lock to protect only the relevant code Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08net: flow_dissector: small optimizations in IPv4 dissectEric Dumazet
By moving code around, we avoid : 1) A reload of iph->ihl (bit field, so needs a mask) 2) A conditional test (replaced by a conditional mov on x86) Fast path loads iph->protocol anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2013-11-08inet: fix a UFO regressionEric Dumazet
While testing virtio_net and skb_segment() changes, Hannes reported that UFO was sending wrong frames. It appears this was introduced by a recent commit : 8c3a897bfab1 ("inet: restore gso for vxlan") The old condition to perform IP frag was : tunnel = !!skb->encapsulation; ... if (!tunnel && proto == IPPROTO_UDP) { So the new one should be : udpfrag = !skb->encapsulation && proto == IPPROTO_UDP; ... if (udpfrag) { Initialization of udpfrag must be done before call to ops->callbacks.gso_segment(skb, features), as skb_udp_tunnel_segment() clears skb->encapsulation (We want udpfrag to be true for UFO, false for VXLAN) With help from Alexei Starovoitov Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net: skbuff - kernel-doc fixesMathias Krause
Use "@" to refer to parameters in the kernel-doc description. According to Documentation/kernel-doc-nano-HOWTO.txt "&" shall be used to refer to structures only. Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07caif: use pskb_put() instead of reimplementing its functionalityMathias Krause
Also remove the warning for fragmented packets -- skb_cow_data() will linearize the buffer, removing all fragments. Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net: move pskb_put() to core codeMathias Krause
This function has usage beside IPsec so move it to the core skbuff code. While doing so, give it some documentation and change its return type to 'unsigned char *' to be in line with skb_put(). Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net: Add layer 2 hardware acceleration operations for macvlan devicesJohn Fastabend
Add a operations structure that allows a network interface to export the fact that it supports package forwarding in hardware between physical interfaces and other mac layer devices assigned to it (such as macvlans). This operaions structure can be used by virtual mac devices to bypass software switching so that forwarding can be done in hardware more efficiently. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-076lowpan: release device on error pathDan Carpenter
We recently added a new error path and it needs a dev_put(). Fixes: 7adac1ec8198 ('6lowpan: Only make 6lowpan links to IEEE802154 devices') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net/vlan: Provide read access to the vlan egress mapEyal Perry
Provide a method for read-only access to the vlan device egress mapping. Do this by refactoring vlan_dev_get_egress_qos_mask() such that now it receives as an argument the skb priority instead of pointer to the skb. Such an access is needed for the IBoE stack where the control plane goes through the network stack. This is an add-on step on top of commit d4a968658c "net/route: export symbol ip_tos2prio" which allowed the RDMA-CM to use ip_tos2prio. Signed-off-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07tipc: reassembly failures should cause link resetErik Hugne
If appending a received fragment to the pending fragment chain in a unicast link fails, the current code tries to force a retransmission of the fragment by decrementing the 'next received sequence number' field in the link. This is done under the assumption that the failure is caused by an out-of-memory situation, an assumption that does not hold true after the previous patch in this series. A failure to append a fragment can now only be caused by a protocol violation by the sending peer, and it must hence be assumed that it is either malicious or buggy. Either way, the correct behavior is now to reset the link instead of trying to revert its sequence number. So, this is what we do in this commit. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07tipc: message reassembly using fragment chainErik Hugne
When the first fragment of a long data data message is received on a link, a reassembly buffer large enough to hold the data from this and all subsequent fragments of the message is allocated. The payload of each new fragment is copied into this buffer upon arrival. When the last fragment is received, the reassembled message is delivered upwards to the port/socket layer. Not only is this an inefficient approach, but it may also cause bursts of reassembly failures in low memory situations. since we may fail to allocate the necessary large buffer in the first place. Furthermore, after 100 subsequent such failures the link will be reset, something that in reality aggravates the situation. To remedy this problem, this patch introduces a different approach. Instead of allocating a big reassembly buffer, we now append the arriving fragments to a reassembly chain on the link, and deliver the whole chain up to the socket layer once the last fragment has been received. This is safe because the retransmission layer of a TIPC link always delivers packets in strict uninterrupted order, to the reassembly layer as to all other upper layers. Hence there can never be more than one fragment chain pending reassembly at any given time in a link, and we can trust (but still verify) that the fragments will be chained up in the correct order. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07tipc: don't reroute message fragmentsErik Hugne
When a message fragment is received in a broadcast or unicast link, the reception code will append the fragment payload to a big reassembly buffer through a call to the function tipc_recv_fragm(). However, after the return of that call, the logics goes on and passes the fragment buffer to the function tipc_net_route_msg(), which will simply drop it. This behavior is a remnant from the now obsolete multi-cluster functionality, and has no relevance in the current code base. Although currently harmless, this unnecessary call would be fatal after applying the next patch in this series, which introduces a completely new reassembly algorithm. So we change the code to eliminate the redundant call. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08Merge tag 'nfs-for-3.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: - Changes to the RPC socket code to allow NFSv4 to turn off timeout+retry: * Detect TCP connection breakage through the "keepalive" mechanism - Add client side support for NFSv4.x migration (Chuck Lever) - Add support for multiple security flavour arguments to the "sec=" mount option (Dros Adamson) - fs-cache bugfixes from David Howells: * Fix an issue whereby caching can be enabled on a file that is open for writing - More NFSv4 open code stable bugfixes - Various Labeled NFS (selinux) bugfixes, including one stable fix - Fix buffer overflow checking in the RPCSEC_GSS upcall encoding" * tag 'nfs-for-3.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits) NFSv4.2: Remove redundant checks in nfs_setsecurity+nfs4_label_init_security NFSv4: Sanity check the server reply in _nfs4_server_capabilities NFSv4.2: encode_readdir - only ask for labels when doing readdirplus nfs: set security label when revalidating inode NFSv4.2: Fix a mismatch between Linux labeled NFS and the NFSv4.2 spec NFS: Fix a missing initialisation when reading the SELinux label nfs: fix oops when trying to set SELinux label nfs: fix inverted test for delegation in nfs4_reclaim_open_state SUNRPC: Cleanup xs_destroy() SUNRPC: close a rare race in xs_tcp_setup_socket. SUNRPC: remove duplicated include from clnt.c nfs: use IS_ROOT not DCACHE_DISCONNECTED SUNRPC: Fix buffer overflow checking in gss_encode_v0_msg/gss_encode_v1_msg SUNRPC: gss_alloc_msg - choose _either_ a v0 message or a v1 message SUNRPC: remove an unnecessary if statement nfs: Use PTR_ERR_OR_ZERO in 'nfs/nfs4super.c' nfs: Use PTR_ERR_OR_ZERO in 'nfs41_callback_up' function nfs: Remove useless 'error' assignment sunrpc: comment typo fix SUNRPC: Add correct rcu_dereference annotation in rpc_clnt_set_transport ...
2013-11-07Merge tag 'driver-core-3.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core / sysfs update for 3.13-rc1. There's lots of dev_groups updates for different subsystems, as they all get slowly migrated over to the safe versions of the attribute groups (removing userspace races with the creation of the sysfs files.) Also in here are some kobject updates, devres expansions, and the first round of Tejun's sysfs reworking to enable it to be used by other subsystems as a backend for an in-kernel filesystem. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits) sysfs: rename sysfs_assoc_lock and explain what it's about sysfs: use generic_file_llseek() for sysfs_file_operations sysfs: return correct error code on unimplemented mmap() mdio_bus: convert bus code to use dev_groups device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name sysfs: separate out dup filename warning into a separate function sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c sysfs: remove unused sysfs_get_dentry() prototype sysfs: honor bin_attr.attr.ignore_lockdep sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr devres: restore zeroing behavior of devres_alloc() sysfs: fix sysfs_write_file for bin file input: gameport: convert bus code to use dev_groups input: serio: remove bus usage of dev_attrs input: serio: use DEVICE_ATTR_RO() i2o: convert bus code to use dev_groups memstick: convert bus code to use dev_groups tifm: convert bus code to use dev_groups virtio: convert bus code to use dev_groups ipack: convert bus code to use dev_groups ...
2013-11-05ipv6: drop the judgement in rt6_alloc_cow()Duan Jiong
Now rt6_alloc_cow() is only called by ip6_pol_route() when rt->rt6i_flags doesn't contain both RTF_NONEXTHOP and RTF_GATEWAY, and rt->rt6i_flags hasn't been changed in ip6_rt_copy(). So there is no neccessary to judge whether rt->rt6i_flags contains RTF_GATEWAY or not. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-05ipv6: fix headroom calculation in udp6_ufo_fragmentHannes Frederic Sowa
Commit 1e2bd517c108816220f262d7954b697af03b5f9c ("udp6: Fix udp fragmentation for tunnel traffic.") changed the calculation if there is enough space to include a fragment header in the skb from a skb->mac_header dervived one to skb_headroom. Because we already peeled off the skb to transport_header this is wrong. Change this back to check if we have enough room before the mac_header. This fixes a panic Saran Neti reported. He used the tbf scheduler which skb_gso_segments the skb. The offsets get negative and we panic in memcpy because the skb was erroneously not expanded at the head. Reported-by: Saran Neti <Saran.Neti@telus.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-05ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACEHannes Frederic Sowa
Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-05Merge branch 'for-upstream' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
2013-11-05Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
2013-11-05Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Conflicts: net/wireless/reg.c
2013-11-05ipv6: remove old conditions on flow label sharingFlorent Fourcot
The code of flow label in Linux Kernel follows the rules of RFC 1809 (an informational one) for conditions on flow label sharing. There rules are not in the last proposed standard for flow label (RFC 6437), or in the previous one (RFC 3697). Since this code does not follow any current or old standard, we can remove it. With this removal, the ipv6_opt_cmp function is now a dead code and it can be removed too. Changelog to v1: * add justification for the change * remove the condition on IPv6 options [ Remove ipv6_hdr_cmp and it is now unused as well. -DaveM ] Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-05Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== Please accept the following pull request intended for the 3.13 tree... I had intended to pass most of these to you as much as two weeks ago. Unfortunately, I failed to account for the effects of bad Internet connections and my own fatique/laziness while traveling. On the bright side, at least these have been baking in linux-next for some time! For the mac80211 bits, Johannes says: "This time I have two fixes for P2P (which requires not using CCK rates) and a workaround for APs with broken WMM information." For the iwlwifi bits, Johannes says: "I have a few fixes for warnings/issues: one from Alex, fixing scan timings, one from Emmanuel fixing a WARN_ON in the DVM driver, one from Stanislaw removing a trigger-happy WARN_ON in the MVM driver and a change from myself to try to recover when the device isn't processing commands quickly." And: "For this round, I have a lot of changes: * power management improvements * BT coexistence improvements/updates * new device support * VHT support * IBSS support (though due to a small bug it requires new firmware) * various other fixes/improvements." For the Bluetooth bits, Gustavo says: "More patches for 3.12, busy times for Bluetooth. More than a 100 commits since the last pull. The bulk of work comes from Johan and Marcel, they are doing fixes and improvements all over the Bluetooth subsystem, as the diffstat can show." For the ath10k and ath6kl bits, Kalle says: "Bartosz added support to ath10k for our 10.x AP firmware branch, which gives us AP specific features and fixes. We still support the main firmware branch as well just like before, ath10k detects runtime what firmware is used. Unfortunately the firmware interface in 10.x branch is somewhat different so there was quite a lot of changes in ath10k for this. Michal and Sujith did some performance improvements in ath10k. Vladimir fixed a compiler warning and Fengguang removed an extra semicolon." For the NFC bits, Samuel says: "It's a fairly big one, with the following highlights: - NFC digital layer implementation: Most NFC chipsets implement the NFC digital layer in firmware, but others have more basic functionalities and expect the host to implement the digital layer. This layer sits below the NFC core. - Sony's port100 support: This is "soft" NFC USB dongle that expects the digital layer to be implemented on the host. This is the first user of our NFC digital stack implementation. - Secure element API: We now provide a netlink API for enabling, disabling and discovering NFC attached (embedded or UICC ones) secure elements. With some userspace help, this allows us to support NFC payments. Only the pn544 driver currently supports that API. - NCI SPI fixes and improvements: In order to support NCI devices over SPI, we fixed and improved our NCI/SPI implementation. The currently most deployed NFC NCI chipset, Broadcom's bcm2079x, supports that mode and we're planning to use our NCI/SPI framework to implement a driver for it. - pn533 fragmentation support in target mode: This was the only missing feature from our pn533 impementation. We now support fragmentation in both Tx and Rx modes, in target mode." On top of all that, brcmfmac and rt2x00 both get the usual flurry of updates. A few other drivers get hit here or there as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04net: introduce skb_coalesce_rx_frag()Jason Wang
Sometimes we need to coalesce the rx frags to avoid frag list. One example is virtio-net driver which tries to use small frags for both MTU sized packet and GSO packet. So this patch introduce skb_coalesce_rx_frag() to do this. Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michael Dalton <mwdalton@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04tcp: properly handle stretch acks in slow startYuchung Cheng
Slow start now increases cwnd by 1 if an ACK acknowledges some packets, regardless the number of packets. Consequently slow start performance is highly dependent on the degree of the stretch ACKs caused by receiver or network ACK compression mechanisms (e.g., delayed-ACK, GRO, etc). But slow start algorithm is to send twice the amount of packets of packets left so it should process a stretch ACK of degree N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A follow up patch will use the remainder of the N (if greater than 1) to adjust cwnd in the congestion avoidance phase. In addition this patch retires the experimental limited slow start (LSS) feature. LSS has multiple drawbacks but questionable benefit. The fractional cwnd increase in LSS requires a loop in slow start even though it's rarely used. Configuring such an increase step via a global sysctl on different BDPS seems hard. Finally and most importantly the slow start overshoot concern is now better covered by the Hybrid slow start (hystart) enabled by default. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04tcp: enable sockets to use MSG_FASTOPEN by defaultYuchung Cheng
Applications have started to use Fast Open (e.g., Chrome browser has such an optional flag) and the feature has gone through several generations of kernels since 3.7 with many real network tests. It's time to enable this flag by default for applications to test more conveniently and extensively. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables Pablo Neira Ayuso says: ==================== This batch contains fives nf_tables patches for your net-next tree, they are: * Fix possible use after free in the module removal path of the x_tables compatibility layer, from Dan Carpenter. * Add filter chain type for the bridge family, from myself. * Fix Kconfig dependencies of the nf_tables bridge family with the core, from myself. * Fix sparse warnings in nft_nat, from Tomasz Bursztyka. * Remove duplicated include in the IPv4 family support for nf_tables, from Wei Yongjun. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== This is another batch containing Netfilter/IPVS updates for your net-next tree, they are: * Six patches to make the ipt_CLUSTERIP target support netnamespace, from Gao feng. * Two cleanups for the nf_conntrack_acct infrastructure, introducing a new structure to encapsulate conntrack counters, from Holger Eitzenberger. * Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann. * Skip checksum recalculation in SCTP support for IPVS, also from Daniel Borkmann. * Fix behavioural change in xt_socket after IP early demux, from Florian Westphal. * Fix bogus large memory allocation in the bitmap port set type in ipset, from Jozsef Kadlecsik. * Fix possible compilation issues in the hash netnet set type in ipset, also from Jozsef Kadlecsik. * Define constants to identify netlink callback data in ipset dumps, again from Jozsef Kadlecsik. * Use sock_gen_put() in xt_socket to replace xt_socket_put_sk, from Eric Dumazet. * Improvements for the SH scheduler in IPVS, from Alexander Frolkin. * Remove extra delay due to unneeded rcu barrier in IPVS net namespace cleanup path, from Julian Anastasov. * Save some cycles in ip6t_REJECT by skipping checksum validation in packets leaving from our stack, from Stanislav Fomichev. * Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from Julian Anastasov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04netfilter: nft_compat: use _safe version of list_for_eachDan Carpenter
We need to use the _safe version of list_for_each_entry() here otherwise we have a use after free bug. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== Open vSwitch A set of updates for net-next/3.13. Major changes are: * Restructure flow handling code to be more logically organized and easier to read. * Rehashing of the flow table is moved from a workqueue to flow installation time. Before, heavy load could block the workqueue for excessive periods of time. * Additional debugging information is provided to help diagnose megaflows. * It's now possible to match on TCP flags. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04net: checksum: fix warning in skb_checksumDaniel Borkmann
This patch fixes a build warning in skb_checksum() by wrapping the csum_partial() usage in skb_checksum(). The problem is that on a few architectures, csum_partial is used with prefix asmlinkage whereas on most architectures it's not. So fix this up generically as we did with csum_block_add_ext() to match the signature. Introduced by 2817a336d4d ("net: skb_checksum: allow custom update/combine for walking skb"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem Conflicts: drivers/net/wireless/brcm80211/brcmfmac/sdio_host.h
2013-11-04Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless Conflicts: drivers/net/wireless/iwlwifi/pcie/drv.c
2013-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/netconsole.c net/bridge/br_private.h Three mostly trivial conflicts. The net/bridge/br_private.h conflict was a function signature (argument addition) change overlapping with the extern removals from Joe Perches. In drivers/net/netconsole.c we had one change adjusting a printk message whilst another changed "printk(KERN_INFO" into "pr_info(". Lastly, the emulex change was a new inline function addition overlapping with Joe Perches's extern removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04net: sctp: do not trigger BUG_ON in sctp_cmd_delete_tcbDaniel Borkmann
Introduced in f9e42b853523 ("net: sctp: sideeffect: throw BUG if primary_path is NULL"), we intended to find a buggy assoc that's part of the assoc hash table with a primary_path that is NULL. However, we better remove the BUG_ON for now and find a more suitable place to assert for these things as Mark reports that this also triggers the bug when duplication cookie processing happens, and the assoc is not part of the hash table (so all good in this case). Such a situation can for example easily be reproduced by: tc qdisc add dev eth0 root handle 1: prio bands 2 priomap 1 1 1 1 1 1 tc qdisc add dev eth0 parent 1:2 handle 20: netem loss 20% tc filter add dev eth0 protocol ip parent 1: prio 2 u32 match ip \ protocol 132 0xff match u8 0x0b 0xff at 32 flowid 1:2 This drops 20% of COOKIE-ACK packets. After some follow-up discussion with Vlad we came to the conclusion that for now we should still better remove this BUG_ON() assertion, and come up with two follow-ups later on, that is, i) find a more suitable place for this assertion, and possibly ii) have a special allocator/initializer for such kind of temporary assocs. Reported-by: Mark Thomas <Mark.Thomas@metaswitch.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-03net/hsr: Add support for the High-availability Seamless Redundancy protocol ↵Arvid Brodin
(HSRv0) High-availability Seamless Redundancy ("HSR") provides instant failover redundancy for Ethernet networks. It requires a special network topology where all nodes are connected in a ring (each node having two physical network interfaces). It is suited for applications that demand high availability and very short reaction time. HSR acts on the Ethernet layer, using a registered Ethernet protocol type to send special HSR frames in both directions over the ring. The driver creates virtual network interfaces that can be used just like any ordinary Linux network interface, for IP/TCP/UDP traffic etc. All nodes in the network ring must be HSR capable. This code is a "best effort" to comply with the HSR standard as described in IEC 62439-3:2010 (HSRv0). Signed-off-by: Arvid Brodin <arvid.brodin@xdin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-03net: extend net_device allocation to vmalloc()Eric Dumazet
Joby Poriyath provided a xen-netback patch to reduce the size of xenvif structure as some netdev allocation could fail under memory pressure/fragmentation. This patch is handling the problem at the core level, allowing any netdev structures to use vmalloc() if kmalloc() failed. As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Joby Poriyath <joby.poriyath@citrix.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>