summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2013-10-25ipv6: ip6_dst_check needs to check for expired dst_entriesHannes Frederic Sowa
On receiving a packet too big icmp error we check if our current cached dst_entry in the socket is still valid. This validation check did not care about the expiration of the (cached) route. The error path I traced down: The socket receives a packet too big mtu notification. It still has a valid dst_entry and thus issues the ip6_rt_pmtu_update on this dst_entry, setting RTF_EXPIRE and updates the dst.expiration value (which could fail because of not up-to-date expiration values, see previous patch). In some seldom cases we race with a) the ip6_fib gc or b) another routing lookup which would result in a recreation of the cached rt6_info from its parent non-cached rt6_info. While copying the rt6_info we reinitialize the metrics store by copying it over from the parent thus invalidating the just installed pmtu update (both dsts use the same key to the inetpeer storage). The dst_entry with the just invalidated metrics data would just get its RTF_EXPIRES flag cleared and would continue to stay valid for the socket. We should have not issued the pmtu update on the already expired dst_entry in the first placed. By checking the expiration on the dst entry and doing a relookup in case it is out of date we close the race because we would install a new rt6_info into the fib before we issue the pmtu update, thus closing this race. Not reliably updating the dst.expire value was fixed by the patch "ipv6: reset dst.expires value when clearing expire flag". Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com> Reported-by: Valentijn Sessink <valentyn@blub.net> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Valentijn Sessink <valentyn@blub.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25netpoll: fix rx_hook() interface by passing the skbAntonio Quartulli
Right now skb->data is passed to rx_hook() even if the skb has not been linearised and without giving rx_hook() a way to linearise it. Change the rx_hook() interface and make it accept the skb and the offset to the UDP payload as arguments. rx_hook() is also renamed to rx_skb_hook() to ensure that out of the tree users notice the API change. In this way any rx_skb_hook() implementation can perform all the needed operations to properly (and safely) access the skb data. Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: fix rtnl notification in atomic contextAlexei Starovoitov
commit 991fb3f74c "dev: always advertise rx_flags changes via netlink" introduced rtnl notification from __dev_set_promiscuity(), which can be called in atomic context. Steps to reproduce: ip tuntap add dev tap1 mode tap ifconfig tap1 up tcpdump -nei tap1 & ip tuntap del dev tap1 mode tap [ 271.627994] device tap1 left promiscuous mode [ 271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940 [ 271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip [ 271.677525] INFO: lockdep is turned off. [ 271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G W 3.12.0-rc3+ #73 [ 271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012 [ 271.731254] ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428 [ 271.760261] ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8 [ 271.790683] 0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8 [ 271.822332] Call Trace: [ 271.838234] [<ffffffff817544e5>] dump_stack+0x55/0x76 [ 271.854446] [<ffffffff8108bad1>] __might_sleep+0x181/0x240 [ 271.870836] [<ffffffff81110ff8>] ? rcu_irq_exit+0x68/0xb0 [ 271.887076] [<ffffffff811a80be>] kmem_cache_alloc_node+0x4e/0x2a0 [ 271.903368] [<ffffffff810b4ddc>] ? vprintk_emit+0x1dc/0x5a0 [ 271.919716] [<ffffffff81614d67>] ? __alloc_skb+0x57/0x2a0 [ 271.936088] [<ffffffff810b4de0>] ? vprintk_emit+0x1e0/0x5a0 [ 271.952504] [<ffffffff81614d67>] __alloc_skb+0x57/0x2a0 [ 271.968902] [<ffffffff8163a0b2>] rtmsg_ifinfo+0x52/0x100 [ 271.985302] [<ffffffff8162ac6d>] __dev_notify_flags+0xad/0xc0 [ 272.001642] [<ffffffff8162ad0c>] __dev_set_promiscuity+0x8c/0x1c0 [ 272.017917] [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380 [ 272.033961] [<ffffffff8162b109>] dev_set_promiscuity+0x29/0x50 [ 272.049855] [<ffffffff8172e937>] packet_dev_mc+0x87/0xc0 [ 272.065494] [<ffffffff81732052>] packet_notifier+0x1b2/0x380 [ 272.080915] [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380 [ 272.096009] [<ffffffff81761c66>] notifier_call_chain+0x66/0x150 [ 272.110803] [<ffffffff8108503e>] __raw_notifier_call_chain+0xe/0x10 [ 272.125468] [<ffffffff81085056>] raw_notifier_call_chain+0x16/0x20 [ 272.139984] [<ffffffff81620190>] call_netdevice_notifiers_info+0x40/0x70 [ 272.154523] [<ffffffff816201d6>] call_netdevice_notifiers+0x16/0x20 [ 272.168552] [<ffffffff816224c5>] rollback_registered_many+0x145/0x240 [ 272.182263] [<ffffffff81622641>] rollback_registered+0x31/0x40 [ 272.195369] [<ffffffff816229c8>] unregister_netdevice_queue+0x58/0x90 [ 272.208230] [<ffffffff81547ca0>] __tun_detach+0x140/0x340 [ 272.220686] [<ffffffff81547ed6>] tun_chr_close+0x36/0x60 Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: initialize hashrnd in flow_dissector with net_get_random_onceHannes Frederic Sowa
We also can defer the initialization of hashrnd in flow_dissector to its first use. Since net_get_random_once is irq safe now we don't have to audit the call paths if one of this functions get called by an interrupt handler. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: make net_get_random_once irq safeHannes Frederic Sowa
I initial build non irq safe version of net_get_random_once because I would liked to have the freedom to defer even the extraction process of get_random_bytes until the nonblocking pool is fully seeded. I don't think this is a good idea anymore and thus this patch makes net_get_random_once irq safe. Now someone using net_get_random_once does not need to care from where it is called. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: add missing dev_put() in __netdev_adjacent_dev_insertNikolay Aleksandrov
I think that a dev_put() is needed in the error path to preserve the proper dev refcount. CC: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25netem: markov loss model transition fixHagen Paul Pfeifer
The transition from markov state "3 => lost packets within a burst period" to "1 => successfully transmitted packets within a gap period" has no *additional* loss event. The loss already happen for transition from 1 -> 3, this additional loss will make things go wild. E.g. transition probabilities: p13: 10% p31: 100% Expected: Ploss = p13 / (p13 + p31) Ploss = ~9.09% ... but it isn't. Even worse: we get a double loss - each time. So simple don't return true to indicate loss, rather break and return false. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Stefano Salsano <stefano.salsano@uniroma2.it> Cc: Fabio Ludovici <fabio.ludovici@yahoo.it> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-24file->f_op is never NULL...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-24sunrpc: switch to %pdAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-23Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Antonio Quartulli says: ==================== this is another set of changes intended for net-next/linux-3.13. (probably our last pull request for this cycle) Patches 1 and 2 reshape two of our main data structures in a way that they can easily be extended in the future to accommodate new routing protocols. Patches from 3 to 9 improve our routing protocol API and its users so that all the protocol-related code is not mixed up with the other components anymore. Patch 10 limits the local Translation Table maximum size to a value such that it can be fully transfered over the air if needed. This value depends on fragmentation being enabled or not and on the mtu values. Patch 11 makes batman-adv send a uevent in case of soft-interface destruction while a "bat-Gateway" was configured (this informs userspace about the GW not being available anymore). Patches 13 and 14 enable the TT component to detect non-mesh client flag changes at runtime (till now those flags where set upon client detection and were not changed anymore). Patch 16 is a generalisation of our user-to-kernel space communication (and viceversa) used to exchange ICMP packets to send/received to/from the mesh network. Now it can easily accommodate new ICMP packet types without breaking the existing userspace API anymore. Remaining patches are minor changes and cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23inet: remove old fragmentation hash initializingHannes Frederic Sowa
All fragmentation hash secrets now get initialized by their corresponding hash function with net_get_random_once. Thus we can eliminate the initial seeding. Also provide a comment that hash secret seeding happens at the first call to the corresponding hashing function. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23ipv6: split inet6_hash_frag for netfilter and initialize secrets with ↵Hannes Frederic Sowa
net_get_random_once Defer the fragmentation hash secret initialization for IPv6 like the previous patch did for IPv4. Because the netfilter logic reuses the hash secret we have to split it first. Thus introduce a new nf_hash_frag function which takes care to seed the hash secret. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23ipv4: initialize ip4_frags hash secret as late as possibleHannes Frederic Sowa
Defer the generation of the first hash secret for the ipv4 fragmentation cache as late as possible. ip4_frags.rnd gets initial seeded by inet_frags_init and regulary reseeded by inet_frag_secret_rebuild. Either we call ipqhashfn directly from ip_fragment.c in which case we initialize the secret directly. If we first get called by inet_frag_secret_rebuild we install a new secret by a manual call to get_random_bytes. This secret will be overwritten as soon as the first call to ipqhashfn happens. This is safe because we won't race while publishing the new secrets with anyone else. Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23net: sctp: fix ASCONF to allow non SCTP_ADDR_SRC addresses in ipv6Daniel Borkmann
Commit 8a07eb0a50 ("sctp: Add ASCONF operation on the single-homed host") implemented possible use of IPv4 addresses with non SCTP_ADDR_SRC state as source address when sending ASCONF (ADD) packets, but IPv6 part for that was not implemented in 8a07eb0a50. Therefore, as this is not restricted to IPv4-only, fix this up to allow the same for IPv6 addresses in SCTP. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Michio Honda <micchie@sfc.wide.ad.jp> Acked-by: Michio Honda <micchie@sfc.wide.ad.jp> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains three netfilter fixes for your net tree, they are: * A couple of fixes to resolve info leak to userspace due to uninitialized memory area in ulogd, from Mathias Krause. * Fix instruction ordering issues that may lead to the access of uninitialized data in x_tables. The problem involves the table update (producer) and the main packet matching (consumer) routines. Detected in SMP ARMv7, from Will Deacon. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23net: always inline net_secret_initHannes Frederic Sowa
Currently net_secret_init does not get inlined, so we always have a call to net_secret_init even in the fast path. Let's specify net_secret_init as __always_inline so we have the nop in the fast-path without the call to net_secret_init and the unlikely path at the epilogue of the function. jump_labels handle the inlining correctly. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23batman-adv: generalize batman-adv icmp packet handlingSimon Wunderlich
Instead of handling icmp packets only up to length of icmp_packet_rr, the code should handle any icmp length size. Therefore the length truncating is moved to when the packet is actually sent to userspace (this does not support lengths longer than icmp_packet_rr yet). Longer packets are forwarded without truncating. This patch also cleans up some parts where the icmp header struct could be used instead of other icmp_packet(_rr) structs to make the code more readable. Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2013-10-23batman-adv: Start new development cycleSimon Wunderlich
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2013-10-23batman-adv: include the sync-flags when compute the global/local table CRCAntonio Quartulli
Flags covered by TT_SYNC_MASK are kept in sync among the nodes in the network and therefore they have to be considered while computing the global/local table CRC. In this way a generic originator is able to understand if its table contains the correct flags or not. Bits from 4 to 7 in the TT flags fields are now reserved for "synchronized" flags only. This allows future developers to add more flags of this type without breaking compatibility. It's important to note that not all the remote TT flags are synchronised. This comes from the fact that some flags are used to inject an information once only. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2013-10-23batman-adv: improve the TT component to support runtime flag changesAntonio Quartulli
Some flags (i.e. the WIFI flag) may change after that the related client has already been announced. However it is useful to informa the rest of the network about this change. Add a runtime-flag-switch detection mechanism and re-announce the related TT entry to advertise the new flag value. This mechanism can be easily exploited by future flags that may need the same treatment. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2013-10-23batman-adv: invoke dev_get_by_index() outside of is_wifi_iface()Antonio Quartulli
Upcoming changes need to perform other checks on the incoming net_device struct. To avoid performing dev_get_by_index() for each and every check, it is better to move it outside of is_wifi_iface() and search the netdev object once only. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2013-10-23batman-adv: send GW_DEL event in case of soft-iface destructionAntonio Quartulli
In case of soft_iface destruction send a GW DEL event to userspace so that applications which are listening for GW events are informed about the lost of connectivity and can react accordingly. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2013-10-23batman-adv: limit local translation table max sizeMarek Lindner
The local translation table size is limited by what can be transferred from one node to another via a full table request. The number of entries fitting into a full table request depend on whether the fragmentation is enabled or not. Therefore this patch introduces a max table size check and refuses to add more local clients when that size is reached. Moreover, if the max full table packet size changes (MTU change or fragmentation is disabled) the local table is downsized instantaneously. Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Acked-by: Antonio Quartulli <ordex@autistici.org>
2013-10-23batman-adv: adapt the TT component to use the new API functionsAntonio Quartulli
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: provide orig_node routing APIAntonio Quartulli
Some operations executed on an orig_node depends on the current routing algorithm being used. To easily make this mechanism routing algorithm agnostic add a orig_node specific API that each algorithm can populate with its own routines. Such routines are then invoked by the code when needed, without knowing which routing algorithm is currently in use With this patch 3 API functions are added: - orig_free (to free routing depending internal structs) - orig_add_if (to change the inner state of an orig_node when a new hard interface is added) - orig_del_if (to change the inner state of an orig_node when an hard interface is removed) Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: adapt the neighbor purging routine to use the new API functionsAntonio Quartulli
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: adapt bonding to use the new API functionsAntonio Quartulli
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: add bat_neigh_is_equiv_or_better API functionAntonio Quartulli
Each routing protocol has its own metric semantic and therefore is the protocol itself the only component able to compare two metrics to check their "similarity". This new API allows each routing protocol to implement its own logic and make the external code protocol agnostic. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: add bat_neigh_cmp API functionAntonio Quartulli
This new API allows to compare the two neighbours based on the metric avoiding the user to deal with any routing algorithm specific detail Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: add bat_orig_print API functionAntonio Quartulli
Each routing protocol has its own metric and private variables, therefore it is useful to introduce a new API for originator information printing. This API needs to be implemented by each protocol in order to provide its specific originator table output. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: make struct batadv_orig_node algorithm agnosticAntonio Quartulli
some of the struct batadv_orig_node members are B.A.T.M.A.N. IV specific and therefore they are moved in a algorithm specific substruct in order to make batadv_orig_node routing algorithm agnostic Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23batman-adv: make struct batadv_neigh_node algorithm agnosticAntonio Quartulli
some of the fields in struct batadv_neigh_node are strictly related to the B.A.T.M.A.N. IV algorithm. In order to make the struct usable by any routing algorithm it has to be split and made more generic Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-10-23netfilter: ip6t_REJECT: skip checksum verification for outgoing ipv6 packetsStanislav Fomichev
Don't verify checksum for outgoing packets because checksum calculation may be done by the device. Without this patch: $ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset $ time telnet ipv6.google.com 80 Trying 2a00:1450:4010:c03::67... telnet: Unable to connect to remote host: Connection timed out real 0m7.201s user 0m0.000s sys 0m0.000s With the patch applied: $ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset $ time telnet ipv6.google.com 80 Trying 2a00:1450:4010:c03::67... telnet: Unable to connect to remote host: Connection refused real 0m0.085s user 0m0.000s sys 0m0.000s Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22Revert "bridge: only expire the mdb entry when query is received"Linus Lüssing
While this commit was a good attempt to fix issues occuring when no multicast querier is present, this commit still has two more issues: 1) There are cases where mdb entries do not expire even if there is a querier present. The bridge will unnecessarily continue flooding multicast packets on the according ports. 2) Never removing an mdb entry could be exploited for a Denial of Service by an attacker on the local link, slowly, but steadily eating up all memory. Actually, this commit became obsolete with "bridge: disable snooping if there is no querier" (b00589af3b) which included fixes for a few more cases. Therefore reverting the following commits (the commit stated in the commit message plus three of its follow up fixes): ==================== Revert "bridge: update mdb expiration timer upon reports." This reverts commit f144febd93d5ee534fdf23505ab091b2b9088edc. Revert "bridge: do not call setup_timer() multiple times" This reverts commit 1faabf2aab1fdaa1ace4e8c829d1b9cf7bfec2f1. Revert "bridge: fix some kernel warning in multicast timer" This reverts commit c7e8e8a8f7a70b343ca1e0f90a31e35ab2d16de1. Revert "bridge: only expire the mdb entry when query is received" This reverts commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b. ==================== CC: Cong Wang <amwang@redhat.com> Signed-off-by: Linus Lüssing <linus.luessing@web.de> Reviewed-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-22net: remove function sk_reset_txq()ZHAO Gang
What sk_reset_txq() does is just calls function sk_tx_queue_reset(), and sk_reset_txq() is used only in sock.h, by dst_negative_advice(). Let dst_negative_advice() calls sk_tx_queue_reset() directly so we can remove unneeded sk_reset_txq(). Signed-off-by: ZHAO Gang <gamerh2o@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-22openvswitch: collect mega flow mask statsAndy Zhou
Collect mega flow mask stats. ovs-dpctl show command can be used to display them for debugging and performance tuning. Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-10-22netfilter: ipset: The unnamed union initialization may lead to compilation errorJozsef Kadlecsik
The unnamed union should be possible to be initialized directly, but unfortunately it's not so: /usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c: In function ?hash_netnet4_kadt?: /usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c:141: error: unknown field ?cidr? specified in initializer Reported-by: Husnu Demir <hdemir@metu.edu.tr> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22netfilter: ipset: Use netlink callback dump args onlyJozsef Kadlecsik
Instead of cb->data, use callback dump args only and introduce symbolic names instead of plain numbers at accessing the argument members. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22netfilter: x_tables: fix ordering of jumpstack allocation and table updateWill Deacon
During kernel stability testing on an SMP ARMv7 system, Yalin Wang reported the following panic from the netfilter code: 1fe0: 0000001c 5e2d3b10 4007e779 4009e110 60000010 00000032 ff565656 ff545454 [<c06c48dc>] (ipt_do_table+0x448/0x584) from [<c0655ef0>] (nf_iterate+0x48/0x7c) [<c0655ef0>] (nf_iterate+0x48/0x7c) from [<c0655f7c>] (nf_hook_slow+0x58/0x104) [<c0655f7c>] (nf_hook_slow+0x58/0x104) from [<c0683bbc>] (ip_local_deliver+0x88/0xa8) [<c0683bbc>] (ip_local_deliver+0x88/0xa8) from [<c0683718>] (ip_rcv_finish+0x418/0x43c) [<c0683718>] (ip_rcv_finish+0x418/0x43c) from [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) from [<c062b314>] (process_backlog+0x84/0x158) [<c062b314>] (process_backlog+0x84/0x158) from [<c062de84>] (net_rx_action+0x70/0x1dc) [<c062de84>] (net_rx_action+0x70/0x1dc) from [<c0088230>] (__do_softirq+0x11c/0x27c) [<c0088230>] (__do_softirq+0x11c/0x27c) from [<c008857c>] (do_softirq+0x44/0x50) [<c008857c>] (do_softirq+0x44/0x50) from [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) from [<c06b0330>] (inet_stream_connect+0x164/0x298) [<c06b0330>] (inet_stream_connect+0x164/0x298) from [<c061d68c>] (sys_connect+0x88/0xc8) [<c061d68c>] (sys_connect+0x88/0xc8) from [<c000e340>] (ret_fast_syscall+0x0/0x30) Code: 2a000021 e59d2028 e59de01c e59f011c (e7824103) ---[ end trace da227214a82491bd ]--- Kernel panic - not syncing: Fatal exception in interrupt This comes about because CPU1 is executing xt_replace_table in response to a setsockopt syscall, resulting in: ret = xt_jumpstack_alloc(newinfo); --> newinfo->jumpstack = kzalloc(size, GFP_KERNEL); [...] table->private = newinfo; newinfo->initial_entries = private->initial_entries; Meanwhile, CPU0 is handling the network receive path and ends up in ipt_do_table, resulting in: private = table->private; [...] jumpstack = (struct ipt_entry **)private->jumpstack[cpu]; On weakly ordered memory architectures, the writes to table->private and newinfo->jumpstack from CPU1 can be observed out of order by CPU0. Furthermore, on architectures which don't respect ordering of address dependencies (i.e. Alpha), the reads from CPU0 can also be re-ordered. This patch adds an smp_wmb() before the assignment to table->private (which is essentially publishing newinfo) to ensure that all writes to newinfo will be observed before plugging it into the table structure. A dependent-read barrier is also added on the consumer sides, to ensure the same ordering requirements are also respected there. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Wang, Yalin <Yalin.Wang@sonymobile.com> Tested-by: Wang, Yalin <Yalin.Wang@sonymobile.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-21tcp: initialize passive-side sk_pacing_rate after 3WHSNeal Cardwell
For passive TCP connections, upon receiving the ACK that completes the 3WHS, make sure we set our pacing rate after we get our first RTT sample. On passive TCP connections, when we receive the ACK completing the 3WHS we do not take an RTT sample in tcp_ack(), but rather in tcp_synack_rtt_meas(). So upon receiving the ACK that completes the 3WHS, tcp_ack() leaves sk_pacing_rate at its initial value. Originally the initial sk_pacing_rate value was 0, so passive-side connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs made in the first RTT. With a default initial cwnd of 10 packets, this happened to be correct for RTTs 5ms or bigger, so it was hard to see problems in WAN or emulated WAN testing. Since 7eec4174ff ("pkt_sched: fq: fix non TCP flows pacing"), the initial sk_pacing_rate is 0xffffffff. So after that change, passive TCP connections were keeping this value (and using large numbers of segments per skbuff) until receiving an ACK for data. Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21ipv6: probe routes asynchronous in rt6_probeHannes Frederic Sowa
Routes need to be probed asynchronous otherwise the call stack gets exhausted when the kernel attemps to deliver another skb inline, like e.g. xt_TEE does, and we probe at the same time. We update neigh->updated still at once, otherwise we would send to many probes. Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21ipv6: sit: add GSO/TSO supportEric Dumazet
Now ipv6_gso_segment() is stackable, its relatively easy to implement GSO/TSO support for SIT tunnels Performance results, when segmentation is done after tunnel device (as no NIC is yet enabled for TSO SIT support) : Before patch : lpq84:~# ./netperf -H 2002:af6:1153:: -Cc MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 16384 10.00 3168.31 4.81 4.64 2.988 2.877 After patch : lpq84:~# ./netperf -H 2002:af6:1153:: -Cc MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 16384 10.00 5525.00 7.76 5.17 2.763 1.840 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21ipv6: gso: make ipv6_gso_segment() stackableEric Dumazet
In order to support GSO on SIT tunnels, we need to make inet_gso_segment() stackable. It should not assume network header starts right after mac header. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21ipv4: Allow unprivileged users to use per net sysctlsEric W. Biederman
Allow unprivileged users to use: /proc/sys/net/ipv4/icmp_echo_ignore_all /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts /proc/sys/net/ipv4/icmp_ignore_bogus_error_response /proc/sys/net/ipv4/icmp_errors_use_inbound_ifaddr /proc/sys/net/ipv4/icmp_ratelimit /proc/sys/net/ipv4/icmp_ratemask /proc/sys/net/ipv4/ping_group_range /proc/sys/net/ipv4/tcp_ecn /proc/sys/net/ipv4/ip_local_ports_range These are occassionally handy and after a quick review I don't see any problems with unprivileged users using them. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21ipv4: Use math to point per net sysctls into the appropriate struct net.Eric W. Biederman
Simplify maintenance of ipv4_net_table by using math to point the per net sysctls into the appropriate struct net, instead of manually reassinging all of the variables into hard coded table slots. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21tcp_memcontrol: Kill struct tcp_memcontrolEric W. Biederman
Replace the pointers in struct cg_proto with actual data fields and kill struct tcp_memcontrol as it is not fully redundant. This removes a confusing, unnecessary layer of abstraction. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21tcp_memcontrol: Remove the per netns control.Eric W. Biederman
The code that is implemented is per memory cgroup not per netns, and having per netns bits is just confusing. Remove the per netns bits to make it easier to see what is really going on. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21tcp_memcontrol: Remove setting cgroup settings via sysctlEric W. Biederman
The code is broken and does not constrain sysctl_tcp_mem as tcp_update_limit does. With the result that it allows the cgroup tcp memory limits to be bypassed. The semantics are broken as the settings are not per netns and are in a per netns table, and instead looks at current. Since the code is broken in both design and implementation and does not implement the functionality for which it was written remove it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21tcp_memcontrol: Remove tcp_max_memoryEric W. Biederman
This function is never called. Remove it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>