summaryrefslogtreecommitdiff
path: root/net/sctp
AgeCommit message (Collapse)Author
2014-05-09sctp: add a checking for sctp_sysctl_net_registerwangweidong
When register_net_sysctl failed, we should free the sysctl_table. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-09Revert "sctp: optimize the sctp_sysctl_net_register"wangweidong
This revert commit efb842c45("sctp: optimize the sctp_sysctl_net_register"), Since it doesn't kmemdup a sysctl_table for init_net, so the init_net->sctp.sysctl_header->ctl_table_arg points to sctp_net_table which is a static array pointer. So when doing sctp_sysctl_net_unregister, it will free sctp_net_table, then we will get a NULL pointer dereference like that: [ 262.948220] BUG: unable to handle kernel NULL pointer dereference at 000000000000006c [ 262.948232] IP: [<ffffffff81144b70>] kfree+0x80/0x420 [ 262.948260] PGD db80a067 PUD dae12067 PMD 0 [ 262.948268] Oops: 0000 [#1] SMP [ 262.948273] Modules linked in: sctp(-) crc32c_generic libcrc32c ... [ 262.948338] task: ffff8800db830190 ti: ffff8800dad00000 task.ti: ffff8800dad00000 [ 262.948344] RIP: 0010:[<ffffffff81144b70>] [<ffffffff81144b70>] kfree+0x80/0x420 [ 262.948353] RSP: 0018:ffff8800dad01d88 EFLAGS: 00010046 [ 262.948358] RAX: 0100000000000000 RBX: ffffffffa0227940 RCX: ffffea0000707888 [ 262.948363] RDX: ffffea0000707888 RSI: 0000000000000001 RDI: ffffffffa0227940 [ 262.948369] RBP: ffff8800dad01de8 R08: 0000000000000000 R09: ffff8800d9e983a9 [ 262.948374] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0227940 [ 262.948380] R13: ffffffff8187cfc0 R14: 0000000000000000 R15: ffffffff8187da10 [ 262.948386] FS: 00007fa2a2658700(0000) GS:ffff880112800000(0000) knlGS:0000000000000000 [ 262.948394] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 262.948400] CR2: 000000000000006c CR3: 00000000cddc0000 CR4: 00000000000006e0 [ 262.948410] Stack: [ 262.948413] ffff8800dad01da8 0000000000000286 0000000020227940 ffffffffa0227940 [ 262.948422] ffff8800dad01dd8 ffffffff811b7fa1 ffffffffa0227940 ffffffffa0227940 [ 262.948431] ffffffff8187d960 ffffffff8187cfc0 ffffffff8187d960 ffffffff8187da10 [ 262.948440] Call Trace: [ 262.948457] [<ffffffff811b7fa1>] ? unregister_sysctl_table+0x51/0xa0 [ 262.948476] [<ffffffffa020d1a1>] sctp_sysctl_net_unregister+0x21/0x30 [sctp] [ 262.948490] [<ffffffffa020ef6d>] sctp_net_exit+0x12d/0x150 [sctp] [ 262.948512] [<ffffffff81394f49>] ops_exit_list+0x39/0x60 [ 262.948522] [<ffffffff813951ed>] unregister_pernet_operations+0x3d/0x70 [ 262.948530] [<ffffffff81395292>] unregister_pernet_subsys+0x22/0x40 [ 262.948544] [<ffffffffa020efcc>] sctp_exit+0x3c/0x12d [sctp] [ 262.948562] [<ffffffff810c5e04>] SyS_delete_module+0x194/0x210 [ 262.948577] [<ffffffff81240fde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 262.948587] [<ffffffff815217a2>] system_call_fastpath+0x16/0x1b With this revert, it won't occur the Oops. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-07net: clean up snmp stats codeWANG Cong
commit 8f0ea0fe3a036a47767f9c80e (snmp: reduce percpu needs by 50%) reduced snmp array size to 1, so technically it doesn't have to be an array any more. What's more, after the following commit: commit 933393f58fef9963eac61db8093689544e29a600 Date: Thu Dec 22 11:58:51 2011 -0600 percpu: Remove irqsafe_cpu_xxx variants We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after almost 3 years, no one complains. So, just convert the array to a single pointer and remove snmp_mib_init() and snmp_mib_free() as well. Cc: Christoph Lameter <cl@linux.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-27net: sctp: Don't transition to PF state when transport has exhausted ↵Karl Heiss
'Path.Max.Retrans'. Don't transition to the PF state on every strike after 'Path.Max.Retrans'. Per draft-ietf-tsvwg-sctp-failover-03 Section 5.1.6: Additional (PMR - PFMR) consecutive timeouts on a PF destination confirm the path failure, upon which the destination transitions to the Inactive state. As described in [RFC4960], the sender (i) SHOULD notify ULP about this state transition, and (ii) transmit heartbeats to the Inactive destination at a lower frequency as described in Section 8.3 of [RFC4960]. This also prevents sending SCTP_ADDR_UNREACHABLE to the user as the state bounces between SCTP_INACTIVE and SCTP_PF for each subsequent strike. Signed-off-by: Karl Heiss <kheiss@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-27sctp: reset flowi4_oif parameter on route lookupXufeng Zhang
commit 813b3b5db83 (ipv4: Use caller's on-stack flowi as-is in output route lookups.) introduces another regression which is very similar to the problem of commit e6b45241c (ipv4: reset flowi parameters on route connect) wants to fix: Before we call ip_route_output_key() in sctp_v4_get_dst() to get a dst that matches a bind address as the source address, we have already called this function previously and the flowi parameters have been initialized including flowi4_oif, so when we call this function again, the process in __ip_route_output_key() will be different because of the setting of flowi4_oif, and we'll get a networking device which corresponds to the inputted flowi4_oif as the output device, this is wrong because we'll never hit this place if the previously returned source address of dst match one of the bound addresses. To reproduce this problem, a vlan setting is enough: # ifconfig eth0 up # route del default # vconfig add eth0 2 # vconfig add eth0 3 # ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0 # route add default gw 10.0.1.254 dev eth0.2 # ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0 # ip rule add from 10.0.0.14 table 4 # ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3 # sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I You'll detect that all the flow are routed to eth0.2(10.0.1.254). Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-20sctp: add support for busy polling to sctp protocolNeil Horman
The busy polling socket option adds support for sockets to busy wait on data arriving on the napi queue from which they have most recently received a frame. Currently only tcp and udp support this feature, but theres no reason sctp can't do so as well. Add it in so appliations can take advantage of it Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: Daniel Borkmann <dborkman@redhat.com> CC: netdev@vger.kernel.org Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-18net: sctp: cache auth_enable per endpointVlad Yasevich
Currently, it is possible to create an SCTP socket, then switch auth_enable via sysctl setting to 1 and crash the system on connect: Oops[#1]: CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1 task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000 [...] Call Trace: [<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80 [<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4 [<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c [<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8 [<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214 [<ffffffff8043af68>] sctp_rcv+0x588/0x630 [<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24 [<ffffffff803acb50>] ip6_input+0x2c0/0x440 [<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564 [<ffffffff80310650>] process_backlog+0xb4/0x18c [<ffffffff80313cbc>] net_rx_action+0x12c/0x210 [<ffffffff80034254>] __do_softirq+0x17c/0x2ac [<ffffffff800345e0>] irq_exit+0x54/0xb0 [<ffffffff800075a4>] ret_from_irq+0x0/0x4 [<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48 [<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148 [<ffffffff805a88b0>] start_kernel+0x37c/0x398 Code: dd0900b8 000330f8 0126302d <dcc60000> 50c0fff1 0047182a a48306a0 03e00008 00000000 ---[ end trace b530b0551467f2fd ]--- Kernel panic - not syncing: Fatal exception in interrupt What happens while auth_enable=0 in that case is, that ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs() when endpoint is being created. After that point, if an admin switches over to auth_enable=1, the machine can crash due to NULL pointer dereference during reception of an INIT chunk. When we enter sctp_process_init() via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk, the INIT verification succeeds and while we walk and process all INIT params via sctp_process_param() we find that net->sctp.auth_enable is set, therefore do not fall through, but invoke sctp_auth_asoc_set_default_hmac() instead, and thus, dereference what we have set to NULL during endpoint initialization phase. The fix is to make auth_enable immutable by caching its value during endpoint initialization, so that its original value is being carried along until destruction. The bug seems to originate from the very first days. Fix in joint work with Daniel Borkmann. Reported-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-15ipv4: add a sock pointer to ip_queue_xmit()Eric Dumazet
ip_queue_xmit() assumes the skb it has to transmit is attached to an inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership") changed l2tp to not change skb ownership and thus broke this assumption. One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(), so that we do not assume skb->sk points to the socket used by l2tp tunnel. Fixes: 31c70d5956fc ("l2tp: keep original skb ownership") Reported-by: Zhan Jianyu <nasa4836@gmail.com> Tested-by: Zhan Jianyu <nasa4836@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-14Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the ↵Daniel Borkmann
receiver's buffer" This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") as it introduced a serious performance regression on SCTP over IPv4 and IPv6, though a not as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs. Current state: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 17:56:21 GMT Connecting to host 192.168.241.3, port 5201 Cookie: Lab200slot2.1397238981.812898.548918 [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec (etc) [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 19:08:41 GMT Connecting to host 2001:db8:0:f101::1, port 5201 Cookie: Lab200slot2.1397243321.714295.2b3f7c [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201 Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec (etc) After patch: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64 Time: Mon, 14 Apr 2014 16:40:48 GMT Connecting to host 192.168.240.3, port 5201 Cookie: Lab200slot2.1397493648.413274.65e131 [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec With the reverted patch applied, the SCTP/IPv4 performance is back to normal on latest upstream for IPv4 and IPv6 and has same throughput as 3.4.2 test kernel, steady and interval reports are smooth again. Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") Reported-by: Peter Butler <pbutler@sonusnet.com> Reported-by: Dongsheng Song <dongsheng.song@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Peter Butler <pbutler@sonusnet.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11net: Fix use after free by removing length arg from sk_data_ready callbacks.David S. Miller
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-09net: sctp: test if association is dead in sctp_wake_up_waitersDaniel Borkmann
In function sctp_wake_up_waiters(), we need to involve a test if the association is declared dead. If so, we don't have any reference to a possible sibling association anymore and need to invoke sctp_write_space() instead, and normally walk the socket's associations and notify them of new wmem space. The reason for special casing is that otherwise, we could run into the following issue when a sctp_primitive_SEND() call from sctp_sendmsg() fails, and tries to flush an association's outq, i.e. in the following way: sctp_association_free() `-> list_del(&asoc->asocs) <-- poisons list pointer asoc->base.dead = true sctp_outq_free(&asoc->outqueue) `-> __sctp_outq_teardown() `-> sctp_chunk_free() `-> consume_skb() `-> sctp_wfree() `-> sctp_wake_up_waiters() <-- dereferences poisoned pointers if asoc->ep->sndbuf_policy=0 Therefore, only walk the list in an 'optimized' way if we find that the current association is still active. We could also use list_del_init() in addition when we call sctp_association_free(), but as Vlad suggests, we want to trap such bugs and thus leave it poisoned as is. Why is it safe to resolve the issue by testing for asoc->base.dead? Parallel calls to sctp_sendmsg() are protected under socket lock, that is lock_sock()/release_sock(). Only within that path under lock held, we're setting skb/chunk owner via sctp_set_owner_w(). Eventually, chunks are freed directly by an association still under that lock. So when traversing association list on destruction time from sctp_wake_up_waiters() via sctp_wfree(), a different CPU can't be running sctp_wfree() while another one calls sctp_association_free() as both happens under the same lock. Therefore, this can also not race with setting/testing against asoc->base.dead as we are guaranteed for this to happen in order, under lock. Further, Vlad says: the times we check asoc->base.dead is when we've cached an association pointer for later processing. In between cache and processing, the association may have been freed and is simply still around due to reference counts. We check asoc->base.dead under a lock, so it should always be safe to check and not race against sctp_association_free(). Stress-testing seems fine now, too. Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-08net: sctp: wake up all assocs if sndbuf policy is per socketDaniel Borkmann
SCTP charges chunks for wmem accounting via skb->truesize in sctp_set_owner_w(), and sctp_wfree() respectively as the reverse operation. If a sender runs out of wmem, it needs to wait via sctp_wait_for_sndbuf(), and gets woken up by a call to __sctp_write_space() mostly via sctp_wfree(). __sctp_write_space() is being called per association. Although we assign sk->sk_write_space() to sctp_write_space(), which is then being done per socket, it is only used if send space is increased per socket option (SO_SNDBUF), as SOCK_USE_WRITE_QUEUE is set and therefore not invoked in sock_wfree(). Commit 4c3a5bdae293 ("sctp: Don't charge for data in sndbuf again when transmitting packet") fixed an issue where in case sctp_packet_transmit() manages to queue up more than sndbuf bytes, sctp_wait_for_sndbuf() will never be woken up again unless it is interrupted by a signal. However, a still remaining issue is that if net.sctp.sndbuf_policy=0, that is accounting per socket, and one-to-many sockets are in use, the reclaimed write space from sctp_wfree() is 'unfairly' handed back on the server to the association that is the lucky one to be woken up again via __sctp_write_space(), while the remaining associations are never be woken up again (unless by a signal). The effect disappears with net.sctp.sndbuf_policy=1, that is wmem accounting per association, as it guarantees a fair share of wmem among associations. Therefore, if we have reclaimed memory in case of per socket accounting, wake all related associations to a socket in a fair manner, that is, traverse the socket association list starting from the current neighbour of the association and issue a __sctp_write_space() to everyone until we end up waking ourselves. This guarantees that no association is preferred over another and even if more associations are taken into the one-to-many session, all receivers will get messages from the server and are not stalled forever on high load. This setting still leaves the advantage of per socket accounting in touch as an association can still use up global limits if unused by others. Fixes: 4eb701dfc618 ("[SCTP] Fix SCTP sendbuffer accouting.") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/r8152.c drivers/net/xen-netback/netback.c Both the r8152 and netback conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13net: sctp: remove NULL check in sctp_assoc_update_retran_pathDaniel Borkmann
This is basically just to let Coverity et al shut up. Remove an unneeded NULL check in sctp_assoc_update_retran_path(). It is safe to remove it, because in sctp_assoc_update_retran_path() we iterate over the list of transports, our own transport which is asoc->peer.retran_path included. In the iteration, we skip the list head element and transports in state SCTP_UNCONFIRMED. Such transports came from peer addresses received in INIT/INIT-ACK address parameters. They are not yet confirmed by a heartbeat and not available for data transfers. We know however that in the list of transports, even if it contains such elements, it at least contains our asoc->peer.retran_path as well, so even if next to that element, we only encounter SCTP_UNCONFIRMED transports, we are always going to fall back to asoc->peer.retran_path through sctp_trans_elect_best(), as that is for sure not SCTP_UNCONFIRMED as per fbdf501c9374 ("sctp: Do no select unconfirmed transports for retransmissions"). Whenever we call sctp_trans_elect_best() it will give us a non-NULL element back, and therefore when we break out of the loop, we are guaranteed to have a non-NULL transport pointer, and can remove the NULL check. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05net: sctp: fix skb leakage in COOKIE ECHO path of chunk->auth_chunkDaniel Borkmann
While working on ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable"), we noticed that there's a skb memory leakage in the error path. Running the same reproducer as in ec0223ec48a9 and by unconditionally jumping to the error label (to simulate an error condition) in sctp_sf_do_5_1D_ce() receive path lets kmemleak detector bark about the unfreed chunk->auth_chunk skb clone: Unreferenced object 0xffff8800b8f3a000 (size 256): comm "softirq", pid 0, jiffies 4294769856 (age 110.757s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 89 ab 75 5e d4 01 58 13 00 00 00 00 00 00 00 00 ..u^..X......... backtrace: [<ffffffff816660be>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8119f328>] kmem_cache_alloc+0xc8/0x210 [<ffffffff81566929>] skb_clone+0x49/0xb0 [<ffffffffa0467459>] sctp_endpoint_bh_rcv+0x1d9/0x230 [sctp] [<ffffffffa046fdbc>] sctp_inq_push+0x4c/0x70 [sctp] [<ffffffffa047e8de>] sctp_rcv+0x82e/0x9a0 [sctp] [<ffffffff815abd38>] ip_local_deliver_finish+0xa8/0x210 [<ffffffff815a64af>] nf_reinject+0xbf/0x180 [<ffffffffa04b4762>] nfqnl_recv_verdict+0x1d2/0x2b0 [nfnetlink_queue] [<ffffffffa04aa40b>] nfnetlink_rcv_msg+0x14b/0x250 [nfnetlink] [<ffffffff815a3269>] netlink_rcv_skb+0xa9/0xc0 [<ffffffffa04aa7cf>] nfnetlink_rcv+0x23f/0x408 [nfnetlink] [<ffffffff815a2bd8>] netlink_unicast+0x168/0x250 [<ffffffff815a2fa1>] netlink_sendmsg+0x2e1/0x3f0 [<ffffffff8155cc6b>] sock_sendmsg+0x8b/0xc0 [<ffffffff8155d449>] ___sys_sendmsg+0x369/0x380 What happens is that commit bbd0d59809f9 clones the skb containing the AUTH chunk in sctp_endpoint_bh_rcv() when having the edge case that an endpoint requires COOKIE-ECHO chunks to be authenticated: ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- ------------------ AUTH; COOKIE-ECHO ----------------> <-------------------- COOKIE-ACK --------------------- When we enter sctp_sf_do_5_1D_ce() and before we actually get to the point where we process (and subsequently free) a non-NULL chunk->auth_chunk, we could hit the "goto nomem_init" path from an error condition and thus leave the cloned skb around w/o freeing it. The fix is to centrally free such clones in sctp_chunk_destroy() handler that is invoked from sctp_chunk_free() after all refs have dropped; and also move both kfree_skb(chunk->auth_chunk) there, so that chunk->auth_chunk is either NULL (since sctp_chunkify() allocs new chunks through kmem_cache_zalloc()) or non-NULL with a valid skb pointer. chunk->skb and chunk->auth_chunk are the only skbs in the sctp_chunk structure that need to be handeled. While at it, we should use consume_skb() for both. It is the same as dev_kfree_skb() but more appropriately named as we are not a device but a protocol. Also, this effectively replaces the kfree_skb() from both invocations into consume_skb(). Functions are the same only that kfree_skb() assumes that the frame was being dropped after a failure (e.g. for tools like drop monitor), usage of consume_skb() seems more appropriate in function sctp_chunk_destroy() though. Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <yasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/wireless/ath/ath9k/recv.c drivers/net/wireless/mwifiex/pcie.c net/ipv6/sit.c The SIT driver conflict consists of a bug fix being done by hand in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper was created (netdev_alloc_pcpu_stats()) which takes care of this. The two wireless conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capableDaniel Borkmann
RFC4895 introduced AUTH chunks for SCTP; during the SCTP handshake RANDOM; CHUNKS; HMAC-ALGO are negotiated (CHUNKS being optional though): ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- A special case is when an endpoint requires COOKIE-ECHO chunks to be authenticated: ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- ------------------ AUTH; COOKIE-ECHO ----------------> <-------------------- COOKIE-ACK --------------------- RFC4895, section 6.3. Receiving Authenticated Chunks says: The receiver MUST use the HMAC algorithm indicated in the HMAC Identifier field. If this algorithm was not specified by the receiver in the HMAC-ALGO parameter in the INIT or INIT-ACK chunk during association setup, the AUTH chunk and all the chunks after it MUST be discarded and an ERROR chunk SHOULD be sent with the error cause defined in Section 4.1. [...] If no endpoint pair shared key has been configured for that Shared Key Identifier, all authenticated chunks MUST be silently discarded. [...] When an endpoint requires COOKIE-ECHO chunks to be authenticated, some special procedures have to be followed because the reception of a COOKIE-ECHO chunk might result in the creation of an SCTP association. If a packet arrives containing an AUTH chunk as a first chunk, a COOKIE-ECHO chunk as the second chunk, and possibly more chunks after them, and the receiver does not have an STCB for that packet, then authentication is based on the contents of the COOKIE-ECHO chunk. In this situation, the receiver MUST authenticate the chunks in the packet by using the RANDOM parameters, CHUNKS parameters and HMAC_ALGO parameters obtained from the COOKIE-ECHO chunk, and possibly a local shared secret as inputs to the authentication procedure specified in Section 6.3. If authentication fails, then the packet is discarded. If the authentication is successful, the COOKIE-ECHO and all the chunks after the COOKIE-ECHO MUST be processed. If the receiver has an STCB, it MUST process the AUTH chunk as described above using the STCB from the existing association to authenticate the COOKIE-ECHO chunk and all the chunks after it. [...] Commit bbd0d59809f9 introduced the possibility to receive and verification of AUTH chunk, including the edge case for authenticated COOKIE-ECHO. On reception of COOKIE-ECHO, the function sctp_sf_do_5_1D_ce() handles processing, unpacks and creates a new association if it passed sanity checks and also tests for authentication chunks being present. After a new association has been processed, it invokes sctp_process_init() on the new association and walks through the parameter list it received from the INIT chunk. It checks SCTP_PARAM_RANDOM, SCTP_PARAM_HMAC_ALGO and SCTP_PARAM_CHUNKS, and copies them into asoc->peer meta data (peer_random, peer_hmacs, peer_chunks) in case sysctl -w net.sctp.auth_enable=1 is set. If in INIT's SCTP_PARAM_SUPPORTED_EXT parameter SCTP_CID_AUTH is set, peer_random != NULL and peer_hmacs != NULL the peer is to be assumed asoc->peer.auth_capable=1, in any other case asoc->peer.auth_capable=0. Now, if in sctp_sf_do_5_1D_ce() chunk->auth_chunk is available, we set up a fake auth chunk and pass that on to sctp_sf_authenticate(), which at latest in sctp_auth_calculate_hmac() reliably dereferences a NULL pointer at position 0..0008 when setting up the crypto key in crypto_hash_setkey() by using asoc->asoc_shared_key that is NULL as condition key_id == asoc->active_key_id is true if the AUTH chunk was injected correctly from remote. This happens no matter what net.sctp.auth_enable sysctl says. The fix is to check for net->sctp.auth_enable and for asoc->peer.auth_capable before doing any operations like sctp_sf_authenticate() as no key is activated in sctp_auth_asoc_init_active_key() for each case. Now as RFC4895 section 6.3 states that if the used HMAC-ALGO passed from the INIT chunk was not used in the AUTH chunk, we SHOULD send an error; however in this case it would be better to just silently discard such a maliciously prepared handshake as we didn't even receive a parameter at all. Also, as our endpoint has no shared key configured, section 6.3 says that MUST silently discard, which we are doing from now onwards. Before calling sctp_sf_pdiscard(), we need not only to free the association, but also the chunk->auth_chunk skb, as commit bbd0d59809f9 created a skb clone in that case. I have tested this locally by using netfilter's nfqueue and re-injecting packets into the local stack after maliciously modifying the INIT chunk (removing RANDOM; HMAC-ALGO param) and the SCTP packet containing the COOKIE_ECHO (injecting AUTH chunk before COOKIE_ECHO). Fixed with this patch applied. Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <yasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-22net: sctp: rework multihoming retransmission path selection to rfc4960Daniel Borkmann
Problem statement: 1) both paths (primary path1 and alternate path2) are up after the association has been established i.e., HB packets are normally exchanged, 2) path2 gets inactive after path_max_retrans * max_rto timed out (i.e. path2 is down completely), 3) now, if a transmission times out on the only surviving/active path1 (any ~1sec network service impact could cause this like a channel bonding failover), then the retransmitted packets are sent over the inactive path2; this happens with partial failover and without it. Besides not being optimal in the above scenario, a small failure or timeout in the only existing path has the potential to cause long delays in the retransmission (depending on RTO_MAX) until the still active path is reselected. Further, when the T3-timeout occurs, we have active_patch == retrans_path, and even though the timeout occurred on the initial transmission of data, not a retransmit, we end up updating retransmit path. RFC4960, section 6.4. "Multi-Homed SCTP Endpoints" states under 6.4.1. "Failover from an Inactive Destination Address" the following: Some of the transport addresses of a multi-homed SCTP endpoint may become inactive due to either the occurrence of certain error conditions (see Section 8.2) or adjustments from the SCTP user. When there is outbound data to send and the primary path becomes inactive (e.g., due to failures), or where the SCTP user explicitly requests to send data to an inactive destination transport address, before reporting an error to its ULP, the SCTP endpoint should try to send the data to an alternate __active__ destination transport address if one exists. When retransmitting data that timed out, if the endpoint is multihomed, it should consider each source-destination address pair in its retransmission selection policy. When retransmitting timed-out data, the endpoint should attempt to pick the most divergent source-destination pair from the original source-destination pair to which the packet was transmitted. Note: Rules for picking the most divergent source-destination pair are an implementation decision and are not specified within this document. So, we should first reconsider to take the current active retransmission transport if we cannot find an alternative active one. If all of that fails, we can still round robin through unkown, partial failover, and inactive ones in the hope to find something still suitable. Commit 4141ddc02a92 ("sctp: retran_path update bug fix") broke that behaviour by selecting the next inactive transport when no other active transport was found besides the current assoc's peer.retran_path. Before commit 4141ddc02a92, we would have traversed through the list until we reach our peer.retran_path again, and in case that is still in state SCTP_ACTIVE, we would take it and return. Only if that is not the case either, we take the next inactive transport. Besides all that, another issue is that transports in state SCTP_UNKNOWN could be preferred over transports in state SCTP_ACTIVE in case a SCTP_ACTIVE transport appears after SCTP_UNKNOWN in the transport list yielding a weaker transport state to be used in retransmission. This patch mostly reverts 4141ddc02a92, but also rewrites this function to introduce more clarity and strictness into the code. A strict priority of transport states is enforced in this patch, hence selection is active > unkown > partial failover > inactive. Fixes: 4141ddc02a92 ("sctp: retran_path update bug fix") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Acked-by: Vlad Yasevich <yasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20net: sctp: Potentially-Failed state should not be reached from unconfirmed stateMatija Glavinic Pecotic
In current implementation it is possible to reach PF state from unconfirmed. We can interpret sctp-failover-02 in a way that PF state is meant to be reached only from active state, in the end, this is when entering PF state makes sense. Here are few quotes from sctp-failover-02, but regardless of these, same understanding can be reached from whole section 5: Section 5.1, quickfailover guide: "The PF state is an intermediate state between Active and Failed states." "Each time the T3-rtx timer expires on an active or idle destination, the error counter of that destination address will be incremented. When the value in the error counter exceeds PFMR, the endpoint should mark the destination transport address as PF." There are several concrete reasons for such interpretation. For start, rfc4960 does not take into concern quickfailover algorithm. Therefore, quickfailover must comply to 4960. Point where this compliance can be argued is following behavior: When PF is entered, association overall error counter is incremented for each missed HB. This is contradictory to rfc4960, as address, while in unconfirmed state, is subjected to probing, and while it is probed, it should not increment association overall error counter. This has as a consequence that we might end up in situation in which we drop association due path failure on unconfirmed address, in case we have wrong configuration in a way: Association.Max.Retrans == Path.Max.Retrans. Another reason is that entering PF from unconfirmed will cause a loss of address confirmed event when address is once (if) confirmed. This is fine from failover guide point of view, but it is not consistent with behavior preceding failover implementation and recommendation from 4960: 5.4. Path Verification Whenever a path is confirmed, an indication MAY be given to the upper layer. Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/bonding/bond_3ad.h drivers/net/bonding/bond_main.c Two minor conflicts in bonding, both of which were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18net: sctp: fix sctp_connectx abi for ia32 emulation/compat modeDaniel Borkmann
SCTP's sctp_connectx() abi breaks for 64bit kernels compiled with 32bit emulation (e.g. ia32 emulation or x86_x32). Due to internal usage of 'struct sctp_getaddrs_old' which includes a struct sockaddr pointer, sizeof(param) check will always fail in kernel as the structure in 64bit kernel space is 4bytes larger than for user binaries compiled in 32bit mode. Thus, applications making use of sctp_connectx() won't be able to run under such circumstances. Introduce a compat interface in the kernel to deal with such situations by using a 'struct compat_sctp_getaddrs_old' structure where user data is copied into it, and then sucessively transformed into a 'struct sctp_getaddrs_old' structure with the help of compat_ptr(). That fixes sctp_connectx() abi without any changes needed in user space, and lets the SCTP test suite pass when compiled in 32bit and run on 64bit kernels. Fixes: f9c67811ebc0 ("sctp: Fix regression introduced by new sctp_connectx api") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17net: sctp: Fix a_rwnd/rwnd management to reflect real state of the ↵Matija Glavinic Pecotic
receiver's buffer Implementation of (a)rwnd calculation might lead to severe performance issues and associations completely stalling. These problems are described and solution is proposed which improves lksctp's robustness in congestion state. 1) Sudden drop of a_rwnd and incomplete window recovery afterwards Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data), but size of sk_buff, which is blamed against receiver buffer, is not accounted in rwnd. Theoretically, this should not be the problem as actual size of buffer is double the amount requested on the socket (SO_RECVBUF). Problem here is that this will have bad scaling for data which is less then sizeof sk_buff. E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion of traffic of this size (less then 100B). An example of sudden drop and incomplete window recovery is given below. Node B exhibits problematic behavior. Node A initiates association and B is configured to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp message in 4G (LTE) network). On B data is left in buffer by not reading socket in userspace. Lets examine when we will hit pressure state and declare rwnd to be 0 for scenario with above stated parameters (rwnd == 10000, chunk size == 43, each chunk is sent in separate sctp packet) Logic is implemented in sctp_assoc_rwnd_decrease: socket_buffer (see below) is maximum size which can be held in socket buffer (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count) A simple expression is given for which it will be examined after how many packets for above stated parameters we enter pressure state: We start by condition which has to be met in order to enter pressure state: socket_buffer < currently_alloced; currently_alloced is represented as size of sctp packets received so far and not yet delivered to userspace. x is the number of chunks/packets (since there is no bundling, and each chunk is delivered in separate packet, we can observe each chunk also as sctp packet, and what is important here, having its own sk_buff): socket_buffer < x*each_sctp_packet; each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is twice the amount of initially requested size of socket buffer, which is in case of sctp, twice the a_rwnd requested: 2*rwnd < x*(payload+sizeof(struc sk_buff)); sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000 and each payload size is 43 20000 < x(43+190); x > 20000/233; x ~> 84; After ~84 messages, pressure state is entered and 0 rwnd is advertised while received 84*43B ~= 3612B sctp data. This is why external observer notices sudden drop from 6474 to 0, as it will be now shown in example: IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017] IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839] IP A.34340 > B.12345: sctp (1) [COOKIE ECHO] IP B.12345 > A.34340: sctp (1) [COOKIE ACK] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0] <...> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18] --> Sudden drop IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] At this point, rwnd_press stores current rwnd value so it can be later restored in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This condition is not met since rwnd, after it hit 0, must first reach rwnd_press by adding amount which is read from userspace. Let us observe values in above example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the amount of actual sctp data currently waiting to be delivered to userspace is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed only for sctp data, which is ~3500. Condition is never met, and when userspace reads all data, rwnd stays on 3569. IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0] --> At this point userspace read everything, rwnd recovered only to 3569 IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0] Reproduction is straight forward, it is enough for sender to send packets of size less then sizeof(struct sk_buff) and receiver keeping them in its buffers. 2) Minute size window for associations sharing the same socket buffer In case multiple associations share the same socket, and same socket buffer (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one of the associations can permanently drop rwnd of other association(s). Situation will be typically observed as one association suddenly having rwnd dropped to size of last packet received and never recovering beyond that point. Different scenarios will lead to it, but all have in common that one of the associations (let it be association from 1)) nearly depleted socket buffer, and the other association blames socket buffer just for the amount enough to start the pressure. This association will enter pressure state, set rwnd_press and announce 0 rwnd. When data is read by userspace, similar situation as in 1) will occur, rwnd will increase just for the size read by userspace but rwnd_press will be high enough so that association doesn't have enough credit to reach rwnd_press and restore to previous state. This case is special case of 1), being worse as there is, in the worst case, only one packet in buffer for which size rwnd will be increased. Consequence is association which has very low maximum rwnd ('minute size', in our case down to 43B - size of packet which caused pressure) and as such unusable. Scenario happened in the field and labs frequently after congestion state (link breaks, different probabilities of packet drop, packet reordering) and with scenario 1) preceding. Here is given a deterministic scenario for reproduction: >From node A establish two associations on the same socket, with rcvbuf_policy being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1 repeat scenario from 1), that is, bring it down to 0 and restore up. Observe scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered', bring it down close to 0, as in just one more packet would close it. This has as a consequence that association number 2 is able to receive (at least) one more packet which will bring it in pressure state. E.g. if association 2 had rwnd of 10000, packet received was 43, and we enter at this point into pressure, rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will increase for 43, but conditions to restore rwnd to original state, just as in 1), will never be satisfied. --> Association 1, between A.y and B.12345 IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569] IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613] IP A.55915 > B.12345: sctp (1) [COOKIE ECHO] IP B.12345 > A.55915: sctp (1) [COOKIE ACK] --> Association 2, between A.z and B.12346 IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173] IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240] IP A.55915 > B.12346: sctp (1) [COOKIE ECHO] IP B.12346 > A.55915: sctp (1) [COOKIE ACK] --> Deplete socket buffer by sending messages of size 43B over association 1 IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] <...> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0] --> Sudden drop on 1 IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using association 1 so there is place in buffer for only one more packet IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0] --> Socket buffer is almost depleted, but there is space for one more packet, send them over association 2, size 43B IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18] IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Immediate drop IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Read everything from the socket, both association recover up to maximum rwnd they are capable of reaching, note that association 1 recovered up to 3698, and association 2 recovered only to 43 IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0] IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18] IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0] A careful reader might wonder why it is necessary to reproduce 1) prior reproduction of 2). It is simply easier to observe when to send packet over association 2 which will push association into the pressure state. Proposed solution: Both problems share the same root cause, and that is improper scaling of socket buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while calculating rwnd is not possible due to fact that there is no linear relationship between amount of data blamed in increase/decrease with IP packet in which payload arrived. Even in case such solution would be followed, complexity of the code would increase. Due to nature of current rwnd handling, slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is entered is rationale, but it gives false representation to the sender of current buffer space. Furthermore, it implements additional congestion control mechanism which is defined on implementation, and not on standard basis. Proposed solution simplifies whole algorithm having on mind definition from rfc: o Receiver Window (rwnd): This gives the sender an indication of the space available in the receiver's inbound buffer. Core of the proposed solution is given with these lines: sctp_assoc_rwnd_update: if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0) asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1; else asoc->rwnd = 0; We advertise to sender (half of) actual space we have. Half is in the braces depending whether you would like to observe size of socket buffer as SO_RECVBUF or twice the amount, i.e. size is the one visible from userspace, that is, from kernelspace. In this way sender is given with good approximation of our buffer space, regardless of the buffer policy - we always advertise what we have. Proposed solution fixes described problems and removes necessity for rwnd restoration algorithm. Finally, as proposed solution is simplification, some lines of code, along with some bytes in struct sctp_association are saved. Version 2 of the patch addressed comments from Vlad. Name of the function is set to be more descriptive, and two parts of code are changed, in one removing the superfluous call to sctp_assoc_rwnd_update since call would not result in update of rwnd, and the other being reordering of the code in a way that call to sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2 in a way that existing function is not reordered/copied in line, but it is correctly called. Thanks Vlad for suggesting. Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13net: remove unnecessary return'sstephen hemminger
One of my pet coding style peeves is the practice of adding extra return; at the end of function. Kill several instances of this in network code. I suppose some coccinelle wizardy could do this automatically. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13sctp: optimize the sctp_sysctl_net_registerwangweidong
Here, when the net is init_net, we needn't to kmemdup the ctl_table again. So add a check for net. Also we can save some memory. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13sctp: fix a missed .data initializationwangweidong
As commit 3c68198e75111a90("sctp: Make hmac algorithm selection for cookie generation dynamic"), we miss the .data initialization. If we don't use the net_namespace, the problem that parts of the sysctl configuration won't be isolation and won't occur. In sctp_sysctl_net_register(), we register the sysctl for each net, in the for(), we use the 'table[i].data' as check condition, so when the 'i' is the index of sctp_hmac_alg, the data is NULL, then break. So add the .data initialization. Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06net: sctp: fix initialization of local source address on accepted ipv6 socketsMatija Glavinic Pecotic
commit efe4208f47f907b86f528788da711e8ab9dea44d: 'ipv6: make lookups simpler and faster' broke initialization of local source address on accepted ipv6 sockets. Before the mentioned commit receive address was copied along with the contents of ipv6_pinfo in sctp_v6_create_accept_sk. Now when it is moved, it has to be copied separately. This also fixes lksctp's ipv6 regression in a sense that test_getname_v6, TC5 - 'getsockname on a connected server socket' now passes. Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21sctp: remove macros sctp_bh_[un]lock_sockwangweidong
Redefined bh_[un]lock_sock to sctp_bh[un]lock_sock for user space friendly code which we haven't use in years, so removing them. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21sctp: remove macros sctp_{lock|release}_sockwangweidong
Redefined {lock|release}_sock to sctp_{lock|release}_sock for user space friendly code which we haven't use in years, so removing them. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21sctp: remove macros sctp_write_[un]_lockwangweidong
Redefined write_[un]lock to sctp_write_[un]lock for user space friendly code which we haven't use in years, so removing them. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21sctp: remove macros sctp_spin_[un]lockwangweidong
Redefined spin_[un]lock to sctp_spin_[un]lock for user space friendly code which we haven't use in years, so removing them. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21sctp: remove macros sctp_local_bh_{disable|enable}wangweidong
Redefined local_bh_{disable|enable} to sctp_local_bh_{disable|enable} for user space friendly code which we haven't use in years, so removing them. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16sctp: remove the unnecessary assignmentwangweidong
When go the right path, the status is 0, no need to assign it again. So just remove the assignment. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-15sctp: create helper function to enable|disable sackdelaywangweidong
add sctp_spp_sackdelay_{enable|disable} helper function for avoiding code duplication. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14sctp: remove a redundant NULL checkDan Carpenter
It confuses Smatch when we check "sinit" for NULL and then non-NULL and that causes a false positive warning later. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: replace macros net_random and net_srandom with direct calls to prandomAruna-Hewapathirane
This patch removes the net_random and net_srandom macros and replaces them with direct calls to the prandom ones. As new commits only seem to use prandom_u32 there is no use to keep them around. This change makes it easier to grep for users of prandom_u32. Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13sctp: make sctp_addto_chunk_fixed localstephen hemminger
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13ipv4: introduce hardened ip_no_pmtu_disc modeHannes Frederic Sowa
This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors to be honored by protocols which do more stringent validation on the ICMP's packet payload. This knob is useful for people who e.g. want to run an unmodified DNS server in a namespace where they need to use pmtu for TCP connections (as they are used for zone transfers or fallback for requests) but don't want to use possibly spoofed UDP pmtu information. Currently the whitelisted protocols are TCP, SCTP and DCCP as they check if the returned packet is in the window or if the association is valid. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03sctp: Add process name and pid to deprecation warningsNeil Horman
Recently I updated the sctp socket option deprecation warnings to be both a bit more clear and ratelimited to prevent user processes from spamming the log file. Ben Hutchings suggested that I add the process name and pid to these warnings so that users can tell who is responsible for using the deprecated apis. This patch accomplishes that. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02sctp: Remove outqueue empty stateVlad Yasevich
The SCTP outqueue structure maintains a data chunks that are pending transmission, the list of chunks that are pending a retransmission and a length of data in flight. It also tries to keep the emtpy state so that it can performe shutdown sequence or notify user. The problem is that the empy state is inconsistently tracked. It is possible to completely drain the queue without sending anything when using PR-SCTP. In this case, the empty state will not be correctly state as report by Jamal Hadi Salim <jhs@mojatatu.com>. This can cause an association to be perminantly stuck in the SHUTDOWN_PENDING state. Additionally, SCTP is incredibly inefficient when setting the empty state. Even though all the data is availaible in the outqueue structure, we ignore it and walk a list of trasnports. In the end, we can completely remove the extra empty state and figure out if the queue is empty by looking at 3 things: length of pending data, length of in-flight data, and exisiting of retransmit data. All of these are already in the strucutre. Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31sctp: move skb_dst_set() a bit downwards in sctp_packet_transmit()wangweidong
skb_dst_set will use dst, if dst is NULL although is not a problem, then goto the 'no_route' and free nskb, so do the skb_dst_set is pointless. so move the skb_dst_set after dst check. Remove the unnecessary initialization as well. v2: fix the subject line because it would confuse people, as pointed out by Daniel. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31SCTP: Reduce log spamming for sctp setsockoptNeil Horman
During a recent discussion regarding some sctp socket options, it was noted that we have several points at which we issue log warnings that can be flooded at an unbounded rate by any user. Fix this by converting all the pr_warns in the sctp_setsockopt path to be pr_warn_ratelimited. Note there are several debug level messages as well. I'm leaving those alone, as, if you turn on pr_debug, you likely want lots of verbosity. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: David Miller <davem@davemloft.net> CC: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26sctp: fix checkpatch errors with //commenwangweidong
Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26sctp: fix checkpatch errors with open brace '{' and trailing statementswangweidong
fix checkpatch errors below: ERROR: that open brace { should be on the previous line ERROR: open brace '{' following function declarations go on the next line ERROR: trailing statements should be on next line Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26sctp: fix checkpatch errors with indentwangweidong
fix checkpatch errors below: ERROR: switch and case should be at the same inden ERROR: code indent should use tabs where possible Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26sctp: fix checkpatch errors with (foo*)|foo * bar|foo* barwangweidong
fix checkpatch errors below: ERROR: "(foo*)" should be "(foo *)" ERROR: "foo * bar" should be "foo *bar" ERROR: "foo* bar" should be "foo *bar" Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26sctp: fix checkpatch errors with space required or prohibitedwangweidong
fix checkpatch errors while the space is required or prohibited to the "=,()++..." Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-22sctp: remove the never used 'return' and redundant 'break'wangweidong
In switch() had do return, and never use the 'return NULL'. The 'break' after return or goto has no effect. Remove it. v2: make it more readable as suggested by Neil. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-22sctp: remove casting from function calls through ops structurewangweidong
remove the unnecessary cast. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-21ipv6: move ip6_sk_accept_pmtu from generic pmtu update path to ipv6 oneHannes Frederic Sowa
In commit 93b36cf3425b9b ("ipv6: support IPV6_PMTU_INTERFACE on sockets") I made a horrible mistake to add ip6_sk_accept_pmtu to the generic sctp_icmp_frag_needed path. This results in build warnings if IPv6 is disabled which were luckily caught by Fengguang's kbuild bot. But it also leads to a kernel panic IPv4 frag-needed packet is received. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>