summaryrefslogtreecommitdiff
path: root/net/tipc
AgeCommit message (Collapse)Author
2016-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/ipv4/ip_gre.c Minor conflicts between tunnel bug fixes in net and ipv6 tunnel cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: redesign connection-level flow controlJon Paul Maloy
There are two flow control mechanisms in TIPC; one at link level that handles network congestion, burst control, and retransmission, and one at connection level which' only remaining task is to prevent overflow in the receiving socket buffer. In TIPC, the latter task has to be solved end-to-end because messages can not be thrown away once they have been accepted and delivered upwards from the link layer, i.e, we can never permit the receive buffer to overflow. Currently, this algorithm is message based. A counter in the receiving socket keeps track of number of consumed messages, and sends a dedicated acknowledge message back to the sender for each 256 consumed message. A counter at the sending end keeps track of the sent, not yet acknowledged messages, and blocks the sender if this number ever reaches 512 unacknowledged messages. When the missing acknowledge arrives, the socket is then woken up for renewed transmission. This works well for keeping the message flow running, as it almost never happens that a sender socket is blocked this way. A problem with the current mechanism is that it potentially is very memory consuming. Since we don't distinguish between small and large messages, we have to dimension the socket receive buffer according to a worst-case of both. I.e., the window size must be chosen large enough to sustain a reasonable throughput even for the smallest messages, while we must still consider a scenario where all messages are of maximum size. Hence, the current fix window size of 512 messages and a maximum message size of 66k results in a receive buffer of 66 MB when truesize(66k) = 131k is taken into account. It is possible to do much better. This commit introduces an algorithm where we instead use 1024-byte blocks as base unit. This unit, always rounded upwards from the actual message size, is used when we advertise windows as well as when we count and acknowledge transmitted data. The advertised window is based on the configured receive buffer size in such a way that even the worst-case truesize/msgsize ratio always is covered. Since the smallest possible message size (from a flow control viewpoint) now is 1024 bytes, we can safely assume this ratio to be less than four, which is the value we are now using. This way, we have been able to reduce the default receive buffer size from 66 MB to 2 MB with maintained performance. In order to keep this solution backwards compatible, we introduce a new capability bit in the discovery protocol, and use this throughout the message sending/reception path to always select the right unit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: propagate peer node capabilities to socket layerJon Paul Maloy
During neighbor discovery, nodes advertise their capabilities as a bit map in a dedicated 16-bit field in the discovery message header. This bit map has so far only be stored in the node structure on the peer nodes, but we now see the need to keep a copy even in the socket structure. This commit adds this functionality. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: re-enable compensation for socket receive buffer double countingJon Paul Maloy
In the refactoring commit d570d86497ee ("tipc: enqueue arrived buffers in socket in separate function") we did by accident replace the test if (sk->sk_backlog.len == 0) atomic_set(&tsk->dupl_rcvcnt, 0); with if (sk->sk_backlog.len) atomic_set(&tsk->dupl_rcvcnt, 0); This effectively disables the compensation we have for the double receive buffer accounting that occurs temporarily when buffers are moved from the backlog to the socket receive queue. Until now, this has gone unnoticed because of the large receive buffer limits we are applying, but becomes indispensable when we reduce this buffer limit later in this series. We now fix this by inverting the mentioned condition. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01tipc: only process unicast on intended nodeHamish Martin
We have observed complete lock up of broadcast-link transmission due to unacknowledged packets never being removed from the 'transmq' queue. This is traced to nodes having their ack field set beyond the sequence number of packets that have actually been transmitted to them. Consider an example where node 1 has sent 10 packets to node 2 on a link and node 3 has sent 20 packets to node 2 on another link. We see examples of an ack from node 2 destined for node 3 being treated as an ack from node 2 at node 1. This leads to the ack on the node 1 to node 2 link being increased to 20 even though we have only sent 10 packets. When node 1 does get around to sending further packets, none of the packets with sequence numbers less than 21 are actually removed from the transmq. To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc: reduce locking scope during packet reception") which ensures that only messages destined for the receiving node are processed by that node. This prevents the sequence numbers from getting out of sync and resolves the packet leakage, thereby resolving the broadcast-link transmission lock-ups we observed. While we are aware that this change only patches over a root problem that we still haven't identified, this is a sanity test that it is always legitimate to do. It will remain in the code even after we identify and fix the real problem. Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01tipc: set 'active' state correctly for first established linkJon Paul Maloy
When we are displaying statistics for the first link established between two peers, it will always be presented as STANDBY although it in reality is ACTIVE. This happens because we forget to set the 'active' flag in the link instance at the moment it is established. Although this is a bug, it only has impact on the presentation view of the link, not on its actual functionality. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28tipc: remove an unnecessary NULL checkDan Carpenter
This is never called with a NULL "buf" and anyway, we dereference 's' on the lines before so it would Oops before we reach the check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24tipc: fix stale links after re-enabling bearerParthasarathy Bhuvaragan
Commit 42b18f605fea ("tipc: refactor function tipc_link_timeout()"), introduced a bug which prevents sending of probe messages during link synchronization phase. This leads to hanging links, if the bearer is disabled/enabled after links are up. In this commit, we send the probe messages correctly. Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-18treewide: Fix typos in printkMasanari Iida
This patch fix spelling typos found in printk within various part of the kernel sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-15tipc: let first message on link be a state messageJon Paul Maloy
According to the link FSM, a received traffic packet can take a link from state ESTABLISHING to ESTABLISHED, but the link can still not be fully set up in one atomic operation. This means that even if the the very first packet on the link is a traffic packet with sequence number 1 (one), it has to be dropped and retransmitted. This can be avoided if we let the mentioned packet be preceded by a LINK_PROTOCOL/STATE message, which takes up the endpoint before the arrival of the traffic. We add this small feature in this commit. This is a fully compatible change. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: ensure that first packets on link are sent in orderJon Paul Maloy
In some link establishment scenarios we see that packet #2 may be sent out before packet #1, forcing the receiver to demand retransmission of the missing packet. This is harmless, but may cause confusion among people tracing the packet flow. Since this is extremely easy to fix, we do so by adding en extra send call to the bearer immediately after the link has come up. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: refactor function tipc_link_timeout()Jon Paul Maloy
The function tipc_link_timeout() is unnecessary complex, and can easily be made more readable. We do that with this commit. The only functional change is that we remove a redundant test for whether the broadcast link is up or not. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: reduce transmission rate of reset messages when link is downJon Paul Maloy
When a link is down, it will continuously try to re-establish contact with the peer by sending out a RESET or an ACTIVATE message at each timeout interval. The default value for this interval is currently 375 ms. This is wasteful, and may become a problem in very large clusters with dozens or hundreds of nodes being down simultaneously. We now introduce a simple backoff algorithm for these cases. The first five messages are sent at default rate; thereafter a message is sent only each 16th timer interval. This will cover the vast majority of link recycling cases, since the endpoint starting last will transmit at the higher speed, and the link should normally be established well be before the rate needs to be reduced. The only case where we will see a degradation of link re-establishment times is when the endpoints remain intact, and a glitch in the transmission media is causing the link reset. We will then experience a worst-case re-establishing time of 6 seconds, something we deem acceptable. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: guarantee peer bearer id exchange after rebootJon Paul Maloy
When a link endpoint is going down locally, e.g., because its interface is being stopped, it will spontaneously send out a RESET message to its peer, informing it about this fact. This saves the peer from detecting the failure via probing, and hence gives both speedier and less resource consuming failure detection on the peer side. According to the link FSM, a receiver of a RESET message, ignoring the reason for it, must now consider the sender ready to come back up, and starts periodically sending out ACTIVATE messages to the peer in order to re-establish the link. Also, according to the FSM, the receiver of an ACTIVATE message can now go directly to state ESTABLISHED and start sending regular traffic packets. This is a well-proven and robust FSM. However, in the case of a reboot, there is a small possibilty that link endpoint on the rebooted node may have been re-created with a new bearer identity between the moment it sent its (pre-boot) RESET and the moment it receives the ACTIVATE from the peer. The new bearer identity cannot be known by the peer according to this scenario, since traffic headers don't convey such information. This is a problem, because both endpoints need to know the correct value of the peer's bearer id at any moment in time in order to be able to produce correct link events for their users. The only way to guarantee this is to enforce a full setup message exchange (RESET + ACTIVATE) even after the reboot, since those messages carry the bearer idientity in their header. In this commit we do this by introducing and setting a "stopping" bit in the header of the spontaneously generated RESET messages, informing the peer that the sender will not be immediately ready to re-establish the link. A receiver seeing this bit must act as if this were a locally detected connectivity failure, and hence has to go through a full two- way setup message exchange before any link can be re-established. Although never reported, this problem seems to have always been around. This protocol addition is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14tipc: fix a race condition leading to subscriber refcnt bugParthasarathy Bhuvaragan
Until now, the requests sent to topology server are queued to a workqueue by the generic server framework. These messages are processed by worker threads and trigger the registered callbacks. To reduce latency on uniprocessor systems, explicit rescheduling is performed using cond_resched() after MAX_RECV_MSG_COUNT(25) messages. This implementation on SMP systems leads to an subscriber refcnt error as described below: When a worker thread yields by calling cond_resched() in a SMP system, a new worker is created on another CPU to process the pending workitem. Sometimes the sleeping thread wakes up before the new thread finishes execution. This breaks the assumption on ordering and being single threaded. The fault is more frequent when MAX_RECV_MSG_COUNT is lowered. If the first thread was processing subscription create and the second thread processing close(), the close request will free the subscriber and the create request oops as follows: [31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380 [tipc] [31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97 [31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc] [...] [31.228377] Call Trace: [31.228377] [<ffffffff812fbb6b>] dump_stack+0x4d/0x72 [31.228377] [<ffffffff8105a311>] __warn+0xd1/0xf0 [31.228377] [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20 [31.228377] [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc] [31.228377] [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc] [31.228377] [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc] [31.228377] [<ffffffff81071925>] process_one_work+0x145/0x3d0 [31.246554] ---[ end trace c3882c9baa05a4fd ]--- [31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266 [31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428 [31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0 [31.249323] PGD 0 [31.249323] Oops: 0000 [#1] SMP In this commit, we - rename tipc_conn_shutdown() to tipc_conn_release(). - move connection release callback execution from tipc_close_conn() to a new function tipc_sock_release(), which is executed before we free the connection. Thus we release the subscriber during connection release procedure rather than connection shutdown procedure. Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13tipc: remove remnants of old broadcast codeJon Paul Maloy
We remove a couple of leftover fields in struct tipc_bearer. Those were used by the old broadcast implementation, and are not needed any longer. There is no functional changes in this commit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11tipc: purge deferred updates from dead nodesErik Hugne
If a peer node becomes unavailable, in addition to removing the nametable entries from this node we also need to purge all deferred updates associated with this node. Signed-off-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11tipc: make dist queue pernetErik Hugne
Nametable updates received from the network that cannot be applied immediately are placed on a defer queue. This queue is global to the TIPC module, which might cause problems when using TIPC in containers. To prevent nametable updates from escaping into the wrong namespace, we make the queue pernet instead. Signed-off-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07tipc: stricter filtering of packets in bearer layerJon Paul Maloy
Resetting a bearer/interface, with the consequence of resetting all its pertaining links, is not an atomic action. This becomes particularly evident in very large clusters, where a lot of traffic may happen on the remaining links while we are busy shutting them down. In extreme cases, we may even see links being re-created and re-established before we are finished with the job. To solve this, we now introduce a solution where we temporarily detach the bearer from the interface when the bearer is reset. This inhibits all packet reception, while sending still is possible. For the latter, we use the fact that the device's user pointer now is zero to filter out which packets can be sent during this situation; i.e., outgoing RESET messages only. This filtering serves to speed up the neighbors' detection of the loss event, and saves us from unnecessary probing. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07tipc: eliminate buffer leak in bearer layerJon Paul Maloy
When enabling a bearer we create a 'neigbor discoverer' instance by calling the function tipc_disc_create() before the bearer is actually registered in the list of enabled bearers. Because of this, the very first discovery broadcast message, created by the mentioned function, is lost, since it cannot find any valid bearer to use. Furthermore, the used send function, tipc_bearer_xmit_skb() does not free the given buffer when it cannot find a bearer, resulting in the leak of exactly one send buffer each time a bearer is enabled. This commit fixes this problem by introducing two changes: 1) Instead of attemting to send the discovery message directly, we let tipc_disc_create() return the discovery buffer to the calling function, tipc_enable_bearer(), so that the latter can send it when the enabling sequence is finished. 2) In tipc_bearer_xmit_skb(), as well as in the two other transmit functions at the bearer layer, we now free the indicated buffer or buffer chain when a valid bearer cannot be found. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14tipc: make sure IPv6 header fits in skb headroomRichard Alpe
Expand headroom further in order to be able to fit the larger IPv6 header. Prior to this patch this caused a skb under panic for certain tipc packets when using IPv6 UDP bearer(s). Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-11ip_tunnel: add support for setting flow label via collect metadataDaniel Borkmann
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label from call sites. Currently, there's no such option and it's always set to zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so that flow-based tunnels via collect metadata frontends can make use of it. vxlan and geneve will be converted to add flow label support separately. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of overlapping changes, as well as one instance (vxlan) of a bug fix in 'net' overlapping with code movement in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07tipc: move netlink policies to netlink.cRichard Alpe
Make the c files less cluttered and enable netlink attributes to be shared between files. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: remove pre-allocated message header in link structJon Paul Maloy
Until now, we have kept a pre-allocated protocol message header aggregated into struct tipc_link. Apart from adding unnecessary footprint to the link instances, this requires extra code both to initialize and re-initialize it. We now remove this sub-optimization. This change also makes it possible to clean up the function tipc_build_proto_msg() and remove a couple of small functions that were accessing the mentioned header. In particular, we can replace all occurrences of the local function call link_own_addr(link) with the generic tipc_own_addr(net). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: fix nullptr crash during subscription cancelParthasarathy Bhuvaragan
commit 4d5cfcba2f6e ('tipc: fix connection abort during subscription cancel'), removes the check for a valid subscription before calling tipc_nametbl_subscribe(). This will lead to a nullptr exception when we process a subscription cancel request. For a cancel request, a null subscription is passed to tipc_nametbl_subscribe() resulting in exception. In this commit, we call tipc_nametbl_subscribe() only for a valid subscription. Fixes: 4d5cfcba2f6e ('tipc: fix connection abort during subscription cancel') Reported-by: Anders Widell <anders.widell@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: make sure required IPv6 addresses are scopedRichard Alpe
Make sure the user has provided a scope for multicast and link local addresses used locally by a UDP bearer. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: safely copy UDP netlink data from userRichard Alpe
The netlink policy for TIPC_NLA_UDP_LOCAL and TIPC_NLA_UDP_REMOTE is of type binary with a defined length. This causes the policy framework to threat the defined length as maximum length. There is however no protection against a user sending a smaller amount of data. Prior to this patch this wasn't handled which could result in a partially incomplete sockaddr_storage struct containing uninitialized data. In this patch we use nla_memcpy() when copying the user data. This ensures a potential gap at the end is cleared out properly. This was found by Julia with Coccinelle tool. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: don't check link reset on non existing linkRichard Alpe
Make sure we have a link before checking if it has been reset or not. Prior to this patch tipc_link_is_reset() could be called with a non existing link, resulting in a null pointer dereference. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: add net device to skb before UDP xmitRichard Alpe
Prior to this patch enabling a IPv4 UDP bearer caused a null pointer dereference in iptunnel_xmit_stats(), when it tried to dereference the net device from the skb. To resolve this we now point the skb device to the net device resolved from the routing table. Fixes: 039f50629b7f (ip_tunnel: Move stats update to iptunnel_xmit()) Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain"Parthasarathy Bhuvaragan
reverts commit 94153e36e709e ("tipc: use existing sk_write_queue for outgoing packet chain") In Commit 94153e36e709e, we assume that we fill & empty the socket's sk_write_queue within the same lock_sock() session. This is not true if the link is congested. During congestion, the socket lock is released while we wait for the congestion to cease. This implementation causes a nullptr exception, if the user space program has several threads accessing the same socket descriptor. Consider two threads of the same program performing the following: Thread1 Thread2 -------------------- ---------------------- Enter tipc_sendmsg() Enter tipc_sendmsg() lock_sock() lock_sock() Enter tipc_link_xmit(), ret=ELINKCONG spin on socket lock.. sk_wait_event() : release_sock() grab socket lock : Enter tipc_link_xmit(), ret=0 : release_sock() Wakeup after congestion lock_sock() skb = skb_peek(pktchain); !! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong; In this case, the second thread transmits the buffers belonging to both thread1 and thread2 successfully. When the first thread wakeup after the congestion it assumes that the pktchain is intact and operates on the skb's in it, which leads to the following exception: [2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 [2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc] [2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0 [2102.440074] Oops: 0000 [#1] SMP [2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1 [2102.440074] RIP: 0010:[<ffffffffa005f330>] [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc] [...] [2102.440074] Call Trace: [2102.440074] [<ffffffff8163f0b9>] ? schedule+0x29/0x70 [2102.440074] [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc] [2102.440074] [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc] [2102.440074] [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc] [2102.440074] [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20 [2102.440074] [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc] [2102.440074] [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0 [2102.440074] [<ffffffff81507895>] ? release_sock+0x145/0x170 [2102.440074] [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0 [2102.440074] [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10 [2102.440074] [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0 [2102.440074] [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0 [2102.440074] [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20 [2102.440074] [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0 [2102.440074] [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950 [2102.440074] [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80 [2102.440074] [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20 [2102.440074] [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b In this commit, we maintain the skb list always in the stack. Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25tipc: fix null deref crash in compat config pathFlorian Westphal
msg.dst_sk needs to be set up with a valid socket because some callbacks later derive the netns from it. Fixes: 263ea09084d172d ("Revert "genl: Add genlmsg_new_unicast() for unicast message allocation") Reported-by: Jon Maloy <maloy@donjonn.com> Bisected-by: Jon Maloy <maloy@donjonn.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25tipc: fix crash during node removalJon Paul Maloy
When the TIPC module is unloaded, we have identified a race condition that allows a node reference counter to go to zero and the node instance being freed before the node timer is finished with accessing it. This leads to occasional crashes, especially in multi-namespace environments. The scenario goes as follows: CPU0:(node_stop) CPU1:(node_timeout) // ref == 2 1: if(!mod_timer()) 2: if (del_timer()) 3: tipc_node_put() // ref -> 1 4: tipc_node_put() // ref -> 0 5: kfree_rcu(node); 6: tipc_node_get(node) 7: // BOOM! We now clean up this functionality as follows: 1) We remove the node pointer from the node lookup table before we attempt deactivating the timer. This way, we reduce the risk that tipc_node_find() may obtain a valid pointer to an instance marked for deletion; a harmless but undesirable situation. 2) We use del_timer_sync() instead of del_timer() to safely deactivate the node timer without any risk that it might be reactivated by the timeout handler. There is no risk of deadlock here, since the two functions never touch the same spinlocks. 3: We remove a pointless tipc_node_get() + tipc_node_put() from the timeout handler. Reported-by: Zhijiang Hu <huzhijiang@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25tipc: eliminate risk of finding to-be-deleted node instanceJon Paul Maloy
Although we have never seen it happen, we have identified the following problematic scenario when nodes are stopped and deleted: CPU0: CPU1: tipc_node_xxx() //ref == 1 tipc_node_put() //ref -> 0 tipc_node_find() // node still in table tipc_node_delete() list_del_rcu(n. list) tipc_node_get() //ref -> 1, bad kfree_rcu() tipc_node_put() //ref to 0 again. kfree_rcu() // BOOM! We fix this by introducing use of the conditional kref_get_if_not_zero() instead of kref_get() in the function tipc_node_find(). This eliminates any risk of post-mortem access. Reported-by: Zhijiang Hu <huzhijiang@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/bcm7xxx.c drivers/net/phy/marvell.c drivers/net/vxlan.c All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19tipc: unlock in error pathInsu Yun
tipc_bcast_unlock need to be unlocked in error path. Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18Revert "genl: Add genlmsg_new_unicast() for unicast message allocation"Florian Westphal
This reverts commit bb9b18fb55b0 ("genl: Add genlmsg_new_unicast() for unicast message allocation")'. Nothing wrong with it; its no longer needed since this was only for mmapped netlink support. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tipc: refactor node xmit and fix memory leaksRichard Alpe
Refactor tipc_node_xmit() to fail fast and fail early. Fix several potential memory leaks in unexpected error paths. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tipc: fix premature addition of node to lookup tableJon Paul Maloy
In commit 5266698661401a ("tipc: let broadcast packet reception use new link receive function") we introduced a new per-node broadcast reception link instance. This link is created at the moment the node itself is created. Unfortunately, the allocation is done after the node instance has already been added to the node lookup hash table. This creates a potential race condition, where arriving broadcast packets are able to find and access the node before it has been fully initialized, and before the above mentioned link has been created. The result is occasional crashes in the function tipc_bcast_rcv(), which is trying to access the not-yet existing link. We fix this by deferring the addition of the node instance until after it has been fully initialized in the function tipc_node_create(). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: use alloc_ordered_workqueue() instead of WQ_UNBOUND w/ max_active = 1Parthasarathy Bhuvaragan
Until now, tipc_rcv and tipc_send workqueues in server are allocated with parameters WQ_UNBOUND & max_active = 1. This parameters passed to this function makes it equivalent to alloc_ordered_workqueue(). The later form is more explicit and can inherit future ordered_workqueue changes. In this commit we replace alloc_workqueue() with more readable alloc_ordered_workqueue(). Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: donot create timers if subscription timeout = TIPC_WAIT_FOREVERParthasarathy Bhuvaragan
Until now, we create timers even for the subscription requests with timeout = TIPC_WAIT_FOREVER. This can be improved by avoiding timer creation when the timeout is set to TIPC_WAIT_FOREVER. In this commit, we introduce a check to creates timers only when timeout != TIPC_WAIT_FOREVER. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: protect tipc_subscrb_get() with subscriber spin lockParthasarathy Bhuvaragan
Until now, during subscription creation the mod_time() & tipc_subscrb_get() are called after releasing the subscriber spin lock. In a SMP system when performing a subscription creation, if the subscription timeout occurs simultaneously (the timer is scheduled to run on another CPU) then the timer thread might decrement the subscribers refcount before the create thread increments the refcount. This can be simulated by creating subscription with timeout=0 and sometimes the timeout occurs before the create request is complete. This leads to the following message: [30.702949] BUG: spinlock bad magic on CPU#1, kworker/u8:3/87 [30.703834] general protection fault: 0000 [#1] SMP [30.704826] CPU: 1 PID: 87 Comm: kworker/u8:3 Not tainted 4.4.0-rc8+ #18 [30.704826] Workqueue: tipc_rcv tipc_recv_work [tipc] [30.704826] task: ffff88003f878600 ti: ffff88003fae0000 task.ti: ffff88003fae0000 [30.704826] RIP: 0010:[<ffffffff8109196c>] [<ffffffff8109196c>] spin_dump+0x5c/0xe0 [...] [30.704826] Call Trace: [30.704826] [<ffffffff81091a16>] spin_bug+0x26/0x30 [30.704826] [<ffffffff81091b75>] do_raw_spin_lock+0xe5/0x120 [30.704826] [<ffffffff81684439>] _raw_spin_lock_bh+0x19/0x20 [30.704826] [<ffffffffa0096f10>] tipc_subscrb_rcv_cb+0x1d0/0x330 [tipc] [30.704826] [<ffffffffa00a37b1>] tipc_receive_from_sock+0xc1/0x150 [tipc] [30.704826] [<ffffffffa00a31df>] tipc_recv_work+0x3f/0x80 [tipc] [30.704826] [<ffffffff8106a739>] process_one_work+0x149/0x3c0 [30.704826] [<ffffffff8106aa16>] worker_thread+0x66/0x460 [30.704826] [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0 [30.704826] [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0 [30.704826] [<ffffffff8107029d>] kthread+0xed/0x110 [30.704826] [<ffffffff810701b0>] ? kthread_create_on_node+0x190/0x190 [30.704826] [<ffffffff81684bdf>] ret_from_fork+0x3f/0x70 In this commit, 1. we remove the check for the return code for mod_timer() 2. we protect tipc_subscrb_get() using the subscriber spin lock. We increment the subscriber's refcount as soon as we add the subscription to subscriber's subscription list. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: hold subscriber->lock for tipc_nametbl_subscribe()Parthasarathy Bhuvaragan
Until now, while creating a subscription the subscriber lock protects only the subscribers subscription list and not the nametable. The call to tipc_nametbl_subscribe() is outside the lock. However, at subscription timeout and cancel both the subscribers subscription list and the nametable are protected by the subscriber lock. This asymmetric locking mechanism leads to the following problem: In a SMP system, the timer can be fire on another core before the create request is complete. When the timer thread calls tipc_nametbl_unsubscribe() before create thread calls tipc_nametbl_subscribe(), we get a nullptr exception. This can be simulated by creating subscription with timeout=0 and sometimes the timeout occurs before the create request is complete. The following is the oops: [57.569661] BUG: unable to handle kernel NULL pointer dereference at (null) [57.577498] IP: [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/0x120 [tipc] [57.584820] PGD 0 [57.586834] Oops: 0002 [#1] SMP [57.685506] CPU: 14 PID: 10077 Comm: kworker/u40:1 Tainted: P OENX 3.12.48-52.27.1. 9688.1.PTF-default #1 [57.703637] Workqueue: tipc_rcv tipc_recv_work [tipc] [57.708697] task: ffff88064c7f00c0 ti: ffff880629ef4000 task.ti: ffff880629ef4000 [57.716181] RIP: 0010:[<ffffffffa02135aa>] [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/ 0x120 [tipc] [...] [57.812327] Call Trace: [57.814806] [<ffffffffa0211c77>] tipc_subscrp_delete+0x37/0x90 [tipc] [57.821357] [<ffffffffa0211e2f>] tipc_subscrp_timeout+0x3f/0x70 [tipc] [57.827982] [<ffffffff810618c1>] call_timer_fn+0x31/0x100 [57.833490] [<ffffffff81062709>] run_timer_softirq+0x1f9/0x2b0 [57.839414] [<ffffffff8105a795>] __do_softirq+0xe5/0x230 [57.844827] [<ffffffff81520d1c>] call_softirq+0x1c/0x30 [57.850150] [<ffffffff81004665>] do_softirq+0x55/0x90 [57.855285] [<ffffffff8105aa35>] irq_exit+0x95/0xa0 [57.860290] [<ffffffff815215b5>] smp_apic_timer_interrupt+0x45/0x60 [57.866644] [<ffffffff8152005d>] apic_timer_interrupt+0x6d/0x80 [57.872686] [<ffffffffa02121c5>] tipc_subscrb_rcv_cb+0x2a5/0x3f0 [tipc] [57.879425] [<ffffffffa021c65f>] tipc_receive_from_sock+0x9f/0x100 [tipc] [57.886324] [<ffffffffa021c826>] tipc_recv_work+0x26/0x60 [tipc] [57.892463] [<ffffffff8106fb22>] process_one_work+0x172/0x420 [57.898309] [<ffffffff8107079a>] worker_thread+0x11a/0x3c0 [57.903871] [<ffffffff81077114>] kthread+0xb4/0xc0 [57.908751] [<ffffffff8151f318>] ret_from_fork+0x58/0x90 In this commit, we do the following at subscription creation: 1. set the subscription's subscriber pointer before performing tipc_nametbl_subscribe(), as this value is required further in the call chain ex: by tipc_subscrp_send_event(). 2. move tipc_nametbl_subscribe() under the scope of subscriber lock Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: fix connection abort when receiving invalid cancel requestParthasarathy Bhuvaragan
Until now, the subscribers endianness for a subscription create/cancel request is determined as: swap = !(s->filter & (TIPC_SUB_PORTS | TIPC_SUB_SERVICE)) The checks are performed only for port/service subscriptions. The swap calculation is incorrect if the filter in the subscription cancellation request is set to TIPC_SUB_CANCEL (it's a malformed cancel request, as the corresponding subscription create filter is missing). Thus, the check if the request is for cancellation fails and the request is treated as a subscription create request. The subscription creation fails as the request is illegal, which terminates this connection. In this commit we determine the endianness by including TIPC_SUB_CANCEL, which will set swap correctly and the request is processed as a cancellation request. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: fix connection abort during subscription cancellationParthasarathy Bhuvaragan
In 'commit 7fe8097cef5f ("tipc: fix nullpointer bug when subscribing to events")', we terminate the connection if the subscription creation fails. In the same commit, the subscription creation result was based on the value of subscription pointer (set in the function) instead of the return code. Unfortunately, the same function also handles subscription cancellation request. For a subscription cancellation request, the subscription pointer cannot be set. Thus the connection is terminated during cancellation request. In this commit, we move the subcription cancel check outside of tipc_subscrp_create(). Hence, - tipc_subscrp_create() will create a subscripton - tipc_subscrb_rcv_cb() will subscribe or cancel a subscription. Fixes: 'commit 7fe8097cef5f ("tipc: fix nullpointer bug when subscribing to events")' Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: introduce tipc_subscrb_subscribe() routineParthasarathy Bhuvaragan
In this commit, we split tipc_subscrp_create() into two: 1. tipc_subscrp_create() creates a subscription 2. A new function tipc_subscrp_subscribe() adds the subscription to the subscriber subscription list, activates the subscription timer and subscribes to the nametable updates. In future commits, the purpose of tipc_subscrb_rcv_cb() will be to either subscribe or cancel a subscription. There is no functional change in this commit. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: remove struct tipc_name_seq from struct tipc_subscriptionParthasarathy Bhuvaragan
Until now, struct tipc_subscriber has duplicate fields for type, upper and lower (as member of struct tipc_name_seq) at: 1. as member seq in struct tipc_subscription 2. as member seq in struct tipc_subscr, which is contained in struct tipc_event The former structure contains the type, upper and lower values in network byte order and the later contains the intact copy of the request. The struct tipc_subscription contains a field swap to determine if request needs network byte order conversion. Thus by using swap, we can convert the request when required instead of duplicating it. In this commit, 1. we remove the references to these elements as members of struct tipc_subscription and replace them with elements from struct tipc_subscr. 2. provide new functions to convert the user request into network byte order. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: remove filter and timeout elements from struct tipc_subscriptionParthasarathy Bhuvaragan
Until now, struct tipc_subscription has duplicate timeout and filter attributes present: 1. directly as members of struct tipc_subscription 2. in struct tipc_subscr, which is contained in struct tipc_event In this commit, we remove the references to these elements as members of struct tipc_subscription and replace them with elements from struct tipc_subscr. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tipc: remove incorrect check for subscription timeout valueParthasarathy Bhuvaragan
Until now, during subscription creation we set sub->timeout by converting the timeout request value in milliseconds to jiffies. This is followed by setting the timeout value in the timer if sub->timeout != TIPC_WAIT_FOREVER. For a subscription create request with a timeout value of TIPC_WAIT_FOREVER, msecs_to_jiffies(TIPC_WAIT_FOREVER) returns MAX_JIFFY_OFFSET (0xfffffffe). This is not equal to TIPC_WAIT_FOREVER (0xffffffff). In this commit, we remove this check. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>