summaryrefslogtreecommitdiff
path: root/net/rds
AgeCommit message (Collapse)Author
2017-03-09net: Work around lockdep limitation in sockets that use socketsDavid Howells
Lockdep issues a circular dependency warning when AFS issues an operation through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem. The theory lockdep comes up with is as follows: (1) If the pagefault handler decides it needs to read pages from AFS, it calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but creating a call requires the socket lock: mmap_sem must be taken before sk_lock-AF_RXRPC (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind() binds the underlying UDP socket whilst holding its socket lock. inet_bind() takes its own socket lock: sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET (3) Reading from a TCP socket into a userspace buffer might cause a fault and thus cause the kernel to take the mmap_sem, but the TCP socket is locked whilst doing this: sk_lock-AF_INET must be taken before mmap_sem However, lockdep's theory is wrong in this instance because it deals only with lock classes and not individual locks. The AF_INET lock in (2) isn't really equivalent to the AF_INET lock in (3) as the former deals with a socket entirely internal to the kernel that never sees userspace. This is a limitation in the design of lockdep. Fix the general case by: (1) Double up all the locking keys used in sockets so that one set are used if the socket is created by userspace and the other set is used if the socket is created by the kernel. (2) Store the kern parameter passed to sk_alloc() in a variable in the sock struct (sk_kern_sock). This informs sock_lock_init(), sock_init_data() and sk_clone_lock() as to the lock keys to be used. Note that the child created by sk_clone_lock() inherits the parent's kern setting. (3) Add a 'kern' parameter to ->accept() that is analogous to the one passed in to ->create() that distinguishes whether kernel_accept() or sys_accept4() was the caller and can be passed to sk_alloc(). Note that a lot of accept functions merely dequeue an already allocated socket. I haven't touched these as the new socket already exists before we get the parameter. Note also that there are a couple of places where I've made the accepted socket unconditionally kernel-based: irda_accept() rds_rcp_accept_one() tcp_accept_from_sock() because they follow a sock_create_kern() and accept off of that. Whilst creating this, I noticed that lustre and ocfs don't create sockets through sock_create_kern() and thus they aren't marked as for-kernel, though they appear to be internal. I wonder if these should do that so that they use the new set of lock keys. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09rds: ib: add error handleZhu Yanjun
In the function rds_ib_setup_qp, the error handle is missing. When some error occurs, it is possible that memory leak occurs. As such, error handle is added. Cc: Joe Jin <joe.jin@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Guanglei Li <guanglei.li@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07rds: tcp: Sequence teardown of listen and acceptor sockets to avoid racesSowmini Varadhan
Commit a93d01f5777e ("RDS: TCP: avoid bad page reference in rds_tcp_listen_data_ready") added the function rds_tcp_listen_sock_def_readable() to handle the case when a partially set-up acceptor socket drops into rds_tcp_listen_data_ready(). However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be null and we would hit a panic of the form BUG: unable to handle kernel NULL pointer dereference at (null) IP: (null) : ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp] tcp_data_queue+0x39d/0x5b0 tcp_rcv_established+0x2e5/0x660 tcp_v4_do_rcv+0x122/0x220 tcp_v4_rcv+0x8b7/0x980 : In the above case, it is not fatal to encounter a NULL value for ready- we should just drop the packet and let the flush of the acceptor thread finish gracefully. In general, the tear-down sequence for listen() and accept() socket that is ensured by this commit is: rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */ In rds_tcp_listen_stop(): serialize with, and prevent, further callbacks using lock_sock() flush rds_wq flush acceptor workq sock_release(listen socket) Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid racesSowmini Varadhan
Order of initialization in rds_tcp_init needs to be done so that resources are set up and destroyed in the correct synchronization sequence with both the data path, as well as netns create/destroy path. Specifically, - we must call register_pernet_subsys and get the rds_tcp_netid before calling register_netdevice_notifier, otherwise we risk the sequence 1. register_netdevice_notifier sets up netdev notifier callback 2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds the wrong rtn, resulting in a panic with string that is of the form: BUG: unable to handle kernel NULL pointer dereference at 000000000000000d IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp] : - the rds_tcp_incoming_slab kmem_cache must be initialized before the datapath starts up. The latter can happen any time after the pernet_subsys registration of rds_tcp_net_ops, whose -> init function sets up the listen socket. If the rds_tcp_incoming_slab has not been set up at that time, a panic of the form below may be encountered BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 IP: kmem_cache_alloc+0x90/0x1c0 : rds_tcp_data_recv+0x1e7/0x370 [rds_tcp] tcp_read_sock+0x96/0x1c0 rds_tcp_recv_path+0x65/0x80 [rds_tcp] : Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07rds: tcp: Take explicit refcounts on struct netSowmini Varadhan
It is incorrect for the rds_connection to piggyback on the sock_net() refcount for the netns because this gives rise to a chicken-and-egg problem during rds_conn_destroy. Instead explicitly take a ref on the net, and hold the netns down till the connection tear-down is complete. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix double-free in batman-adv, from Sven Eckelmann. 2) Fix packet stats for fast-RX path, from Joannes Berg. 3) Netfilter's ip_route_me_harder() doesn't handle request sockets properly, fix from Florian Westphal. 4) Fix sendmsg deadlock in rxrpc, from David Howells. 5) Add missing RCU locking to transport hashtable scan, from Xin Long. 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel. 7) Fix race in NAPI handling between poll handlers and busy polling, from Eric Dumazet. 8) TX path in vxlan and geneve need proper RCU locking, from Jakub Kicinski. 9) SYN processing in DCCP and TCP need to disable BH, from Eric Dumazet. 10) Properly handle net_enable_timestamp() being invoked from IRQ context, also from Eric Dumazet. 11) Fix crash on device-tree systems in xgene driver, from Alban Bedel. 12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de Melo. 13) Fix use-after-free in netvsc driver, from Dexuan Cui. 14) Fix max MTU setting in bonding driver, from WANG Cong. 15) xen-netback hash table can be allocated from softirq context, so use GFP_ATOMIC. From Anoob Soman. 16) Fix MAC address change bug in bgmac driver, from Hari Vyas. 17) strparser needs to destroy strp_wq on module exit, from WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits) strparser: destroy workqueue on module exit sfc: fix IPID endianness in TSOv2 sfc: avoid max() in array size rds: remove unnecessary returned value check rxrpc: Fix potential NULL-pointer exception nfp: correct DMA direction in XDP DMA sync nfp: don't tell FW about the reserved buffer space net: ethernet: bgmac: mac address change bug net: ethernet: bgmac: init sequence bug xen-netback: don't vfree() queues under spinlock xen-netback: keep a local pointer for vif in backend_disconnect() netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups netfilter: nf_conntrack_sip: fix wrong memory initialisation can: flexcan: fix typo in comment can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer can: gs_usb: fix coding style can: gs_usb: Don't use stack memory for USB transfers ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines ixgbe: update the rss key on h/w, when ethtool ask for it ...
2017-03-03rds: remove unnecessary returned value checkZhu Yanjun
The function rds_trans_register always returns 0. As such, it is not necessary to check the returned value. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02Merge branch 'work.sendmsg' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs sendmsg updates from Al Viro: "More sendmsg work. This is a fairly separate isolated stuff (there's a continuation around lustre, but that one was too late to soak in -next), thus the separate pull request" * 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ncpfs: switch to sock_sendmsg() ncpfs: don't mess with manually advancing iovec on send ncpfs: sendmsg does *not* bugger iovec these days ceph_tcp_sendpage(): use ITER_BVEC sendmsg afs_send_pages(): use ITER_BVEC rds: remove dead code ceph: switch to sock_recvmsg() usbip_recv(): switch to sock_recvmsg() iscsi_target: deal with short writes on the tx side [nbd] pass iov_iter to nbd_xmit() [nbd] switch sock_xmit() to sock_{send,recv}msg() [drbd] use sock_sendmsg()
2017-03-01rds: ib: add the static type to the variablesZhu Yanjun
The variables rds_ib_mr_1m_pool_size and rds_ib_mr_8k_pool_size are used only in the ib.c file. As such, the static type is added to limit them in this file. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Don't save TIPC header values before the header has been validated, from Jon Paul Maloy. 2) Fix memory leak in RDS, from Zhu Yanjun. 3) We miss to initialize the UID in the flow key in some paths, from Julian Anastasov. 4) Fix latent TOS masking bug in the routing cache removal from years ago, also from Julian. 5) We forget to set the sockaddr port in sctp_copy_local_addr_list(), fix from Xin Long. 6) Missing module ref count drop in packet scheduler actions, from Roman Mashak. 7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu. 8) Fix use after free which happens because L2TP's ipv4 support returns non-zero values from it's backlog_rcv function which ipv4 interprets as protocol values. Fix from Paul Hüber. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) qed: Don't use attention PTT for configuring BW qed: Fix race with multiple VFs l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv xfrm: provide correct dst in xfrm_neigh_lookup rhashtable: Fix RCU dereference annotation in rht_bucket_nested rhashtable: Fix use before NULL check in bucket_table_free net sched actions: do not overwrite status of action creation. rxrpc: Kernel calls get stuck in recvmsg net sched actions: decrement module reference count after table flush. lib: Allow compile-testing of parman ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt sctp: set sin_port for addr param when checking duplicate address net/mlx4_en: fix overflow in mlx4_en_init_timestamp() netfilter: nft_set_bitmap: incorrect bitmap size net: s2io: fix typo argumnet argument net: vxge: fix typo argumnet argument netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value. ipv4: mask tos for input route ipv4: add missing initialization for flowi4_uid lib: fix spelling mistake: "actualy" -> "actually" ...
2017-02-25Merge tag 'for-next-dma_ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma DMA mapping updates from Doug Ledford: "Drop IB DMA mapping code and use core DMA code instead. Bart Van Assche noted that the ib DMA mapping code was significantly similar enough to the core DMA mapping code that with a few changes it was possible to remove the IB DMA mapping code entirely and switch the RDMA stack to use the core DMA mapping code. This resulted in a nice set of cleanups, but touched the entire tree and has been kept separate for that reason." * tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits) IB/rxe, IB/rdmavt: Use dma_virt_ops instead of duplicating it IB/core: Remove ib_device.dma_device nvme-rdma: Switch from dma_device to dev.parent RDS: net: Switch from dma_device to dev.parent IB/srpt: Modify a debug statement IB/srp: Switch from dma_device to dev.parent IB/iser: Switch from dma_device to dev.parent IB/IPoIB: Switch from dma_device to dev.parent IB/rxe: Switch from dma_device to dev.parent IB/vmw_pvrdma: Switch from dma_device to dev.parent IB/usnic: Switch from dma_device to dev.parent IB/qib: Switch from dma_device to dev.parent IB/qedr: Switch from dma_device to dev.parent IB/ocrdma: Switch from dma_device to dev.parent IB/nes: Remove a superfluous assignment statement IB/mthca: Switch from dma_device to dev.parent IB/mlx5: Switch from dma_device to dev.parent IB/mlx4: Switch from dma_device to dev.parent IB/i40iw: Remove a superfluous assignment statement IB/hns: Switch from dma_device to dev.parent ...
2017-02-24rds: fix memory leak errorZhu Yanjun
When the function register_netdevice_notifier fails, the memory allocated by kmem_cache_create should be freed by the function kmem_cache_destroy. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24RDS: IB: fix ifnullfree.cocci warningsWu Fengguang
net/rds/ib.c:115:2-7: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17rds:Remove unnecessary ib_ring unallocZhu Yanjun
In the function rds_ib_xmit_atomic, ib_ring is not allocated successfully. As such, it is not necessary to unalloc it. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24RDS: net: Switch from dma_device to dev.parentBart Van Assche
Prepare for removal of ib_device.dma_device. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24IB/core: Change the type of an ib_dma_alloc_coherent() argumentBart Van Assche
Change the type of the dma_handle argument from u64 * to dma_addr_t *. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDS: IB: Remove an unused structure memberBart Van Assche
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: rds-devel@oss.oracle.com Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-06RDS: validate the requested traces user input against max supportedsantosh.shilimkar@oracle.com
Larger than supported value can lead to array read/write overflow. Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-02RDS: add receive message trace used by applicationSantosh Shilimkar
Socket option to tap receive path latency in various stages in nano seconds. It can be enabled on selective sockets using using SO_RDS_MSG_RXPATH_LATENCY socket option. RDS will return the data to application with RDS_CMSG_RXPATH_LATENCY in defined format. Scope is left to add more trace points for future without need of change in the interface. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: make message size limit compliant with specAvinash Repaka
RDS support max message size as 1M but the code doesn't check this in all cases. Patch fixes it for RDMA & non-RDMA and RDS MR size and its enforced irrespective of underlying transport. Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: add stat for socket recv memory usageVenkat Venkatsubra
Tracks the receive side memory added to scokets and removed from sockets. Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: fix panic due to handlers running post teardownSantosh Shilimkar
Shutdown code reaping loop takes care of emptying the CQ's before they being destroyed. And once tasklets are killed, the hanlders are not expected to run. But because of core tasklet code issues, tasklet handler could still run even after tasklet_kill, RDS IB shutdown code already reaps the CQs before freeing cq/qp resources so as such the handlers have nothing left to do post shutdown. On other hand any handler running after teardown and trying to access already freed qp/cq resources causes issues Patch fixes this race by makes sure that handlers returns without any action post teardown. Reviewed-by: Wengang <wen.gang.wang@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: RDMA: Fix the composite message user notificationSantosh Shilimkar
When application sends an RDS RDMA composite message consist of RDMA transfer to be followed up by non RDMA payload, it expect to be notified *only* when the full message gets delivered. RDS RDMA notification doesn't behave this way though. Thanks to Venkat for debug and root casuing the issue where only first part of the message(RDMA) was successfully delivered but remainder payload delivery failed. In that case, application should not be notified with a false positive of message delivery success. Fix this case by making sure the user gets notified only after the full message delivery. Reviewed-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: Add vector spreading for cqsSantosh Shilimkar
Based on available device vectors, allocate cqs accordingly to get better spread of completion vectors which helps performace great deal.. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: add few useful cache stastsSantosh Shilimkar
Tracks the ib receive cache total, incoming and frag allocations. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: track and log active side endpoint in connectionSantosh Shilimkar
Useful to know the active and passive end points in a RDS IB connection. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: RDMA: silence the use_once mr log floodSantosh Shilimkar
In absence of extension headers, message log will keep flooding the console. As such even without use_once we can clean up the MRs so its not really an error case message so make it debug message Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: split the mr registration and invalidation pathSantosh Shilimkar
MR invalidation in RDS is done in background thread and not in data path like registration. So break the dependency between them which helps to remove the performance bottleneck. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: RDMA: return appropriate error on rdma map failuresSantosh Shilimkar
The first message to a remote node should prompt a new connection even if it is RDMA operation. For RDMA operation the MR mapping can fail because connections is not yet up. Since the connection establishment is asynchronous, we make sure the map failure because of unavailable connection reach to the user by appropriate error code. Before returning to the user, lets trigger the connection so that its ready for the next retry. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: RDMA: start rdma listening after initQing Huang
This prevents RDS from handling incoming rdma packets before RDS completes initializing its recv/send components. Signed-off-by: Qing Huang <qing.huang@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: RDMA: fix the ib_map_mr_sg_zbva() argumentSantosh Shilimkar
Fixes warning: Using plain integer as NULL pointer Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: make the transport retry count smallestSantosh Shilimkar
Transport retry is not much useful since it indicate packet loss in fabric so its better to failover fast rather than longer retry. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: IB: include faddr in connection logSantosh Shilimkar
Also use pr_* for it. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: mark few internal functions static to make sparse build happySantosh Shilimkar
Fixes below warnings: warning: symbol 'rds_send_probe' was not declared. Should it be static? warning: symbol 'rds_send_ping' was not declared. Should it be static? warning: symbol 'rds_tcp_accept_one_path' was not declared. Should it be static? warning: symbol 'rds_walk_conn_path_info' was not declared. Should it be static? Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2017-01-02RDS: log the address on bind failureSantosh Shilimkar
It's useful to know the IP address when RDS fails to bind a connection. Thus, adding it to the error message. Orabug: 21894138 Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2016-12-26rds: remove dead codeAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-20RDS: use rb_entry()Geliang Tang
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-15Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is the complete update for the rdma stack for this release cycle. Most of it is typical driver and core updates, but there is the entirely new VMWare pvrdma driver. You may have noticed that there were changes in DaveM's pull request to the bnxt Ethernet driver to support a RoCE RDMA driver. The bnxt_re driver was tentatively set to be pulled in this release cycle, but it simply wasn't ready in time and was dropped (a few review comments still to address, and some multi-arch build issues like prefetch() not working across all arches). Summary: - shared mlx5 updates with net stack (will drop out on merge if Dave's tree has already been merged) - driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe - debug cleanups - new connection rejection helpers - SRP updates - various misc fixes - new paravirt driver from vmware" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (210 commits) IB: Add vmw_pvrdma driver IB/mlx4: fix improper return value IB/ocrdma: fix bad initialization infiniband: nes: return value of skb_linearize should be handled MAINTAINERS: Update Intel RDMA RNIC driver maintainers MAINTAINERS: Remove Mitesh Ahuja from emulex maintainers IB/core: fix unmap_sg argument qede: fix general protection fault may occur on probe IB/mthca: Replace pci_pool_alloc by pci_pool_zalloc mlx5, calc_sq_size(): Make a debug message more informative mlx5: Remove a set-but-not-used variable mlx5: Use { } instead of { 0 } to init struct IB/srp: Make writing the add_target sysfs attr interruptible IB/srp: Make mapping failures easier to debug IB/srp: Make login failures easier to debug IB/srp: Introduce a local variable in srp_add_one() IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build IB/multicast: Check ib_find_pkey() return value IPoIB: Avoid reading an uninitialized member variable IB/mad: Fix an array index check ...
2016-12-14rds_rdma: log the connection reject messageSteve Wise
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_netSowmini Varadhan
If some error is encountered in rds_tcp_init_net, make sure to unregister_netdevice_notifier(), else we could trigger a panic later on, when the modprobe from a netns fails. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18netns: make struct pernet_operations::id unsigned intAlexey Dobriyan
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17RDS: TCP: Force every connection to be initiated by numerically smaller IP ↵Sowmini Varadhan
address When 2 RDS peers initiate an RDS-TCP connection simultaneously, there is a potential for "duelling syns" on either/both sides. See commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") for a description of this condition, and the arbitration logic which ensures that the numerically large IP address in the TCP connection is bound to the RDS_TCP_PORT ("canonical ordering"). The rds_connection should not be marked as RDS_CONN_UP until the arbitration logic has converged for the following reason. The sender may start transmitting RDS datagrams as soon as RDS_CONN_UP is set, and since the sender removes all datagrams from the rds_connection's cp_retrans queue based on TCP acks. If the TCP ack was sent from a tcp socket that got reset as part of duel aribitration (but before data was delivered to the receivers RDS socket layer), the sender may end up prematurely freeing the datagram, and the datagram is no longer reliably deliverable. This patch remedies that condition by making sure that, upon receipt of 3WH completion state change notification of TCP_ESTABLISHED in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP if, and only if, the IP addresses and ports for the connection are canonically ordered. In all other cases, rds_tcp_state_change will force an rds_conn_path_drop(), and rds_queue_reconnect() on both peers will restart the connection to ensure canonical ordering. A side-effect of enforcing this condition in rds_tcp_state_change() is that rds_tcp_accept_one_path() can now be refactored for simplicity. It is also no longer possible to encounter an RDS_CONN_UP connection in the arbitration logic in rds_tcp_accept_one(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17RDS: TCP: Track peer's connection generation numberSowmini Varadhan
The RDS transport has to be able to distinguish between two types of failure events: (a) when the transport fails (e.g., TCP connection reset) but the RDS socket/connection layer on both sides stays the same (b) when the peer's RDS layer itself resets (e.g., due to module reload or machine reboot at the peer) In case (a) both sides must reconnect and continue the RDS messaging without any message loss or disruption to the message sequence numbers, and this is achieved by rds_send_path_reset(). In case (b) we should reset all rds_connection state to the new incarnation of the peer. Examples of state that needs to be reset are next expected rx sequence number from, or messages to be retransmitted to, the new incarnation of the peer. To achieve this, the RDS handshake probe added as part of commit 5916e2c1554f ("RDS: TCP: Enable multipath RDS for TCP") is enhanced so that sender and receiver of the RDS ping-probe will add a generation number as part of the RDS_EXTHDR_GEN_NUM extension header. Each peer stores local and remote generation numbers as part of each rds_connection. Changes in generation number will be detected via incoming handshake probe ping request or response and will allow the receiver to reset rds_connection state. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17RDS: TCP: set RDS_FLAG_RETRANSMITTED in cp_retrans listSowmini Varadhan
As noted in rds_recv_incoming() sequence numbers on data packets can decreas for the failover case, and the Rx path is equipped to recover from this, if the RDS_FLAG_RETRANSMITTED is set on the rds header of an incoming message with a suspect sequence number. The RDS_FLAG_RETRANSMITTED is predicated on the RDS_FLAG_RETRANSMITTED flag in the rds_message, so make sure the flag is set on messages queued for retransmission. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09RDS: TCP: start multipath acceptor loop at 0Sowmini Varadhan
The for() loop in rds_tcp_accept_one() assumes that the 0'th rds_tcp_conn_path is UP and starts multipath accepts at index 1. But this assumption may not always be true: if the 0'th path has failed (ERROR or DOWN state) an incoming connection request should be used to resurrect this path. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09RDS: TCP: report addr/port info based on TCP socket in rds-infoSowmini Varadhan
The socket argument passed to rds_tcp_tc_info() is a PF_RDS socket, so it is incorrect to report the address port info based on rds_getname() as part of TCP state report. Invoke inet_getname() for the t_sock associated with the rds_tcp_connection instead. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29rds: debug messages are enabled by defaultshamir rabinovitch
rds use Kconfig option called "RDS_DEBUG" to enable rds debug messages. This option cause the rds Makefile to add -DDEBUG to the rds gcc command line. When CONFIG_DYNAMIC_DEBUG is enabled, the "DEBUG" macro is used by include/linux/dynamic_debug.h to decide if dynamic debug prints should be sent by default to the kernel log. rds should not enable this macro for production builds. rds dynamic debug work as expected follow this fix. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17rds: Remove duplicate prefix from rds_conn_path_error useJoe Perches
rds_conn_path_error already prefixes "RDS:" to the output. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>