summaryrefslogtreecommitdiff
path: root/net/rds/ib_rdma.c
AgeCommit message (Collapse)Author
2015-10-05RDS: IB: split mr pool to improve 8K messages performanceSantosh Shilimkar
8K message sizes are pretty important usecase for RDS current workloads so we make provison to have 8K mrs available from the pool. Based on number of SG's in the RDS message, we pick a pool to use. Also to make sure that we don't under utlise mrs when say 8k messages are dominating which could lead to 8k pull being exhausted, we fall-back to 1m pool till 8k pool recovers for use. This helps to at least push ~55 kB/s bidirectional data which is a nice improvement. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: mark rds_ib_fmr_wq staticSantosh Shilimkar
Fix below warning by marking rds_ib_fmr_wq static net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static? Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: use already available pool handle from ibmrSantosh Shilimkar
rds_ib_mr already keeps the pool handle which it associates with. Lets use that instead of round about way of fetching it from rds_ib_device. No functional change. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: fix the rds_ib_fmr_wq kick callSantosh Shilimkar
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need to use queue_delayed_work() to kick the work. This was hurting the performance since pool maintenance was less often triggered from other path. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-09-30RDS: use kfree_rcu in rds_ib_remove_ipaddrSantosh Shilimkar
synchronize_rcu() slowing down un-necessarily the socket shutdown path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr() which is perfect usecase for kfree_rcu(); So lets use that to gain some speedup. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-08-25RDS: remove superfluous from rds_ib_alloc_fmr()santosh.shilimkar@oracle.com
Memory allocated for 'ibmr' uses kzalloc_node() which already initialises the memory to zero. There is no need to do memset() 0 on that memory. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: flush the FMR pool less oftensantosh.shilimkar@oracle.com
FMR flush is an expensive and time consuming operation. Reduce the frequency of FMR pool flush by 50% so that more FMR work gets accumulated for more efficient flushing. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: push FMR pool flush work to its own workersantosh.shilimkar@oracle.com
RDS FMR flush operation and also it races with connect/reconect which happes a lot with RDS. FMR flush being on common rds_wq aggrevates the problem. Lets push RDS FMR pool flush work to its own worker. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: fix fmr pool dirty_countWengang Wang
In rds_ib_flush_mr_pool(), dirty_count accounts the clean ones which is wrong. This can lead to a negative dirty count value. Lets fix it. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: Fix assertion level from fatal to warningsantosh.shilimkar@oracle.com
Fix the asserion level since its not fatal and can be hit in normal execution paths. There is no need to take the system down. We keep the WARN_ON() to detect the condition if we get here with bad pages. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: don't update ip address tables if the address hasn't changedsantosh.shilimkar@oracle.com
If the ip address tables hasn't changed, there is no need to remove them only to be added back again. Lets fix it. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-14rds: rds_ib_device.refcount overflowWengang Wang
Fixes: 3e0249f9c05c ("RDS/IB: add refcount tracking to struct rds_ib_device") There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr failed(mr pool running out). this lead to the refcount overflow. A complain in line 117(see following) is seen. From vmcore: s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448. That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely to return ERR_PTR(-EAGAIN). 115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev) 116 { 117 BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0); 118 if (atomic_dec_and_test(&rds_ibdev->refcount)) 119 queue_work(rds_wq, &rds_ibdev->free_work); 120 } fix is to drop refcount when rds_ib_alloc_fmr failed. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2014-08-26net: Replace get_cpu_var through this_cpu_ptrChristoph Lameter
Replace uses of get_cpu_var for address calculation through this_cpu_ptr. Cc: netdev@vger.kernel.org Cc: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-09-15net, rds, Replace xlist in net/rds/xlist.h with llistHuang Ying
The functionality of xlist and llist is almost same. This patch replace xlist with llist to avoid code duplication. Known issues: don't know how to test this, need special hardware? Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Andy Grover <andy.grover@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-01rds/ib: use system_wq instead of rds_ib_fmr_wqTejun Heo
With cmwq, there's no reason to use dedicated rds_ib_fmr_wq - it's not in the memory reclaim path and the maximum number of concurrent work items is bound by the number of devices. Drop it and use system_wq instead. This rds_ib_fmr_init/exit() noops. Both removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Grover <andy.grover@oracle.com>
2010-10-21rds: make local functions/variables staticstephen hemminger
The RDS protocol has lots of functions that should be declared static. rds_message_get/add_version_extension is removed since it defined but never used. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-19rds: spin_lock_irq() is not nestableDan Carpenter
This is basically just a cleanup. IRQs were disabled on the previous line so we don't need to do it again here. In the current code IRQs would get turned on one line earlier than intended. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08RDS/IB: protect the list of IB devicesZach Brown
The RDS IB device list wasn't protected by any locking. Traversal in both the get_mr and FMR flushing paths could race with additon and removal. List manipulation is done with RCU primatives and is protected by the write side of a rwsem. The list traversal in the get_mr fast path is protected by a rcu read critical section. The FMR list traversal is more problematic because it can block while traversing the list. We protect this with the read side of the rwsem. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08RDS: flush fmrs before allocating new onesChris Mason
Flushing FMRs is somewhat expensive, and is currently kicked off when the interrupt handler notices that we are getting low. The result of this is that FMR flushing only happens from the interrupt cpus. This spreads the load more effectively by triggering flushes just before we allocate a new FMR. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08RDS: remove __init and __exit annotationZach Brown
The trivial amount of memory saved isn't worth the cost of dealing with section mismatches. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08RDS/IB: create a work queue for FMR flushingZach Brown
This patch moves the FMR flushing work in to its own mult-threaded work queue. This is to maintain performance in preparation for returning the main krdsd work queue back to a single threaded work queue to avoid deep-rooted concurrency bugs. This is also good because it further separates FMRs, which might be removed some day, from the rest of the code base. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08RDS/IB: destroy connections on rmmodZach Brown
IB connections were not being destroyed during rmmod. First, recently IB device removal callback was changed to disconnect connections that used the removing device rather than destroying them. So connections with devices during rmmod were not being destroyed. Second, rds_ib_destroy_nodev_conns() was being called before connections are disassociated with devices. It would almost never find connections in the nodev list. We first get rid of rds_ib_destroy_conns(), which is no longer called, and refactor the existing caller into the main body of the function and get rid of the list and lock wrappers. Then we call rds_ib_destroy_nodev_conns() *after* ib_unregister_client() has removed the IB device from all the conns and put the conns on the nodev list. The result is that IB connections are destroyed by rmmod. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08RDS: whitespaceAndy Grover
2010-09-08RDS: use delayed work for the FMR flushesChris Mason
Using a delayed work queue helps us make sure a healthy number of FMRs have queued up over the limit. It makes for a large improvement in RDMA iops. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08rds: recycle FMRs through lockless listsChris Mason
FRM allocation and recycling is performance critical and fairly lock intensive. The current code has a per connection lock that all processes bang on and it becomes a major bottleneck on large systems. This changes things to use a number of cmpxchg based lists instead, allowing us to go through the whole FMR lifecycle without locking inside RDS. Zach Brown pointed out that our usage of cmpxchg for xlist removal is racey if someone manages to remove and add back an FMR struct into the list while another CPU can see the FMR's address at the head of the list. The second CPU might assume the list hasn't changed when in fact any number of operations might have happened in between the deletion and reinsertion. This commit maintains a per cpu count of CPUs that are currently in xlist removal, and establishes a grace period to make sure that nobody can see an entry we have just removed from the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08RDS/IB: add refcount tracking to struct rds_ib_deviceZach Brown
The RDS IB client .remove callback used to free the rds_ibdev for the given device unconditionally. This could race other users of the struct. This patch adds refcounting so that we only free the rds_ibdev once all of its users are done. Many rds_ibdev users are tied to connections. We give the connection a reference and change these users to reference the device in the connection instead of looking it up in the IB client data. The only user of the IB client data remaining is the first lookup of the device as connections are built up. Incrementing the reference count of a device found in the IB client data could race with final freeing so we use an RCU grace period to make sure that freeing won't happen until those lookups are done. MRs need the rds_ibdev to get at the pool that they're freed in to. They exist outside a connection and many MRs can reference different devices from one socket, so it was natural to have each MR hold a reference. MR refs can be dropped from interrupt handlers and final device teardown can block so we push it off to a work struct. Pool teardown had to be fixed to cancel its pending work instead of deadlocking waiting for all queued work, including itself, to finish. MRs get their reference from the global device list, which gets a reference. It is left unprotected by locks and remains racy. A simple global lock would be a significant bottleneck. More scalable (complicated) locking should be done carefully in a later patch. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08rds: Use RCU for the bind lookup searchesChris Mason
The RDS bind lookups are somewhat expensive in terms of CPU time and locking overhead. This commit changes them into a faster RCU based hash tree instead of the rbtrees they were using before. On large NUMA systems it is a significant improvement. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node()Andy Grover
Allocate send/recv rings in memory that is node-local to the HCA. This significantly helps performance. Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08RDS/IB: Remove unused variable in ib_remove_addr()Andy Grover
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08rds: rcu-ize rds_ib_get_device()Chris Mason
rds_ib_get_device is called very often as we turn an ip address into a corresponding device structure. It currently take a global spinlock as it walks different lists to find active devices. This commit changes the lists over to RCU, which isn't very complex because they are not updated very often at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08RDS: Implement atomic operationsAndy Grover
Implement a CMSG-based interface to do FADD and CSWP ops. Alter send routines to handle atomic ops. Add atomic counters to stats. Add xmit_atomic() to struct rds_transport Inline rds_ib_send_unmap_rdma into unmap_rm Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08RDS: fold rdma.h into rds.hAndy Grover
RDMA is now an intrinsic part of RDS, so it's easier to just have a single header. Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08RDS: Fix BUG_ONs to not fire when in a taskletAndy Grover
in_interrupt() is true in softirqs. The BUG_ONs are supposed to check for if irqs are disabled, so we should use BUG_ON(irqs_disabled()) instead, duh. Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-04-11Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/stmmac/stmmac_main.c drivers/net/wireless/wl12xx/wl1271_cmd.c drivers/net/wireless/wl12xx/wl1271_main.c drivers/net/wireless/wl12xx/wl1271_spi.c net/core/ethtool.c net/mac80211/scan.c
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-16RDS: Do not call set_page_dirty() with irqs offAndy Grover
set_page_dirty() unconditionally re-enables interrupts, so if we call it with irqs off, they will be on after the call, and that's bad. This patch moves the call after we've re-enabled interrupts in send_drop_to(), so it's safe. Also, add BUG_ONs to let us know if we ever do call set_page_dirty with interrupts off. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16RDS: Workaround for in-use MRs on close causing crashAndy Grover
if a machine is shut down without closing sockets properly, and freeing all MRs, then a BUG_ON will bring it down. This patch changes these to WARN_ONs -- leaking MRs is not fatal (although not ideal, and there is more work to do here for a proper fix.) Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29net: Move && and || to end of previous lineJoe Perches
Not including net/atm/ Compiled tested x86 allyesconfig only Added a > 80 column line or two, which I ignored. Existing checkpatch plaints willfully, cheerfully ignored. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-30RDS: Fix panic on unloadAndy Grover
Remove explicit destruction of passive connection when destroying active end of the connection. The passive end is also on the device's connection list, and will thus be cleaned up properly. Panic was caused by trying to clean it up twice. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-20RDS/IB: Always use PAGE_SIZE for FMR page sizeAndy Grover
While FMRs allow significant flexibility in what size of pages they can use, we really just want FMR pages to match CPU page size. Roland says we can count on this always being supported, so this simplifies things. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-02RDS: Rewrite connection cleanup, fixing oops on rmmodAndy Grover
This fixes a bug where a connection was unexpectedly not on *any* list while being destroyed. It also cleans up some code duplication and regularizes some function names. * Grab appropriate lock in conn_free() and explain in comment * Ensure via locking that a conn is never not on either a dev's list or the nodev list * Add rds_xx_remove_conn() to match rds_xx_add_conn() * Make rds_xx_add_conn() return void * Rename remove_{,nodev_}conns() to destroy_{,nodev_}conns() and unify their implementation in a helper function * Document lock ordering as nodev conn_lock before dev_conn_lock Reported-by: Yosef Etigin <yosefe@voltaire.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-26RDS/IB: Implement RDMA ops using FMRsAndy Grover
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>