Age | Commit message (Collapse) | Author |
|
There are 2 call chains:
a) xsk_bind --> xdp_umem_assign_dev
b) unregister_netdevice_queue --> xsk_notifier
with the following locking order:
a) xs->mutex --> rtnl_lock
b) rtnl_lock --> xdp.lock --> xs->mutex
Different order of taking 'xs->mutex' and 'rtnl_lock' could produce a
deadlock here. Fix that by moving the 'rtnl_lock' before 'xs->lock' in
the bind call chain (a).
Reported-by: syzbot+bf64ec93de836d7f4c2c@syzkaller.appspotmail.com
Fixes: 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Completion queue address reservation could not be undone.
In case of bad 'queue_id' or skb allocation failure, reserved entry
will be leaked reducing the total capacity of completion queue.
Fix that by moving reservation to the point where failure is not
possible. Additionally, 'queue_id' checking moved out from the loop
since there is no point to check it there.
Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: William Tu <u9012063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux
Santosh Shilimkar says:
====================
rds fixes
Few rds fixes which makes rds rdma transport reliably working on mainline
First two fixes are applicable to v4.11+ stable versions and last
three patches applies to only v5.1 stable and current mainline.
Patchset is re-based against 'net' and also available on below tree
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull SCSI scatter-gather list updates from James Bottomley:
"This topic branch covers a fundamental change in how our sg lists are
allocated to make mq more efficient by reducing the size of the
preallocated sg list.
This necessitates a large number of driver changes because the
previous guarantee that if a driver specified SG_ALL as the size of
its scatter list, it would get a non-chained list and didn't need to
bother with scatterlist iterators is now broken and every driver
*must* use scatterlist iterators.
This was broken out as a separate topic because we need to convert all
the drivers before pulling the trigger and unconverted drivers kept
being found, necessitating a rebase"
* tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits)
scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN
scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation
scsi: core: avoid preallocating big SGL for data
scsi: core: avoid preallocating big SGL for protection information
scsi: lib/sg_pool.c: improve APIs for allocating sg pool
scsi: esp: use sg helper to iterate over scatterlist
scsi: NCR5380: use sg helper to iterate over scatterlist
scsi: wd33c93: use sg helper to iterate over scatterlist
scsi: ppa: use sg helper to iterate over scatterlist
scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist
scsi: imm: use sg helper to iterate over scatterlist
scsi: aha152x: use sg helper to iterate over scatterlist
scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist
scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist
scsi: usb: image: microtek: use sg helper to iterate over scatterlist
scsi: pmcraid: use sg helper to iterate over scatterlist
scsi: ipr: use sg helper to iterate over scatterlist
scsi: mvumi: use sg helper to iterate over scatterlist
scsi: lpfc: use sg helper to iterate over scatterlist
scsi: advansys: use sg helper to iterate over scatterlist
...
|
|
fl_create() should call static_branch_deferred_inc() only in
case of success.
Also we should not call fl_free() in error path, as this could
cause a static key imbalance.
jump label: negative count!
WARNING: CPU: 0 PID: 15907 at kernel/jump_label.c:221 static_key_slow_try_dec kernel/jump_label.c:221 [inline]
WARNING: CPU: 0 PID: 15907 at kernel/jump_label.c:221 static_key_slow_try_dec+0x1ab/0x1d0 kernel/jump_label.c:206
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 15907 Comm: syz-executor.2 Not tainted 5.2.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
panic+0x2cb/0x744 kernel/panic.c:219
__warn.cold+0x20/0x4d kernel/panic.c:576
report_bug+0x263/0x2b0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:179 [inline]
fixup_bug arch/x86/kernel/traps.c:174 [inline]
do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:986
RIP: 0010:static_key_slow_try_dec kernel/jump_label.c:221 [inline]
RIP: 0010:static_key_slow_try_dec+0x1ab/0x1d0 kernel/jump_label.c:206
Code: c0 e8 e9 3e e5 ff 83 fb 01 0f 85 32 ff ff ff e8 5b 3d e5 ff 45 31 ff eb a0 e8 51 3d e5 ff 48 c7 c7 40 99 92 87 e8 13 75 b7 ff <0f> 0b eb 8b 4c 89 e7 e8 a9 c0 1e 00 e9 de fe ff ff e8 bf 6d b7 ff
RSP: 0018:ffff88805f9c7450 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000000
RDX: 000000000000e3e1 RSI: ffffffff815adb06 RDI: ffffed100bf38e7c
RBP: ffff88805f9c74e0 R08: ffff88806acf0700 R09: ffffed1015d060a9
R10: ffffed1015d060a8 R11: ffff8880ae830547 R12: ffffffff89832ce0
R13: ffff88805f9c74b8 R14: 1ffff1100bf38e8b R15: 00000000ffffff01
__static_key_slow_dec_deferred+0x65/0x110 kernel/jump_label.c:272
fl_free+0xa9/0xe0 net/ipv6/ip6_flowlabel.c:121
fl_create+0x6af/0x9f0 net/ipv6/ip6_flowlabel.c:457
ipv6_flowlabel_opt+0x80e/0x2730 net/ipv6/ip6_flowlabel.c:624
do_ipv6_setsockopt.isra.0+0x2119/0x4100 net/ipv6/ipv6_sockglue.c:825
ipv6_setsockopt+0xf6/0x170 net/ipv6/ipv6_sockglue.c:944
tcp_setsockopt net/ipv4/tcp.c:3131 [inline]
tcp_setsockopt+0x8f/0xe0 net/ipv4/tcp.c:3125
sock_common_setsockopt+0x94/0xd0 net/core/sock.c:3130
__sys_setsockopt+0x253/0x4b0 net/socket.c:2080
__do_sys_setsockopt net/socket.c:2096 [inline]
__se_sys_setsockopt net/socket.c:2093 [inline]
__x64_sys_setsockopt+0xbe/0x150 net/socket.c:2093
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4597c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2670556c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00000000004597c9
RDX: 0000000000000020 RSI: 0000000000000029 RDI: 0000000000000003
RBP: 000000000075bfc8 R08: 000000000000fdf7 R09: 0000000000000000
R10: 0000000020000000 R11: 0000000000000246 R12: 00007f26705576d4
R13: 00000000004cec00 R14: 00000000004dd520 R15: 00000000ffffffff
Kernel Offset: disabled
Rebooting in 86400 seconds..
Fixes: 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Willem forgot to change one of the calls to fl6_sock_lookup(),
which can now return an error or NULL.
syzbot reported :
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 31763 Comm: syz-executor.0 Not tainted 5.2.0-rc6+ #63
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:ip6_datagram_dst_update+0x559/0xc30 net/ipv6/datagram.c:83
Code: 00 00 e8 ea 29 3f fb 4d 85 f6 0f 84 96 04 00 00 e8 dc 29 3f fb 49 8d 7e 20 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 16 06 00 00 4d 8b 6e 20 e8 b4 29 3f fb 4c 89 ee
RSP: 0018:ffff88809ba97ae0 EFLAGS: 00010207
RAX: dffffc0000000000 RBX: ffff8880a81254b0 RCX: ffffc90008118000
RDX: 0000000000000003 RSI: ffffffff86319a84 RDI: 000000000000001e
RBP: ffff88809ba97c10 R08: ffff888065e9e700 R09: ffffed1015d26c80
R10: ffffed1015d26c7f R11: ffff8880ae9363fb R12: ffff8880a8124f40
R13: 0000000000000001 R14: fffffffffffffffe R15: ffff88809ba97b40
FS: 00007f38e606a700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000202c0140 CR3: 00000000a026a000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__ip6_datagram_connect+0x5e9/0x1390 net/ipv6/datagram.c:246
ip6_datagram_connect+0x30/0x50 net/ipv6/datagram.c:269
ip6_datagram_connect_v6_only+0x69/0x90 net/ipv6/datagram.c:281
inet_dgram_connect+0x14a/0x2d0 net/ipv4/af_inet.c:571
__sys_connect+0x264/0x330 net/socket.c:1824
__do_sys_connect net/socket.c:1835 [inline]
__se_sys_connect net/socket.c:1832 [inline]
__x64_sys_connect+0x73/0xb0 net/socket.c:1832
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4597c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f38e6069c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004597c9
RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f38e606a6d4
R13: 00000000004bfd07 R14: 00000000004d1838 R15: 00000000ffffffff
Modules linked in:
RIP: 0010:ip6_datagram_dst_update+0x559/0xc30 net/ipv6/datagram.c:83
Code: 00 00 e8 ea 29 3f fb 4d 85 f6 0f 84 96 04 00 00 e8 dc 29 3f fb 49 8d 7e 20 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 16 06 00 00 4d 8b 6e 20 e8 b4 29 3f fb 4c 89 ee
Fixes: 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In 323a53c41292 ("ipv6: tcp: enable flowlabel reflection in some RST packets")
and 50a8accf1062 ("ipv6: tcp: send consistent flowlabel in TIME_WAIT state")
we took care of IPv6 flowlabel reflections for two cases.
This patch takes care of the remaining case, when the RST packet
is sent on behalf of a 'full' socket.
In Marek use case, this was a socket in TCP_CLOSE state.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marek Majkowski <marek@cloudflare.com>
Tested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The length of AH header is computed manually as (hp->hdrlen+2)<<2.
However, in include/linux/ipv6.h, a macro named ipv6_authlen is
already defined for exactly the same job. This commit replaces
the manual computation code with the macro.
Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Switching from ->priv_destructor to dellink() has an unexpected
consequence: existing RCU readers, that is, hsr_port_get_hsr()
callers, may still be able to read the port list.
Instead of checking the return value of each hsr_port_get_hsr(),
we can just move it to ->ndo_uninit() which is called after
device unregister and synchronize_net(), and we still have RTNL
lock there.
Fixes: b9a1e627405d ("hsr: implement dellink to clean up resources")
Fixes: edf070a0fb45 ("hsr: fix a NULL pointer deref in hsr_dev_xmit()")
Reported-by: syzbot+097ef84cdc95843fbaa8@syzkaller.appspotmail.com
Cc: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking updates from David Miller:
"Some highlights from this development cycle:
1) Big refactoring of ipv6 route and neigh handling to support
nexthop objects configurable as units from userspace. From David
Ahern.
2) Convert explored_states in BPF verifier into a hash table,
significantly decreased state held for programs with bpf2bpf
calls, from Alexei Starovoitov.
3) Implement bpf_send_signal() helper, from Yonghong Song.
4) Various classifier enhancements to mvpp2 driver, from Maxime
Chevallier.
5) Add aRFS support to hns3 driver, from Jian Shen.
6) Fix use after free in inet frags by allocating fqdirs dynamically
and reworking how rhashtable dismantle occurs, from Eric Dumazet.
7) Add act_ctinfo packet classifier action, from Kevin
Darbyshire-Bryant.
8) Add TFO key backup infrastructure, from Jason Baron.
9) Remove several old and unused ISDN drivers, from Arnd Bergmann.
10) Add devlink notifications for flash update status to mlxsw driver,
from Jiri Pirko.
11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.
12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.
13) Various enhancements to ipv6 flow label handling, from Eric
Dumazet and Willem de Bruijn.
14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
der Merwe, and others.
15) Various improvements to axienet driver including converting it to
phylink, from Robert Hancock.
16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.
17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
Radulescu.
18) Add devlink health reporting to mlx5, from Moshe Shemesh.
19) Convert stmmac over to phylink, from Jose Abreu.
20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
Shalom Toledo.
21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.
22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.
23) Track spill/fill of constants in BPF verifier, from Alexei
Starovoitov.
24) Support bounded loops in BPF, from Alexei Starovoitov.
25) Various page_pool API fixes and improvements, from Jesper Dangaard
Brouer.
26) Just like ipv4, support ref-countless ipv6 route handling. From
Wei Wang.
27) Support VLAN offloading in aquantia driver, from Igor Russkikh.
28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.
29) Add flower GRE encap/decap support to nfp driver, from Pieter
Jansen van Vuuren.
30) Protect against stack overflow when using act_mirred, from John
Hurley.
31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.
32) Use page_pool API in netsec driver, Ilias Apalodimas.
33) Add Google gve network driver, from Catherine Sullivan.
34) More indirect call avoidance, from Paolo Abeni.
35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.
36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.
37) Add MPLS manipulation actions to TC, from John Hurley.
38) Add sending a packet to connection tracking from TC actions, and
then allow flower classifier matching on conntrack state. From
Paul Blakey.
39) Netfilter hw offload support, from Pablo Neira Ayuso"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
net/mlx5e: Return in default case statement in tx_post_resync_params
mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
net: dsa: add support for BRIDGE_MROUTER attribute
pkt_sched: Include const.h
net: netsec: remove static declaration for netsec_set_tx_de()
net: netsec: remove superfluous if statement
netfilter: nf_tables: add hardware offload support
net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
net: flow_offload: add flow_block_cb_is_busy() and use it
net: sched: remove tcf block API
drivers: net: use flow block API
net: sched: use flow block API
net: flow_offload: add flow_block_cb_{priv, incref, decref}()
net: flow_offload: add list handling functions
net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
net: flow_offload: add flow_block_cb_setup_simple()
net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
...
|
|
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new /proc/fs/nfsd/clients/ directory which exposes some
long-requested information about NFSv4 clients (like open files)
and allows forced revocation of client state.
- Replace the global duplicate reply cache by a cache per network
namespace; previously, a request in one network namespace could
incorrectly match an entry from another, though we haven't seen
this in production. This is the last remaining container bug that
I'm aware of; at this point you should be able to run separate
nfsd's in each network namespace, each with their own set of
exports, and everything should work.
- Cleanup and modify lock code to show the pid of lockd as the owner
of NLM locks. This is the correct version of the bugfix originally
attempted in b8eee0e90f97 ("lockd: Show pid of lockd for remote
locks")"
* tag 'nfsd-5.3' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd: Make __get_nfsdfs_client() static
nfsd: Make two functions static
nfsd: Fix misuse of strlcpy
sunrpc/cache: remove the exporting of cache_seq_next
nfsd: decode implementation id
nfsd: create xdr_netobj_dup helper
nfsd: allow forced expiration of NFSv4 clients
nfsd: create get_nfsdfs_clp helper
nfsd4: show layout stateids
nfsd: show lock and deleg stateids
nfsd4: add file to display list of client's opens
nfsd: add more information to client info file
nfsd: escape high characters in binary data
nfsd: copy client's address including port number to cl_addr
nfsd4: add a client info file
nfsd: make client/ directory names small ints
nfsd: add nfsd/clients directory
nfsd4: use reference count to free client
nfsd: rename cl_refcount
nfsd: persist nfsd filesystem across mounts
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"This contains cleanups of the fsnotify name removal hook and also a
patch to disable fanotify permission events for 'proc' filesystem"
* tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: get rid of fsnotify_nameremove()
fsnotify: move fsnotify_nameremove() hook out of d_delete()
configfs: call fsnotify_rmdir() hook
debugfs: call fsnotify_{unlink,rmdir}() hooks
debugfs: simplify __debugfs_remove_file()
devpts: call fsnotify_unlink() hook
tracefs: call fsnotify_{unlink,rmdir}() hooks
rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks
btrfs: call fsnotify_rmdir() hook
fsnotify: add empty fsnotify_{unlink,rmdir}() hooks
fanotify: Disallow permission events for proc filesystem
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"
This reverts merge 0f75ef6a9cff49ff612f7ce0578bced9d0b38325 (and thus
effectively commits
7a1ade847596 ("keys: Provide KEYCTL_GRANT_PERMISSION")
2e12256b9a76 ("keys: Replace uid/gid/perm permissions checking with an ACL")
that the merge brought in).
It turns out that it breaks booting with an encrypted volume, and Eric
biggers reports that it also breaks the fscrypt tests [1] and loading of
in-kernel X.509 certificates [2].
The root cause of all the breakage is likely the same, but David Howells
is off email so rather than try to work it out it's getting reverted in
order to not impact the rest of the merge window.
[1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
[2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/
Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
Reported-by: Eric Biggers <ebiggers@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Connections with legitimate tos values can get into usual connection
race. It can result in consumer reject. We don't want tos value or
protocol version to be demoted for such connections otherwise
piers would end up different tos values which can results in
no connection. Example a peer initiated connection with say
tos 8 while usual connection racing can get downgraded to tos 0
which is not desirable.
Patch fixes above issue introduced by commit
commit d021fabf525f ("rds: rdma: add consumer reject")
Reported-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Tested-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
|
|
The proper "tos" value needs to be returned
to user-space (sockopt RDS_INFO_CONNECTIONS).
Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure")
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
|
|
Prior to
commit d021fabf525ff ("rds: rdma: add consumer reject")
function "rds_rdma_cm_event_handler_cmn" would always honor a rejected
connection attempt by issuing a "rds_conn_drop".
The commit mentioned above added a "break", eliminating
the "fallthrough" case and made the "rds_conn_drop" rather conditional:
Now it only happens if a "consumer defined" reject (i.e. "rdma_reject")
carries an integer-value of "1" inside "private_data":
if (!conn)
break;
err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
&conn->c_laddr, &conn->c_faddr);
conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
rds_conn_drop(conn);
}
rdsdebug("Connection rejected: %s\n",
rdma_reject_msg(cm_id, event->status));
break;
/* FALLTHROUGH */
A number of issues are worth mentioning here:
#1) Previous versions of the RDS code simply rejected a connection
by calling "rdma_reject(cm_id, NULL, 0);"
So the value of the payload in "private_data" will not be "1",
but "0".
#2) Now the code has become dependent on host byte order and sizing.
If one peer is big-endian, the other is little-endian,
or there's a difference in sizeof(int) (e.g. ILP64 vs LP64),
the *err check does not work as intended.
#3) There is no check for "len" to see if the data behind *err is even valid.
Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always
carry 148 bytes of zeroized payload.
But that should probably not be relied upon here.
#4) With the added "break;",
we might as well drop the misleading "/* FALLTHROUGH */" comment.
This commit does _not_ address issue #2, as the sender would have to
agree on a byte order as well.
Here is the sequence of messages in this observed error-scenario:
Host-A is pre-QoS changes (excluding the commit mentioned above)
Host-B is post-QoS changes (including the commit mentioned above)
#1 Host-B
issues a connection request via function "rds_conn_path_transition"
connection state transitions to "RDS_CONN_CONNECTING"
#2 Host-A
rejects the incompatible connection request (from #1)
It does so by calling "rdma_reject(cm_id, NULL, 0);"
#3 Host-B
receives an "RDMA_CM_EVENT_REJECTED" event (from #2)
But since the code is changed in the way described above,
it won't drop the connection here, simply because "*err == 0".
#4 Host-A
issues a connection request
#5 Host-B
receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event
and ends up calling "rds_ib_cm_handle_connect".
But since the state is already in "RDS_CONN_CONNECTING"
(as of #1) it will end up issuing a "rdma_reject" without
dropping the connection:
if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
/* Wait and see - our connect may still be succeeding */
rds_ib_stats_inc(s_ib_connect_raced);
}
goto out;
#6 Host-A
receives an "RDMA_CM_EVENT_REJECTED" event (from #5),
drops the connection and tries again (goto #4) until it gives up.
Tested-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
|
|
This reverts commit 56012459310a1dbcc55c2dbf5500a9f7571402cb.
RDS kept spinning inside function "rds_ib_post_reg_frmr", waiting for
"i_fastreg_wrs" to become incremented:
while (atomic_dec_return(&ibmr->ic->i_fastreg_wrs) <= 0) {
atomic_inc(&ibmr->ic->i_fastreg_wrs);
cpu_relax();
}
Looking at the original commit:
commit 56012459310a ("RDS: IB: split the mr registration and
invalidation path")
In there, the "rds_ib_mr_cqe_handler" was changed in the following
way:
void rds_ib_mr_cqe_handler(struct
rds_ib_connection *ic,
struct ib_wc *wc)
if (frmr->fr_inv) {
frmr->fr_state = FRMR_IS_FREE;
frmr->fr_inv = false;
atomic_inc(&ic->i_fastreg_wrs);
} else {
atomic_inc(&ic->i_fastunreg_wrs);
}
It looks like it's got it exactly backwards:
Function "rds_ib_post_reg_frmr" keeps track of the outstanding
requests via "i_fastreg_wrs".
Function "rds_ib_post_inv" keeps track of the outstanding requests
via "i_fastunreg_wrs" (post original commit). It also sets:
frmr->fr_inv = true;
However the completion handler "rds_ib_mr_cqe_handler" adjusts
"i_fastreg_wrs" when "fr_inv" had been true, and adjusts
"i_fastunreg_wrs" otherwise.
The original commit was done in the name of performance:
to remove the performance bottleneck
No performance benefit could be observed with a fixed-up version
of the original commit measured between two Oracle X7 servers,
both equipped with Mellanox Connect-X5 HCAs.
The prudent course of action is to revert this commit.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
|
|
RDS composite message(rdma + control) user notification needs to be
triggered once the full message is delivered and such a fix was
added as part of commit 941f8d55f6d61 ("RDS: RDMA: Fix the composite
message user notification"). But rds_send_remove_from_sock is missing
data part notify check and hence at times the user don't get
notification which isn't desirable.
One way is to fix the rds_send_remove_from_sock to check of that case
but considering the ordering complexity with completion handler and
rdma + control messages are always dispatched back to back in same send
context, just delaying the signaled completion on rmda work request also
gets the desired behaviour. i.e Notifying application only after
RDMA + control message send completes. So patch updates the earlier
fix with this approach. The delay signaling completions of rdma op
till the control message send completes fix was done by Venkat
Venkatsubra in downstream kernel.
Reviewed-and-tested-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
|
|
This patch adds support for enabling or disabling the flooding of
unknown multicast traffic on the CPU ports, depending on the value
of the switchdev SWITCHDEV_ATTR_ID_BRIDGE_MROUTER attribute.
The current behavior is kept unchanged but a user can now prevent
the CPU conduit to be flooded with a lot of unregistered traffic that
the network stack needs to filter in software with e.g.:
echo 0 > /sys/class/net/br0/multicast_router
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds hardware offload support for nftables through the
existing netdev_ops->ndo_setup_tc() interface, the TC_SETUP_CLSFLOWER
classifier and the flow rule API. This hardware offload support is
available for the NFPROTO_NETDEV family and the ingress hook.
Each nftables expression has a new ->offload interface, that is used to
populate the flow rule object that is attached to the transaction
object.
There is a new per-table NFT_TABLE_F_HW flag, that is set on to offload
an entire table, including all of its chains.
This patch supports for basic metadata (layer 3 and 4 protocol numbers),
5-tuple payload matching and the accept/drop actions; this also includes
basechain hardware offload only.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
And any other existing fields in this structure that refer to tc.
Specifically:
* tc_cls_flower_offload_flow_rule() to flow_cls_offload_flow_rule().
* TC_CLSFLOWER_* to FLOW_CLS_*.
* tc_cls_common_offload to tc_cls_common_offload.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds a function to check if flow block callback is already in
use. Call this new function from flow_block_cb_setup_simple() and from
drivers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Unused, now replaced by flow block API.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch updates flow_block_cb_setup_simple() to use the flow block API.
Several drivers are also adjusted to use it.
This patch introduces the per-driver list of flow blocks to account for
blocks that are already in use.
Remove tc_block_offload alias.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds tcf_block_setup() which uses the flow block API.
This infrastructure takes the flow block callbacks coming from the
driver and register/unregister to/from the cls_api core.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch completes the flow block API to introduce:
* flow_block_cb_priv() to access callback private data.
* flow_block_cb_incref() to bump reference counter on this flow block.
* flow_block_cb_decref() to decrement the reference counter.
These functions are taken from the existing tcf_block_cb_priv(),
tcf_block_cb_incref() and tcf_block_cb_decref().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds the list handling functions for the flow block API:
* flow_block_cb_lookup() allows drivers to look up for existing flow blocks.
* flow_block_cb_add() adds a flow block to the per driver list to be registered
by the core.
* flow_block_cb_remove() to remove a flow block from the list of existing
flow blocks per driver and to request the core to unregister this.
The flow block API also annotates the netns this flow block belongs to.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new helper function to allocate flow_block_cb objects.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rename from TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_* and
remove temporary tcf_block_binder_type alias.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rename from TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND and remove
temporary tc_block_command alias.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Most drivers do the same thing to set up the flow block callbacks, this
patch adds a helper function to do this.
This preparation patch reduces the number of changes to adapt the
existing drivers to use the flow block callback API.
This new helper function takes a flow block list per-driver, which is
set to NULL until this driver list is used.
This patch also introduces the flow_block_command and
flow_block_binder_type enumerations, which are renamed to use
FLOW_BLOCK_* in follow up patches.
There are three definitions (aliases) in order to reduce the number of
updates in this patch, which go away once drivers are fully adapted to
use this flow block API.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is done through IORING_OP_RECVMSG. This opcode uses the same
sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
msghdr struct in the sqe->addr field as well.
We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
for the flags argument, and the msghdr struct is passed in the
sqe->addr field.
We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull Documentation updates from Jonathan Corbet:
"It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with
other trees, unfortunately. He has a lot more of these waiting on
the wings that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos,
and one on Spectre vulnerabilities.
- Various improvements to the build system, including automatic
markup of function() references because some people, for reasons I
will never understand, were of the opinion that
:c:func:``function()`` is unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc"
* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
docs: automarkup.py: ignore exceptions when seeking for xrefs
docs: Move binderfs to admin-guide
Disable Sphinx SmartyPants in HTML output
doc: RCU callback locks need only _bh, not necessarily _irq
docs: format kernel-parameters -- as code
Doc : doc-guide : Fix a typo
platform: x86: get rid of a non-existent document
Add the RCU docs to the core-api manual
Documentation: RCU: Add TOC tree hooks
Documentation: RCU: Rename txt files to rst
Documentation: RCU: Convert RCU UP systems to reST
Documentation: RCU: Convert RCU linked list to reST
Documentation: RCU: Convert RCU basic concepts to reST
docs: filesystems: Remove uneeded .rst extension on toctables
scripts/sphinx-pre-install: fix out-of-tree build
docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
Documentation: PGP: update for newer HW devices
Documentation: Add section about CPU vulnerabilities for Spectre
Documentation: platform: Delete x86-laptop-drivers.txt
docs: Note that :c:func: should no longer be used
...
|
|
New matches for conntrack mark, label, zone, and state.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Retreives connection tracking zone, mark, label, and state from
a SKB.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow sending a packet to conntrack module for connection tracking.
The packet will be marked with conntrack connection's state, and
any metadata such as conntrack mark and label. This state metadata
can later be matched against with tc classifers, for example with the
flower classifier as below.
In addition to committing new connections the user can optionally
specific a zone to track within, set a mark/label and configure nat
with an address range and port range.
Usage is as follows:
$ tc qdisc add dev ens1f0_0 ingress
$ tc qdisc add dev ens1f0_1 ingress
$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 0 proto ip \
flower ip_proto tcp ct_state -trk \
action ct zone 2 pipe \
action goto chain 2
$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower ct_state +trk+new \
action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
action ct nat pipe \
action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_1 ingress \
prio 1 chain 0 proto ip \
flower ip_proto tcp ct_state -trk \
action ct zone 2 pipe \
action goto chain 1
$ tc filter add dev ens1f0_1 ingress \
prio 1 chain 1 proto ip \
flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
action ct nat pipe \
action mirred egress redirect dev ens1f0_0
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Changelog:
V5->V6:
Added CONFIG_NF_DEFRAG_IPV6 in handle fragments ipv6 case
V4->V5:
Reordered nf_conntrack_put() in tcf_ct_skb_nfct_cached()
V3->V4:
Added strict_start_type for act_ct policy
V2->V3:
Fixed david's comments: Removed extra newline after rcu in tcf_ct_params , and indent of break in act_ct.c
V1->V2:
Fixed parsing of ranges TCA_CT_NAT_IPV6_MAX as 'else' case overwritten ipv4 max
Refactored NAT_PORT_MIN_MAX range handling as well
Added ipv4/ipv6 defragmentation
Removed extra skb pull push of nw offset in exectute nat
Refactored tcf_ct_skb_network_trim after pull
Removed TCA_ACT_CT define
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In an eswitch, PCI VF may have port which is normally represented using
a representor netdevice.
To have better visibility of eswitch port, its association with VF,
and its representor netdevice, introduce a PCI VF port flavour.
When devlink port flavour is PCI VF, fill up PCI VF attributes of
the port.
Extend port name creation using PCI PF and VF number scheme on best
effort basis, so that vendor drivers can skip defining their own scheme.
$ devlink port show
pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0
pci/0000:05:00.0/1: type eth netdev eth1 flavour pcivf pfnum 0 vfnum 0
pci/0000:05:00.0/2: type eth netdev eth2 flavour pcivf pfnum 0 vfnum 1
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In an eswitch, PCI PF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with
PF and a representor netdevice, introduce a PCI PF port
flavour and port attriute.
When devlink port flavour is PCI PF, fill up PCI PF attributes of the
port.
Extend port name creation using PCI PF number on best effort basis.
So that vendor drivers can skip defining their own scheme.
$ devlink port show
pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Physical port number and split group fields are applicable only to
physical port flavours such as PHYSICAL, CPU and DSA.
Hence limit returning those values in netlink response to such port
flavours.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To support additional devlink port flavours and to support few common
and few different port attributes, move physical port attributes to a
different structure.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED updates from Jacek Anaszewski:
- Add a new LED common module for ti-lmu driver family
- Modify MFD ti-lmu bindings
- add ti,brightness-resolution
- add the ramp up/down property
- Add regulator support for LM36274 driver to lm363x-regulator.c
- New LED class drivers with DT bindings:
- leds-spi-byte
- leds-lm36274
- leds-lm3697 (move the support from MFD to LED subsystem)
- Simplify getting the I2C adapter of a client:
- leds-tca6507
- leds-pca955x
- Convert LED documentation to ReST
* tag 'leds-for-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
dt: leds-lm36274.txt: fix a broken reference to ti-lmu.txt
docs: leds: convert to ReST
leds: leds-tca6507: simplify getting the adapter of a client
leds: leds-pca955x: simplify getting the adapter of a client
leds: lm36274: Introduce the TI LM36274 LED driver
dt-bindings: leds: Add LED bindings for the LM36274
regulator: lm363x: Add support for LM36274
mfd: ti-lmu: Add LM36274 support to the ti-lmu
dt-bindings: mfd: Add lm36274 bindings to ti-lmu
leds: max77650: Remove set but not used variable 'parent'
leds: avoid flush_work in atomic context
leds: lm3697: Introduce the lm3697 driver
mfd: ti-lmu: Remove support for LM3697
dt-bindings: ti-lmu: Modify dt bindings for the LM3697
leds: TI LMU: Add common code for TI LMU devices
leds: spi-byte: add single byte SPI LED driver
dt-bindings: leds: Add binding for spi-byte LED.
dt-bindings: mfd: LMU: Add ti,brightness-resolution
dt-bindings: mfd: LMU: Add the ramp up/down property
|
|
Adapt and apply changes that were made to the TCP socket connect
code. See the following commits for details on the purpose of
these changes:
Commit 7196dbb02ea0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly")
Commit 3851f1cdb2b8 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout")
Commit 02910177aede ("SUNRPC: Fix reconnection timeouts")
Some common transport code is moved to xprt.c to satisfy the code
duplication police.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up.
There is only one remaining function, rpcrdma_buffer_put(), that
uses this field. Its caller can supply a pointer to the correct
rpcrdma_buffer, enabling the removal of an 8-byte pointer field
from a frequently-allocated shared data structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up.
Move the "not present" case into the individual chunk encoders. This
improves code organization and readability.
The reason for the original organization was to optimize for the
case where there there are no chunks. The optimization turned out to
be inconsequential, so let's err on the side of code readability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
rb_lock is contended between rpcrdma_buffer_create,
rpcrdma_buffer_put, and rpcrdma_post_recvs.
Commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)")
causes rpcrdma_post_recvs to take the rb_lock repeatedly when it
determines more Receives are needed. Streamline this code path so
it takes the lock just once in most cases to build the Receive
chain that is about to be posted.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up.
Commit 7c8d9e7c8863 ("xprtrdma: Move Receive posting to Receive
handler") reduced the number of rpcrdma_rep_create call sites to
one. After that commit, the backchannel code no longer invokes it.
Therefore the free list logic added by commit d698c4a02ee0
("xprtrdma: Fix backchannel allocation of extra rpcrdma_reps") is
no longer necessary, and in fact adds some extra overhead that we
can do without.
Simply post any newly created reps. They will get added back to
the rb_recv_bufs list when they subsequently complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Eliminate a context switch in the path that handles RPC wake-ups
when a Receive completion has to wait for a Send completion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Since commit ba69cd122ece ("xprtrdma: Remove support for FMR memory
registration"), FRWR is the only supported memory registration mode.
We can take advantage of the asynchronous nature of FRWR's LOCAL_INV
Work Requests to get rid of the completion wait by having the
LOCAL_INV completion handler take care of DMA unmapping MRs and
waking the upper layer RPC waiter.
This eliminates two context switches when local invalidation is
necessary. As a side benefit, we will no longer need the per-xprt
deferred completion work queue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
When a marshal operation fails, any MRs that were already set up for
that request are recycled. Recycling releases MRs and creates new
ones, which is expensive.
Since commit f2877623082b ("xprtrdma: Chain Send to FastReg WRs")
was merged, recycling FRWRs is unnecessary. This is because before
that commit, frwr_map had already posted FAST_REG Work Requests,
so ownership of the MRs had already been passed to the NIC and thus
dealing with them had to be delayed until they completed.
Since that commit, however, FAST_REG WRs are posted at the same time
as the Send WR. This means that if marshaling fails, we are certain
the MRs are safe to simply unmap and place back on the free list
because neither the Send nor the FAST_REG WRs have been posted yet.
The kernel still has ownership of the MRs at this point.
This reduces the total number of MRs that the xprt has to create
under heavy workloads and makes the marshaling logic less brittle.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|