summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-11-30NFSD: Replace the internals of the READ_BUF() macroChuck Lever
Convert the READ_BUF macro in nfs4xdr.c from open code to instead use the new xdr_stream-style decoders already in use by the encode side (and by the in-kernel NFS client implementation). Once this conversion is done, each individual NFSv4 argument decoder can be independently cleaned up to replace these macros with C code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Prepare for xdr_stream-style decoding on the server-sideChuck Lever
A "permanent" struct xdr_stream is allocated in struct svc_rqst so that it is usable by all server-side decoders. A per-rqst scratch buffer is also allocated to handle decoding XDR data items that cross page boundaries. To demonstrate how it will be used, add the first call site for the new svcxdr_init_decode() API. As an additional part of the overall conversion, add symbolic constants for successful and failed XDR operations. Returning "0" is overloaded. Sometimes it means something failed, but sometimes it means success. To make it more clear when XDR decoding functions succeed or fail, introduce symbolic constants. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Add xdr_set_scratch_page() and xdr_reset_scratch_buffer()Chuck Lever
Clean up: De-duplicate some frequently-used code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Move the svc_xdr_recvfrom() tracepointChuck Lever
Commit c509f15a5801 ("SUNRPC: Split the xdr_buf event class") added display of the rqst's XID to the svc_xdr_buf_class. However, when the recvfrom tracepoint fires, rq_xid has yet to be filled in with the current XID. So it ends up recording the previous XID that was handled by that svc_rqst. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: support multiple Read chunks per RPCChuck Lever
An efficient way to handle multiple Read chunks is to post them all together and then take a single completion. This is also how the code is already structured: when the Read completion fires, all portions of the incoming RPC message are available to be assembled. The difficult problem is setting up the Read sink buffers so that the server pulls the client's data into place, making subsequent pull-up unnecessary. There are several cases: * No Read chunks. No-op. * One data item Read chunk. This is the fast case, where the inline part of the RPC-over-RDMA message becomes the head and tail, and the data item chunk is placed in buf->pages. * A Position-zero Read chunk. Treated like TCP: the Read chunk is pulled into contiguous pages. + A Position-zero Read chunk with data item chunks. Treated like TCP: all of the Read chunks are pulled into contiguous pages. + Multiple data item chunks. Treated like TCP: the inline part is copied and the data item chunks are pulled into contiguous pages. The "*" cases are already supported. This patch adds support for the "+" cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use the new parsed chunk list when pulling Read chunksChuck Lever
As a pre-requisite for handling multiple Read chunks in each Read list, convert svc_rdma_recv_read_chunk() to use the new parsed Read chunk list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Rename info::ri_chunklenChuck Lever
I'm about to change the purpose of ri_chunklen: Instead of tracking the number of bytes in one Read chunk, it will track the total number of bytes in the Read list. Rename it for clarity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Clean up chunk tracepointsChuck Lever
We already have trace_svcrdma_decode_rseg(), which records each ingress Read segment. Instead of reporting those again when they are about to be posted as RDMA Reads, let's fire one tracepoint before posting each type of chunk. So we'll get: nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905 nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100 nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3 kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0) Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Remove chunk list pointersChuck Lever
Clean up: These pointers are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunkChuck Lever
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts of rq_res that do not contain a result payload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg()Chuck Lever
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only the parts of rq_res that do not contain a result payload. This change has been tested to confirm that it does not cause a regression in the no Write chunk and single Write chunk cases. Multiple Write chunks have not been tested. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple write chunks when pulling upChuck Lever
When counting the number of SGEs needed to construct a Send request, do not count result payloads. And, when copying the Reply message into the pull-up buffer, result payloads are not to be copied to the Send buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to encode Reply transport headersChuck Lever
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing the egress RPC Reply transport header, use the new parsed Write list and Reply chunk, which are version- agnostic and already XDR decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to construct RDMA WritesChuck Lever
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing RDMA Writes, use the new parsed chunk lists for the Write list and Reply chunk, which are version-agnostic and already XDR-decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to detect reverse direction repliesChuck Lever
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Note that the XID sanity test is also removed. The XID is already looked up by the cb handler, and is rejected if it's not recognized. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to derive the inv_rkeyChuck Lever
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Add a "parsed chunk list" data structureChuck Lever
This simple data structure binds the location of each data payload inside of an RPC message to the chunk that will be used to push it to or pull it from the client. There are several benefits to this small additional overhead: * It enables support for more than one chunk in incoming Read and Write lists. * It translates the version-specific on-the-wire format into a generic in-memory structure, enabling support for multiple versions of the RPC/RDMA transport protocol. * It enables the server to re-organize a chunk list if it needs to adjust where Read chunk data lands in server memory without altering the contents of the XDR-encoded Receive buffer. Construction of these lists is done while sanity checking each incoming RPC/RDMA header. Subsequent patches will make use of the generated data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Clean up svc_rdma_encode_reply_chunk()Chuck Lever
Refactor: Match the control flow of svc_rdma_encode_write_list(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Post RDMA Writes while XDR encoding repliesChuck Lever
The only RPC/RDMA ordering requirement between RDMA Writes and RDMA Sends is that the responder must post the Writes on the Send queue before posting the Send that conveys the RPC Reply for that Write payload. The Linux NFS server implementation now has a transport method that can post result Payload Writes earlier than svc_rdma_sendto: ->xpo_result_payload() This gets RDMA Writes going earlier so they are more likely to be complete at the remote end before the Send completes. Some care must be taken with pulled-up Replies. We don't want to push the Write chunk and then send the same payload data via Send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Invoke svc_encode_result_payload() in "read" NFSD encodersChuck Lever
Have the NFSD encoders annotate the boundaries of every direct-data-placement eligible result data payload. Then change svcrdma to use that annotation instead of the xdr->page_len when handling Write chunks. For NFSv4 on RDMA, that enables the ability to recognize multiple result payloads per compound. This is a pre-requisite for supporting multiple Write chunks per RPC transaction. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Rename svc_encode_read_payload()Chuck Lever
Clean up: "result payload" is a less confusing name for these payloads. "READ payload" reflects only the NFS usage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Refactor the RDMA Write pathChuck Lever
Refactor for subsequent changes. Constify the xdr_buf argument to ensure the code here does not modify it, and to enable callers to pass in a "const struct xdr_buf *". At the same time, rename the helper functions, which emit RDMA Writes, not RDMA Sends, and add documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Const-ify the xdr_buf argumentsChuck Lever
Clean up: Ensure the code in rw.c does not modify the argument, and enable callers to also use "const struct xdr_buf *". Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Adjust synopsis of xdr_buf_subsegment()Chuck Lever
Clean up: This enables xdr_buf_subsegment()'s callers to pass in a const pointer to that buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Catch another Reply chunk overflow caseChuck Lever
When space in the Reply chunk runs out in the middle of a segment, we end up passing a zero-length SGL to rdma_rw_ctx_init(), and it oopses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-27Merge tag 'linux-can-fixes-for-5.10-20201127' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2020-11-27 The first patch is by me and target the gs_usb driver and fixes the endianess problem with candleLight firmware. Another patch by me for the mcp251xfd driver add sanity checking to bail out if no IRQ is configured. The next three patches target the m_can driver. A patch by me removes the hardcoded IRQF_TRIGGER_FALLING from the request_threaded_irq() as this clashes with the trigger level specified in the DT. Further a patch by me fixes the nominal bitiming tseg2 min value for modern m_can cores. Pankaj Sharma's patch add support for cores version 3.3.x. The last patch by Oliver Hartkopp is for af_can and converts a WARN() into a pr_warn(), which is triggered by the syzkaller. It was able to create a situation where the closing of a socket runs simultaneously to the notifier call chain for removing the CAN network device in use. * tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0 can: m_can: fix nominal bitiming tseg2 min for version >= 3.1 can: m_can: m_can_open(): remove IRQF_TRIGGER_FALLING from request_threaded_irq()'s flags can: mcp251xfd: mcp251xfd_probe(): bail out if no IRQ was given can: gs_usb: fix endianess problem with candleLight firmware ==================== Link: https://lore.kernel.org/r/20201127100301.512603-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27sock: set sk_err to ee_errno on dequeue from errqWillem de Bruijn
When setting sk_err, set it to ee_errno, not ee_origin. Commit f5f99309fa74 ("sock: do not set sk_err in sock_dequeue_err_skb") disabled updating sk_err on errq dequeue, which is correct for most error types (origins): - sk->sk_err = err; Commit 38b257938ac6 ("sock: reset sk_err when the error queue is empty") reenabled the behavior for IMCP origins, which do require it: + if (icmp_next) + sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin; But read from ee_errno. Fixes: 38b257938ac6 ("sock: reset sk_err when the error queue is empty") Reported-by: Ayush Ranjan <ayushranjan@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27mptcp: fix NULL ptr dereference on bad MPJPaolo Abeni
If an msk listener receives an MPJ carrying an invalid token, it will zero the request socket msk entry. That should later cause fallback and subflow reset - as per RFC - at subflow_syn_recv_sock() time due to failing hmac validation. Since commit 4cf8b7e48a09 ("subflow: introduce and use mptcp_can_accept_new_subflow()"), we unconditionally dereference - in mptcp_can_accept_new_subflow - the subflow request msk before performing hmac validation. In the above scenario we hit a NULL ptr dereference. Address the issue doing the hmac validation earlier. Fixes: 4cf8b7e48a09 ("subflow: introduce and use mptcp_can_accept_new_subflow()") Tested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/03b2cfa3ac80d8fc18272edc6442a9ddf0b1e34e.1606400227.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net: openvswitch: fix TTL decrement action netlink message formatEelco Chaudron
Currently, the openvswitch module is not accepting the correctly formated netlink message for the TTL decrement action. For both setting and getting the dec_ttl action, the actions should be nested in the OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi. When the original patch was sent, it was tested with a private OVS userspace implementation. This implementation was unfortunately not upstreamed and reviewed, hence an erroneous version of this patch was sent out. Leaving the patch as-is would cause problems as the kernel module could interpret additional attributes as actions and vice-versa, due to the actions not being encapsulated/nested within the actual attribute, but being concatinated after it. Fixes: 744676e77720 ("openvswitch: add TTL decrement action") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27can: af_can: can_rx_unregister(): remove WARN() statement from list ↵Oliver Hartkopp
operation sanity check To detect potential bugs in CAN protocol implementations (double removal of receiver entries) a WARN() statement has been used if no matching list item was found for removal. The fault injection issued by syzkaller was able to create a situation where the closing of a socket runs simultaneously to the notifier call chain for removing the CAN network device in use. This case is very unlikely in real life but it doesn't break anything. Therefore we just replace the WARN() statement with pr_warn() to preserve the notification for the CAN protocol development. Reported-by: syzbot+381d06e0c8eaacb8706f@syzkaller.appspotmail.com Reported-by: syzbot+d0ddd88c9a7432f041e6@syzkaller.appspotmail.com Reported-by: syzbot+76d62d3b8162883c7d11@syzkaller.appspotmail.com Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20201126192140.14350-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-25net/tls: Protect from calling tls_dev_del for TLS RX twiceMaxim Mikityanskiy
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after calling tls_dev_del if TLX TX offload is also enabled. Clearing tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a time frame when tls_device_down may get called and call tls_dev_del for RX one extra time, confusing the driver, which may lead to a crash. This patch corrects this racy behavior by adding a flag to prevent tls_device_down from calling tls_dev_del the second time. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25devlink: Make sure devlink instance and port are in same net namespaceParav Pandit
When devlink reload operation is not used, netdev of an Ethernet port may be present in different net namespace than the net namespace of the devlink instance. Ensure that both the devlink instance and devlink port netdev are located in same net namespace. Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload") Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25devlink: Hold rtnl lock while reading netdev attributesParav Pandit
A netdevice of a devlink port can be moved to different net namespace than its parent devlink instance. This scenario occurs when devlink reload is not used. When netdevice is undergoing migration to net namespace, its ifindex and name may change. In such use case, devlink port query may read stale netdev attributes. Fix it by reading them under rtnl lock. Fixes: bfcd3a466172 ("Introduce devlink infrastructure") Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25gro_cells: reduce number of synchronize_net() callsEric Dumazet
After cited commit, gro_cells_destroy() became damn slow on hosts with a lot of cores. This is because we have one additional synchronize_net() per cpu as stated in the changelog. gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough to not have one synchronize_net() call per netif_napi_del() We can factorize all the synchronize_net() to a single one, right before freeing per-cpu memory. Fixes: 5198d545dba8 ("net: remove napi_hash_del() from driver-facing API") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201124203822.1360107-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_initWang Hai
kmemleak report a memory leak as follows: unreferenced object 0xffff8880059c6a00 (size 64): comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s) hex dump (first 32 bytes): 20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00 ............... 1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00 ................ backtrace: [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0 [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170 [<000000006a9ca9d4>] ops_init+0xa8/0x3c0 [<000000002da57bf2>] setup_net+0x2de/0x7e0 [<000000004e52d573>] copy_net_ns+0x27d/0x530 [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30 [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0 [<0000000030653721>] ksys_unshare+0x3a4/0x780 [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40 [<0000000031a10c08>] do_syscall_64+0x33/0x40 [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 We should free all rules when we catch an error in ip6addrlbl_net_init(). otherwise a memory leak will occur. Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24Merge tag 'batadv-net-pullrequest-20201124' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - set module owner to THIS_MODULE, by Taehee Yoo * tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge: batman-adv: set .owner to THIS_MODULE ==================== Link: https://lore.kernel.org/r/20201124134417.17269-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECNAlexander Duyck
When a BPF program is used to select between a type of TCP congestion control algorithm that uses either ECN or not there is a case where the synack for the frame was coming up without the ECT0 bit set. A bit of research found that this was due to the final socket being configured to dctcp while the listener socket was staying in cubic. To reproduce it all that is needed is to monitor TCP traffic while running the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed, assuming tcp_dctcp module is loaded or compiled in and the traffic matches the rules in the sample file, is that for all frames with the exception of the synack the ECT0 bit is set. To address that it is necessary to make one additional call to tcp_bpf_ca_needs_ecn using the request socket and then use the output of that to set the ECT0 bit for the tos/tclass of the packet. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24devlink: Fix reload stats structureMoshe Shemesh
Fix reload stats structure exposed to the user. Change stats structure hierarchy to have the reload action as a parent of the stat entry and then stat entry includes value per limit. This will also help to avoid string concatenation on iproute2 output. Reload stats structure before this fix: "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 } } After this fix: "stats": { "reload": { "driver_reinit": { "unspecified": 2 }, "fw_activate": { "unspecified": 1, "no_reset": 0 } } Fixes: a254c264267e ("devlink: Add reload stats") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23net/packet: fix packet receive on L3 devices without visible hard headerEyal Birger
In the patchset merged by commit b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which did not have header_ops were given one for the purpose of protocol parsing on af_packet transmit path. That change made af_packet receive path regard these devices as having a visible L3 header and therefore aligned incoming skb->data to point to the skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not reset their mac_header prior to ingress and therefore their incoming packets became malformed. Ideally these devices would reset their mac headers, or af_packet would be able to rely on dev->hard_header_len being 0 for such cases, but it seems this is not the case. Fix by changing af_packet RX ll visibility criteria to include the existence of a '.create()' header operation, which is used when creating a device hard header - via dev_hard_header() - by upper layers, and does not exist in these L3 devices. As this predicate may be useful in other situations, add it as a common dev_has_header() helper in netdevice.h. Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23vsock/virtio: discard packets only when socket is really closedStefano Garzarella
Starting from commit 8692cefc433f ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt"), we discard packets in virtio_transport_recv_pkt() if the socket has been released. When the socket is connected, we schedule a delayed work to wait the RST packet from the other peer, also if SHUTDOWN_MASK is set in sk->sk_shutdown. This is done to complete the virtio-vsock shutdown algorithm, releasing the port assigned to the socket definitively only when the other peer has consumed all the packets. If we discard the RST packet received, the socket will be closed only when the VSOCK_CLOSE_TIMEOUT is reached. Sergio discovered the issue while running ab(1) HTTP benchmark using libkrun [1] and observing a latency increase with that commit. To avoid this issue, we discard packet only if the socket is really closed (SOCK_DONE flag is set). We also set SOCK_DONE in virtio_transport_release() when we don't need to wait any packets from the other peer (we didn't schedule the delayed work). In this case we remove the socket from the vsock lists, releasing the port assigned. [1] https://github.com/containers/libkrun Fixes: 8692cefc433f ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt") Cc: justin.he@arm.com Reported-by: Sergio Lopez <slp@redhat.com> Tested-by: Sergio Lopez <slp@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Jia He <justin.he@arm.com> Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23tcp: fix race condition when creating child sockets from syncookiesRicardo Dias
When the TCP stack is in SYN flood mode, the server child socket is created from the SYN cookie received in a TCP packet with the ACK flag set. The child socket is created when the server receives the first TCP packet with a valid SYN cookie from the client. Usually, this packet corresponds to the final step of the TCP 3-way handshake, the ACK packet. But is also possible to receive a valid SYN cookie from the first TCP data packet sent by the client, and thus create a child socket from that SYN cookie. Since a client socket is ready to send data as soon as it receives the SYN+ACK packet from the server, the client can send the ACK packet (sent by the TCP stack code), and the first data packet (sent by the userspace program) almost at the same time, and thus the server will equally receive the two TCP packets with valid SYN cookies almost at the same instant. When such event happens, the TCP stack code has a race condition that occurs between the momement a lookup is done to the established connections hashtable to check for the existence of a connection for the same client, and the moment that the child socket is added to the established connections hashtable. As a consequence, this race condition can lead to a situation where we add two child sockets to the established connections hashtable and deliver two sockets to the userspace program to the same client. This patch fixes the race condition by checking if an existing child socket exists for the same client when we are adding the second child socket to the established connections socket. If an existing child socket exists, we drop the packet and discard the second child socket to the same client. Signed-off-by: Ricardo Dias <rdias@singlestore.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-21net/af_iucv: set correct sk_protocol for child socketsJulian Wiedmann
Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*), instead of the PF_IUCV protocol that the parent was created with in iucv_sock_create(). We're currently not using sk->sk_protocol ourselves, so this shouldn't have much impact (except eg. getting the output in skb_dump() right). Fixes: eac3731bd04c ("[S390]: Add AF_IUCV socket support") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_controlAlexander Duyck
When setting congestion control via a BPF program it is seen that the SYN/ACK for packets within a given flow will not include the ECT0 flag. A bit of simple printk debugging shows that when this is configured without BPF we will see the value INET_ECN_xmit value initialized in tcp_assign_congestion_control however when we configure this via BPF the socket is in the closed state and as such it isn't configured, and I do not see it being initialized when we transition the socket into the listen state. The result of this is that the ECT0 bit is configured based on whatever the default state is for the socket. Any easy way to reproduce this is to monitor the following with tcpdump: tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca Without this patch the SYN/ACK will follow whatever the default is. If dctcp all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0 will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK bit matches the value seen on the other packets in the given stream. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 headerAlexander Duyck
An issue was recently found where DCTCP SYN/ACK packets did not have the ECT bit set in the L3 header. A bit of code review found that the recent change referenced below had gone though and added a mask that prevented the ECN bits from being populated in the L3 header. This patch addresses that by rolling back the mask so that it is only applied to the flags coming from the incoming TCP request instead of applying it to the socket tos/tclass field. Doing this the ECT bits were restored in the SYN/ACK packets in my testing. One thing that is not addressed by this patch set is the fact that tcp_reflect_tos appears to be incompatible with ECN based congestion avoidance algorithms. At a minimum the feature should likely be documented which it currently isn't. Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20net/tls: missing received data after fast remote closeVadim Fedorenko
In case when tcp socket received FIN after some data and the parser haven't started before reading data caller will receive an empty buffer. This behavior differs from plain TCP socket and leads to special treating in user-space. The flow that triggers the race is simple. Server sends small amount of data right after the connection is configured to use TLS and closes the connection. In this case receiver sees TLS Handshake data, configures TLS socket right after Change Cipher Spec record. While the configuration is in process, TCP socket receives small Application Data record, Encrypted Alert record and FIN packet. So the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with SK_DONE bit set. The received data is not parsed upon arrival and is never sent to user-space. Patch unpauses parser directly if we have unparsed data in tcp receive queue. Fixes: fcf4793e278e ("tls: check RCV_SHUTDOWN in tls_wait_data") Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20rose: Fix Null pointer dereference in rose_send_frame()Anmol Karn
rose_send_frame() dereferences `neigh->dev` when called from rose_transmit_clear_request(), and the first occurrence of the `neigh` is in rose_loopback_timer() as `rose_loopback_neigh`, and it is initialized in rose_add_loopback_neigh() as NULL. i.e when `rose_loopback_neigh` used in rose_loopback_timer() its `->dev` was still NULL and rose_loopback_timer() was calling rose_rx_call_request() without checking for NULL. - net/rose/rose_link.c This bug seems to get triggered in this line: rose_call = (ax25_address *)neigh->dev->dev_addr; Fix it by adding NULL checking for `rose_loopback_neigh->dev` in rose_loopback_timer(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Jakub Kicinski <kuba@kernel.org> Reported-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com Tested-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=9d2a7ca8c7f2e4b682c97578dfa3f236258300b3 Signed-off-by: Anmol Karn <anmol.karan123@gmail.com> Link: https://lore.kernel.org/r/20201119191043.28813-1-anmol.karan123@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Alexei Starovoitov says: ==================== 1) libbpf should not attempt to load unused subprogs, from Andrii. 2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel. 3) Relax return code check for subprograms in the BPF verifier, from Dmitrii. 4) Fix several sockmap issues, from John. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: fail_function: Remove a redundant mutex unlock selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL lib/strncpy_from_user.c: Mask out bytes after NUL terminator. libbpf: Fix VERSIONED_SYM_COUNT number parsing bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self bpf, sockmap: Use truesize with sk_rmem_schedule() bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made selftests/bpf: Fix error return code in run_getsockopt_test() bpf: Relax return code check for subprograms tools, bpftool: Add missing close before bpftool net attach exit MAINTAINERS/bpf: Update Andrii's entry. selftests/bpf: Fix unused attribute usage in subprogs_unused test bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare libbpf: Don't attempt to load unused subprog as an entry-point BPF program ==================== Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()Karsten Graul
Sparse complaints 3 times about: net/smc/smc_ib.c:203:52: warning: incorrect type in argument 1 (different address spaces) net/smc/smc_ib.c:203:52: expected struct net_device const *dev net/smc/smc_ib.c:203:52: got struct net_device [noderef] __rcu *const ndev Fix that by using the existing and validated ndev variable instead of accessing attr->ndev directly. Fixes: 5102eca9039b ("net/smc: Use rdma_read_gid_l2_fields to L2 fields") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19net/smc: fix matching of existing link groupsKarsten Graul
With the multi-subnet support of SMC-Dv2 the match for existing link groups should not include the vlanid of the network device. Set ini->smcd_version accordingly before the call to smc_conn_create() and use this value in smc_conn_create() to skip the vlanid check. Fixes: 5c21c4ccafe8 ("net/smc: determine accepted ISM devices") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 moduleGeorg Kohmann
IPV6=m NF_DEFRAG_IPV6=y ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function `nf_ct_frag6_gather': net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to `ipv6_frag_thdr_truncated' Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This dependency is forcing IPV6=y. Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This is the same solution as used with a similar issues: Referring to commit 70b095c843266 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module") Fixes: 9d9e937b1c8b ("ipv6/netfilter: Discard first fragment not including all headers") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Georg Kohmann <geokohma@cisco.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>