summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-04-02net: socket: add __sys_listen() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_listen() allows us to avoid the internal calls to the sys_listen() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02net: socket: add __sys_connect() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_connect() allows us to avoid the internal calls to the sys_connect() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02net: socket: add __sys_bind() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_bind() allows us to avoid the internal calls to the sys_bind() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02net: socket: add __sys_socket() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_socket() allows us to avoid the internal calls to the sys_socket() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02net: socket: add __sys_accept4() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_accept4() allows us to avoid the internal calls to the sys_accept4() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02net: socket: add __sys_sendto() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_sendto() allows us to avoid the internal calls to the sys_sendto() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscallDominik Brodowski
Using the net-internal helper __sys_recvfrom() allows us to avoid the internal calls to the sys_recvfrom() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_threshEric Dumazet
I forgot to change ip6frag_low_thresh proc_handler from proc_dointvec_minmax to proc_doulongvec_minmax Fixes: 3e67f106f619 ("inet: frags: break the 2GB limit for frags storage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-02ceph: quota: add initial infrastructure to support cephfs quotasLuis Henriques
This patch adds the infrastructure required to support cephfs quotas as it is currently implemented in the ceph fuse client. Cephfs quotas can be set on any directory, and can restrict the number of bytes or the number of files stored beneath that point in the directory hierarchy. Quotas are set using the extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', and can be removed by setting these attributes to '0'. Link: http://tracker.ceph.com/issues/22372 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph, ceph: add __init attribution to init funcitonsChengguang Xu
Add __init attribution to the functions which are called only once during initiating/registering operations and deleting unnecessary symbol exports. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph: adding missing message types to ceph_msg_type_name()Chengguang Xu
Some of message types are missing in ceph_msg_type_name(), so just adding them for better understanding of output information. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph: fix misjudgement of maximum monitor numberChengguang Xu
num_mon should allow up to CEPH_MAX_MON in ceph_monmap_decode(). Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph, ceph: change permission for readonly debugfs entriesChengguang Xu
Remove write permission for debugfs entries which only have readonly function. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02ceph: add newline to end of debug message formatChengguang Xu
Some of dout format do not include newline in the end, fix for the files which are in fs/ceph and net/ceph directories, and changing printk to dout for printing debug info in super.c Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph, ceph: move ceph_calc_file_object_mapping() to striper.cIlya Dryomov
ceph_calc_file_object_mapping() has nothing to do with osdmaps. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph: striping framework implementationIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph: handle zero-length data itemsIlya Dryomov
rbd needs this for null copyups -- if copyup data is all zeroes, we want to save some I/O and network bandwidth. See rbd_obj_issue_copyup() in the next commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02libceph: introduce BVECS data typeIlya Dryomov
In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for working with bio_vec array data buffers. The wrappers are trivial, but make it look similar to ceph_bio_iter. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph, rbd: new bio handling code (aka don't clone bios)Ilya Dryomov
The reason we clone bios is to be able to give each object request (and consequently each ceph_osd_data/ceph_msg_data item) its own pointer to a (list of) bio(s). The messenger then initializes its cursor with cloned bio's ->bi_iter, so it knows where to start reading from/writing to. That's all the cloned bios are used for: to determine each object request's starting position in the provided data buffer. Introduce ceph_bio_iter to do exactly that -- store position within bio list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02libceph, ceph: change ceph_calc_file_object_mapping() signatureIlya Dryomov
- make it void - xlen (object extent length) out parameter should be u32 because only a single stripe unit is mapped at a time Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02libceph: eliminate overflows in ceph_calc_file_object_mapping()Ilya Dryomov
bl, stripeno and objsetno should be u64 -- otherwise large enough files get corrupted. How large depends on file layout: - 4M-objects layout (default): any file over 16P - 64K-objects layout (smallest possible object size): any file over 512T Only CephFS is affected, rbd doesn't use ceph_calc_file_object_mapping() yet. Fortunately, CephFS has a max_file_size configurable, the default for which is way below both of the above numbers. Reimplement the logic from scratch with no layout validation -- it's done on the MDS side. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-01route: check sysctl_fib_multipath_use_neigh earlier than hashXin Long
Prior to this patch, when one packet is hashed into path [1] (hash <= nh_upper_bound) and it's neigh is dead, it will try path [2]. However, if path [2]'s neigh is alive but it's hash > nh_upper_bound, it will not return this alive path. This packet will never be sent even if path [2] is alive. 3.3.3.1/24: nexthop via 1.1.1.254 dev eth1 weight 1 <--[1] (dead neigh) nexthop via 2.2.2.254 dev eth2 weight 1 <--[2] With sysctl_fib_multipath_use_neigh set is supposed to find an available path respecting to the l3/l4 hash. But if there is no available route with this hash, it should at least return an alive route even with other hash. This patch is to fix it by processing fib_multipath_use_neigh earlier than the hash check, so that it will at least return an alive route if there is when fib_multipath_use_neigh is enabled. It's also compatible with before when there are alive routes with the l3/l4 hash. Fixes: a6db4494d218 ("net: ipv4: Consider failed nexthops in multipath routes") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01net: sched: do not emit messages while holding spinlockLi RongQing
move messages emitting out of sch_tree_lock to avoid holding this lock too long. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01vlan: also check phy_driver ts_info for vlan's real deviceHangbin Liu
Just like function ethtool_get_ts_info(), we should also consider the phy_driver ts_info call back. For example, driver dp83640. Fixes: 37dd9255b2f6 ("vlan: Pass ethtool get_ts_info queries to real device.") Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c, we had some overlapping changes: 1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE --> MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE 2) In 'net-next' params->log_rq_size is renamed to be params->log_rq_mtu_frames. 3) In 'net-next' params->hard_mtu is added. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2018-04-01 Here's (most likely) the last bluetooth-next pull request for the 4.17 kernel: - Remove unused btuart_cs driver (replaced by serial_cs + hci_uart) - New USB ID for Edimax EW-7611ULB controller - Cleanups & fixes to hci_bcm driver - Clenups to btmrvl driver Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01Bluetooth: Mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-04-01ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()Eric Dumazet
While testing my inet defrag changes, I found that the senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook, competing with TX completion of prior skbs possibly happening on another cpus. The solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()Eric Dumazet
While testing my inet defrag changes, I found that the senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook. The solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. Similar change for IPv6 is provided in following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01ip6_gre: remove redundant 'tunnel' setting in ip6erspan_tap_init()Alexey Kodanev
'tunnel' was already set at the start of ip6erspan_tap_init(). Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31crypto : chtls - CPL handler definitionAtul Gupta
Exchange messages with hardware to program the TLS session CPL handlers for messages received from chip. Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Signed-off-by: Michael Werner <werner@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31ethtool: enable Inline TLS in HWAtul Gupta
Ethtool option enables TLS record offload on HW, user configures the feature for netdev capable of Inline TLS. This allows user to define custom sk_prot for Inline TLS sock Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31tls: support for Inline tls recordAtul Gupta
Facility to register Inline TLS drivers to net/tls. Setup TLS_HW_RECORD prot to listen on offload device. Cases handled - Inline TLS device exists, setup prot for TLS_HW_RECORD - Atleast one Inline TLS exists, sets TLS_HW_RECORD. - If non-inline device establish connection, move to TLS_SW_TX Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-03-31 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add raw BPF tracepoint API in order to have a BPF program type that can access kernel internal arguments of the tracepoints in their raw form similar to kprobes based BPF programs. This infrastructure also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which returns an anon-inode backed fd for the tracepoint object that allows for automatic detach of the BPF program resp. unregistering of the tracepoint probe on fd release, from Alexei. 2) Add new BPF cgroup hooks at bind() and connect() entry in order to allow BPF programs to reject, inspect or modify user space passed struct sockaddr, and as well a hook at post bind time once the port has been allocated. They are used in FB's container management engine for implementing policy, replacing fragile LD_PRELOAD wrapper intercepting bind() and connect() calls that only works in limited scenarios like glibc based apps but not for other runtimes in containerized applications, from Andrey. 3) BPF_F_INGRESS flag support has been added to sockmap programs for their redirect helper call bringing it in line with cls_bpf based programs. Support is added for both variants of sockmap programs, meaning for tx ULP hooks as well as recv skb hooks, from John. 4) Various improvements on BPF side for the nfp driver, besides others this work adds BPF map update and delete helper call support from the datapath, JITing of 32 and 64 bit XADD instructions as well as offload support of bpf_get_prandom_u32() call. Initial implementation of nfp packet cache has been tackled that optimizes memory access (see merge commit for further details), from Jakub and Jiong. 5) Removal of struct bpf_verifier_env argument from the print_bpf_insn() API has been done in order to prepare to use print_bpf_insn() soon out of perf tool directly. This makes the print_bpf_insn() API more generic and pushes the env into private data. bpftool is adjusted as well with the print_bpf_insn() argument removal, from Jiri. 6) Couple of cleanups and prep work for the upcoming BTF (BPF Type Format). The latter will reuse the current BPF verifier log as well, thus bpf_verifier_log() is further generalized, from Martin. 7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read and write support has been added in similar fashion to existing IPv6 IPV6_TCLASS socket option we already have, from Nikita. 8) Fixes in recent sockmap scatterlist API usage, which did not use sg_init_table() for initialization thus triggering a BUG_ON() in scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and uses a small helper sg_init_marker() to properly handle the affected cases, from Prashant. 9) Let the BPF core follow IDR code convention and therefore use the idr_preload() and idr_preload_end() helpers, which would also help idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory pressure, from Shaohua. 10) Last but not least, a spelling fix in an error message for the BPF cookie UID helper under BPF sample code, from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CBEric Dumazet
nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CBEric Dumazet
ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[], we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: get rid of ipfrag_skb_cb/FRAG_CBEric Dumazet
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately this integer is currently in a different cache line than skb->next, meaning that we use two cache lines per skb when finding the insertion point. By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields in a single cache line and save precious memory bandwidth. Note that after the fast path added by Changli Gao in commit d6bebca92c66 ("fragment: add fast path for in-order fragments") this change wont help the fast path, since we still need to access prev->len (2nd cache line), but will show great benefits when slow path is entered, since we perform a linear scan of a potentially long list. Also, note that this potential long list is an attack vector, we might consider also using an rb-tree there eventually. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31ipv6: frags: rewrite ip6_expire_frag_queue()Eric Dumazet
Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: do not clone skb in ip_expire()Eric Dumazet
An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release spinlock before calling icmp_send()") While fixing the bug at that time, it also added a very high cost for DDOS frags, as the ICMP rate limit is applied after this expensive operation (skb_clone() + consume_skb(), implying memory allocations, copy, and freeing) We can use skb_get(head) here, all we want is to make sure skb wont be freed by another cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: break the 2GB limit for frags storageEric Dumazet
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: remove inet_frag_maybe_warn_overflow()Eric Dumazet
This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: get rif of inet_frag_evicting()Eric Dumazet
This refactors ip_expire() since one indentation level is removed. Note: in the future, we should try hard to avoid the skb_clone() since this is a serious performance cost. Under DDOS, the ICMP message wont be sent because of rate limits. Fact that ip6_expire_frag_queue() does not use skb_clone() is disturbing too. Presumably IPv6 should have the same issue than the one we fixed in commit ec4fbd64751d ("inet: frag: release spinlock before calling icmp_send()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: remove some helpersEric Dumazet
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem() Also since we use rhashtable we can bring back the number of fragments in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was removed in commit 434d305405ab ("inet: frag: don't account number of fragment queues") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: use rhashtables for reassembly unitsEric Dumazet
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: refactor ipfrag_init()Eric Dumazet
We need to call inet_frags_init() before register_pernet_subsys(), as a prereq for following patch ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: refactor lowpan_net_frag_init()Eric Dumazet
We want to call lowpan_net_frag_init() earlier. Similar to commit "inet: frags: refactor ipv6_frag_init()" This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: refactor ipv6_frag_init()Eric Dumazet
We want to call inet_frags_init() earlier. This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: add a pointer to struct netns_fragsEric Dumazet
In order to simplify the API, add a pointer to struct inet_frags. This will allow us to make things less complex. These functions no longer have a struct inet_frags parameter : inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */) ip6_expire_frag_queue(struct net *net, struct frag_queue *fq) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: change inet_frags_init_net() return valueEric Dumazet
We will soon initialize one rhashtable per struct netns_frags in inet_frags_init_net(). This patch changes the return value to eventually propagate an error. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31vlan: vlan_hw_filter_capable() can be staticWei Yongjun
Fixes the following sparse warning: net/8021q/vlan_core.c:168:6: warning: symbol 'vlan_hw_filter_capable' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>