Age | Commit message (Collapse) | Author |
|
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header). It is
done by adding a new XDP helper bpf_xdp_adjust_head().
It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.
This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support. The driver can then decide
to error out during XDP_SETUP_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.
Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.
If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.
Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.
Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.
This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)
This patch gave me about 5 % increase in throughput in my tests.
skb_condense() helper could probably used in other contexts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
RFS is not commonly used, so add a jump label to avoid some conditionals
in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Support matching on ICMP type and code.
Example usage:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: flower \
indev eth0 ip_proto icmp type 8 code 0 action drop
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 ip_proto icmpv6 type 128 code 0 action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow dissection of ICMP(V6) type and code. This should only occur
if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set.
There are currently no users of FLOW_DISSECTOR_KEY_ICMP.
A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by
the flower classifier.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add UAPI to provide set of flags for matching, where the flags
provided from user-space are mapped to flow-dissector flags.
The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, icmp_rcv() always return zero on a packet delivery upcall.
To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains a large Netfilter update for net-next,
to summarise:
1) Add support for stateful objects. This series provides a nf_tables
native alternative to the extended accounting infrastructure for
nf_tables. Two initial stateful objects are supported: counters and
quotas. Objects are identified by a user-defined name, you can fetch
and reset them anytime. You can also use a maps to allow fast lookups
using any arbitrary key combination. More info at:
http://marc.info/?l=netfilter-devel&m=148029128323837&w=2
2) On-demand registration of nf_conntrack and defrag hooks per netns.
Register nf_conntrack hooks if we have a stateful ruleset, ie.
state-based filtering or NAT. The new nf_conntrack_default_on sysctl
enables this from newly created netnamespaces. Default behaviour is not
modified. Patches from Florian Westphal.
3) Allocate 4k chunks and then use these for x_tables counter allocation
requests, this improves ruleset load time and also datapath ruleset
evaluation, patches from Florian Westphal.
4) Add support for ebpf to the existing x_tables bpf extension.
From Willem de Bruijn.
5) Update layer 4 checksum if any of the pseudoheader fields is updated.
This provides a limited form of 1:1 stateless NAT that make sense in
specific scenario, eg. load balancing.
6) Add support to flush sets in nf_tables. This series comes with a new
set->ops->deactivate_one() indirection given that we have to walk
over the list of set elements, then deactivate them one by one.
The existing set->ops->deactivate() performs an element lookup that
we don't need.
7) Two patches to avoid cloning packets, thus speed up packet forwarding
via nft_fwd from ingress. From Florian Westphal.
8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to
prevent infinite loops, patch from Dwip Banerjee. And one minor
refactoring from Gao feng.
9) Revisit recent log support for nf_tables netdev families: One patch
to ensure that we correctly handle non-ethernet packets. Another
patch to add missing logger definition for netdev. Patches from
Liping Zhang.
10) Three patches for nft_fib, one to address insufficient register
initialization and another to solve incorrect (although harmless)
byteswap operation. Moreover update xt_rpfilter and nft_fib to match
lbcast packets with zeronet as source, eg. DHCP Discover packets
(0.0.0.0 -> 255.255.255.255). Also from Liping Zhang.
11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from
Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has
been broken in many-cast mode for some little time, let's give them a
chance by placing them at the same level as other existing protocols.
Thus, users don't explicitly have to modprobe support for this and
NAT rules work for them. Some people point to the lack of support in
SOHO Linux-based routers that make deployment of new protocols harder.
I guess other middleboxes outthere on the Internet are also to blame.
Anyway, let's see if this has any impact in the midrun.
12) Skip software SCTP software checksum calculation if the NIC comes
with SCTP checksum offload support. From Davide Caratti.
13) Initial core factoring to prepare conversion to hook array. Three
patches from Aaron Conole.
14) Gao Feng made a wrong conversion to switch in the xt_multiport
extension in a patch coming in the previous batch. Fix it in this
batch.
15) Get vmalloc call in sync with kmalloc flags to avoid a warning
and likely OOM killer intervention from x_tables. From Marcelo
Ricardo Leitner.
16) Update Arturo Borrero's email address in all source code headers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow to restore consumed quota, this is useful to restore the quota
state across reboots.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add support for attaching an eBPF object by file descriptor.
The iptables binary can be called with a path to an elf object or a
pinned bpf object. Also pass the mode and path to the kernel to be
able to return it later for iptables dump and save.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Andrey Konovalov reported that this vmalloc call is based on an
userspace request and that it's spewing traces, which may flood the logs
and cause DoS if abused.
Florian Westphal also mentioned that this call should not trigger OOM
killer.
This patch brings the vmalloc call in sync to kmalloc and disables the
warn trace on allocation failure and also disable OOM killer invocation.
Note, however, that under such stress situation, other places may
trigger OOM killer invocation.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds support for set flushing, that consists of walking over
the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set.
This patch requires the following changes:
1) Add set->ops->deactivate_one() operation: This allows us to
deactivate an element from the set element walk path, given we can
skip the lookup that happens in ->deactivate().
2) Add a new nft_trans_alloc_gfp() function since we need to allocate
transactions using GFP_ATOMIC given the set walk path happens with
held rcu_read_lock.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This new function allows us to deactivate one single element, this is
required by the set flush command that comes in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Context is not modified by nft_trans_alloc(), so constify it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
SCTP GSO and hardware can do CRC32c computation after netfilter processing,
so we can avoid calling sctp_compute_checksum() on skb if skb->ip_summed
is equal to CHECKSUM_PARTIAL. Moreover, set skb->ip_summed to CHECKSUM_NONE
when the NAT code computes the CRC, to prevent offloaders from computing
it again (on ixgbe this resulted in a transmission with wrong L4 checksum).
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Otherwise, DHCP Discover packets(0.0.0.0->255.255.255.255) may be
dropped incorrectly.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds the netlink code to filter out dump of stateful objects,
through the NFTA_OBJ_TYPE netlink attribute.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch allows us to refer to stateful object dictionaries, the
source register indicates the key data to be used to look up for the
corresponding state object. We can refer to these maps through names or,
alternatively, the map transaction id. This allows us to refer to both
anonymous and named maps.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch allows you to refer to stateful objects from set elements.
This provides the infrastructure to create maps where the right hand
side of the mapping is a stateful object.
This allows us to build dictionaries of stateful objects, that you can
use to perform fast lookups using any arbitrary key combination.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag
indicates we have reached overquota.
Add pointer to table from nft_object, so we can use it when sending the
depletion notification to userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Introduce nf_tables_obj_notify() to notify internal state changes in
stateful objects. This is used by the quota object to report depletion
in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
dump-and-reset of the stateful object. This also comes with add support
for atomic dump and reset for counter and quota objects.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add a new attribute NFTA_QUOTA_CONSUMED that displays the amount of
quota that has been already consumed. This allows us to restore the
internal state of the quota object between reboots as well as to monitor
how wasted it is.
This patch changes the logic to account for the consumed bytes, instead
of the bytes that remain to be consumed.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
|
|
This new expression allows us to refer to existing stateful objects from
rules.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Register a new quota stateful object type into the new stateful object
infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Register a new percpu counter stateful object type into the stateful
object infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch augments nf_tables to support stateful objects. This new
infrastructure allows you to create, dump and delete stateful objects,
that are identified by a user-defined name.
This patch adds the generic infrastructure, follow up patches add
support for two stateful objects: counters and quotas.
This patch provides a native infrastructure for nf_tables to replace
nfacct, the extended accounting infrastructure for iptables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
... so we can use current skb instead of working with a clone.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
I lost one test case in the last commit for xt_multiport.
For example, the rule is "-m multiport --dports 22,80,443".
When first port is unmatched and the second is matched, the curent codes
could not return the right result.
It would return false directly when the first port is unmatched.
Fixes: dd2602d00f80 ("netfilter: xt_multiport: Use switch case instead
of multiple condition checks")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds a new flag that signals the kernel to update layer 4
checksum if the packet field belongs to the layer 4 pseudoheader. This
implicitly provides stateless NAT 1:1 that is useful under very specific
usecases.
Since rules mangling layer 3 fields that are part of the pseudoheader
may potentially convey any layer 4 packet, we have to deal with the
layer 4 checksum adjustment using protocol specific code.
This patch adds support for TCP, UDP and ICMPv6, since they include the
pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we
can skip it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Otherwise, if fib lookup fail, *dest will be filled with garbage value,
so reverse path filtering will not work properly:
# nft add rule x prerouting fib saddr oif eq 0 drop
Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Acctually ntohl and htonl are identical, so this doesn't affect
anything, but it is conceptually wrong.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
instead of allocating each xt_counter individually, allocate 4k chunks
and then use these for counter allocation requests.
This should speed up rule evaluation by increasing data locality,
also speeds up ruleset loading because we reduce calls to the percpu
allocator.
As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
arches with 64k page size.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Keeps some noise away from a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
On SMP we overload the packet counter (unsigned long) to contain
percpu offset. Hide this from callers and pass xt_counters address
instead.
Preparation patch to allocate the percpu counters in page-sized batch
chunks.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This is to facilitate converting from a singly-linked list to an array
of elements.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This allows easier future refactoring.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply 'calls'
this empty function to create a phony module dependency -- modprobe will
then load the defrag module too.
This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
registration until the functionality is requested within a network namespace
instead of module load time for all namespaces.
Hooks are only un-registered on module unload or when a namespace that used
such defrag functionality exits.
We have to use struct net for this as the register hooks can be called
before netns initialization here from the ipv4/ipv6 conntrack module
init path.
There is no unregister functionality support, defrag will always be
active once it was requested inside a net namespace.
The reason is that defrag has impact on nft and iptables rulesets
(without defrag we might see framents).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Neal Cardwell says:
If I am reading the code correctly, then I would have two concerns:
1) Has that been tested? That seems like an extremely dramatic
decrease in cwnd. For example, if the cwnd is 80, and there are 40
ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
calculations seem to suggest that after just 11 ACKs the cwnd would be
down to a minimal value of 2 [..]
2) That seems to contradict another passage in the draft [..] where it
sazs:
Just as specified in [RFC3168], DCTCP does not react to congestion
indications more than once for every window of data.
Neal is right. Fortunately we don't have to complicate this by testing
vs. current rtt estimate, we can just revert the patch.
Normal stack already handles this for us: receiving ACKs with ECE
set causes a call to tcp_enter_cwr(), from there on the ssthresh gets
adjusted and prr will take care of cwnd adjustment.
Fixes: 4780566784b396 ("dctcp: update cwnd on congestion event")
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.
Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.
This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.
The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.
Tested with virtio by forcing gso_size to 0.
v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
Dumazet's suggestion. As the skb may be backlogged and the interface
gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
David's suggestion
Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In UDP recvmsg() path we currently access 3 cache lines from an skb
while holding receive queue lock, plus another one if packet is
dequeued, since we need to change skb->next->prev
1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
skb->peeked is only needed to make sure 0-length packets are properly
handled while MSG_PEEK is operated.
I had first the intent to remove skb->peeked but the "MSG_PEEK at
non-zero offset" support added by Sam Kumar makes this not possible.
This patch avoids one cache line miss during the locked section, when
skb->len and skb->peeked do not have to be read.
It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is wrong to schedule a work from sk_destruct using the socket
as the memory reserve because the socket will be freed immediately
after the return from sk_destruct.
Instead we should do the deferral prior to sk_free.
This patch does just that.
Fixes: 707693c8a498 ("netlink: Call cb->done from a worker thread")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When loading a BPF program via bpf(2), calculate the digest over
the program's instruction stream and store it in struct bpf_prog's
digest member. This is done at a point in time before any instructions
are rewritten by the verifier. Any unstable map file descriptor
number part of the imm field will be zeroed for the hash.
fdinfo example output for progs:
# cat /proc/1590/fdinfo/5
pos: 0
flags: 02000002
mnt_id: 11
prog_type: 1
prog_jited: 1
prog_digest: b27e8b06da22707513aa97363dfb11c7c3675d28
memlock: 4096
When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded. Furthermore, this can also be
exposed through other means such as netlink in case of a tc cls/act
dump (or xdp in future), but also through tracepoints or other
facilities to identify the program. Other than that, the digest can
also serve as a base name for the work in progress kallsyms support
of programs. The digest doesn't depend/select the crypto layer, since
we need to keep dependencies to a minimum. iproute2 will get support
for this facility.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 18cdb37ebf4c ("net: sched: do not use tcf_proto 'tp' argument from
call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also
remove it from the function as argument to not give a wrong impression. tp
is illegal to access from this callback, since it could already have been
freed.
Refactor the deletion code a bit, so that cls_bpf_destroy() can call into
the same code for prog deletion as cls_bpf_delete() op, instead of having
it unnecessarily duplicated.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit d691f9e8d440 ("bpf: allow programs to write to certain skb
fields") pushed access type check outside of __is_valid_access()
to have different restrictions for socket filters and tc programs.
type is thus not used anymore within __is_valid_access() and should
be removed as a function argument. Same for __is_valid_xdp_access()
introduced by 6a773a15a1e8 ("bpf: add XDP prog type for early driver
filter").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
1) Old code was hard to maintain, due to complex lock chains.
(We probably will be able to remove some kfree_rcu() in callers)
2) Using a single timer to update all estimators does not scale.
3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
is not supposed to work well)
In this rewrite :
- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.
- Each estimator has its own timer.
- Estimations are maintained in net_rate_estimator structure,
instead of dirtying the qdisc. Minor, but part of the simplification.
- Reading the estimator uses RCU and a seqcount to provide proper
support for 32bit kernels.
- We reduce memory need when estimators are not used, since
we store a pointer, instead of the bytes/packets counters.
- xt_rateest_mt() no longer has to grab a spinlock.
(In the future, xt_rateest_tg() could be switched to per cpu counters)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Check if the returned device from tcf_exts_get_dev function supports tc
offload and in case the rule can't be offloaded, set the filter hw_dev
parameter to the original device given by the user.
The filter hw_device parameter should always be set by fl_hw_replace_filter
function, since this pointer is used by dump stats and destroy
filter for each flower rule (offloaded or not).
Fixes: 7091d8c7055d ('net/sched: cls_flower: Add offload support using egress Hardware device')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reported-by: Simon Horman <horms@verge.net.au>
Tested-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.
It is possible to configure IP interfaces with IPv4-mapped addresses, and
one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
to this fix the kernel returned an EINVAL when attempting to add an IPv6
route with an IPv4-mapped address as a nexthop/gateway.
RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
thus in order to support that type of address configuration the kernel
needs to allow IPv4-mapped addresses as nexthops.
Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The check of the return value of sock_register() is ineffective.
"if(!err)" seems to be a typo. It is better to propagate the error code
to the callers of caif_sktinit_module(). This patch removes the check
statment and directly returns the result of sock_register().
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|