Age | Commit message (Collapse) | Author |
|
call->rx_winsize should be initialised to the sysctl setting and the sysctl
setting should be limited to the maximum we want to permit. Further, we
need to place this in the ACK info instead of the sysctl setting.
Furthermore, discard the idea of accepting the subpackets of a jumbo packet
that lie beyond the receive window when the first packet of the jumbo is
within the window. Just discard the excess subpackets instead. This
allows the receive window to be opened up right to the buffer size less one
for the dead slot.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The preallocated call buffer holds a ref on the calls within that buffer.
The ref was being released in the wrong place - it worked okay for incoming
calls to the AFS cache manager service, but doesn't work right for incoming
calls to a userspace service.
Instead of releasing an extra ref service calls in rxrpc_release_call(),
the ref needs to be released during the acceptance/rejectance process. To
this end:
(1) The prealloc ref is now normally released during
rxrpc_new_incoming_call().
(2) For preallocated kernel API calls, the kernel API's ref needs to be
released when the call is discarded on socket close.
(3) We shouldn't take a second ref in rxrpc_accept_call().
(4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds
the call to the to_be_accepted socket queue.
In doing (4) above, we would prefer not to put the call's refcount down to
0 as that entails doing cleanup in softirq context, but it's unlikely as
there are several refs held elsewhere, at least one of which must be put by
someone in process context calling rxrpc_release_call(). However, it's not
a problem if we do have to do that.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Adjust the call ref tracepoint to show references held on a call by the
kernel API separately as much as possible and add an additional trace to at
the allocation point from the preallocation buffer for an incoming call.
Note that this doesn't show the allocation of a client call for the kernel
separately at the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Allow tx_winsize to grow when the ACK info packet shows a larger receive
window at the other end rather than only permitting it to shrink.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
skb->len should be used rather than skb->data_len when referring to the
amount of data in a packet. This will only cause a malfunction in the
following cases:
(1) We receive a jumbo packet (validation and splitting both are wrong).
(2) We see if there's extra ACK info in an ACK packet (we think it's not
there and just ignore it).
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Add a missing unlock in rxrpc_call_accept() in the path taken if there's no
call to wake up.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
rxrpc_recvmsg() needs to make sure that the call it has just been
processing gets requeued for further attention if the buffer has been
filled and there's more data to be consumed. The softirq producer only
queues the call and wakes the socket if it fills the first slot in the
window, so userspace might end up sleeping forever otherwise, despite there
being data available.
This is not a problem provided the userspace buffer is big enough or it
empties the buffer completely before more data comes in.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the
timer is set for it.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
We need to wake up the sender when Tx window rotation due to an incoming
ACK makes space in the buffer otherwise the sender is liable to just hang
endlessly.
This problem isn't noticeable if the Tx phase transfers no more than will
fit in a single window or the Tx window rotates fast enough that it doesn't
get full.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Peer records created for incoming connections weren't getting their hash
key set. This meant that incoming calls wouldn't see more than one DATA
packet - which is not a problem for AFS CM calls with small request data
blobs.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
'ub' is malloced in tipc_udp_enable() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.
Fixes: ba5aa84a2d22 ("tipc: split UDP nl address parsing")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If /sbin/bridge-stp is available on the system, bridge tries to execute
it instead of the kernel implementation when starting/stopping STP.
If anything goes wrong with /sbin/bridge-stp, bridge silently falls back
to kernel STP, making hard to debug userspace STP.
This patch adds a br_stp_call_user helper to start/stop userspace STP
and debug errors from the program: abnormal exit status is stored in the
lower byte and normal exit status is stored in higher byte.
Below is a simple example on a kernel with dynamic debug enabled:
# ln -s /bin/false /sbin/bridge-stp
# brctl stp br0 on
br0: failed to start userspace STP (256)
# dmesg
br0: /sbin/bridge-stp exited with code 1
br0: failed to start userspace STP (256)
br0: using kernel STP
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking fixes from David Miller:
"Mostly small sets of driver fixes scattered all over the place.
1) Mediatek driver fixes from Sean Wang. Forward port not written
correctly during TX map, missed handling of EPROBE_DEFER, and
mistaken use of put_page() instead of skb_free_frag().
2) Fix socket double-free in KCM code, from WANG Cong.
3) QED driver fixes from Sudarsana Reddy Kalluru, including a fix for
using the dcbx buffers before initializing them.
4) Mellanox Switch driver fixes from Jiri Pirko, including a fix for
double fib removals and an error handling fix in
mlxsw_sp_module_init().
5) Fix kernel panic when enabling LLDP in i40e driver, from Dave
Ertman.
6) Fix padding of TSO packets in thunderx driver, from Sunil Goutham.
7) TCP's rcv_wup not initialized properly when using fastopen, from
Neal Cardwell.
8) Don't use uninitialized flow keys in flow dissector, from Gao
Feng.
9) Use after free in l2tp module unload, from Sabrina Dubroca.
10) Fix interrupt registry ordering issues in smsc911x driver, from
Jeremy Linton.
11) Fix crashes in bonding having to do with enslaving and rx_handler,
from Mahesh Bandewar.
12) AF_UNIX deadlock fixes from Linus.
13) In mlx5 driver, don't read skb->xmit_mode after it might have been
freed from the TX reclaim path. From Tariq Toukan.
14) Fix a bug from 2015 in TCP Yeah where the congestion window does
not increase, from Artem Germanov.
15) Don't pad frames on receive in NFP driver, from Jakub Kicinski.
16) Fix chunk fragmenting in SCTP wrt. GSO, from Marcelo Ricardo
Leitner.
17) Fix deletion of VRF routes, from Mark Tomlinson.
18) Fix device refcount leak when DAD fails in ipv6, from Wei Yongjun"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (101 commits)
net/mlx4_en: Fix panic on xmit while port is down
net/mlx4_en: Fixes for DCBX
net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_state()
net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_all()
net: ethernet: renesas: sh_eth: add POST registers for rz
drivers: net: phy: mdio-xgene: Add hardware dependency
dwc_eth_qos: do not register semi-initialized device
sctp: identify chunks that need to be fragmented at IP level
mlxsw: spectrum: Set port type before setting its address
mlxsw: spectrum_router: Fix error path in mlxsw_sp_router_init
nfp: don't pad frames on receive
nfp: drop support for old firmware ABIs
nfp: remove linux/version.h includes
tcp: cwnd does not increase in TCP YeAH
net/mlx5e: Fix parsing of vlan packets when updating lro header
net/mlx5e: Fix global PFC counters replication
net/mlx5e: Prevent casting overflow
net/mlx5e: Move an_disable_cap bit to a new position
net/mlx5e: Fix xmit_more counter race issue
tcp: fastopen: avoid negative sk_forward_alloc
...
|
|
No longer needed
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No longer needed
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A previous patch added l3mdev flow update making these hooks
redundant. Remove them.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv6 output processing to
send the packet to the VRF driver on xmit.
Link scope addresses (linklocal and multicast) need special handling:
specifically the oif the flow struct can not be changed because we
want the lookup tied to the enslaved interface. ie., the source address
and the returned route MUST point to the interface scope passed in.
Convert the existing vrf_get_rt6_dst to handle only link scope addresses.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv4 output processing to
send the packet to the VRF driver on xmit.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow an L3 master device to act as the loopback for that L3 domain.
For IPv4 the device can also have the address 127.0.0.1.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the existing hook that returns the vrf dst on the fib lookup.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif
in flow struct if its oif or iif points to a device enslaved to an L3
Master device. Only 1 needs to be converted to match the l3mdev FIB
rule. This moves the flow adjustment for l3mdev to a single point
catching all lookups. It is redundant for existing hooks (those are
removed in later patches) but is needed for missed lookups such as
PMTU updates.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Willem noticed that we could avoid an rbtree lookup if the
the attempt to coalesce incoming skb to the last skb failed
for some reason.
Since most ooo additions are at the tail, this is definitely
worth adding a test and fast path.
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When userspace tries to create datapaths and the module is not loaded,
it will simply fail. With this patch, the module will be automatically
loaded.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.
The action will release the metadata created by the tunnel device
(decap), or set the metadata with the specified values for encap
operation.
For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:
$ tc filter add dev net0 protocol ip parent ffff: \
flower \
ip_proto 1 \
dst_ip 11.11.11.2 \
action tunnel_key set \
src_ip 11.11.0.1 \
dst_ip 11.11.0.2 \
id 11 \
action mirred egress redirect dev vxlan0
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce classifying by metadata extracted by the tunnel device.
Outer header fields - source/dest ip and tunnel id, are extracted from
the metadata when classifying.
For example, the following will add a filter on the ingress Qdisc of shared
vxlan device named 'vxlan0'. To forward packets with outer src ip
11.11.0.2, dst ip 11.11.0.1 and tunnel id 11. The packets will be
forwarded to tap device 'vnet0' (after metadata is released):
$ tc filter add dev vxlan0 protocol ip parent ffff: \
flower \
enc_src_ip 11.11.0.2 \
enc_dst_ip 11.11.0.1 \
enc_key_id 11 \
dst_ip 11.11.11.1 \
action tunnel_key release \
action mirred egress redirect dev vnet0
The action tunnel_key, will be introduced in the next patch in this
series.
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add utility functions to convert a 32 bits key into a 64 bits tunnel and
vice versa.
These functions will be used instead of cloning code in GRE and VXLAN,
and in tc act_iptunnel which will be introduced in a following patch in
this patchset.
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When fetching ifindex, we don't need to test dev for being NULL since
we're always guaranteed to have a valid dev for clsact programs. Thus,
avoid this test in fast path.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
which otherwise often result in overly long bytes_to_bpf_size(sizeof())
and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
check in convert_bpf_extensions(), but we should rather make that generic
as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
users to detect any rewriter size issues at compile time. Note, there are
currently none, but we want to assert that it stays this way.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
anyway, so we can drop the unneeded test for dev; also few more other
misc bits addressed here.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If skb has a valid l4 hash, there is no point clearing hash and force
a further flow dissection when a tunnel encapsulation is added.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rewrite data and ack handling
This patch set constitutes the main portion of the AF_RXRPC rewrite. It
consists of five fix/helper patches:
(1) Fix ASSERTCMP's and ASSERTIFCMP's handling of signed values.
(2) Update some protocol definitions slightly.
(3) Use of an hlist for RCU purposes.
(4) Removal of per-call sk_buff accounting (not really needed when skbs
aren't being queued on the main queue).
(5) Addition of a tracepoint to log incoming packets in the data_ready
callback and to log the end of the data_ready callback.
And then there are two patches that form the main part:
(6) Preallocation of resources for incoming calls so that in patch (7) the
data_ready handler can be made to fully instantiate an incoming call
and make it live. This extends through into AFS so that AFS can
preallocate its own incoming call resources.
The preallocation size is capped at the listen() backlog setting - and
that is capped at a sysctl limit which can be set between 4 and 32.
The preallocation is (re)charged either by accepting/rejecting pending
calls or, in the case of AFS, manually. If insufficient preallocation
resources exist, a BUSY packet will be transmitted.
The advantage of using this preallocation is that once a call is set
up in the data_ready handler, DATA packets can be queued on it
immediately rather than the DATA packets being queued for a background
work item to do all the allocation and then try and sort out the DATA
packets whilst other DATA packets may still be coming in and going
either to the background thread or the new call.
(7) Rewrite the handling of DATA, ACK and ABORT packets.
In the receive phase, DATA packets are now held in per-call circular
buffers with deduplication, out of sequence detection and suchlike
being done in data_ready. Since there is only one producer and only
once consumer, no locks need be used on the receive queue.
Received ACK and ABORT packets are now parsed and discarded in
data_ready to recycle resources as fast as possible.
sk_buffs are no longer pulled, trimmed or cloned, but rather the
offset and size of the content is tracked. This particularly affects
jumbo DATA packets which need insertion into the receive buffer in
multiple places. Annotations are kept to track which bit is which.
Packets are no longer queued on the socket receive queue; rather,
calls are queued. Dummy packets to convey events therefore no longer
need to be invented and metadata packets can be discarded as soon as
parsed rather then being pushed onto the socket receive queue to
indicate terminal events.
The preallocation facility added in (6) is now used to set up incoming
calls with very little locking required and no calls to the allocator
in data_ready.
Decryption and verification is now handled in recvmsg() rather than in
a background thread. This allows for the future possibility of
decrypting directly into the user buffer.
With this patch, the code is a lot simpler and most of the mass of
call event and state wangling code in call_event.c is gone.
With this, the majority of the AF_RXRPC rewrite is complete. However,
there are still things to be done, including:
(*) Limit the number of active service calls to prevent an attacker from
filling up a server's memory.
(*) Limit the number of calls on the rebuff-with-BUSY queue.
(*) Transmit delayed/deferred ACKs from recvmsg() if possible, rather than
punting to the background thread. Ideally, the background thread
shouldn't run at all, but data_ready can't call kernel_sendmsg() and
we can't rely on recvmsg() attending to the call in a timely fashion.
(*) Prevent the call at the front of the socket queue from hogging
recvmsg()'s attention if there's a sufficiently continuous supply of
data.
(*) Distribute ICMP errors by connection rather than by call. Possibly
parse the ICMP packet to try and pin down the exact connection and
call.
(*) Encrypt/decrypt directly between user buffers and socket buffers where
possible.
(*) IPv6.
(*) Service ID upgrade. This is a facility whereby a special flag bit is
set in the DATA packet header when making a call that tells the server
that it is allowed to change the service ID to an upgraded one and
reply with an equivalent call from the upgraded service.
This is used, for example, to override certain AFS calls so that IPv6
addresses can be returned.
(*) Allow userspace to preallocate call user IDs for incoming calls.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Previously, without GSO, it was easy to identify it: if the chunk didn't
fit and there was no data chunk in the packet yet, we could fragment at
IP level. So if there was an auth chunk and we were bundling a big data
chunk, it would fragment regardless of the size of the auth chunk. This
also works for the context of PMTU reductions.
But with GSO, we cannot distinguish such PMTU events anymore, as the
packet is allowed to exceed PMTU.
So we need another check: to ensure that the chunk that we are adding,
actually fits the current PMTU. If it doesn't, trigger a flush and let
it be fragmented at IP level in the next round.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
dtefacs.calling_ae and called_ae are both 20 element __u8 arrays and
cannot be null and hence are redundant checks. Remove these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This structure is defined but never used. Flagged with W=1
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit 37a1d3611c12 ("ipv6: include NLM_F_REPLACE in route
replace notifications"), RTM_NEWROUTE notifications have their
NLM_F_REPLACE flag set if the new route replaced a preexisting one.
However, other flags aren't set.
This patch reports the missing NLM_F_CREATE and NLM_F_EXCL flag bits.
NLM_F_APPEND is not reported, because in ipv6 a NLM_F_CREATE request
is interpreted as an append request (contrary to ipv4, "prepend" is not
supported, so if NLM_F_EXCL is not set then NLM_F_APPEND is implicit).
As a result, the possible flag combination can now be reported
(iproute2's terminology into parentheses):
* NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
("add").
* NLM_F_CREATE: route did already exist, new route added after
preexisting ones ("append").
* NLM_F_REPLACE: route did already exist, new route replaced the
first preexisting one ("change").
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
fib_table_insert() inconsistently fills the nlmsg_flags field in its
notification messages.
Since commit b8f558313506 ("[RTNETLINK]: Fix sending netlink message
when replace route."), the netlink message has its nlmsg_flags set to
NLM_F_REPLACE if the route replaced a preexisting one.
Then commit a2bb6d7d6f42 ("ipv4: include NLM_F_APPEND flag in append
route notifications") started setting nlmsg_flags to NLM_F_APPEND if
the route matched a preexisting one but was appended.
In other cases (exclusive creation or prepend), nlmsg_flags is 0.
This patch sets ->nlmsg_flags in all situations, preserving the
semantic of the NLM_F_* bits:
* NLM_F_CREATE: a new fib entry has been created for this route.
* NLM_F_EXCL: no other fib entry existed for this route.
* NLM_F_REPLACE: this route has overwritten a preexisting fib entry.
* NLM_F_APPEND: the new fib entry was added after other entries for
the same route.
As a result, the possible flag combination can now be reported
(iproute2's terminology into parentheses):
* NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
("add").
* NLM_F_CREATE | NLM_F_APPEND: route did already exist, new route
added after preexisting ones ("append").
* NLM_F_CREATE: route did already exist, new route added before
preexisting ones ("prepend").
* NLM_F_REPLACE: route did already exist, new route replaced the
first preexisting one ("change").
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as
ancillary data") Francesco added IP_TOS values specified as integer.
However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
in a single byte, when IP_RECVTOS is set on the socket.
It can be very useful to reflect all ancillary options as given by the
kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
EINVAL after Francesco patch.
So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
server can simply reuse same ancillary block without having to mangle
it.
Jesper can then augment
https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
to add TOS reflection ;)
Fixes: f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 76174004a0f19785a328f40388e87e982bbf69b9
(tcp: do not slow start when cwnd equals ssthresh )
introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual
ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP
connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth
~100KB/s.
That is caused by stalled cwnd when cwnd equals ssthresh. This patch
fixes it by proper increasing cwnd in this case.
Signed-off-by: Artem Germanov <agermanov@anchorfree.com>
Acked-by: Dmitry Adamushko <d.adamushko@anchorfree.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support for 802.1ad including the ability to push and pop double
tagged vlans. Add support for 802.1ad to netlink parsing and flow
conversion. Uses double nested encap attributes to represent double
tagged vlan. Inner TPID encoded along with ctci in nested attributes.
This is based on Thomas F Herbert's original v20 patch. I made some
small clean ups and bug fixes.
Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com>
Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|