Age | Commit message (Collapse) | Author |
|
Add support for operation code 3 (OC3) of the
Perform-Network-Subchannel-Operations (PNSO) function
of the Channel-Subsystem-Call (CHSC) instruction.
PNSO provides 2 operation codes:
OC0 - BRIDGE_INFO
OC3 - ADDR_INFO (new)
Extend the function calls to *pnso* to pass the OC and
add new response code 0108.
Support for OC3 is indicated by a flag in the css_general_characteristics.
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add CQE compression support for completions of packets that span
multiple strides in a Striding RQ, per the HW capability.
In our memory model, we use small strides (256B as of today) for the
non-linear SKB mode. This feature allows CQE compression to work also
for multiple strides packets. In this case decompressing the mini CQE
array will use stride index provided by HW as part of the mini CQE.
Before this feature, compression was possible only for single-strided
packets, i.e. for packets of size up to 256 bytes when in non-linear
mode, and the index was maintained by SW.
This feature is supported for ConnectX-5 and above.
Feature performance test:
This was whitebox-tested, we reduced the PCI speed from 125Gb/s to
62.5Gb/s to overload pci and manipulated mlx5 driver to drop incoming
packets before building the SKB to achieve low cpu utilization.
Outcome is low cpu utilization and bottleneck on pci only.
Test setup:
Server: Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz server, 32 cores
NIC: ConnectX-6 DX.
Sender side generates 300 byte packets at full pci bandwidth.
Receiver side configuration:
Single channel, one cpu processing with one ring allocated. Cpu utilization
is ~20% while pci bandwidth is fully utilized.
For the generated traffic and interface MTU of 4500B (to activate the
non-linear SKB mode), packet rate improvement is about 19% from ~17.6Mpps
to ~21Mpps.
Without this feature, counters show no CQE compression blocks for
this setup, while with the feature, counters show ~20.7Mpps compressed CQEs
in ~500K compression blocks.
Signed-off-by: Ofer Levi <oferle@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add support for rewriting of IPV6 DSCP part of traffic class field.
Next commands, for example, can be used to offload rewrite action:
OVS:
$ ovs-ofctl add-flow ovs-sriov "tcpv6, in_port=REP, \
actions=mod_nw_tos:68, output:NIC"
iproute2:
$ tc filter add dev REP ingress protocol ipv6 prio 1 flower skip_sw \
ip_proto tcp \
action pedit ex munge ip6 traffic_class set 68 retain 0xfc pipe \
action mirred egress redirect dev NIC
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Support tc trap such that packets can explicitly be forwarded to slow
path if they match a specific rule.
In the example below, we want packets with src IP equals 7.7.7.8 to be
forwarded to software, in which case it will get to the appropriate
representor net device.
$ tc filter add dev eth1 protocol ip prio 1 root flower skip_sw \
src_ip 7.7.7.8 action trap
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Multiple features use metadata matching such as bond vport
in live migration, multi-port RoCE mode, stacked devices;
hence, enable vport metadata matching by default.
Fixes: 1e62e222db2e ("net/mlx5: E-Switch, Use vport metadata matching only when mandatory")
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
In merged eswitch configuration, peer miss rule is setup for all
vports. If metadata is enabled, peer miss rule with metadata matching
will be configured instead of source port matching; however, some
vports that have not yet been enabled don't have default_metadata
setup and their default_metadata will be zero.
Hence, setup/cleanup default metadata for all vports when eswitch moves
in/out of offloads mode.
Fixes: 133dcfc577ea ("net/mlx5: E-Switch, Alloc and free unique metadata for match")
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Uplink vport must have a dedicated metadata with vhca_id
being part of the metadata.
Fixes: 133dcfc577ea ("net/mlx5: E-Switch, Alloc and free unique metadata for match")
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Check E-Switch capabilities and enable metadata support flag
before using it to setup other features that need metadata.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
LAG offload can't be enabled if the enslaved PF is not lag master,
which is indicated by HCA capabilities bit. It is cleared if more than
64 VFs are configured for this PF.
Previously, a data structure is created to store lag info, including
PFs to be enslaved, then a handler is registered for netdev notifier.
However, this initialization is skipped if PF is not lag master. So
PF can't handle CHANGEUPPER event from upper bond device. Even worse,
PF is enslaved silently, and LAG offload is not activated.
Fix this by registering netdev notifier for PFs which are not lag
masters. When CHANGEUPPER event is received, and both physical ports
(and only them) on the same NIC are about to be enslaved, a warning is
returned for user to know it.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
If bond tx type is not active-backup or hash, LAG offload can't be
enabled. When CHANGEUPPER event is received, and both PFs (and only
them) under the same lag master are about to be enslaved, a warning
is returned for user to know the offload failure, otherwise PFs are
enslaved silently without LAG offload activated.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Change the return value to -ENOENT, to make it more meaningful.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Before calling timecounter_cyc2time(), clock->lock must be taken.
Use mlx5_timecounter_cyc2time instead which guarantees a safe access.
Fixes: afc98a0b46d8 ("net/mlx5: Update ptp_clock_event foreach PPS event")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
|
|
Holding the clock lock is not required when scheduling a PPS work.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
|
|
Fix a typo in ptp_clock_info naming: mlx5_p2p -> mlx5_ptp.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
|
|
Clock struct is part of struct mlx5_core_dev. Code was inconsistent, on
some cases used container_of and on another used clock->mdev.
Align code to use container_of amd remove clock->mdev pointer.
While here, fix reverse xmas tree coding style.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
|
|
This isn't a fall through because it was after a return statement. The
fall through annotation leads to a Smatch warning:
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c:246
mlx5e_ethtool_get_sset_count() warn: ignoring unreachable code.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
Add variable initialization to eliminate the warning
"variable may be used uninitialized".
Fixes: 5f29458b77d5 ("net/mlx5e: Support dump callback in TX reporter")
Signed-off-by: Moshe Tal <moshet@mellanox.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
For EPOLLET, applications must call sendmsg until they get EAGAIN.
Otherwise, there is no guarantee that EPOLLOUT is sent if there was
a failure upon memory allocation.
As a result on high-speed NICs, userspace observes multiple small
sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
1-2 TSOs in between two sendmsg syscalls:
// One large partial send due to memory allocation failure.
sendmsg(20MB) = 2MB
// Many small sends until EAGAIN.
sendmsg(18MB) = 64KB
sendmsg(17.9MB) = 128KB
sendmsg(17.8MB) = 64KB
...
sendmsg(...) = EAGAIN
// At this point, userspace can assume an EPOLLOUT.
To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
to guarantee that we send EPOLLOUT after partial sendmsg.
After this commit userspace can assume that it will receive an EPOLLOUT
after the first partial sendmsg. This EPOLLOUT will benefit from
sk_stream_write_space() logic delaying the EPOLLOUT until significant
space is available in write queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If there was any event available on the TCP socket, tcp_poll()
will be called to retrieve all the events. In tcp_poll(), we call
sk_stream_is_writeable() which returns true as long as we are at least
one byte below notsent_lowat. This will result in quite a few
spurious EPLLOUT and frequent tiny sendmsg() calls as a result.
Similar to sk_stream_write_space(), use __sk_stream_is_writeable
with a wake value of 1, so that we set EPOLLOUT only if half the
space is available for write.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Clean and rebuild the debugfs info for the queues being swapped.
Fixes: a34e25ab977c ("ionic: change the descriptor ring length without full reset")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A DSA master interface has upper network devices, each representing an
Ethernet switch port attached to it. Demultiplexing the source ports and
setting skb->dev accordingly is done through the catch-all ETH_P_XDSA
packet_type handler. Catch-all because DSA vendors have various header
implementations, which can be placed anywhere in the frame: before the
DMAC, before the EtherType, before the FCS, etc. So, the ETH_P_XDSA
handler acts like an rx_handler more than anything.
It is unlikely for the DSA master interface to have any other upper than
the DSA switch interfaces themselves. Only maybe a bridge upper*, but it
is very likely that the DSA master will have no 8021q upper. So
__netif_receive_skb_core() will try to untag the VLAN, despite the fact
that the DSA switch interface might have an 8021q upper. So the skb will
never reach that.
So far, this hasn't been a problem because most of the possible
placements of the DSA switch header mentioned in the first paragraph
will displace the VLAN header when the DSA master receives the frame, so
__netif_receive_skb_core() will not actually execute any VLAN-specific
code for it. This only becomes a problem when the DSA switch header does
not displace the VLAN header (for example with a tail tag).
What the patch does is it bypasses the untagging of the skb when there
is a DSA switch attached to this net device. So, DSA is the only
packet_type handler which requires seeing the VLAN header. Once skb->dev
will be changed, __netif_receive_skb_core() will be invoked again and
untagging, or delivery to an 8021q upper, will happen in the RX of the
DSA switch interface itself.
*see commit 9eb8eff0cf2f ("net: bridge: allow enslaving some DSA master
network devices". This is actually the reason why I prefer keeping DSA
as a packet_type handler of ETH_P_XDSA rather than converting to an
rx_handler. Currently the rx_handler code doesn't support chaining, and
this is a problem because a DSA master might be bridged.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Landen Chao says:
====================
net-next: dsa: mt7530: add support for MT7531
This patch series adds support for MT7531.
MT7531 is the next generation of MT7530 which could be found on Mediatek
router platforms such as MT7622 or MT7629.
It is also a 7-ports switch with 5 giga embedded phys, 2 cpu ports, and
the same MAC logic of MT7530. Cpu port 6 only supports SGMII interface.
Cpu port 5 supports either RGMII or SGMII in different HW SKU, but cannot
be muxed to PHY of port 0/4 like mt7530. Due to support for SGMII
interface, pll, and pad setting are different from MT7530.
MT7531 SGMII interface can be configured in following mode:
- 'SGMII AN mode' with in-band negotiation capability
which is compatible with PHY_INTERFACE_MODE_SGMII.
- 'SGMII force mode' without in-band negotiation
which is compatible with 10B/8B encoding of
PHY_INTERFACE_MODE_1000BASEX with fixed full-duplex and fixed pause.
- 2.5 times faster clocked 'SGMII force mode' without in-band negotiation
which is compatible with 10B/8B encoding of
PHY_INTERFACE_MODE_2500BASEX with fixed full-duplex and fixed pause.
v4 -> v5
- Add fixed-link node to dsa cpu port in dts file by suggestion of
Vladimir Oltean.
v3 -> v4
- Adjust the coding style by suggestion of Jakub Kicinski.
Remove unnecessary jumping label, merge continuous numeric 'switch
cases' into one line, and keep the variables longest to shortest
(reverse xmas tree).
v2 -> v3
- Keep the same setup logic of mt7530/mt7621 because these series of
patches is for adding mt7531 hardware.
- Do not adjust rgmii delay when vendor phy driver presents in order to
prevent double adjustment by suggestion of Andrew Lunn.
- Remove redundant 'Example 4' from dt-bindings by suggestion of
Rob Herring.
- Fix typo.
v1 -> v2
- change phylink_validate callback function to support full-duplex
gigabit only to match hardware capability.
- add description of SGMII interface.
- configure mt7531 cpu port in fastest speed by default.
- parse SGMII control word for in-band negotiation mode.
- configure RGMII delay based on phy.rst.
- Rename the definition in the header file to avoid potential conflicts.
- Add wrapper function for mdio read/write to support both C22 and C45.
- correct fixed-link speed of 2500base-x in dts.
- add MT7531 port mirror setting.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add mt7531 dsa to bananapi-bpi-r64 board for 5 giga Ethernet ports support.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Tested-By: Frank Wunderlich <frank-w@public-files.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add mt7531 dsa to mt7622-rfb1 board for 5 giga Ethernet ports support.
mt7622 only supports 1 sgmii interface, so either gmac0 or gmac1 can be
configured as sgmii interface. In this patch, change to connect mt7622
gmac0 and mt7531 port6 through sgmii interface.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add new support for MT7531:
MT7531 is the next generation of MT7530. It is also a 7-ports switch with
5 giga embedded phys, 2 cpu ports, and the same MAC logic of MT7530. Cpu
port 6 only supports SGMII interface. Cpu port 5 supports either RGMII
or SGMII in different HW sku, but cannot be muxed to PHY of port 0/4 like
mt7530. Due to SGMII interface support, pll, and pad setting are different
from MT7530. This patch adds different initial setting, and SGMII phylink
handlers of MT7531.
MT7531 SGMII interface can be configured in following mode:
- 'SGMII AN mode' with in-band negotiation capability
which is compatible with PHY_INTERFACE_MODE_SGMII.
- 'SGMII force mode' without in-band negotiation
which is compatible with 10B/8B encoding of
PHY_INTERFACE_MODE_1000BASEX with fixed full-duplex and fixed pause.
- 2.5 times faster clocked 'SGMII force mode' without in-band negotiation
which is compatible with 10B/8B encoding of
PHY_INTERFACE_MODE_2500BASEX with fixed full-duplex and fixed pause.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add devicetree binding to support the compatible mt7531 switch as used
in the MediaTek MT7531 switch.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a structure holding required operations for each device such as device
initialization, PHY port read or write, a checker whether PHY interface is
supported on a certain port, MAC port setup for either bus pad or a
specific PHY interface.
The patch is done for ready adding a new hardware MT7531, and keep the
same setup logic of existing hardware.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Refine message in Kconfig with fixing typo and an explicit MT7621 support.
Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
x25_type_trans only needs to be called before we call netif_rx to pass
the skb to upper layers.
It does not need to be called before lapb_data_received. The LAPB module
does not need the fields that are set by calling it.
In the other two X.25 drivers - lapbether and hdlc_x25. x25_type_trans
is only called before netif_rx and not before lapb_data_received.
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
flush_all_backlogs() may cause deadlock on systems
running processes with FIFO scheduling policy.
The above is critical in -RT scenarios, where user-space
specifically ensure no network activity is scheduled on
the CPU running the mentioned FIFO process, but still get
stuck.
This commit tries to address the problem checking the
backlog status on the remote CPUs before scheduling the
flush operation. If the backlog is empty, we can skip it.
v1 -> v2:
- explicitly clear flushed cpu mask - Eric
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Ido Schimmel says:
====================
mlxsw: Derive SBIB from maximum port speed & MTU
Petr says:
Internal buffer is a part of port headroom used for packets that are
mirrored due to triggers that the Spectrum ASIC considers "egress". Besides
ACL mirroring on port egresss this includes also packets mirrored due to
ECN marking.
This patchset changes the way the internal mirroring buffer is reserved.
Currently the buffer reflects port MTU and speed accurately. In the future,
mlxsw should support dcbnl_setbuffer hook to allow the users to set buffer
sizes by hand. In that case, there might not be enough space for growth of
the internal mirroring buffer due to MTU and speed changes. While vetoing
MTU changes would be merely confusing, port speed changes cannot be vetoed,
and such change would simply lead to issues in packet mirroring.
For these reasons, with these patches the internal mirroring buffer is
derived from maximum MTU and maximum speed achievable on the port.
Patches #1 and #2 introduce a new callback to determine the maximum speed a
given port can achieve.
With patches #3 and #4, the information about, respectively, maximum MTU
and maximum port speed, is kept in struct mlxsw_sp_port.
In patch #5, maximum MTU and maximum speed are used to determine the size
of the internal buffer. MTU update and speed update hooks are dropped,
because they are no longer necessary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The SBIB register configures the size of an internal buffer that the
Spectrum ASICs use when mirroring traffic on egress. This size should be
taken into account when validating that the port headroom buffers are not
larger than the chip can handle. Up until now this was not done, which is
incidentally not a problem, because the priority group buffers that mlxsw
auto-configures are small enough that the boundary condition could not be
violated.
However when dcbnl_setbuffer is implemented, the user has control over
sizes of PG buffers, and they might overshoot the headroom capacity.
However the size of the SBIB buffer depends on port speed, and that cannot
be vetoed. Therefore SBIB size should be deduced from maximum port speed.
Additionally, once the buffers are configured by hand, the user could get
into an uncomfortable situation where their MTU change requests get vetoed,
because the SBIB does not fit anymore. Therefore derive SBIB size from
maximum permissible MTU as well.
Remove all the code that adjusted the SBIB size whenever speed or MTU
changed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The maximum port speed depends on link modes supported by the port, and for
Ethernet ports is constant. The maximum speed will be handy when setting
SBIB, the internal buffer used for traffic mirroring. Therefore, keep it in
struct mlxsw_sp_port for easy access.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The maximum port MTU depends on port type. On Spectrum, mlxsw configures
all ports as Ethernet ports, and the maximum MTU therefore never changes.
Besides checking MTU configuration, maximum MTU will also be handy when
setting SBIB, the internal buffer used for traffic mirroring. Therefore,
keep it in struct mlxsw_sp_port for easy access.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The SBIB register configures the size of an internal buffer that the
Spectrum ASICs use when mirroring traffic on egress. This size should be
taken into account when validating that the port headroom buffers are not
larger than the chip can handle. Up until now this was not done, which is
incidentally not a problem, because the priority group buffers that mlxsw
auto-configures are small enough that the boundary condition could not be
violated.
When dcbnl_setbuffer is implemented, the user gets control over sizes of PG
buffers, and they might overshoot the headroom capacity. However the size
of the SBIB buffer depends on port speed, which cannot be vetoed. There is
obviously no way to retroactively push back on requests for overlarge PG
buffers, or reject an overlarge MTU, or cancel losslessness of a certain
PG.
Therefore, instead of taking into account the current speed when
calculating SBIB buffer size, take into account the maximum speed that a
port with given Ethernet protocol capabilities can have.
To that end, add a new ethtool callback, ptys_max_speed, which determines
this maximum speed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In order to allow reusing the logic, extract from
mlxsw_sp_port_get_link_ksettings() the code to obtain Ethernet protocol
attributes, mlxsw_sp_port_ptys_query().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2020-09-14
This series contains updates to i40e driver only.
Li RongQing removes binding affinity mask to a fixed CPU and sets
prefetch of Rx buffer page to occur conditionally.
Björn provides AF_XDP performance improvements by not prefetching HW
descriptors, using 16 byte descriptors, and moving buffer allocation
out of Rx processing loop.
v2: Define prefetch_page_address in a common header for patch 2.
Dropped, previous, patch 5 as it is being reworked to be more
generalized.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fixes for the connection manager rewrite
Here are some fixes for the connection manager rewrite:
(1) Fix a goto to the wrong place in error handling.
(2) Fix a missing NULL pointer check.
(3) The stored allocation error needs to be stored signed.
(4) Fix a leak of connection bundle when clearing connections due to
net namespace exit.
(5) Fix an overget of the bundle when setting up a new client conn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add NETIF_F_GSO_UDP_TUNNEL and NETIF_F_GSO_UDP_TUNNEL_CSUM features
to support vxlan segmentation and checksum offload. Ipip and ipv6
tunnel packets are regarded as non-tunnel pkt for hw and as for other
type of tunnel pkts, checksum offload is disabled.
Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixes the following W=1 kernel build warning(s):
drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c:661:6: warning:
variable 'val' set but not used [-Wunused-but-set-variable]
661 | u32 val;
| ^~~
After commit 7f9664525f9c ("qlcnic: 83xx memory map and HW access
routines"), variable 'val' is never used in qlcnic_83xx_cam_unlock(), so
removing it to avoid build warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixes the following W=1 kernel build warning(s):
drivers/net/ethernet/marvell/pxa168_eth.c:1190:6: warning:
variable 'retval' set but not used [-Wunused-but-set-variable]
1190 | int retval;
| ^~~~~~
Function pxa168_eth_change_mtu() always return zero, so variable 'retval'
is redundant, just remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixes the following W=1 kernel build warning(s):
drivers/net/ethernet/freescale/fec_ptp.c:523:6: warning:
variable 'ns' set but not used [-Wunused-but-set-variable]
523 | u64 ns;
| ^~
After commit 6605b730c061 ("FEC: Add time stamping code and a PTP
hardware clock"), variable 'ns' is never used in fec_time_keep(),
so removing it to avoid build warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixes the following W=1 kernel build warning(s):
drivers/net/ethernet/dnet.c:510:6: warning:
variable 'tx_status' set but not used [-Wunused-but-set-variable]
u32 tx_status, irq_enable;
^~~~~~~~~
After commit 4796417417a6 ("dnet: Dave DNET ethernet controller driver
(updated)"), variable 'tx_status' is never used in dnet_start_xmit(),
so removing it to avoid build warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
that remembers if some room has been made in the rtx queue
by an incoming ACK packet.
This is later used from tcp_check_space() before
considering to send EPOLLOUT.
Problem is: If we receive SACK packets, and no packet
is removed from RTX queue, we can send fresh packets, thus
moving them from write queue to rtx queue and eventually
empty the write queue.
This stall can happen if TCP_NOTSENT_LOWAT is used.
With this fix, we no longer risk stalling sends while holes
are repaired, and we can fully use socket sndbuf.
This also removes a cache line dirtying for typical RPC
workloads.
Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This comment is outdated and no longer reflects the actual implementation
of af_packet.c.
Reasons for the new comment:
1.
In af_packet.c, the function packet_snd first reserves a headroom of
length (dev->hard_header_len + dev->needed_headroom).
Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header,
which calls dev->header_ops->create, to create the link layer header.
If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of
length (dev->hard_header_len), and checks if the user has provided a
header sized between (dev->min_header_len) and (dev->hard_header_len)
(in dev_validate_header).
This shows the developers of af_packet.c expect hard_header_len to
be consistent with header_ops.
2.
In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment.
That comment states that prepending an LL header internally in a driver
is considered a bug. I believe this bug can be fixed by setting
hard_header_len to 0, making the internal header completely invisible
to af_packet.c (and requesting the headroom in needed_headroom instead).
3.
There is a commit for a WiFi driver:
commit 9454f7a895b8 ("mwifiex: set needed_headroom, not hard_header_len")
According to the discussion about it at:
https://patchwork.kernel.org/patch/11407493/
The author tried to set the WiFi driver's hard_header_len to the Ethernet
header length, and request additional header space internally needed by
setting needed_headroom.
This means this usage is already adopted by driver developers.
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Paolo Abeni says:
====================
mptcp: introduce support for real multipath xmit
This series enable MPTCP socket to transmit data on multiple subflows
concurrently in a load balancing scenario.
First the receive code path is refactored to better deal with out-of-order
data (patches 1-7). An RB-tree is introduced to queue MPTCP-level out-of-order
data, closely resembling the TCP level OoO handling.
When data is sent on multiple subflows, the peer can easily see OoO - "future"
data at the MPTCP level, especially if speeds, delay, or jitter are not
symmetric.
The other major change regards the netlink PM, which is extended to allow
creating non backup subflows in patches 9-11.
There are a few smaller additions, like the introduction of OoO related mibs,
send buffer autotuning and better ack handling.
Finally a bunch of new self-tests is introduced. The new feature is tested
ensuring that the B/W used by an MPTCP socket using multiple subflows matches
the link aggregated B/W - we use low B/W virtual links, to ensure the tests
are not CPU bounded.
v1 -> v2:
- fix 32 bit build breakage
- fix a bunch of checkpatch issues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a bunch of test-cases for multiple subflow xmit:
create multiple subflows simulating different links
condition via netem and verify that the msk is able
to use completely the aggregated bandwidth.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
That is needed to let the subflows announce promptly when new
space is available in the receive buffer.
tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update the scheduler to less trivial heuristic: cache
the last used subflow, and try to send on it a reasonably
long burst of data.
When the burst or the subflow send space is exhausted, pick
the subflow with the lower ratio between write space and
send buffer - that is, the subflow with the greater relative
amount of free space.
v1 -> v2:
- fix 32 bit build breakage due to 64bits div
- fix checkpath issues (uint64_t -> u64)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently the 'backup' attribute of local endpoint
is ignored. Let's use it for the MP_JOIN handshake
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|