Age | Commit message (Collapse) | Author |
|
mptcp_add_pending_subflow() performs a sock_hold() on the subflow,
then adds the subflow to the join list.
Without a sock_put the subflow sk won't be freed in case connect() fails.
unreferenced object 0xffff88810c03b100 (size 3000):
[..]
sk_prot_alloc.isra.0+0x2f/0x110
sk_alloc+0x5d/0xc20
inet6_create+0x2b7/0xd30
__sock_create+0x17f/0x410
mptcp_subflow_create_socket+0xff/0x9c0
__mptcp_subflow_connect+0x1da/0xaf0
mptcp_pm_nl_work+0x6e0/0x1120
mptcp_worker+0x508/0x9a0
Fixes: 5b950ff4331ddda ("mptcp: link MPC subflow into msk only after accept")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Send logic caches last active subflow in the msk, so it needs to be
cleared when the cached subflow is closed.
Fixes: d5f49190def61c ("mptcp: allow picking different xmit subflows")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/155
Reported-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is a follow up of commit ea3274695353 ("net: sched: avoid
duplicates in qdisc dump") which has fixed the issue only for the qdisc
dump.
The duplicate printing also occurs when dumping the classes via
tc class show dev eth0
Fixes: 59cc1f61f09c ("net: sched: convert qdisc linked list to hashtable")
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There's no reason for preventing the creation and removal
of qmimux network interfaces when the underlying interface
is up.
This makes qmi_wwan mux implementation more similar to the
rmnet one, simplifying userspace management of the same
logical interfaces.
Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support")
Reported-by: Aleksander Morgado <aleksander@aleksander.es>
Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the blamed patch I managed to introduce a bug while moving code
around: the same logic is applied to the ucast_egress_floods and
bcast_egress_floods variables both on the "if" and the "else" branches.
This is clearly an unintended change compared to how the code used to be
prior to that bugfix, so restore it.
Fixes: 7f7ccdea8c73 ("net: dsa: sja1105: fix leakage of flooded frames outside bridging domain")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
SPEED_10
When using MLO_AN_PHY or MLO_AN_FIXED, the MII_BMCR of the SGMII PCS is
read before resetting the switch so it can be reprogrammed afterwards.
This works for the speeds of 1Gbps and 100Mbps, but not for 10Mbps,
because SPEED_10 is actually 0, so AND-ing anything with 0 is false,
therefore that last branch is dead code.
Do what others do (genphy_read_status_fixed, phy_mii_ioctl) and just
remove the check for SPEED_10, let it fall into the default case.
Fixes: ffe10e679cec ("net: dsa: sja1105: Add support for the SGMII port")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
An attempt is made to warn the user about the fact that VCAP IS1 cannot
offload keys matching on destination IP (at least given the current half
key format), but sadly that warning fails miserably in practice, due to
the fact that it operates on an uninitialized "match" variable. We must
first decode the keys from the flow rule.
Fixes: 75944fda1dfe ("net: mscc: ocelot: offload ingress skbedit and vlan actions to VCAP IS1")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Ido Schimmel says:
====================
nexthop: Do not flush blackhole nexthops when loopback goes down
Patch #1 prevents blackhole nexthops from being flushed when the
loopback device goes down given that as far as user space is concerned,
these nexthops do not have a nexthop device.
Patch #2 adds a test case.
There are no regressions in fib_nexthops.sh with this change:
# ./fib_nexthops.sh
...
Tests passed: 165
Tests failed: 0
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Test that blackhole nexthops are not flushed when the loopback device
goes down.
Output without previous patch:
# ./fib_nexthops.sh -t basic
Basic functional tests
----------------------
TEST: List with nothing defined [ OK ]
TEST: Nexthop get on non-existent id [ OK ]
TEST: Nexthop with no device or gateway [ OK ]
TEST: Nexthop with down device [ OK ]
TEST: Nexthop with device that is linkdown [ OK ]
TEST: Nexthop with device only [ OK ]
TEST: Nexthop with duplicate id [ OK ]
TEST: Blackhole nexthop [ OK ]
TEST: Blackhole nexthop with other attributes [ OK ]
TEST: Blackhole nexthop with loopback device down [FAIL]
TEST: Create group [ OK ]
TEST: Create group with blackhole nexthop [FAIL]
TEST: Create multipath group where 1 path is a blackhole [ OK ]
TEST: Multipath group can not have a member replaced by blackhole [ OK ]
TEST: Create group with non-existent nexthop [ OK ]
TEST: Create group with same nexthop multiple times [ OK ]
TEST: Replace nexthop with nexthop group [ OK ]
TEST: Replace nexthop group with nexthop [ OK ]
TEST: Nexthop group and device [ OK ]
TEST: Test proto flush [ OK ]
TEST: Nexthop group and blackhole [ OK ]
Tests passed: 19
Tests failed: 2
Output with previous patch:
# ./fib_nexthops.sh -t basic
Basic functional tests
----------------------
TEST: List with nothing defined [ OK ]
TEST: Nexthop get on non-existent id [ OK ]
TEST: Nexthop with no device or gateway [ OK ]
TEST: Nexthop with down device [ OK ]
TEST: Nexthop with device that is linkdown [ OK ]
TEST: Nexthop with device only [ OK ]
TEST: Nexthop with duplicate id [ OK ]
TEST: Blackhole nexthop [ OK ]
TEST: Blackhole nexthop with other attributes [ OK ]
TEST: Blackhole nexthop with loopback device down [ OK ]
TEST: Create group [ OK ]
TEST: Create group with blackhole nexthop [ OK ]
TEST: Create multipath group where 1 path is a blackhole [ OK ]
TEST: Multipath group can not have a member replaced by blackhole [ OK ]
TEST: Create group with non-existent nexthop [ OK ]
TEST: Create group with same nexthop multiple times [ OK ]
TEST: Replace nexthop with nexthop group [ OK ]
TEST: Replace nexthop group with nexthop [ OK ]
TEST: Nexthop group and device [ OK ]
TEST: Test proto flush [ OK ]
TEST: Nexthop group and blackhole [ OK ]
Tests passed: 21
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As far as user space is concerned, blackhole nexthops do not have a
nexthop device and therefore should not be affected by the
administrative or carrier state of any netdev.
However, when the loopback netdev goes down all the blackhole nexthops
are flushed. This happens because internally the kernel associates
blackhole nexthops with the loopback netdev.
This behavior is both confusing to those not familiar with kernel
internals and also diverges from the legacy API where blackhole IPv4
routes are not flushed when the loopback netdev goes down:
# ip route add blackhole 198.51.100.0/24
# ip link set dev lo down
# ip route show 198.51.100.0/24
blackhole 198.51.100.0/24
Blackhole IPv6 routes are flushed, but at least user space knows that
they are associated with the loopback netdev:
# ip -6 route show 2001:db8:1::/64
blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium
Fix this by only flushing blackhole nexthops when the loopback netdev is
unregistered.
Fixes: ab84be7e54fc ("net: Initial nexthop code")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Donald Sharp <sharpd@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix typo of 'overflow' for comment in sctp_tsnmap_check().
Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Drew Fustini <drew@beagleboard.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-03-03
This series contains updates to ixgbe and ixgbevf drivers.
Bartosz Golaszewski does not error on -ENODEV from ixgbe_mii_bus_init()
as this is valid for some devices with a shared bus for ixgbe.
Antony Antony adds a check to fail for non transport mode SA with
offload as this is not supported for ixgbe and ixgbevf.
Dinghao Liu fixes a memory leak on failure to program a perfect filter
for ixgbe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When ixgbe_fdir_write_perfect_filter_82599() fails,
input allocated by kzalloc() has not been freed,
which leads to memleak.
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Based on talks and indirect references ixgbe IPsec offlod do not
support IPsec tunnel mode offload. It can only support IPsec transport
mode offload. Now explicitly fail when creating non transport mode SA
with offload to avoid false performance expectations.
Fixes: 63a67fe229ea ("ixgbe: add ipsec offload add and remove SA")
Signed-off-by: Antony Antony <antony@phenome.org>
Acked-by: Shannon Nelson <snelson@pensando.io>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
insn_has_def32() returns false for 32-bit BPF_FETCH insns. This makes
adjust_insn_aux_data() incorrectly set zext_dst, as can be seen in [1].
This happens because insn_no_def() does not know about the BPF_FETCH
variants of BPF_STX.
Fix in two steps.
First, replace insn_no_def() with insn_def_regno(), which returns the
register an insn defines. Normally insn_no_def() calls are followed by
insn->dst_reg uses; replace those with the insn_def_regno() return
value.
Second, adjust the BPF_STX special case in is_reg64() to deal with
queries made from opt_subreg_zext_lo32_rnd_hi32(), where the state
information is no longer available. Add a comment, since the purpose
of this special case is not clear at first glance.
[1] https://lore.kernel.org/bpf/20210223150845.1857620-1-jackmanb@google.com/
Fixes: 5ffa25502b5a ("bpf: Add instructions for atomic_[cmp]xchg")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/bpf/20210301154019.129110-1-iii@linux.ibm.com
|
|
xsk_lookup_bpf_maps, based on prog_fd, looks whether current prog has a
reference to XSKMAP. BPF prog can include insns that work on various BPF
maps and this is covered by iterating through map_ids.
The bpf_map_info that is passed to bpf_obj_get_info_by_fd for filling
needs to be cleared at each iteration, so that it doesn't contain any
outdated fields and that is currently missing in the function of
interest.
To fix that, zero-init map_info via memset before each
bpf_obj_get_info_by_fd call.
Also, since the area of this code is touched, in general strcmp is
considered harmful, so let's convert it to strncmp and provide the
size of the array name for current map_info.
While at it, do s/continue/break/ once we have found the xsks_map to
terminate the search.
Fixes: 5750902a6e9b ("libbpf: proper XSKMAP cleanup")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210303185636.18070-4-maciej.fijalkowski@intel.com
|
|
We mmap the umem region, but we never munmap it.
Add the missing call at the end of the cleanup.
Fixes: 3945b37a975d ("samples/bpf: use hugepages in xdpsock app")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210303185636.18070-3-maciej.fijalkowski@intel.com
|
|
xdp_umem_query() is dead for a long time, drop the declaration from
include/linux/netdevice.h
Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210303185636.18070-2-maciej.fijalkowski@intel.com
|
|
The existing branch checks for 0 != table->nlpid which always evaluates
true for tables that have an owner.
Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Skip hook unregistration of owner tables from the netns exit path,
nft_rcv_nl_event() unregisters the table hooks before tearing down
the table content.
Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Signed-off-by: zhang kai <zhangkaiheb@126.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I met below warning when cating a small size(about 80bytes) txt file
on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we
miss iov_iter_advance() if the read count is 0 for zerocopy case, so
we didn't truncate the pipe, then iov_iter_pipe() thinks the pipe is
full. Fix it by removing the exception for 0 to ensure to call
iov_iter_advance() even on empty read for zerocopy case.
[ 8.279568] WARNING: CPU: 0 PID: 39 at lib/iov_iter.c:1203 iov_iter_pipe+0x31/0x40
[ 8.280028] Modules linked in:
[ 8.280561] CPU: 0 PID: 39 Comm: cat Not tainted 5.11.0+ #6
[ 8.281260] RIP: 0010:iov_iter_pipe+0x31/0x40
[ 8.281974] Code: 2b 42 54 39 42 5c 76 22 c7 07 20 00 00 00 48 89 57 18 8b 42 50 48 c7 47 08 b
[ 8.283169] RSP: 0018:ffff888000cbbd80 EFLAGS: 00000246
[ 8.283512] RAX: 0000000000000010 RBX: ffff888000117d00 RCX: 0000000000000000
[ 8.283876] RDX: ffff88800031d600 RSI: 0000000000000000 RDI: ffff888000cbbd90
[ 8.284244] RBP: ffff888000cbbe38 R08: 0000000000000000 R09: ffff8880008d2058
[ 8.284605] R10: 0000000000000002 R11: ffff888000375510 R12: 0000000000000050
[ 8.284964] R13: ffff888000cbbe80 R14: 0000000000000050 R15: ffff88800031d600
[ 8.285439] FS: 00007f24fd8af600(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[ 8.285844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8.286150] CR2: 00007f24fd7d7b90 CR3: 0000000000c97000 CR4: 00000000000406b0
[ 8.286710] Call Trace:
[ 8.288279] generic_file_splice_read+0x31/0x1a0
[ 8.289273] ? do_splice_to+0x2f/0x90
[ 8.289511] splice_direct_to_actor+0xcc/0x220
[ 8.289788] ? pipe_to_sendpage+0xa0/0xa0
[ 8.290052] do_splice_direct+0x8b/0xd0
[ 8.290314] do_sendfile+0x1ad/0x470
[ 8.290576] do_syscall_64+0x2d/0x40
[ 8.290818] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 8.291409] RIP: 0033:0x7f24fd7dca0a
[ 8.292511] Code: c3 0f 1f 80 00 00 00 00 4c 89 d2 4c 89 c6 e9 bd fd ff ff 0f 1f 44 00 00 31 8
[ 8.293360] RSP: 002b:00007ffc20932818 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
[ 8.293800] RAX: ffffffffffffffda RBX: 0000000001000000 RCX: 00007f24fd7dca0a
[ 8.294153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
[ 8.294504] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 8.294867] R10: 0000000001000000 R11: 0000000000000206 R12: 0000000000000003
[ 8.295217] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[ 8.295782] ---[ end trace 63317af81b3ca24b ]---
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 134f98bcf1b898fb9d6f2b91bc85dd2e5478b4b8.
The r8153_mac_clk_spd() is used for RTL8153A only, because the register
table of RTL8153B is different from RTL8153A. However, this function would
be called when RTL8153B calls r8153_first_init() and r8153_enter_oob().
That causes RTL8153B becomes unstable when suspending and resuming. The
worst case may let the device stop working.
Besides, revert this commit to disable MAC clock speed down for RTL8153A.
It would avoid the known issue when enabling U1. The data of the first
control transfer may be wrong when exiting U1.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 5ee759cda51b ("l2tp: use standard API for warning log messages")
changed a number of warnings about invalid packets in the receive path
so that they are always shown, instead of only when a special L2TP debug
flag is set. Even with rate limiting these warnings can easily cause
significant log spam - potentially triggered by a malicious party
sending invalid packets on purpose.
In addition these warnings were noticed by projects like Tunneldigger [1],
which uses L2TP for its data path, but implements its own control
protocol (which is sufficiently different from L2TP data packets that it
would always be passed up to userspace even with future extensions of
L2TP).
Some of the warnings were already redundant, as l2tp_stats has a counter
for these packets. This commit adds one additional counter for invalid
packets that are passed up to userspace. Packets with unknown session are
not counted as invalid, as there is nothing wrong with the format of
these packets.
With the additional counter, all of these messages are either redundant
or benign, so we reduce them to pr_debug_ratelimited().
[1] https://github.com/wlanslovenija/tunneldigger/issues/160
Fixes: 5ee759cda51b ("l2tp: use standard API for warning log messages")
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is no usrio config defined for default gem config leading to
a kernel panic devices that don't define a data. This issue can be
reprdouced with microchip polar fire soc where compatible string
is defined as "cdns,macb".
Fixes: edac63861db7 ("add userio bits as platform configuration")
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for v5.12
Second set of fixes for v5.12. Only three iwlwifi fixes this time, the
crash with MVM being the most important one and reported by multiple
people.
iwlwifi
* fix kernel crash regression when using LTO with MVM devices
* fix printk format warnings
* fix potential deadlock found by lockdep
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Leave it to Greg.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We introduce dwmac410_dma_init_channel() here for both EQoS v4.10 and
above which use different DMA_CH(n)_Interrupt_Enable bit definitions for
NIE and AIE.
Fixes: 48863ce5940f ("stmmac: add DMA support for GMAC 4.xx")
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Ramesh Babu B <ramesh.babu.b@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
GCC 7.5 reports:
../drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_reset_init':
../drivers/net/ethernet/ibm/ibmvnic.c:5373:51: warning: 'old_num_tx_queues' may be used uninitialized in this function [-Wmaybe-uninitialized]
../drivers/net/ethernet/ibm/ibmvnic.c:5373:6: warning: 'old_num_rx_queues' may be used uninitialized in this function [-Wmaybe-uninitialized]
The variable is initialized only if(reset) and used only if(reset &&
something) so this is a false positive. However, there is no reason to
not initialize the variables unconditionally avoiding the warning.
Fixes: 635e442f4a48 ("ibmvnic: merge ibmvnic_reset_init and ibmvnic_init")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The value of "lmac_id" can be controlled by the user and if it is larger
then the number of bits in long then it reads outside the bitmap.
The highest valid value is less than MAX_LMAC_PER_CGX (4).
Fixes: 91c6945ea1f9 ("octeontx2-af: cn10k: Add RPM MAC support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
warning in iwl_pcie_rx_handle())
We can't call netif_napi_add() with rxq-lock held, as there is a potential
for deadlock as spotted by lockdep (see below). rxq->lock is not
protecting anything over the netif_napi_add() codepath anyway, so let's
drop it just before calling into NAPI.
========================================================
WARNING: possible irq lock inversion dependency detected
5.12.0-rc1-00002-gbada49429032 #5 Not tainted
--------------------------------------------------------
irq/136-iwlwifi/565 just changed the state of lock:
ffff89f28433b0b0 (&rxq->lock){+.-.}-{2:2}, at: iwl_pcie_rx_handle+0x7f/0x960 [iwlwifi]
but this lock took another, SOFTIRQ-unsafe lock in the past:
(napi_hash_lock){+.+.}-{2:2}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(napi_hash_lock);
local_irq_disable();
lock(&rxq->lock);
lock(napi_hash_lock);
<Interrupt>
lock(&rxq->lock);
*** DEADLOCK ***
1 lock held by irq/136-iwlwifi/565:
#0: ffff89f2b1440170 (sync_cmd_lockdep_map){+.+.}-{0:0}, at: iwl_pcie_irq_handler+0x5/0xb30
the shortest dependencies between 2nd lock and 1st lock:
-> (napi_hash_lock){+.+.}-{2:2} {
HARDIRQ-ON-W at:
lock_acquire+0x277/0x3d0
_raw_spin_lock+0x2c/0x40
netif_napi_add+0x14b/0x270
e1000_probe+0x2fe/0xee0 [e1000e]
local_pci_probe+0x42/0x90
pci_device_probe+0x10b/0x1c0
really_probe+0xef/0x4b0
driver_probe_device+0xde/0x150
device_driver_attach+0x4f/0x60
__driver_attach+0x9c/0x140
bus_for_each_dev+0x79/0xc0
bus_add_driver+0x18d/0x220
driver_register+0x5b/0xf0
do_one_initcall+0x5b/0x300
do_init_module+0x5b/0x21c
load_module+0x1dae/0x22c0
__do_sys_finit_module+0xad/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
SOFTIRQ-ON-W at:
lock_acquire+0x277/0x3d0
_raw_spin_lock+0x2c/0x40
netif_napi_add+0x14b/0x270
e1000_probe+0x2fe/0xee0 [e1000e]
local_pci_probe+0x42/0x90
pci_device_probe+0x10b/0x1c0
really_probe+0xef/0x4b0
driver_probe_device+0xde/0x150
device_driver_attach+0x4f/0x60
__driver_attach+0x9c/0x140
bus_for_each_dev+0x79/0xc0
bus_add_driver+0x18d/0x220
driver_register+0x5b/0xf0
do_one_initcall+0x5b/0x300
do_init_module+0x5b/0x21c
load_module+0x1dae/0x22c0
__do_sys_finit_module+0xad/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
INITIAL USE at:
lock_acquire+0x277/0x3d0
_raw_spin_lock+0x2c/0x40
netif_napi_add+0x14b/0x270
e1000_probe+0x2fe/0xee0 [e1000e]
local_pci_probe+0x42/0x90
pci_device_probe+0x10b/0x1c0
really_probe+0xef/0x4b0
driver_probe_device+0xde/0x150
device_driver_attach+0x4f/0x60
__driver_attach+0x9c/0x140
bus_for_each_dev+0x79/0xc0
bus_add_driver+0x18d/0x220
driver_register+0x5b/0xf0
do_one_initcall+0x5b/0x300
do_init_module+0x5b/0x21c
load_module+0x1dae/0x22c0
__do_sys_finit_module+0xad/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
}
... key at: [<ffffffffae84ef38>] napi_hash_lock+0x18/0x40
... acquired at:
_raw_spin_lock+0x2c/0x40
netif_napi_add+0x14b/0x270
_iwl_pcie_rx_init+0x1f4/0x710 [iwlwifi]
iwl_pcie_rx_init+0x1b/0x3b0 [iwlwifi]
iwl_trans_pcie_start_fw+0x2ac/0x6a0 [iwlwifi]
iwl_mvm_load_ucode_wait_alive+0x116/0x460 [iwlmvm]
iwl_run_init_mvm_ucode+0xa4/0x3a0 [iwlmvm]
iwl_op_mode_mvm_start+0x9ed/0xbf0 [iwlmvm]
_iwl_op_mode_start.isra.4+0x42/0x80 [iwlwifi]
iwl_opmode_register+0x71/0xe0 [iwlwifi]
iwl_mvm_init+0x34/0x1000 [iwlmvm]
do_one_initcall+0x5b/0x300
do_init_module+0x5b/0x21c
load_module+0x1dae/0x22c0
__do_sys_finit_module+0xad/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
[ ... lockdep output trimmed .... ]
Fixes: 25edc8f259c7106 ("iwlwifi: pcie: properly implement NAPI")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/nycvar.YFH.7.76.2103021134060.12405@cbobk.fhfr.pm
|
|
An unsigned long variable should rely on '%lu' format strings, not '%zd'
Fixes: a1a6a4cf49ece ("iwlwifi: pnvm: implement reading PNVM from UEFI")
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Acked-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20210302011640.1276636-1-pierre-louis.bossart@linux.intel.com
|
|
Make sure dmi_system_id tables are NULL terminated. This crashed when LTO was enabled:
BUG: KASAN: global-out-of-bounds in dmi_check_system+0x5a/0x70
Read of size 1 at addr ffffffffc16af750 by task NetworkManager/1913
CPU: 4 PID: 1913 Comm: NetworkManager Not tainted 5.12.0-rc1+ #10057
Hardware name: LENOVO 20THCTO1WW/20THCTO1WW, BIOS N2VET27W (1.12 ) 12/21/2020
Call Trace:
dump_stack+0x90/0xbe
print_address_description.constprop.0+0x1d/0x140
? dmi_check_system+0x5a/0x70
? dmi_check_system+0x5a/0x70
kasan_report.cold+0x7b/0xd4
? dmi_check_system+0x5a/0x70
__asan_load1+0x4d/0x50
dmi_check_system+0x5a/0x70
iwl_mvm_up+0x1360/0x1690 [iwlmvm]
? iwl_mvm_send_recovery_cmd+0x270/0x270 [iwlmvm]
? setup_object.isra.0+0x27/0xd0
? kasan_poison+0x20/0x50
? ___slab_alloc.constprop.0+0x483/0x5b0
? mempool_kmalloc+0x17/0x20
? ftrace_graph_ret_addr+0x2a/0xb0
? kasan_poison+0x3c/0x50
? cfg80211_iftype_allowed+0x2e/0x90 [cfg80211]
? __kasan_check_write+0x14/0x20
? mutex_lock+0x86/0xe0
? __mutex_lock_slowpath+0x20/0x20
__iwl_mvm_mac_start+0x49/0x290 [iwlmvm]
iwl_mvm_mac_start+0x37/0x50 [iwlmvm]
drv_start+0x73/0x1b0 [mac80211]
ieee80211_do_open+0x53e/0xf10 [mac80211]
? ieee80211_check_concurrent_iface+0x266/0x2e0 [mac80211]
ieee80211_open+0xb9/0x100 [mac80211]
__dev_open+0x1b8/0x280
Fixes: a2ac0f48a07c ("iwlwifi: mvm: implement approved list for the PPAG feature")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Victor Michel <vic.michel.web@gmail.com>
Acked-by: Luca Coelho <luciano.coelho@intel.com>
[kvalo@codeaurora.org: improve commit log]
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20210223140039.1708534-1-weiyongjun1@huawei.com
|
|
mtk_star_dma_unmap_rx() should unmap the dma_addr of old skb rather than
that of new skb.
Assign new_dma_addr to desc_data.dma_addr after all handling of old skb
ends to avoid unexpected receive side error.
Fixes: f96e9641e92b ("net: ethernet: mtk-star-emac: fix error path in RX handling")
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
On Intel platforms which consist of two Ethernet Controllers such as
TGL-H and ADL-S, a unique MDIO bus id is required for MDIO bus to be
successful registered:
[ 13.076133] sysfs: cannot create duplicate filename '/class/mdio_bus/stmmac-1'
[ 13.083404] CPU: 8 PID: 1898 Comm: systemd-udevd Tainted: G U 5.11.0-net-next #106
[ 13.092410] Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-S ADP-S DRR4 CRB, BIOS ADLIFSI1.R00.1494.B00.2012031421 12/03/2020
[ 13.105709] Call Trace:
[ 13.108176] dump_stack+0x64/0x7c
[ 13.111553] sysfs_warn_dup+0x56/0x70
[ 13.115273] sysfs_do_create_link_sd.isra.2+0xbd/0xd0
[ 13.120371] device_add+0x4df/0x840
[ 13.123917] ? complete_all+0x2a/0x40
[ 13.127636] __mdiobus_register+0x98/0x310 [libphy]
[ 13.132572] stmmac_mdio_register+0x1c5/0x3f0 [stmmac]
[ 13.137771] ? stmmac_napi_add+0xa5/0xf0 [stmmac]
[ 13.142493] stmmac_dvr_probe+0x806/0xee0 [stmmac]
[ 13.147341] intel_eth_pci_probe+0x1cb/0x250 [dwmac_intel]
[ 13.152884] pci_device_probe+0xd2/0x150
[ 13.156897] really_probe+0xf7/0x4d0
[ 13.160527] driver_probe_device+0x5d/0x140
[ 13.164761] device_driver_attach+0x4f/0x60
[ 13.168996] __driver_attach+0xa2/0x140
[ 13.172891] ? device_driver_attach+0x60/0x60
[ 13.177300] bus_for_each_dev+0x76/0xc0
[ 13.181188] bus_add_driver+0x189/0x230
[ 13.185083] ? 0xffffffffc0795000
[ 13.188446] driver_register+0x5b/0xf0
[ 13.192249] ? 0xffffffffc0795000
[ 13.195577] do_one_initcall+0x4d/0x210
[ 13.199467] ? kmem_cache_alloc_trace+0x2ff/0x490
[ 13.204228] do_init_module+0x5b/0x21c
[ 13.208031] load_module+0x2a0c/0x2de0
[ 13.211838] ? __do_sys_finit_module+0xb1/0x110
[ 13.216420] __do_sys_finit_module+0xb1/0x110
[ 13.220825] do_syscall_64+0x33/0x40
[ 13.224451] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 13.229515] RIP: 0033:0x7fc2b1919ccd
[ 13.233113] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 31 0c 00 f7 d8 64 89 01 48
[ 13.251912] RSP: 002b:00007ffcea2e5b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 13.259527] RAX: ffffffffffffffda RBX: 0000560558920f10 RCX: 00007fc2b1919ccd
[ 13.266706] RDX: 0000000000000000 RSI: 00007fc2b1a881e3 RDI: 0000000000000012
[ 13.273887] RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000000
[ 13.281036] R10: 0000000000000012 R11: 0000000000000246 R12: 00007fc2b1a881e3
[ 13.288183] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcea2e5d58
[ 13.295389] libphy: mii_bus stmmac-1 failed to register
Fixes: 88af9bd4efbd ("stmmac: intel: Add ADL-S 1Gbps PCI IDs")
Fixes: 8450e23f142f ("stmmac: intel: Add PCI IDs for TGL-H platform")
Signed-off-by: Wong Vee Khee <vee.khee.wong@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Disallow updating the ownership bit on an existing table: Do not allow
to grab ownership on an existing table. Do not allow to drop ownership
on an existing table.
Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The verifier test labelled "valid read map access into a read-only array
2" calls the bpf_csum_diff() helper and checks its return value. However,
architecture implementations of csum_partial() (which is what the helper
uses) differ in whether they fold the return value to 16 bit or not. For
example, x86 version has ...
if (unlikely(odd)) {
result = from32to16(result);
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
}
... while generic lib/checksum.c does:
result = from32to16(result);
if (odd)
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
This makes the helper return different values on different architectures,
breaking the test on non-x86. To fix this, add an additional instruction
to always mask the return value to 16 bits, and update the expected return
value accordingly.
Fixes: fb2abb73e575 ("bpf, selftest: test {rd, wr}only flags and direct value access")
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210228103017.320240-1-yauheni.kaliuta@redhat.com
|
|
test_snprintf_btf fails on s390, because NULL points to a readable
struct lowcore there. Fix by using the last page instead.
Error message example:
printing fffffffffffff000 should generate error, got (361)
Fixes: 076a95f5aff2 ("selftests/bpf: Add bpf_snprintf_btf helper tests")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210227051726.121256-1-iii@linux.ibm.com
|
|
Qingyu Li reported a syzkaller bug where the repro
changes RCV SEQ _after_ restoring data in the receive queue.
mprotect(0x4aa000, 12288, PROT_READ) = 0
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
recvfrom(3, NULL, 20, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
syslog shows:
[ 111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
[ 111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
This should not be allowed. TCP_QUEUE_SEQ should only be used
when queues are empty.
This patch fixes this case, and the tx path as well.
Fixes: ee9952831cfd ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005
Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Contrary to the RNDIS protocol specification, certain (pre-Fe)
implementations of Hyper-V's vSwitch did not account for the status
buffer field in the length of an RNDIS packet; the bug was fixed in
newer implementations. Validate the status buffer fields using the
length of the 'vmtransfer_page' packet (all implementations), that
is known/validated to be less than or equal to the receive section
size and not smaller than the length of the RNDIS message.
Reported-by: Dexuan Cui <decui@microsoft.com>
Suggested-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Fixes: 505e3f00c3f36 ("hv_netvsc: Add (more) validation for untrusted Hyper-V values")
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A different TPID bit is used for 802.1ad VLAN frames.
Reported-by: Ilario Gelmetti <iochesonome@gmail.com>
Fixes: f0af34317f4b ("net: dsa: mediatek: combine MediaTek tag with VLAN tag")
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The referenced commit expands the skb_seq_state used by
skb_find_text with a 4B frag_off field, growing it to 48B.
This exceeds container ts_state->cb, causing a stack corruption:
[ 73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
is corrupted in: skb_find_text+0xc5/0xd0
[ 73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
[ 73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[ 73.260078] Call Trace:
[ 73.264677] dump_stack+0x57/0x6a
[ 73.267866] panic+0xf6/0x2b7
[ 73.270578] ? skb_find_text+0xc5/0xd0
[ 73.273964] __stack_chk_fail+0x10/0x10
[ 73.277491] skb_find_text+0xc5/0xd0
[ 73.280727] string_mt+0x1f/0x30
[ 73.283639] ipt_do_table+0x214/0x410
The struct is passed between skb_find_text and its callbacks
skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
the textsearch interface using TS_SKB_CB.
I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.
skb->cb was increased from 40B to 48B after ts_state was introduced,
in commit 3e3850e989c5 ("[NETFILTER]: Fix xfrm lookup in
ip_route_me_harder/ip6_route_me_harder").
Increase ts_state.cb[] to 48 to fit the struct.
Also add a BUILD_BUG_ON to avoid a repeat.
The alternative is to directly add a dependency from textsearch onto
linux/skbuff.h, but I think the intent is textsearch to have no such
dependencies on its callers.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
Fixes: 97550f6fa592 ("net: compound page support in skb_seq_read")
Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch fixes a spelling typo in bonding.rst.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2021-03-01
this is a pull request of 6 patches for net/master.
The first 3 patches are by Joakim Zhang for the flexcan driver and fix
the probing and starting of the chip.
The next patch is by me, for the mcp251xfd driver and reverts the BQL
support. BQL support got mainline with rc1 and assumes that CAN frames
are always echoed, which is not the case. A proper fix requires
changes more changes and will be rolled out via linux-can-next later.
Oleksij Rempel's patch fixes the socket ref counting if socket was
closed before setting skb ownership.
Torin Cooper-Bennun's patch for the tcan4x5x driver fixes a race
condition, where the chip is first attached the bus and then the MRAM
is initialized, which may result in lost data.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Vladimir Oltean says:
====================
Fixes for NXP ENETC driver
This contains an assorted set of fixes collected over the past 2 weeks
on the enetc driver. Some are related to VLAN processing, some to
physical link settings, some are fixups of previous hardware workarounds,
and some are simply zero-day data path bugs that for some reason were
never caught or at least identified.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The RX rings have a producer index owned by hardware, where newly
received frame buffers are placed, and a consumer index owned by
software, where newly allocated buffers are placed, in expectation of
hardware being able to place frame data in them.
Hardware increments the producer index when a frame is received, however
it is not allowed to increment the producer index to match the consumer
index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
BDs. Whenever the producer index matches the value of the consumer
index, the ring has no unprocessed received frames and all BDs in the
ring have been initialized/prepared by software, i.e. hardware owns all
BDs in the ring.
The code uses the next_to_clean variable to keep track of the producer
index, and the next_to_use variable to keep track of the consumer index.
The RX rings are seeded from enetc_refill_rx_ring, which is called from
two places:
1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:
.ndo_open
-> enetc_open
-> enetc_setup_bdrs
-> enetc_setup_rxbdr
-> enetc_refill_rx_ring
2. then during the data path processing, it is refilled with 16 buffers
at a time:
enetc_msix
-> napi_schedule
-> enetc_poll
-> enetc_clean_rx_ring
-> enetc_refill_rx_ring
There is just one problem: the initial seeding done during .ndo_open
updates just the producer index (ENETC_RBPIR) with 0, and the software
next_to_clean and next_to_use variables. Notably, it will not update the
consumer index to make the hardware aware of the newly added buffers.
Wait, what? So how does it work?
Well, the reset values of the producer index and of the consumer index
of a ring are both zero. As per the description in the second paragraph,
it means that the ring is full of buffers waiting for hardware to put
frames in them, which by coincidence is almost true, because we have in
fact seeded 511 buffers into the ring.
But will the hardware attempt to access the 512th entry of the ring,
which has an invalid BD in it? Well, no, because in order to do that, it
would have to first populate the first 511 entries, and the NAPI
enetc_poll will kick in by then. Eventually, after 16 processed slots
have become available in the RX ring, enetc_clean_rx_ring will call
enetc_refill_rx_ring and then will [ finally ] update the consumer index
with the new software next_to_use variable. From now on, the
next_to_clean and next_to_use variables are in sync with the producer
and consumer ring indices.
So the day is saved, right? Well, not quite. Freeing the memory
allocated for the rings is done in:
enetc_close
-> enetc_clear_bdrs
-> enetc_clear_rxbdr
-> this just disables the ring
-> enetc_free_rxtx_rings
-> enetc_free_rx_ring
-> sets next_to_clean and next_to_use to 0
but again, nothing is committed to the hardware producer and consumer
indices (yay!). The assumption is that the ring is disabled, so the
indices don't matter anyway, and it's the responsibility of the "open"
code path to set those up.
.. Except that the "open" code path does not set those up properly.
While initially, things almost work, during subsequent enetc_close ->
enetc_open sequences, we have problems. To be precise, the enetc_open
that is subsequent to enetc_close will again refill the ring with 511
entries, but it will leave the consumer index untouched. Untouched
means, of course, equal to the value it had before disabling the ring
and draining the old buffers in enetc_close.
But as mentioned, enetc_setup_rxbdr will at least update the producer
index though, through this line of code:
enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);
so at this stage we'll have:
next_to_clean=0 (in hardware 0)
next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)
Again, the next_to_clean and producer index are in sync and set to
correct values, so the driver manages to limp on. Eventually, 16 ring
entries will be consumed by enetc_poll, and the savior
enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
update the hardware consumer ring based upon the new next_to_use.
So.. it works?
Well, by coincidence, it almost does, but there's a circumstance where
enetc_clean_rx_ring won't be there to save us. If the previous value of
the consumer index was 15, there's a problem, because the NAPI poll
sequence will only issue a refill when 16 or more buffers have been
consumed.
It's easiest to illustrate this with an example:
ip link set eno0 up
ip addr add 192.168.100.1/24 dev eno0
ping 192.168.100.1 -c 20 # ping this port from another board
ip link set eno0 down
ip link set eno0 up
ping 192.168.100.1 -c 20 # ping it again from the same other board
One by one:
1. ip link set eno0 up
-> calls enetc_setup_rxbdr:
-> calls enetc_refill_rx_ring(511 buffers)
-> next_to_clean=0 (in hw 0)
-> next_to_use=511 (in hw 0)
2. ping 192.168.100.1 -c 20 # ping this port from another board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)
20 packets transmitted, 20 packets received, 0% packet loss
3. ip link set eno0 down
enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)
4. ip link set eno0 up
-> calls enetc_setup_rxbdr:
-> calls enetc_refill_rx_ring(511 buffers)
-> next_to_clean=0 (in hw 0)
-> next_to_use=511 (in hw 15)
5. ping 192.168.100.1 -c 20 # ping it again from the same other board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)
20 packets transmitted, 12 packets received, 40% packet loss
And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
to 15 for that to happen), no nothing. The hardware enters the condition where
the producer (14) + 1 is equal to the consumer (15) index, which makes it
believe it has no more free buffers to put packets in, so it starts discarding
them:
ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
NIC statistics:
Rx ring 0 discarded frames: 8
Summarized, if the interface receives between 16 and 32 (mod 512) frames
and then there is a link flap, then the port will eventually die with no
way to recover. If it receives less than 16 (mod 512) frames, then the
initial NAPI poll [ before the link flap ] will not update the consumer
index in hardware (it will remain zero) which will be ok when the buffers
are later reinitialized. If more than 32 (mod 512) frames are received,
the initial NAPI poll has the chance to refill the ring twice, updating
the consumer index to at least 32. So after the link flap, the consumer
index is still wrong, but the post-flap NAPI poll gets a chance to
refill the ring once (because it passes through cleaned_cnt=15) and
makes the consumer index be again back in sync with next_to_use.
The solution to this problem is actually simple, we just need to write
next_to_use into the hardware consumer index at enetc_open time, which
always brings it back in sync after an initial buffer seeding process.
The simpler thing would be to put the write to the consumer index into
enetc_refill_rx_ring directly, but there are issues with the MDIO
locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
top-level and we use the unlocked enetc_wr_reg_hot, whereas in
enetc_open, the enetc_lock_mdio() is not taken at the top level, but
instead by each individual enetc_wr_reg, so we are forced to put an
additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
the code is left as a refactoring exercise.
Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The Station Interface Receive Interrupt Detect Register (SIRXIDR)
contains a 16-bit wide mask of 'interrupt detected' events for each ring
associated with a port. Bit i is write-1-to-clean for RX ring i.
I have no explanation whatsoever how this line of code came to be
inserted in the blamed commit. I checked the downstream versions of that
patch and none of them have it.
The somewhat comical aspect of it is that we're writing a binary number
to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring).
Since the RX rings have 512 buffer descriptors, we end up writing 511 to
this register, which is 0x1ff, so we are effectively clearing the
'interrupt detected' event for rings 0-8.
This register is not what is used for interrupt handling though - it
only provides a summary for the entire SI. The hardware provides one
separate Interrupt Detect Register per RX ring, which auto-clears upon
read. So there doesn't seem to be any adverse effect caused by this
bogus write.
There is, however, one reason why this should be handled as a bugfix:
next_to_clean _should_ be committed to hardware, just not to that
register, and this was obscuring the fact that it wasn't. This is fixed
in the next patch, and removing the bogus line now allows the fix patch
to be backported beyond that point.
Fixes: fd5736bf9f23 ("enetc: Workaround for MDIO register access issue")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ENETC port 0 MAC supports in-band status signaling coming from a PHY
when operating in RGMII mode, and this feature is enabled by default.
It has been reported that RGMII is broken in fixed-link, and that is not
surprising considering the fact that no PHY is attached to the MAC in
that case, but a switch.
This brings us to the topic of the patch: the enetc driver should have
not enabled the optional in-band status signaling for RGMII unconditionally,
but should have forced the speed and duplex to what was resolved by
phylink.
Note that phylink does not accept the RGMII modes as valid for in-band
signaling, and these operate a bit differently than 1000base-x and SGMII
(notably there is no clause 37 state machine so no ACK required from the
MAC, instead the PHY sends extra code words on RXD[3:0] whenever it is
not transmitting something else, so it should be safe to leave a PHY
with this option unconditionally enabled even if we ignore it). The spec
talks about this here:
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/138/RGMIIv1_5F00_3.pdf
Fixes: 71b77a7a27a3 ("enetc: Migrate to PHYLINK and PCS_LYNX")
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Quoting from the blamed commit:
In promiscuous mode, it is more intuitive that all traffic is received,
including VLAN tagged traffic. It appears that it is necessary to set
the flag in PSIPVMR for that to be the case, so VLAN promiscuous mode is
also temporarily enabled. On exit from promiscuous mode, the setting
made by ethtool is restored.
Intuitive or not, there isn't any definition issued by a standards body
which says that promiscuity has anything to do with VLAN filtering - it
only has to do with accepting packets regardless of destination MAC address.
In fact people are already trying to use this misunderstanding/bug of
the enetc driver as a justification to transform promiscuity into
something it never was about: accepting every packet (maybe that would
be the "rx-all" netdev feature?):
https://lore.kernel.org/netdev/20201110153958.ci5ekor3o2ekg3ky@ipetronik.com/
This is relevant because there are use cases in the kernel (such as
tc-flower rules with the protocol 802.1Q and a vlan_id key) which do not
(yet) use the vlan_vid_add API to be compatible with VLAN-filtering NICs
such as enetc, so for those, disabling rx-vlan-filter is currently the
only right solution to make these setups work:
https://lore.kernel.org/netdev/CA+h21hoxwRdhq4y+w8Kwgm74d4cA0xLeiHTrmT-VpSaM7obhkg@mail.gmail.com/
The blamed patch has unintentionally introduced one more way for this to
work, which is to enable IFF_PROMISC, however this is non-portable
because port promiscuity is not meant to disable VLAN filtering.
Therefore, it could invite people to write broken scripts for enetc, and
then wonder why they are broken when migrating to other drivers that
don't handle promiscuity in the same way.
Fixes: 7070eea5e95a ("enetc: permit configuration of rx-vlan-filter with ethtool")
Cc: Markus Blöchl <Markus.Bloechl@ipetronik.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When the enetc ports have rx-vlan-offload enabled, they report a TPID of
ETH_P_8021Q regardless of what was actually in the packet. When
rx-vlan-offload is disabled, packets have the proper TPID. Fix this
inconsistency by finishing the TODO left in the code.
Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|