Age | Commit message (Collapse) | Author |
|
Currently, this support is limited to active-backup mode, as I'm not sure
about the feasilibity of mapping an xfrm_state's offload handle to
multiple hardware devices simultaneously, and we rely on being able to
pass some hints to both the xfrm and NIC driver about whether or not
they're operating on a slave device.
I've tested this atop an Intel x520 device (ixgbe) using libreswan in
transport mode, succesfully achieving ~4.3Gbps throughput with netperf
(more or less identical to throughput on a bare NIC in this system),
as well as successful failover and recovery mid-netperf.
v2: just use CONFIG_XFRM_OFFLOAD for wrapping, isolate more code with it
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The dynamic key update for addr_list_lock still causes troubles,
for example the following race condition still exists:
CPU 0: CPU 1:
(RCU read lock) (RTNL lock)
dev_mc_seq_show() netdev_update_lockdep_key()
-> lockdep_unregister_key()
-> netif_addr_lock_bh()
because lockdep doesn't provide an API to update it atomically.
Therefore, we have to move it back to static keys and use subclass
for nest locking like before.
In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key
changes"), I already reverted most parts of commit ab92d68fc22f
("net: core: add generic lockdep keys").
This patch reverts the rest and also part of commit f3b0a18bb6cb
("net: remove unnecessary variables and callback"). After this
patch, addr_list_lock changes back to using static keys and
subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
not have to change back to ->ndo_get_lock_subclass().
And hopefully this reduces some syzbot lockdep noises too.
Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge includes updates to bonding driver needed for the rdma stack,
to avoid conflicts with the RDMA branch.
Maor Gottlieb Says:
====================
Bonding: Add support to get xmit slave
The following series adds support to get the LAG master xmit slave by
introducing new .ndo - ndo_get_xmit_slave. Every LAG module can
implement it and it first implemented in the bond driver.
This is follow-up to the RFC discussion [1].
The main motivation for doing this is for drivers that offload part
of the LAG functionality. For example, Mellanox Connect-X hardware
implements RoCE LAG which selects the TX affinity when the resources
are created and port is remapped when it goes down.
The first part of this patchset introduces the new .ndo and add the
support to the bonding module.
The second part adds support to get the RoCE LAG xmit slave by building
skb of the RoCE packet based on the AH attributes and call to the new
.ndo.
The third part change the mlx5 driver driver to set the QP's affinity
port according to the slave which found by the .ndo.
====================
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Currently, bonding always returns NETDEV_TX_OK to its caller.
It is worth trying to be more accurate : TCP for instance
can have different recovery strategies if it can have more
precise status, if packet was dropped by slave qdisc.
This is especially important when host is under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit b3e80d44f5b1
("bonding: fix lockdep warning in bond_get_stats()") the dynamic
key is no longer necessary, as we compute nest level at run-time.
So, we can just remove it to save some lockdep key entries.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch reverts the folowing commits:
commit 064ff66e2bef84f1153087612032b5b9eab005bd
"bonding: add missing netdev_update_lockdep_key()"
commit 53d374979ef147ab51f5d632dfe20b14aebeccd0
"net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"
commit 1f26c0d3d24125992ab0026b0dab16c08df947c7
"net: fix kernel-doc warning in <linux/netdevice.h>"
commit ab92d68fc22f9afab480153bd82a20f6e2533769
"net: core: add generic lockdep keys"
but keeps the addr_list_lock_key because we still lock
addr_list_lock nestedly on stack devices, unlikely xmit_lock
this is safe because we don't take addr_list_lock on any fast
path.
Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add implementation of ndo_get_xmit_slave. Find the slave by using the
helper function according to the bond mode. If the caller set all_slaves
to true, then it assumes that all slaves are available to transmit.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Keep all slaves in array so it could be used to get the xmit slave
assume all the slaves are active.
The logic to add slave to the array is like the usable slaves, except
that we also add slaves that currently can't transmit - not up or active.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Add helper function to get the xmit slave in active-backup mode.
It's only one line function that return the curr_active_slave,
but it will used both in the xmit flow and by the new .ndo to get
the xmit slave.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Add helper function to get the xmit slave when bond is in round
robin mode. Change bond_xmit_slave_id to bond_get_slave_by_id, then
the logic for find the next slave for transmit could be used
both by the xmit flow and the .ndo to get the xmit slave.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Both xor and 802.3ad modes use bond_xmit_hash to get the xmit slave.
Export the logic to helper function so it could be used in the
following patches by the .ndo to get the xmit slave.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Rename slave_arr to usable_slaves, since we will have two arrays,
one for the usable slaves and the other to all slaves.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
As a preparation for following change that add array of
all slaves, extract code that skip slave to function.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
The in-kernel code has already unique version, which is based
on Linus's tag, update the bond driver to be consistent with that
version.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflict resolution of ice_virtchnl_pf.c based upon work by
Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Trivial cleanup, so that all bridge port-specific code can be found in
one go.
CC: Johannes Berg <johannes@sipsolutions.net>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the "struct bonding", there is stats_lock.
This lock protects "bond_stats" in the "struct bonding".
bond_stats is updated in the bond_get_stats() and this function would be
executed concurrently. So, the lock is needed.
Bonding interfaces would be nested.
So, either stats_lock should use dynamic lockdep class key or stats_lock
should be used by spin_lock_nested(). In the current code, stats_lock is
using a dynamic lockdep class key.
But there is no updating stats_lock_key routine So, lockdep warning
will occur.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Splat looks like:
[ 38.420603][ T957] 5.5.0+ #394 Not tainted
[ 38.421074][ T957] ------------------------------------------------------
[ 38.421837][ T957] ip/957 is trying to acquire lock:
[ 38.422399][ T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.423528][ T957]
[ 38.423528][ T957] but task is already holding lock:
[ 38.424526][ T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.426075][ T957]
[ 38.426075][ T957] which lock already depends on the new lock.
[ 38.426075][ T957]
[ 38.428536][ T957]
[ 38.428536][ T957] the existing dependency chain (in reverse order) is:
[ 38.429475][ T957]
[ 38.429475][ T957] -> #1 (&bond->stats_lock_key){+.+.}:
[ 38.430273][ T957] _raw_spin_lock+0x30/0x70
[ 38.430812][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.431451][ T957] dev_get_stats+0x1ec/0x270
[ 38.432088][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.432767][ T957] dev_get_stats+0x1ec/0x270
[ 38.433322][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.433866][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.434474][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.435081][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.436848][ T957] rtnetlink_event+0xcd/0x120
[ 38.437455][ T957] notifier_call_chain+0x90/0x160
[ 38.438067][ T957] netdev_change_features+0x74/0xa0
[ 38.438708][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.439522][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.440225][ T957] do_setlink+0xaab/0x2ef0
[ 38.440786][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.441463][ T957] rtnl_newlink+0x65/0x90
[ 38.442075][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.442774][ T957] netlink_rcv_skb+0x121/0x350
[ 38.443451][ T957] netlink_unicast+0x42e/0x610
[ 38.444282][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.444992][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.445679][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.446365][ T957] __sys_sendmsg+0xc6/0x150
[ 38.447007][ T957] do_syscall_64+0x99/0x4f0
[ 38.447668][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.448538][ T957]
[ 38.448538][ T957] -> #0 (&bond->stats_lock_key#2){+.+.}:
[ 38.449554][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.450148][ T957] lock_acquire+0x164/0x3b0
[ 38.450711][ T957] _raw_spin_lock+0x30/0x70
[ 38.451292][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.451950][ T957] dev_get_stats+0x1ec/0x270
[ 38.452425][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.453362][ T957] dev_get_stats+0x1ec/0x270
[ 38.453825][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.454390][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.456257][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.456998][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.459351][ T957] rtnetlink_event+0xcd/0x120
[ 38.460086][ T957] notifier_call_chain+0x90/0x160
[ 38.460829][ T957] netdev_change_features+0x74/0xa0
[ 38.461752][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.462705][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.463476][ T957] do_setlink+0xaab/0x2ef0
[ 38.464141][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.464897][ T957] rtnl_newlink+0x65/0x90
[ 38.465522][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.466215][ T957] netlink_rcv_skb+0x121/0x350
[ 38.466895][ T957] netlink_unicast+0x42e/0x610
[ 38.467583][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.468285][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.469202][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.469884][ T957] __sys_sendmsg+0xc6/0x150
[ 38.470587][ T957] do_syscall_64+0x99/0x4f0
[ 38.471245][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.472093][ T957]
[ 38.472093][ T957] other info that might help us debug this:
[ 38.472093][ T957]
[ 38.473438][ T957] Possible unsafe locking scenario:
[ 38.473438][ T957]
[ 38.474898][ T957] CPU0 CPU1
[ 38.476234][ T957] ---- ----
[ 38.480171][ T957] lock(&bond->stats_lock_key);
[ 38.480808][ T957] lock(&bond->stats_lock_key#2);
[ 38.481791][ T957] lock(&bond->stats_lock_key);
[ 38.482754][ T957] lock(&bond->stats_lock_key#2);
[ 38.483416][ T957]
[ 38.483416][ T957] *** DEADLOCK ***
[ 38.483416][ T957]
[ 38.484505][ T957] 3 locks held by ip/957:
[ 38.485048][ T957] #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890
[ 38.486198][ T957] #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.487625][ T957] #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding]
[ 38.488897][ T957]
[ 38.488897][ T957] stack backtrace:
[ 38.489646][ T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394
[ 38.490497][ T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 38.492810][ T957] Call Trace:
[ 38.493219][ T957] dump_stack+0x96/0xdb
[ 38.493709][ T957] check_noncircular+0x371/0x450
[ 38.494344][ T957] ? lookup_address+0x60/0x60
[ 38.494923][ T957] ? print_circular_bug.isra.35+0x310/0x310
[ 38.495699][ T957] ? hlock_class+0x130/0x130
[ 38.496334][ T957] ? __lock_acquire+0x2d8d/0x3de0
[ 38.496979][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.497607][ T957] ? register_lock_class+0x14d0/0x14d0
[ 38.498333][ T957] ? check_chain_key+0x236/0x5d0
[ 38.499003][ T957] lock_acquire+0x164/0x3b0
[ 38.499800][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.500706][ T957] _raw_spin_lock+0x30/0x70
[ 38.501435][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.502311][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ ... ]
But, there is another problem.
The dynamic lockdep class key is protected by RTNL, but bond_get_stats()
would be called outside of RTNL.
So, it would use an invalid dynamic lockdep class key.
In order to fix this issue, stats_lock uses spin_lock_nested() instead of
a dynamic lockdep key.
The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct
nest level value, which will be used by spin_lock_nested().
The "dev->lower_level" indicates lower nest level value, but this value
is invalid outside of RTNL.
So, bond_get_lowest_level_rcu() returns valid lower nest level value in
the RCU critical section.
bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled.
Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After bond_release(), netdev_update_lockdep_key() should be called.
But both ioctl path and attribute path don't call
netdev_update_lockdep_key().
This patch adds missing netdev_update_lockdep_key().
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ifenslave bond0 bond1
ifenslave -d bond0 bond1
ifenslave bond1 bond0
Splat looks like:
[ 29.501182][ T1046] WARNING: possible circular locking dependency detected
[ 29.501945][ T1039] hardirqs last disabled at (1962): [<ffffffffac6c807f>] handle_mm_fault+0x13f/0x700
[ 29.503442][ T1046] 5.5.0+ #322 Not tainted
[ 29.503447][ T1046] ------------------------------------------------------
[ 29.504277][ T1039] softirqs last enabled at (1180): [<ffffffffade00678>] __do_softirq+0x678/0x981
[ 29.505443][ T1046] ifenslave/1046 is trying to acquire lock:
[ 29.505886][ T1039] softirqs last disabled at (1169): [<ffffffffac19c18a>] irq_exit+0x17a/0x1a0
[ 29.509997][ T1046] ffff88805d5da280 (&dev->addr_list_lock_key#3){+...}, at: dev_mc_sync_multiple+0x95/0x120
[ 29.511243][ T1046]
[ 29.511243][ T1046] but task is already holding lock:
[ 29.512192][ T1046] ffff8880460f2280 (&dev->addr_list_lock_key#4){+...}, at: bond_enslave+0x4482/0x47b0 [bonding]
[ 29.514124][ T1046]
[ 29.514124][ T1046] which lock already depends on the new lock.
[ 29.514124][ T1046]
[ 29.517297][ T1046]
[ 29.517297][ T1046] the existing dependency chain (in reverse order) is:
[ 29.518231][ T1046]
[ 29.518231][ T1046] -> #1 (&dev->addr_list_lock_key#4){+...}:
[ 29.519076][ T1046] _raw_spin_lock+0x30/0x70
[ 29.519588][ T1046] dev_mc_sync_multiple+0x95/0x120
[ 29.520208][ T1046] bond_enslave+0x448d/0x47b0 [bonding]
[ 29.520862][ T1046] bond_option_slaves_set+0x1a3/0x370 [bonding]
[ 29.521640][ T1046] __bond_opt_set+0x1ff/0xbb0 [bonding]
[ 29.522438][ T1046] __bond_opt_set_notify+0x2b/0xf0 [bonding]
[ 29.523251][ T1046] bond_opt_tryset_rtnl+0x92/0xf0 [bonding]
[ 29.524082][ T1046] bonding_sysfs_store_option+0x8a/0xf0 [bonding]
[ 29.524959][ T1046] kernfs_fop_write+0x276/0x410
[ 29.525620][ T1046] vfs_write+0x197/0x4a0
[ 29.526218][ T1046] ksys_write+0x141/0x1d0
[ 29.526818][ T1046] do_syscall_64+0x99/0x4f0
[ 29.527430][ T1046] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 29.528265][ T1046]
[ 29.528265][ T1046] -> #0 (&dev->addr_list_lock_key#3){+...}:
[ 29.529272][ T1046] __lock_acquire+0x2d8d/0x3de0
[ 29.529935][ T1046] lock_acquire+0x164/0x3b0
[ 29.530638][ T1046] _raw_spin_lock+0x30/0x70
[ 29.531187][ T1046] dev_mc_sync_multiple+0x95/0x120
[ 29.531790][ T1046] bond_enslave+0x448d/0x47b0 [bonding]
[ 29.532451][ T1046] bond_option_slaves_set+0x1a3/0x370 [bonding]
[ 29.533163][ T1046] __bond_opt_set+0x1ff/0xbb0 [bonding]
[ 29.533789][ T1046] __bond_opt_set_notify+0x2b/0xf0 [bonding]
[ 29.534595][ T1046] bond_opt_tryset_rtnl+0x92/0xf0 [bonding]
[ 29.535500][ T1046] bonding_sysfs_store_option+0x8a/0xf0 [bonding]
[ 29.536379][ T1046] kernfs_fop_write+0x276/0x410
[ 29.537057][ T1046] vfs_write+0x197/0x4a0
[ 29.537640][ T1046] ksys_write+0x141/0x1d0
[ 29.538251][ T1046] do_syscall_64+0x99/0x4f0
[ 29.538870][ T1046] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 29.539659][ T1046]
[ 29.539659][ T1046] other info that might help us debug this:
[ 29.539659][ T1046]
[ 29.540953][ T1046] Possible unsafe locking scenario:
[ 29.540953][ T1046]
[ 29.541883][ T1046] CPU0 CPU1
[ 29.542540][ T1046] ---- ----
[ 29.543209][ T1046] lock(&dev->addr_list_lock_key#4);
[ 29.543880][ T1046] lock(&dev->addr_list_lock_key#3);
[ 29.544873][ T1046] lock(&dev->addr_list_lock_key#4);
[ 29.545863][ T1046] lock(&dev->addr_list_lock_key#3);
[ 29.546525][ T1046]
[ 29.546525][ T1046] *** DEADLOCK ***
[ 29.546525][ T1046]
[ 29.547542][ T1046] 5 locks held by ifenslave/1046:
[ 29.548196][ T1046] #0: ffff88806044c478 (sb_writers#5){.+.+}, at: vfs_write+0x3bb/0x4a0
[ 29.549248][ T1046] #1: ffff88805af00890 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cf/0x410
[ 29.550343][ T1046] #2: ffff88805b8b54b0 (kn->count#157){.+.+}, at: kernfs_fop_write+0x1f2/0x410
[ 29.551575][ T1046] #3: ffffffffaecf4cf0 (rtnl_mutex){+.+.}, at: bond_opt_tryset_rtnl+0x5f/0xf0 [bonding]
[ 29.552819][ T1046] #4: ffff8880460f2280 (&dev->addr_list_lock_key#4){+...}, at: bond_enslave+0x4482/0x47b0 [bonding]
[ 29.554175][ T1046]
[ 29.554175][ T1046] stack backtrace:
[ 29.554907][ T1046] CPU: 0 PID: 1046 Comm: ifenslave Not tainted 5.5.0+ #322
[ 29.555854][ T1046] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 29.557064][ T1046] Call Trace:
[ 29.557504][ T1046] dump_stack+0x96/0xdb
[ 29.558054][ T1046] check_noncircular+0x371/0x450
[ 29.558723][ T1046] ? print_circular_bug.isra.35+0x310/0x310
[ 29.559486][ T1046] ? hlock_class+0x130/0x130
[ 29.560100][ T1046] ? __lock_acquire+0x2d8d/0x3de0
[ 29.560761][ T1046] __lock_acquire+0x2d8d/0x3de0
[ 29.561366][ T1046] ? register_lock_class+0x14d0/0x14d0
[ 29.562045][ T1046] ? find_held_lock+0x39/0x1d0
[ 29.562641][ T1046] lock_acquire+0x164/0x3b0
[ 29.563199][ T1046] ? dev_mc_sync_multiple+0x95/0x120
[ 29.563872][ T1046] _raw_spin_lock+0x30/0x70
[ 29.564464][ T1046] ? dev_mc_sync_multiple+0x95/0x120
[ 29.565146][ T1046] dev_mc_sync_multiple+0x95/0x120
[ 29.565793][ T1046] bond_enslave+0x448d/0x47b0 [bonding]
[ 29.566487][ T1046] ? bond_update_slave_arr+0x940/0x940 [bonding]
[ 29.567279][ T1046] ? bstr_printf+0xc20/0xc20
[ 29.567857][ T1046] ? stack_trace_consume_entry+0x160/0x160
[ 29.568614][ T1046] ? deactivate_slab.isra.77+0x2c5/0x800
[ 29.569320][ T1046] ? check_chain_key+0x236/0x5d0
[ 29.569939][ T1046] ? sscanf+0x93/0xc0
[ 29.570442][ T1046] ? vsscanf+0x1e20/0x1e20
[ 29.571003][ T1046] bond_option_slaves_set+0x1a3/0x370 [bonding]
[ ... ]
Fixes: ab92d68fc22f ("net: core: add generic lockdep keys")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After the recent fix in commit 1899bb325149 ("bonding: fix state
transition issue in link monitoring"), the active-backup mode with
miimon initially come-up fine but after a link-failure, both members
transition into backup state.
Following steps to reproduce the scenario (eth1 and eth2 are the
slaves of the bond):
ip link set eth1 up
ip link set eth2 down
sleep 1
ip link set eth2 up
ip link set eth1 down
cat /sys/class/net/eth1/bonding_slave/state
cat /sys/class/net/eth2/bonding_slave/state
Fixes: 1899bb325149 ("bonding: fix state transition issue in link monitoring")
CC: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
1) syzbot reported an uninit-value in bond_neigh_setup() [1]
bond_neigh_setup() uses a temporary on-stack 'struct neigh_parms parms',
but only clears parms.neigh_setup field.
A stacked bonding device would then enter bond_neigh_setup()
and read garbage from parms->dev.
If we get really unlucky and garbage is matching @dev, then we
could recurse and eventually crash.
Let's make sure the whole structure is cleared to avoid surprises.
2) bond_neigh_setup() can be called while another cpu manipulates
the master device, removing or adding a slave.
We need at least rcu protection to prevent use-after-free.
Note: Prior code does not support a stack of bonding devices,
this patch does not attempt to fix this, and leave a comment instead.
[1]
BUG: KMSAN: uninit-value in bond_neigh_setup+0xa4/0x110 drivers/net/bonding/bond_main.c:3655
CPU: 0 PID: 11256 Comm: syz-executor.0 Not tainted 5.4.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x220 lib/dump_stack.c:118
kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108
__msan_warning+0x57/0xa0 mm/kmsan/kmsan_instr.c:245
bond_neigh_setup+0xa4/0x110 drivers/net/bonding/bond_main.c:3655
bond_neigh_init+0x216/0x4b0 drivers/net/bonding/bond_main.c:3626
___neigh_create+0x169e/0x2c40 net/core/neighbour.c:613
__neigh_create+0xbd/0xd0 net/core/neighbour.c:674
ip6_finish_output2+0x149a/0x2670 net/ipv6/ip6_output.c:113
__ip6_finish_output+0x83d/0x8f0 net/ipv6/ip6_output.c:142
ip6_finish_output+0x2db/0x420 net/ipv6/ip6_output.c:152
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:175
dst_output include/net/dst.h:436 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
mld_sendpack+0xebd/0x13d0 net/ipv6/mcast.c:1682
mld_send_cr net/ipv6/mcast.c:1978 [inline]
mld_ifc_timer_expire+0x116b/0x1680 net/ipv6/mcast.c:2477
call_timer_fn+0x232/0x530 kernel/time/timer.c:1404
expire_timers kernel/time/timer.c:1449 [inline]
__run_timers+0xd60/0x1270 kernel/time/timer.c:1773
run_timer_softirq+0x2d/0x50 kernel/time/timer.c:1786
__do_softirq+0x4a1/0x83a kernel/softirq.c:293
invoke_softirq kernel/softirq.c:375 [inline]
irq_exit+0x230/0x280 kernel/softirq.c:416
exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1138
apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:835
</IRQ>
RIP: 0010:kmsan_free_page+0x18d/0x1c0 mm/kmsan/kmsan_shadow.c:439
Code: 4c 89 ff 44 89 f6 e8 82 0d ee ff 65 ff 0d 9f 26 3b 60 65 8b 05 98 26 3b 60 85 c0 75 24 e8 5b f6 35 ff 4c 89 6d d0 ff 75 d0 9d <48> 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 0b 0f
RSP: 0018:ffffb328034af818 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffffe2d7471f8360 RCX: 0000000000000000
RDX: ffffffffadea7000 RSI: 0000000000000004 RDI: ffff93496fcda104
RBP: ffffb328034af850 R08: ffff934a47e86d00 R09: ffff93496fc41900
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000246 R14: 0000000000000000 R15: ffffe2d7472225c0
free_pages_prepare mm/page_alloc.c:1138 [inline]
free_pcp_prepare mm/page_alloc.c:1230 [inline]
free_unref_page_prepare+0x1d9/0x770 mm/page_alloc.c:3025
free_unref_page mm/page_alloc.c:3074 [inline]
free_the_page mm/page_alloc.c:4832 [inline]
__free_pages+0x154/0x230 mm/page_alloc.c:4840
__vunmap+0xdac/0xf20 mm/vmalloc.c:2277
__vfree mm/vmalloc.c:2325 [inline]
vfree+0x7c/0x170 mm/vmalloc.c:2355
copy_entries_to_user net/ipv6/netfilter/ip6_tables.c:883 [inline]
get_entries net/ipv6/netfilter/ip6_tables.c:1041 [inline]
do_ip6t_get_ctl+0xfa4/0x1030 net/ipv6/netfilter/ip6_tables.c:1709
nf_sockopt net/netfilter/nf_sockopt.c:104 [inline]
nf_getsockopt+0x481/0x4e0 net/netfilter/nf_sockopt.c:122
ipv6_getsockopt+0x264/0x510 net/ipv6/ipv6_sockglue.c:1400
tcp_getsockopt+0x1c6/0x1f0 net/ipv4/tcp.c:3688
sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3110
__sys_getsockopt+0x533/0x7b0 net/socket.c:2129
__do_sys_getsockopt net/socket.c:2144 [inline]
__se_sys_getsockopt+0xe1/0x100 net/socket.c:2141
__x64_sys_getsockopt+0x62/0x80 net/socket.c:2141
do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45d20a
Code: b8 34 01 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 8d 8b fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 6a 8b fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:0000000000a6f618 EFLAGS: 00000212 ORIG_RAX: 0000000000000037
RAX: ffffffffffffffda RBX: 0000000000a6f640 RCX: 000000000045d20a
RDX: 0000000000000041 RSI: 0000000000000029 RDI: 0000000000000003
RBP: 0000000000717cc0 R08: 0000000000a6f63c R09: 0000000000004000
R10: 0000000000a6f740 R11: 0000000000000212 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000029 R15: 0000000000715b00
Local variable description: ----parms@bond_neigh_init
Variable was created at:
bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617
bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617
Fixes: 9918d5bf329d ("bonding: modify only neigh_parms owned by us")
Fixes: 234bcf8a499e ("net/bonding: correctly proxy slave neigh param setup ndo function")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
neigh_cleanup() has not been used for seven years, and was a wrong design.
Messing with shared pointer in bond_neigh_init() without proper
memory barriers would at least trigger syzbot complains eventually.
It is time to remove this stuff.
Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:
# tcpdump -qltnni veth0 |sed 's/^/0: /' &
# tcpdump -qltnni veth1 |sed 's/^/1: /' &
# hping3 -2 192.168.0.2 -p 9
0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.
# hping3 -2 192.168.0.2 -p 9
1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since de77ecd4ef02 ("bonding: improve link-status update in
mii-monitoring"), the bonding driver has utilized two separate variables
to indicate the next link state a particular slave should transition to.
Each is used to communicate to a different portion of the link state
change commit logic; one to the bond_miimon_commit function itself, and
another to the state transition logic.
Unfortunately, the two variables can become unsynchronized,
resulting in incorrect link state transitions within bonding. This can
cause slaves to become stuck in an incorrect link state until a
subsequent carrier state transition.
The issue occurs when a special case in bond_slave_netdev_event
sets slave->link directly to BOND_LINK_FAIL. On the next pass through
bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
case will set the proposed next state (link_new_state) to BOND_LINK_UP,
but the new_link to BOND_LINK_DOWN. The setting of the final link state
from new_link comes after that from link_new_state, and so the slave
will end up incorrectly in _DOWN state.
Resolve this by combining the two variables into one.
Reported-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Reported-by: Sha Zhang <zhangsha.zhang@huawei.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Fixes: de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.
The rest were (relatively) trivial in nature.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a bonding interface is being created, it setups its mode and options.
At that moment, it uses mode_lock so mode_lock should be initialized
before that moment.
rtnl_newlink()
rtnl_create_link()
alloc_netdev_mqs()
->setup() //bond_setup()
->newlink //bond_newlink
bond_changelink()
register_netdevice()
->ndo_init() //bond_init()
After commit 089bca2caed0 ("bonding: use dynamic lockdep key instead of
subclass"), mode_lock is initialized in bond_init().
So in the bond_changelink(), un-initialized mode_lock can be used.
mode_lock should be initialized in bond_setup().
This patch partially reverts commit 089bca2caed0 ("bonding: use dynamic
lockdep key instead of subclass")
Test command:
ip link add bond0 type bond mode 802.3ad lacp_rate 0
Splat looks like:
[ 60.615127] INFO: trying to register non-static key.
[ 60.615900] the code is fine but needs lockdep annotation.
[ 60.616697] turning off the locking correctness validator.
[ 60.617490] CPU: 1 PID: 957 Comm: ip Not tainted 5.4.0-rc3+ #109
[ 60.618350] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 60.619481] Call Trace:
[ 60.619918] dump_stack+0x7c/0xbb
[ 60.620453] register_lock_class+0x1215/0x14d0
[ 60.621131] ? alloc_netdev_mqs+0x7b3/0xcc0
[ 60.621771] ? is_bpf_text_address+0x86/0xf0
[ 60.622416] ? is_dynamic_key+0x230/0x230
[ 60.623032] ? unwind_get_return_address+0x5f/0xa0
[ 60.623757] ? create_prof_cpu_mask+0x20/0x20
[ 60.624408] ? arch_stack_walk+0x83/0xb0
[ 60.625023] __lock_acquire+0xd8/0x3de0
[ 60.625616] ? stack_trace_save+0x82/0xb0
[ 60.626225] ? stack_trace_consume_entry+0x160/0x160
[ 60.626957] ? deactivate_slab.isra.80+0x2c5/0x800
[ 60.627668] ? register_lock_class+0x14d0/0x14d0
[ 60.628380] ? alloc_netdev_mqs+0x7b3/0xcc0
[ 60.629020] ? save_stack+0x69/0x80
[ 60.629574] ? save_stack+0x19/0x80
[ 60.630121] ? __kasan_kmalloc.constprop.4+0xa0/0xd0
[ 60.630859] ? __kmalloc_node+0x16f/0x480
[ 60.631472] ? alloc_netdev_mqs+0x7b3/0xcc0
[ 60.632121] ? rtnl_create_link+0x2ed/0xad0
[ 60.634388] ? __rtnl_newlink+0xad4/0x11b0
[ 60.635024] lock_acquire+0x164/0x3b0
[ 60.635608] ? bond_3ad_update_lacp_rate+0x91/0x200 [bonding]
[ 60.636463] _raw_spin_lock_bh+0x38/0x70
[ 60.637084] ? bond_3ad_update_lacp_rate+0x91/0x200 [bonding]
[ 60.637930] bond_3ad_update_lacp_rate+0x91/0x200 [bonding]
[ 60.638753] ? bond_3ad_lacpdu_recv+0xb30/0xb30 [bonding]
[ 60.639552] ? bond_opt_get_val+0x180/0x180 [bonding]
[ 60.640307] ? ___slab_alloc+0x5aa/0x610
[ 60.640925] bond_option_lacp_rate_set+0x71/0x140 [bonding]
[ 60.641751] __bond_opt_set+0x1ff/0xbb0 [bonding]
[ 60.643217] ? kasan_unpoison_shadow+0x30/0x40
[ 60.643924] bond_changelink+0x9a4/0x1700 [bonding]
[ 60.644653] ? memset+0x1f/0x40
[ 60.742941] ? bond_slave_changelink+0x1a0/0x1a0 [bonding]
[ 60.752694] ? alloc_netdev_mqs+0x8ea/0xcc0
[ 60.753330] ? rtnl_create_link+0x2ed/0xad0
[ 60.753964] bond_newlink+0x1e/0x60 [bonding]
[ 60.754612] __rtnl_newlink+0xb9f/0x11b0
[ ... ]
Reported-by: syzbot+8da67f407bcba2c72e6e@syzkaller.appspotmail.com
Reported-by: syzbot+0d083911ab18b710da71@syzkaller.appspotmail.com
Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch removes variables and callback these are related to the nested
device structure.
devices that can be nested have their own nest_level variable that
represents the depth of nested devices.
In the previous patch, new {lower/upper}_level variables are added and
they replace old private nest_level variable.
So, this patch removes all 'nest_level' variables.
In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
to get lockdep subclass value, which is actually lower nested depth value.
But now, they use the dynamic lockdep key to avoid lockdep warning instead
of the subclass.
So, this patch removes ->ndo_get_lock_subclass() callback.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All bonding device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes bonding use dynamic lockdep key instead of the
subclass.
Test commands:
ip link add bond0 type bond
for i in {1..5}
do
let A=$i-1
ip link add bond$i type bond
ip link set bond$i master bond$A
done
ip link set bond5 master bond0
Splat looks like:
[ 307.992912] WARNING: possible recursive locking detected
[ 307.993656] 5.4.0-rc3+ #96 Tainted: G W
[ 307.994367] --------------------------------------------
[ 307.995092] ip/761 is trying to acquire lock:
[ 307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 307.997045]
but task is already holding lock:
[ 307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 307.999215]
other info that might help us debug this:
[ 308.000251] Possible unsafe locking scenario:
[ 308.001137] CPU0
[ 308.001533] ----
[ 308.001915] lock(&(&bond->stats_lock)->rlock#2/2);
[ 308.002609] lock(&(&bond->stats_lock)->rlock#2/2);
[ 308.003302]
*** DEADLOCK ***
[ 308.004310] May be due to missing lock nesting notation
[ 308.005319] 3 locks held by ip/761:
[ 308.005830] #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 308.006894] #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 308.008243] #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
[ 308.009422]
stack backtrace:
[ 308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G W 5.4.0-rc3+ #96
[ 308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 308.012179] Call Trace:
[ 308.012601] dump_stack+0x7c/0xbb
[ 308.013089] __lock_acquire+0x269d/0x3de0
[ 308.013669] ? register_lock_class+0x14d0/0x14d0
[ 308.014318] lock_acquire+0x164/0x3b0
[ 308.014858] ? bond_get_stats+0xb8/0x500 [bonding]
[ 308.015520] _raw_spin_lock_nested+0x2e/0x60
[ 308.016129] ? bond_get_stats+0xb8/0x500 [bonding]
[ 308.017215] bond_get_stats+0xb8/0x500 [bonding]
[ 308.018454] ? bond_arp_rcv+0xf10/0xf10 [bonding]
[ 308.019710] ? rcu_read_lock_held+0x90/0xa0
[ 308.020605] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 308.021286] ? bond_get_stats+0x9f/0x500 [bonding]
[ 308.021953] dev_get_stats+0x1ec/0x270
[ 308.022508] bond_get_stats+0x1d1/0x500 [bonding]
Fixes: d3fff6c443fe ("net: add netdev_lockdep_set_classes() helper")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IFF_BONDING means bonding master or bonding slave device.
->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() unsets
IFF_BONDING flag.
bond0<--bond1
Both bond0 and bond1 are bonding device and these should keep having
IFF_BONDING flag until they are removed.
But bond1 would lose IFF_BONDING at ->ndo_del_slave() because that routine
do not check whether the slave device is the bonding type or not.
This patch adds the interface type check routine before removing
IFF_BONDING flag.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond1 master bond0
ip link set bond1 nomaster
ip link del bond1 type bond
ip link add bond1 type bond
Splat looks like:
[ 226.665555] proc_dir_entry 'bonding/bond1' already registered
[ 226.666440] WARNING: CPU: 0 PID: 737 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0
[ 226.667571] Modules linked in: bonding af_packet sch_fq_codel ip_tables x_tables unix
[ 226.668662] CPU: 0 PID: 737 Comm: ip Not tainted 5.4.0-rc3+ #96
[ 226.669508] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 226.670652] RIP: 0010:proc_register+0x2a9/0x3e0
[ 226.671612] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 a0 0b 14 9f 48 8b b0 e
0 00 00 00 e8 07 e7 88 ff <0f> 0b 48 c7 c7 40 2d a5 9f e8 59 d6 23 01 48 8b 4c 24 10 48 b8 00
[ 226.675007] RSP: 0018:ffff888050e17078 EFLAGS: 00010282
[ 226.675761] RAX: dffffc0000000008 RBX: ffff88805fdd0f10 RCX: ffffffff9dd344e2
[ 226.676757] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88806c9f6b8c
[ 226.677751] RBP: ffff8880507160f3 R08: ffffed100d940019 R09: ffffed100d940019
[ 226.678761] R10: 0000000000000001 R11: ffffed100d940018 R12: ffff888050716008
[ 226.679757] R13: ffff8880507160f2 R14: dffffc0000000000 R15: ffffed100a0e2c1e
[ 226.680758] FS: 00007fdc217cc0c0(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000
[ 226.681886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 226.682719] CR2: 00007f49313424d0 CR3: 0000000050e46001 CR4: 00000000000606f0
[ 226.683727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 226.684725] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 226.685681] Call Trace:
[ 226.687089] proc_create_seq_private+0xb3/0xf0
[ 226.687778] bond_create_proc_entry+0x1b3/0x3f0 [bonding]
[ 226.691458] bond_netdev_event+0x433/0x970 [bonding]
[ 226.692139] ? __module_text_address+0x13/0x140
[ 226.692779] notifier_call_chain+0x90/0x160
[ 226.693401] register_netdevice+0x9b3/0xd80
[ 226.694010] ? alloc_netdev_mqs+0x854/0xc10
[ 226.694629] ? netdev_change_features+0xa0/0xa0
[ 226.695278] ? rtnl_create_link+0x2ed/0xad0
[ 226.695849] bond_newlink+0x2a/0x60 [bonding]
[ 226.696422] __rtnl_newlink+0xb9f/0x11b0
[ 226.696968] ? rtnl_link_unregister+0x220/0x220
[ ... ]
Fixes: 0b680e753724 ("[PATCH] bonding: Add priv_flag to avoid event mishandling")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some interface types could be nested.
(VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
These interface types should set lockdep class because, without lockdep
class key, lockdep always warn about unexisting circular locking.
In the current code, these interfaces have their own lockdep class keys and
these manage itself. So that there are so many duplicate code around the
/driver/net and /net/.
This patch adds new generic lockdep keys and some helper functions for it.
This patch does below changes.
a) Add lockdep class keys in struct net_device
- qdisc_running, xmit, addr_list, qdisc_busylock
- these keys are used as dynamic lockdep key.
b) When net_device is being allocated, lockdep keys are registered.
- alloc_netdev_mqs()
c) When net_device is being free'd llockdep keys are unregistered.
- free_netdev()
d) Add generic lockdep key helper function
- netdev_register_lockdep_key()
- netdev_unregister_lockdep_key()
- netdev_update_lockdep_key()
e) Remove unnecessary generic lockdep macro and functions
f) Remove unnecessary lockdep code of each interfaces.
After this patch, each interface modules don't need to maintain
their lockdep keys.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot got a NULL dereference in bond_update_slave_arr() [1],
happening after a failure to allocate bond->slave_arr
A workqueue (bond_slave_arr_handler) is supposed to retry
the allocation later, but if the slave is removed before
the workqueue had a chance to complete, bond->slave_arr
can still be NULL.
[1]
Failed to build slave-array.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN PTI
Modules linked in:
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:bond_update_slave_arr.cold+0xc6/0x198 drivers/net/bonding/bond_main.c:4039
RSP: 0018:ffff88018fe33678 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000290b000
RDX: 0000000000000000 RSI: ffffffff82b63037 RDI: ffff88019745ea20
RBP: ffff88018fe33760 R08: ffff880170754280 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88019745ea00 R14: 0000000000000000 R15: ffff88018fe338b0
FS: 00007febd837d700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004540a0 CR3: 00000001c242e005 CR4: 00000000001626f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffff82b5b45e>] __bond_release_one+0x43e/0x500 drivers/net/bonding/bond_main.c:1923
[<ffffffff82b5b966>] bond_release drivers/net/bonding/bond_main.c:2039 [inline]
[<ffffffff82b5b966>] bond_do_ioctl+0x416/0x870 drivers/net/bonding/bond_main.c:3562
[<ffffffff83ae25f4>] dev_ifsioc+0x6f4/0x940 net/core/dev_ioctl.c:328
[<ffffffff83ae2e58>] dev_ioctl+0x1b8/0xc70 net/core/dev_ioctl.c:495
[<ffffffff83995ffd>] sock_do_ioctl+0x1bd/0x300 net/socket.c:1088
[<ffffffff83996a80>] sock_ioctl+0x300/0x5d0 net/socket.c:1196
[<ffffffff81b124db>] vfs_ioctl fs/ioctl.c:47 [inline]
[<ffffffff81b124db>] file_ioctl fs/ioctl.c:501 [inline]
[<ffffffff81b124db>] do_vfs_ioctl+0xacb/0x1300 fs/ioctl.c:688
[<ffffffff81b12dc6>] SYSC_ioctl fs/ioctl.c:705 [inline]
[<ffffffff81b12dc6>] SyS_ioctl+0xb6/0xe0 fs/ioctl.c:696
[<ffffffff8101ccc8>] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305
[<ffffffff84400091>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
Fixes: ee6377147409 ("bonding: Simplify the xmit function for modes that use xmit_hash")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
As commit 30d8177e8ac7 ("bonding: Always enable vlan tx offload")
said, we should always enable bonding's vlan tx offload, pass the
vlan packets to the slave devices with vlan tci, let them to handle
vlan implementation.
Now if encapsulation protocols like VXLAN is used, skb->encapsulation
may be set, then the packet is passed to vlan device which based on
bonding device. However in netif_skb_features(), the check of
hw_enc_features:
if (skb->encapsulation)
features &= dev->hw_enc_features;
clears NETIF_F_HW_VLAN_CTAG_TX/NETIF_F_HW_VLAN_STAG_TX. This results
in same issue in commit 30d8177e8ac7 like this:
vlan_dev_hard_start_xmit
-->dev_queue_xmit
-->validate_xmit_skb
-->netif_skb_features //NETIF_F_HW_VLAN_CTAG_TX is cleared
-->validate_xmit_vlan
-->__vlan_hwaccel_push_inside //skb->tci is cleared
...
--> bond_start_xmit
--> bond_xmit_hash //BOND_XMIT_POLICY_ENCAP34
--> __skb_flow_dissect // nhoff point to IP header
--> case htons(ETH_P_8021Q)
// skb_vlan_tag_present is false, so
vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
//vlan point to ip header wrongly
Fixes: b2a103e6d0af ("bonding: convert to ndo_fix_features")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The following scenario was encountered during testing of logical
partition mobility on pseries partitions with bonded ibmvnic
adapters in LACP mode.
1. Driver receives a signal that the device has been
swapped, and it needs to reset to initialize the new
device.
2. Driver reports loss of carrier and begins initialization.
3. Bonding driver receives NETDEV_CHANGE notifier and checks
the slave's current speed and duplex settings. Because these
are unknown at the time, the bond sets its link state to
BOND_LINK_FAIL and handles the speed update, clearing
AD_PORT_LACP_ENABLE.
4. Driver finishes recovery and reports that the carrier is on.
5. Bond receives a new notification and checks the speed again.
The speeds are valid but miimon has not altered the link
state yet. AD_PORT_LACP_ENABLE remains off.
Because the slave's link state is still BOND_LINK_FAIL,
no further port checks are made when it recovers. Though
the slave devices are operational and have valid speed
and duplex settings, the bond will not send LACPDU's. The
simplest fix I can see is to force another speed check
in bond_miimon_commit. This way the bond will update
AD_PORT_LACP_ENABLE if needed when transitioning from
BOND_LINK_FAIL to BOND_LINK_UP.
CC: Jarod Wilson <jarod@redhat.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Two cases of overlapping changes, nothing fancy.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
bond_xmit_roundrobin() checks for IGMP packets but it parses
the IP header even before checking skb->protocol.
We should validate the IP header with pskb_may_pull() before
using iph->protocol.
Reported-and-tested-by: syzbot+e5be16aa39ad6e755391@syzkaller.appspotmail.com
Fixes: a2fd940f4cff ("bonding: fix broken multicast with round-robin mode")
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A bonding master can be up while best_slave is NULL.
[12105.636318] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[12105.638204] mlx4_en: eth1: Linkstate event 1 -> 1
[12105.648984] IP: bond_select_active_slave+0x125/0x250
[12105.653977] PGD 0 P4D 0
[12105.656572] Oops: 0000 [#1] SMP PTI
[12105.660487] gsmi: Log Shutdown Reason 0x03
[12105.664620] Modules linked in: kvm_intel loop act_mirred uhaul vfat fat stg_standard_ftl stg_megablocks stg_idt stg_hdi stg elephant_dev_num stg_idt_eeprom w1_therm wire i2c_mux_pca954x i2c_mux mlx4_i2c i2c_usb cdc_acm ehci_pci ehci_hcd i2c_iimc mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [last unloaded: kvm_intel]
[12105.685686] mlx4_core 0000:03:00.0: dispatching link up event for port 2
[12105.685700] mlx4_en: eth2: Linkstate event 2 -> 1
[12105.685700] mlx4_en: eth2: Link Up (linkstate)
[12105.724452] Workqueue: bond0 bond_mii_monitor
[12105.728854] RIP: 0010:bond_select_active_slave+0x125/0x250
[12105.734355] RSP: 0018:ffffaf146a81fd88 EFLAGS: 00010246
[12105.739637] RAX: 0000000000000003 RBX: ffff8c62b03c6900 RCX: 0000000000000000
[12105.746838] RDX: 0000000000000000 RSI: ffffaf146a81fd08 RDI: ffff8c62b03c6000
[12105.754054] RBP: ffffaf146a81fdb8 R08: 0000000000000001 R09: ffff8c517d387600
[12105.761299] R10: 00000000001075d9 R11: ffffffffaceba92f R12: 0000000000000000
[12105.768553] R13: ffff8c8240ae4800 R14: 0000000000000000 R15: 0000000000000000
[12105.775748] FS: 0000000000000000(0000) GS:ffff8c62bfa40000(0000) knlGS:0000000000000000
[12105.783892] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12105.789716] CR2: 0000000000000000 CR3: 0000000d0520e001 CR4: 00000000001626f0
[12105.796976] Call Trace:
[12105.799446] [<ffffffffac31d387>] bond_mii_monitor+0x497/0x6f0
[12105.805317] [<ffffffffabd42643>] process_one_work+0x143/0x370
[12105.811225] [<ffffffffabd42c7a>] worker_thread+0x4a/0x360
[12105.816761] [<ffffffffabd48bc5>] kthread+0x105/0x140
[12105.821865] [<ffffffffabd42c30>] ? rescuer_thread+0x380/0x380
[12105.827757] [<ffffffffabd48ac0>] ? kthread_associate_blkcg+0xc0/0xc0
[12105.834266] [<ffffffffac600241>] ret_from_fork+0x51/0x60
Fixes: e2a7420df2e0 ("bonding/main: convert to using slave printk macros")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John Sperbeck <jsperbeck@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.
In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.
The aquantia driver conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We build vlan on top of bonding interface, which vlan offload
is off, bond mode is 802.3ad (LACP) and xmit_hash_policy is
BOND_XMIT_POLICY_ENCAP34.
Because vlan tx offload is off, vlan tci is cleared and skb push
the vlan header in validate_xmit_vlan() while sending from vlan
devices. Then in bond_xmit_hash, __skb_flow_dissect() fails to
get information from protocol headers encapsulated within vlan,
because 'nhoff' is points to IP header, so bond hashing is based
on layer 2 info, which fails to distribute packets across slaves.
This patch always enable bonding's vlan tx offload, pass the vlan
packets to the slave devices with vlan tci, let them to handle
vlan implementation.
Fixes: 278339a42a1b ("bonding: propogate vlan_features to bonding master")
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All of these printk instances benefit from having both master and slave
device information included, so convert to using a standardized macro
format and remove redundant information.
Suggested-by: Joe Perches <joe@perches.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Passing the bond name again to debug output when referencing slave is wrong.
We're trying to set the bond's MAC to that of the new_active slave, so adjust
the error message slightly and pass in the slave's name, not the bond's.
Then we're trying to set the MAC on the old active slave, but putting the
new active slave's name in the output. While we're at it, clarify the
error messages so you know which one actually triggered.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Seeing bonding debug log data along the lines of "event: 5" is a bit spartan,
and often requires a lookup table if you don't remember what every event is.
Make use of netdev_cmd_to_name for an improved debugging experience, so for
the prior example, you'll see: "bond_netdev_event received NETDEV_REGISTER"
instead (both are prefixed with the device for which the event pertains).
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When setting the bonding interface net device features,
the kernel code doesn't address the slaves' MPLS features
and doesn't inherit them.
Therefore, HW offloads that enhance performance such as
checksumming and TSO are disabled for MPLS tagged traffic
flowing via the bonding interface.
The patch add the inheritance of the MPLS features from the
slave devices with a similar logic to setting the bonding device's
VLAN and encapsulation features.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Once in a while, with just the right timing, 802.3ad slaves will fail to
properly initialize, winding up in a weird state, with a partner system
mac address of 00:00:00:00:00:00. This started happening after a fix to
properly track link_failure_count tracking, where an 802.3ad slave that
reported itself as link up in the miimon code, but wasn't able to get a
valid speed/duplex, started getting set to BOND_LINK_FAIL instead of
BOND_LINK_DOWN. That was the proper thing to do for the general "my link
went down" case, but has created a link initialization race that can put
the interface in this odd state.
The simple fix is to instead set the slave link to BOND_LINK_DOWN again,
if the link has never been up (last_link_up == 0), so the link state
doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed
in this case, it simply hasn't been up yet, and this prevents the
unnecessary state change from DOWN to FAIL and getting stuck in an init
failure w/o a partner mac.
Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking")
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Tested-by: Heesoon Kim <Heesoon.Kim@stratus.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflict resolution of af_smc.c from Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a bond is enslaved to another bond, bond_netdev_event() only
handles the event as if the bond is a master, and skips treating the
bond as a slave.
This leads to a refcount leak on the slave, since we don't remove the
adjacency to its master and the master holds a reference on the slave.
Reproducer:
ip link add bondL type bond
ip link add bondU type bond
ip link set bondL master bondU
ip link del bondL
No "Fixes:" tag, this code is older than git history.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.
We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.
Tested with ixgbe and CONFIG_FCOE=m
With pktgen using queue xmit:
threads vanilla patched
(kpps) (kpps)
1 2334 2428
2 4166 4278
4 7895 8100
v1 -> v2:
- rebased after helper's name change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is no longer necessary after eca59f691566 ("net: Remove support for bridge bypass ndos from stacked devices")
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch fixes a subtle PACKET_ORIGDEV regression which was a side
effect of fixes introduced by:
6a9e461f6fe4 bonding: pass link-local packets to bonding master also.
... to:
b89f04c61efe bonding: deliver link-local packets with skb->dev set to link that packets arrived on
While 6a9e461f6fe4 restored pre-b89f04c61efe presence of link-local
packets on bonding masters (which is required e.g. by linux bridges
participating in spanning tree or needed for lab-like setups created
with group_fwd_mask) it also caused the originating device
information to be lost due to cloning.
Maciej Żenczykowski proposed another solution that doesn't require
packet cloning and retains original device information - instead of
returning RX_HANDLER_PASS for all link-local packets it's now limited
only to packets from inactive slaves.
At the same time, packets passed to bonding masters retain correct
information about the originating device and PACKET_ORIGDEV can be used
to determine it.
This elegantly solves all issues so far:
- link-local packets that were removed from bonding masters
- LLDP daemons being forced to explicitly bind to slave interfaces
- PACKET_ORIGDEV having no effect on bond interfaces
Fixes: 6a9e461f6fe4 (bonding: pass link-local packets to bonding master also.)
Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|