summaryrefslogtreecommitdiff
path: root/drivers/net/bonding/bond_main.c
AgeCommit message (Collapse)Author
2020-06-22bonding: support hardware encryption offload to slavesJarod Wilson
Currently, this support is limited to active-backup mode, as I'm not sure about the feasilibity of mapping an xfrm_state's offload handle to multiple hardware devices simultaneously, and we rely on being able to pass some hints to both the xfrm and NIC driver about whether or not they're operating on a slave device. I've tested this atop an Intel x520 device (ixgbe) using libreswan in transport mode, succesfully achieving ~4.3Gbps throughput with netperf (more or less identical to throughput on a bare NIC in this system), as well as successful failover and recovery mid-netperf. v2: just use CONFIG_XFRM_OFFLOAD for wrapping, isolate more code with it CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Jakub Kicinski <kuba@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: netdev@vger.kernel.org CC: intel-wired-lan@lists.osuosl.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-09net: change addr_list_lock back to static keyCong Wang
The dynamic key update for addr_list_lock still causes troubles, for example the following race condition still exists: CPU 0: CPU 1: (RCU read lock) (RTNL lock) dev_mc_seq_show() netdev_update_lockdep_key() -> lockdep_unregister_key() -> netif_addr_lock_bh() because lockdep doesn't provide an API to update it atomically. Therefore, we have to move it back to static keys and use subclass for nest locking like before. In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key changes"), I already reverted most parts of commit ab92d68fc22f ("net: core: add generic lockdep keys"). This patch reverts the rest and also part of commit f3b0a18bb6cb ("net: remove unnecessary variables and callback"). After this patch, addr_list_lock changes back to using static keys and subclasses to satisfy lockdep. Thanks to dev->lower_level, we do not have to change back to ->ndo_get_lock_subclass(). And hopefully this reduces some syzbot lockdep noises too. Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-09Merge branch 'mlx5-next' of ↵Saeed Mahameed
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux This merge includes updates to bonding driver needed for the rdma stack, to avoid conflicts with the RDMA branch. Maor Gottlieb Says: ==================== Bonding: Add support to get xmit slave The following series adds support to get the LAG master xmit slave by introducing new .ndo - ndo_get_xmit_slave. Every LAG module can implement it and it first implemented in the bond driver. This is follow-up to the RFC discussion [1]. The main motivation for doing this is for drivers that offload part of the LAG functionality. For example, Mellanox Connect-X hardware implements RoCE LAG which selects the TX affinity when the resources are created and port is remapped when it goes down. The first part of this patchset introduces the new .ndo and add the support to the bonding module. The second part adds support to get the RoCE LAG xmit slave by building skb of the RoCE packet based on the AH attributes and call to the new .ndo. The third part change the mlx5 driver driver to set the QP's affinity port according to the slave which found by the .ndo. ==================== Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-07bonding: propagate transmit statusEric Dumazet
Currently, bonding always returns NETDEV_TX_OK to its caller. It is worth trying to be more accurate : TCP for instance can have different recovery strategies if it can have more precise status, if packet was dropped by slave qdisc. This is especially important when host is under stress. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04bonding: remove useless stats_lock_keyCong Wang
After commit b3e80d44f5b1 ("bonding: fix lockdep warning in bond_get_stats()") the dynamic key is no longer necessary, as we compute nest level at run-time. So, we can just remove it to save some lockdep key entries. Test commands: ip link add bond0 type bond ip link add bond1 type bond ip link set bond0 master bond1 ip link set bond0 nomaster ip link set bond1 master bond0 Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: partially revert dynamic lockdep key changesCong Wang
This patch reverts the folowing commits: commit 064ff66e2bef84f1153087612032b5b9eab005bd "bonding: add missing netdev_update_lockdep_key()" commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3d24125992ab0026b0dab16c08df947c7 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68fc22f9afab480153bd82a20f6e2533769 "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01bonding: Implement ndo_get_xmit_slaveMaor Gottlieb
Add implementation of ndo_get_xmit_slave. Find the slave by using the helper function according to the bond mode. If the caller set all_slaves to true, then it assumes that all slaves are available to transmit. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01bonding: Add array of all slavesMaor Gottlieb
Keep all slaves in array so it could be used to get the xmit slave assume all the slaves are active. The logic to add slave to the array is like the usable slaves, except that we also add slaves that currently can't transmit - not up or active. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01bonding: Add function to get the xmit slave in active-backup modeMaor Gottlieb
Add helper function to get the xmit slave in active-backup mode. It's only one line function that return the curr_active_slave, but it will used both in the xmit flow and by the new .ndo to get the xmit slave. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01bonding: Add helper function to get the xmit slave in rr modeMaor Gottlieb
Add helper function to get the xmit slave when bond is in round robin mode. Change bond_xmit_slave_id to bond_get_slave_by_id, then the logic for find the next slave for transmit could be used both by the xmit flow and the .ndo to get the xmit slave. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01bonding: Add helper function to get the xmit slave based on hashMaor Gottlieb
Both xor and 802.3ad modes use bond_xmit_hash to get the xmit slave. Export the logic to helper function so it could be used in the following patches by the .ndo to get the xmit slave. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01bonding: Rename slave_arr to usable_slavesMaor Gottlieb
Rename slave_arr to usable_slaves, since we will have two arrays, one for the usable slaves and the other to all slaves. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01bonding: Export skip slave logic to functionMaor Gottlieb
As a preparation for following change that add array of all slaves, extract code that skip slave to function. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-24net/bond: Delete driver and module versionsLeon Romanovsky
The in-kernel code has already unique version, which is based on Linus's tag, update the bond driver to be consistent with that version. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Conflict resolution of ice_virtchnl_pf.c based upon work by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20net: use netif_is_bridge_port() to check for IFF_BRIDGE_PORTJulian Wiedmann
Trivial cleanup, so that all bridge port-specific code can be found in one go. CC: Johannes Berg <johannes@sipsolutions.net> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16bonding: fix lockdep warning in bond_get_stats()Taehee Yoo
In the "struct bonding", there is stats_lock. This lock protects "bond_stats" in the "struct bonding". bond_stats is updated in the bond_get_stats() and this function would be executed concurrently. So, the lock is needed. Bonding interfaces would be nested. So, either stats_lock should use dynamic lockdep class key or stats_lock should be used by spin_lock_nested(). In the current code, stats_lock is using a dynamic lockdep class key. But there is no updating stats_lock_key routine So, lockdep warning will occur. Test commands: ip link add bond0 type bond ip link add bond1 type bond ip link set bond0 master bond1 ip link set bond0 nomaster ip link set bond1 master bond0 Splat looks like: [ 38.420603][ T957] 5.5.0+ #394 Not tainted [ 38.421074][ T957] ------------------------------------------------------ [ 38.421837][ T957] ip/957 is trying to acquire lock: [ 38.422399][ T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding] [ 38.423528][ T957] [ 38.423528][ T957] but task is already holding lock: [ 38.424526][ T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding] [ 38.426075][ T957] [ 38.426075][ T957] which lock already depends on the new lock. [ 38.426075][ T957] [ 38.428536][ T957] [ 38.428536][ T957] the existing dependency chain (in reverse order) is: [ 38.429475][ T957] [ 38.429475][ T957] -> #1 (&bond->stats_lock_key){+.+.}: [ 38.430273][ T957] _raw_spin_lock+0x30/0x70 [ 38.430812][ T957] bond_get_stats+0x90/0x4d0 [bonding] [ 38.431451][ T957] dev_get_stats+0x1ec/0x270 [ 38.432088][ T957] bond_get_stats+0x1a5/0x4d0 [bonding] [ 38.432767][ T957] dev_get_stats+0x1ec/0x270 [ 38.433322][ T957] rtnl_fill_stats+0x44/0xbe0 [ 38.433866][ T957] rtnl_fill_ifinfo+0xeb2/0x3720 [ 38.434474][ T957] rtmsg_ifinfo_build_skb+0xca/0x170 [ 38.435081][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0 [ 38.436848][ T957] rtnetlink_event+0xcd/0x120 [ 38.437455][ T957] notifier_call_chain+0x90/0x160 [ 38.438067][ T957] netdev_change_features+0x74/0xa0 [ 38.438708][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding] [ 38.439522][ T957] bond_enslave+0x3639/0x47b0 [bonding] [ 38.440225][ T957] do_setlink+0xaab/0x2ef0 [ 38.440786][ T957] __rtnl_newlink+0x9c5/0x1270 [ 38.441463][ T957] rtnl_newlink+0x65/0x90 [ 38.442075][ T957] rtnetlink_rcv_msg+0x4a8/0x890 [ 38.442774][ T957] netlink_rcv_skb+0x121/0x350 [ 38.443451][ T957] netlink_unicast+0x42e/0x610 [ 38.444282][ T957] netlink_sendmsg+0x65a/0xb90 [ 38.444992][ T957] ____sys_sendmsg+0x5ce/0x7a0 [ 38.445679][ T957] ___sys_sendmsg+0x10f/0x1b0 [ 38.446365][ T957] __sys_sendmsg+0xc6/0x150 [ 38.447007][ T957] do_syscall_64+0x99/0x4f0 [ 38.447668][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 38.448538][ T957] [ 38.448538][ T957] -> #0 (&bond->stats_lock_key#2){+.+.}: [ 38.449554][ T957] __lock_acquire+0x2d8d/0x3de0 [ 38.450148][ T957] lock_acquire+0x164/0x3b0 [ 38.450711][ T957] _raw_spin_lock+0x30/0x70 [ 38.451292][ T957] bond_get_stats+0x90/0x4d0 [bonding] [ 38.451950][ T957] dev_get_stats+0x1ec/0x270 [ 38.452425][ T957] bond_get_stats+0x1a5/0x4d0 [bonding] [ 38.453362][ T957] dev_get_stats+0x1ec/0x270 [ 38.453825][ T957] rtnl_fill_stats+0x44/0xbe0 [ 38.454390][ T957] rtnl_fill_ifinfo+0xeb2/0x3720 [ 38.456257][ T957] rtmsg_ifinfo_build_skb+0xca/0x170 [ 38.456998][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0 [ 38.459351][ T957] rtnetlink_event+0xcd/0x120 [ 38.460086][ T957] notifier_call_chain+0x90/0x160 [ 38.460829][ T957] netdev_change_features+0x74/0xa0 [ 38.461752][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding] [ 38.462705][ T957] bond_enslave+0x3639/0x47b0 [bonding] [ 38.463476][ T957] do_setlink+0xaab/0x2ef0 [ 38.464141][ T957] __rtnl_newlink+0x9c5/0x1270 [ 38.464897][ T957] rtnl_newlink+0x65/0x90 [ 38.465522][ T957] rtnetlink_rcv_msg+0x4a8/0x890 [ 38.466215][ T957] netlink_rcv_skb+0x121/0x350 [ 38.466895][ T957] netlink_unicast+0x42e/0x610 [ 38.467583][ T957] netlink_sendmsg+0x65a/0xb90 [ 38.468285][ T957] ____sys_sendmsg+0x5ce/0x7a0 [ 38.469202][ T957] ___sys_sendmsg+0x10f/0x1b0 [ 38.469884][ T957] __sys_sendmsg+0xc6/0x150 [ 38.470587][ T957] do_syscall_64+0x99/0x4f0 [ 38.471245][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 38.472093][ T957] [ 38.472093][ T957] other info that might help us debug this: [ 38.472093][ T957] [ 38.473438][ T957] Possible unsafe locking scenario: [ 38.473438][ T957] [ 38.474898][ T957] CPU0 CPU1 [ 38.476234][ T957] ---- ---- [ 38.480171][ T957] lock(&bond->stats_lock_key); [ 38.480808][ T957] lock(&bond->stats_lock_key#2); [ 38.481791][ T957] lock(&bond->stats_lock_key); [ 38.482754][ T957] lock(&bond->stats_lock_key#2); [ 38.483416][ T957] [ 38.483416][ T957] *** DEADLOCK *** [ 38.483416][ T957] [ 38.484505][ T957] 3 locks held by ip/957: [ 38.485048][ T957] #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890 [ 38.486198][ T957] #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding] [ 38.487625][ T957] #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding] [ 38.488897][ T957] [ 38.488897][ T957] stack backtrace: [ 38.489646][ T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394 [ 38.490497][ T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 38.492810][ T957] Call Trace: [ 38.493219][ T957] dump_stack+0x96/0xdb [ 38.493709][ T957] check_noncircular+0x371/0x450 [ 38.494344][ T957] ? lookup_address+0x60/0x60 [ 38.494923][ T957] ? print_circular_bug.isra.35+0x310/0x310 [ 38.495699][ T957] ? hlock_class+0x130/0x130 [ 38.496334][ T957] ? __lock_acquire+0x2d8d/0x3de0 [ 38.496979][ T957] __lock_acquire+0x2d8d/0x3de0 [ 38.497607][ T957] ? register_lock_class+0x14d0/0x14d0 [ 38.498333][ T957] ? check_chain_key+0x236/0x5d0 [ 38.499003][ T957] lock_acquire+0x164/0x3b0 [ 38.499800][ T957] ? bond_get_stats+0x90/0x4d0 [bonding] [ 38.500706][ T957] _raw_spin_lock+0x30/0x70 [ 38.501435][ T957] ? bond_get_stats+0x90/0x4d0 [bonding] [ 38.502311][ T957] bond_get_stats+0x90/0x4d0 [bonding] [ ... ] But, there is another problem. The dynamic lockdep class key is protected by RTNL, but bond_get_stats() would be called outside of RTNL. So, it would use an invalid dynamic lockdep class key. In order to fix this issue, stats_lock uses spin_lock_nested() instead of a dynamic lockdep key. The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct nest level value, which will be used by spin_lock_nested(). The "dev->lower_level" indicates lower nest level value, but this value is invalid outside of RTNL. So, bond_get_lowest_level_rcu() returns valid lower nest level value in the RCU critical section. bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled. Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16bonding: add missing netdev_update_lockdep_key()Taehee Yoo
After bond_release(), netdev_update_lockdep_key() should be called. But both ioctl path and attribute path don't call netdev_update_lockdep_key(). This patch adds missing netdev_update_lockdep_key(). Test commands: ip link add bond0 type bond ip link add bond1 type bond ifenslave bond0 bond1 ifenslave -d bond0 bond1 ifenslave bond1 bond0 Splat looks like: [ 29.501182][ T1046] WARNING: possible circular locking dependency detected [ 29.501945][ T1039] hardirqs last disabled at (1962): [<ffffffffac6c807f>] handle_mm_fault+0x13f/0x700 [ 29.503442][ T1046] 5.5.0+ #322 Not tainted [ 29.503447][ T1046] ------------------------------------------------------ [ 29.504277][ T1039] softirqs last enabled at (1180): [<ffffffffade00678>] __do_softirq+0x678/0x981 [ 29.505443][ T1046] ifenslave/1046 is trying to acquire lock: [ 29.505886][ T1039] softirqs last disabled at (1169): [<ffffffffac19c18a>] irq_exit+0x17a/0x1a0 [ 29.509997][ T1046] ffff88805d5da280 (&dev->addr_list_lock_key#3){+...}, at: dev_mc_sync_multiple+0x95/0x120 [ 29.511243][ T1046] [ 29.511243][ T1046] but task is already holding lock: [ 29.512192][ T1046] ffff8880460f2280 (&dev->addr_list_lock_key#4){+...}, at: bond_enslave+0x4482/0x47b0 [bonding] [ 29.514124][ T1046] [ 29.514124][ T1046] which lock already depends on the new lock. [ 29.514124][ T1046] [ 29.517297][ T1046] [ 29.517297][ T1046] the existing dependency chain (in reverse order) is: [ 29.518231][ T1046] [ 29.518231][ T1046] -> #1 (&dev->addr_list_lock_key#4){+...}: [ 29.519076][ T1046] _raw_spin_lock+0x30/0x70 [ 29.519588][ T1046] dev_mc_sync_multiple+0x95/0x120 [ 29.520208][ T1046] bond_enslave+0x448d/0x47b0 [bonding] [ 29.520862][ T1046] bond_option_slaves_set+0x1a3/0x370 [bonding] [ 29.521640][ T1046] __bond_opt_set+0x1ff/0xbb0 [bonding] [ 29.522438][ T1046] __bond_opt_set_notify+0x2b/0xf0 [bonding] [ 29.523251][ T1046] bond_opt_tryset_rtnl+0x92/0xf0 [bonding] [ 29.524082][ T1046] bonding_sysfs_store_option+0x8a/0xf0 [bonding] [ 29.524959][ T1046] kernfs_fop_write+0x276/0x410 [ 29.525620][ T1046] vfs_write+0x197/0x4a0 [ 29.526218][ T1046] ksys_write+0x141/0x1d0 [ 29.526818][ T1046] do_syscall_64+0x99/0x4f0 [ 29.527430][ T1046] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.528265][ T1046] [ 29.528265][ T1046] -> #0 (&dev->addr_list_lock_key#3){+...}: [ 29.529272][ T1046] __lock_acquire+0x2d8d/0x3de0 [ 29.529935][ T1046] lock_acquire+0x164/0x3b0 [ 29.530638][ T1046] _raw_spin_lock+0x30/0x70 [ 29.531187][ T1046] dev_mc_sync_multiple+0x95/0x120 [ 29.531790][ T1046] bond_enslave+0x448d/0x47b0 [bonding] [ 29.532451][ T1046] bond_option_slaves_set+0x1a3/0x370 [bonding] [ 29.533163][ T1046] __bond_opt_set+0x1ff/0xbb0 [bonding] [ 29.533789][ T1046] __bond_opt_set_notify+0x2b/0xf0 [bonding] [ 29.534595][ T1046] bond_opt_tryset_rtnl+0x92/0xf0 [bonding] [ 29.535500][ T1046] bonding_sysfs_store_option+0x8a/0xf0 [bonding] [ 29.536379][ T1046] kernfs_fop_write+0x276/0x410 [ 29.537057][ T1046] vfs_write+0x197/0x4a0 [ 29.537640][ T1046] ksys_write+0x141/0x1d0 [ 29.538251][ T1046] do_syscall_64+0x99/0x4f0 [ 29.538870][ T1046] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.539659][ T1046] [ 29.539659][ T1046] other info that might help us debug this: [ 29.539659][ T1046] [ 29.540953][ T1046] Possible unsafe locking scenario: [ 29.540953][ T1046] [ 29.541883][ T1046] CPU0 CPU1 [ 29.542540][ T1046] ---- ---- [ 29.543209][ T1046] lock(&dev->addr_list_lock_key#4); [ 29.543880][ T1046] lock(&dev->addr_list_lock_key#3); [ 29.544873][ T1046] lock(&dev->addr_list_lock_key#4); [ 29.545863][ T1046] lock(&dev->addr_list_lock_key#3); [ 29.546525][ T1046] [ 29.546525][ T1046] *** DEADLOCK *** [ 29.546525][ T1046] [ 29.547542][ T1046] 5 locks held by ifenslave/1046: [ 29.548196][ T1046] #0: ffff88806044c478 (sb_writers#5){.+.+}, at: vfs_write+0x3bb/0x4a0 [ 29.549248][ T1046] #1: ffff88805af00890 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cf/0x410 [ 29.550343][ T1046] #2: ffff88805b8b54b0 (kn->count#157){.+.+}, at: kernfs_fop_write+0x1f2/0x410 [ 29.551575][ T1046] #3: ffffffffaecf4cf0 (rtnl_mutex){+.+.}, at: bond_opt_tryset_rtnl+0x5f/0xf0 [bonding] [ 29.552819][ T1046] #4: ffff8880460f2280 (&dev->addr_list_lock_key#4){+...}, at: bond_enslave+0x4482/0x47b0 [bonding] [ 29.554175][ T1046] [ 29.554175][ T1046] stack backtrace: [ 29.554907][ T1046] CPU: 0 PID: 1046 Comm: ifenslave Not tainted 5.5.0+ #322 [ 29.555854][ T1046] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 29.557064][ T1046] Call Trace: [ 29.557504][ T1046] dump_stack+0x96/0xdb [ 29.558054][ T1046] check_noncircular+0x371/0x450 [ 29.558723][ T1046] ? print_circular_bug.isra.35+0x310/0x310 [ 29.559486][ T1046] ? hlock_class+0x130/0x130 [ 29.560100][ T1046] ? __lock_acquire+0x2d8d/0x3de0 [ 29.560761][ T1046] __lock_acquire+0x2d8d/0x3de0 [ 29.561366][ T1046] ? register_lock_class+0x14d0/0x14d0 [ 29.562045][ T1046] ? find_held_lock+0x39/0x1d0 [ 29.562641][ T1046] lock_acquire+0x164/0x3b0 [ 29.563199][ T1046] ? dev_mc_sync_multiple+0x95/0x120 [ 29.563872][ T1046] _raw_spin_lock+0x30/0x70 [ 29.564464][ T1046] ? dev_mc_sync_multiple+0x95/0x120 [ 29.565146][ T1046] dev_mc_sync_multiple+0x95/0x120 [ 29.565793][ T1046] bond_enslave+0x448d/0x47b0 [bonding] [ 29.566487][ T1046] ? bond_update_slave_arr+0x940/0x940 [bonding] [ 29.567279][ T1046] ? bstr_printf+0xc20/0xc20 [ 29.567857][ T1046] ? stack_trace_consume_entry+0x160/0x160 [ 29.568614][ T1046] ? deactivate_slab.isra.77+0x2c5/0x800 [ 29.569320][ T1046] ? check_chain_key+0x236/0x5d0 [ 29.569939][ T1046] ? sscanf+0x93/0xc0 [ 29.570442][ T1046] ? vsscanf+0x1e20/0x1e20 [ 29.571003][ T1046] bond_option_slaves_set+0x1a3/0x370 [bonding] [ ... ] Fixes: ab92d68fc22f ("net: core: add generic lockdep keys") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-14bonding: fix active-backup transition after link failureMahesh Bandewar
After the recent fix in commit 1899bb325149 ("bonding: fix state transition issue in link monitoring"), the active-backup mode with miimon initially come-up fine but after a link-failure, both members transition into backup state. Following steps to reproduce the scenario (eth1 and eth2 are the slaves of the bond): ip link set eth1 up ip link set eth2 down sleep 1 ip link set eth2 up ip link set eth1 down cat /sys/class/net/eth1/bonding_slave/state cat /sys/class/net/eth2/bonding_slave/state Fixes: 1899bb325149 ("bonding: fix state transition issue in link monitoring") CC: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-12-09bonding: fix bond_neigh_init()Eric Dumazet
1) syzbot reported an uninit-value in bond_neigh_setup() [1] bond_neigh_setup() uses a temporary on-stack 'struct neigh_parms parms', but only clears parms.neigh_setup field. A stacked bonding device would then enter bond_neigh_setup() and read garbage from parms->dev. If we get really unlucky and garbage is matching @dev, then we could recurse and eventually crash. Let's make sure the whole structure is cleared to avoid surprises. 2) bond_neigh_setup() can be called while another cpu manipulates the master device, removing or adding a slave. We need at least rcu protection to prevent use-after-free. Note: Prior code does not support a stack of bonding devices, this patch does not attempt to fix this, and leave a comment instead. [1] BUG: KMSAN: uninit-value in bond_neigh_setup+0xa4/0x110 drivers/net/bonding/bond_main.c:3655 CPU: 0 PID: 11256 Comm: syz-executor.0 Not tainted 5.4.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108 __msan_warning+0x57/0xa0 mm/kmsan/kmsan_instr.c:245 bond_neigh_setup+0xa4/0x110 drivers/net/bonding/bond_main.c:3655 bond_neigh_init+0x216/0x4b0 drivers/net/bonding/bond_main.c:3626 ___neigh_create+0x169e/0x2c40 net/core/neighbour.c:613 __neigh_create+0xbd/0xd0 net/core/neighbour.c:674 ip6_finish_output2+0x149a/0x2670 net/ipv6/ip6_output.c:113 __ip6_finish_output+0x83d/0x8f0 net/ipv6/ip6_output.c:142 ip6_finish_output+0x2db/0x420 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] mld_sendpack+0xebd/0x13d0 net/ipv6/mcast.c:1682 mld_send_cr net/ipv6/mcast.c:1978 [inline] mld_ifc_timer_expire+0x116b/0x1680 net/ipv6/mcast.c:2477 call_timer_fn+0x232/0x530 kernel/time/timer.c:1404 expire_timers kernel/time/timer.c:1449 [inline] __run_timers+0xd60/0x1270 kernel/time/timer.c:1773 run_timer_softirq+0x2d/0x50 kernel/time/timer.c:1786 __do_softirq+0x4a1/0x83a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:375 [inline] irq_exit+0x230/0x280 kernel/softirq.c:416 exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536 smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1138 apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:835 </IRQ> RIP: 0010:kmsan_free_page+0x18d/0x1c0 mm/kmsan/kmsan_shadow.c:439 Code: 4c 89 ff 44 89 f6 e8 82 0d ee ff 65 ff 0d 9f 26 3b 60 65 8b 05 98 26 3b 60 85 c0 75 24 e8 5b f6 35 ff 4c 89 6d d0 ff 75 d0 9d <48> 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 0b 0f RSP: 0018:ffffb328034af818 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000000 RBX: ffffe2d7471f8360 RCX: 0000000000000000 RDX: ffffffffadea7000 RSI: 0000000000000004 RDI: ffff93496fcda104 RBP: ffffb328034af850 R08: ffff934a47e86d00 R09: ffff93496fc41900 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000000246 R14: 0000000000000000 R15: ffffe2d7472225c0 free_pages_prepare mm/page_alloc.c:1138 [inline] free_pcp_prepare mm/page_alloc.c:1230 [inline] free_unref_page_prepare+0x1d9/0x770 mm/page_alloc.c:3025 free_unref_page mm/page_alloc.c:3074 [inline] free_the_page mm/page_alloc.c:4832 [inline] __free_pages+0x154/0x230 mm/page_alloc.c:4840 __vunmap+0xdac/0xf20 mm/vmalloc.c:2277 __vfree mm/vmalloc.c:2325 [inline] vfree+0x7c/0x170 mm/vmalloc.c:2355 copy_entries_to_user net/ipv6/netfilter/ip6_tables.c:883 [inline] get_entries net/ipv6/netfilter/ip6_tables.c:1041 [inline] do_ip6t_get_ctl+0xfa4/0x1030 net/ipv6/netfilter/ip6_tables.c:1709 nf_sockopt net/netfilter/nf_sockopt.c:104 [inline] nf_getsockopt+0x481/0x4e0 net/netfilter/nf_sockopt.c:122 ipv6_getsockopt+0x264/0x510 net/ipv6/ipv6_sockglue.c:1400 tcp_getsockopt+0x1c6/0x1f0 net/ipv4/tcp.c:3688 sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3110 __sys_getsockopt+0x533/0x7b0 net/socket.c:2129 __do_sys_getsockopt net/socket.c:2144 [inline] __se_sys_getsockopt+0xe1/0x100 net/socket.c:2141 __x64_sys_getsockopt+0x62/0x80 net/socket.c:2141 do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45d20a Code: b8 34 01 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 8d 8b fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 6a 8b fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:0000000000a6f618 EFLAGS: 00000212 ORIG_RAX: 0000000000000037 RAX: ffffffffffffffda RBX: 0000000000a6f640 RCX: 000000000045d20a RDX: 0000000000000041 RSI: 0000000000000029 RDI: 0000000000000003 RBP: 0000000000717cc0 R08: 0000000000a6f63c R09: 0000000000004000 R10: 0000000000a6f740 R11: 0000000000000212 R12: 0000000000000003 R13: 0000000000000000 R14: 0000000000000029 R15: 0000000000715b00 Local variable description: ----parms@bond_neigh_init Variable was created at: bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617 bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617 Fixes: 9918d5bf329d ("bonding: modify only neigh_parms owned by us") Fixes: 234bcf8a499e ("net/bonding: correctly proxy slave neigh param setup ndo function") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09neighbour: remove neigh_cleanup() methodEric Dumazet
neigh_cleanup() has not been used for seven years, and was a wrong design. Messing with shared pointer in bond_neigh_init() without proper memory barriers would at least trigger syzbot complains eventually. It is time to remove this stuff. Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16bonding: symmetric ICMP transmitMatteo Croce
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports to balance packets between slaves. With some network errors, we receive an ICMP error packet by the remote host or a router. If sent by a router, the source IP can differ from the remote host one. Additionally the ICMP protocol has no port numbers, so a layer3+4 bonding will get a different hash than the previous one. These two conditions could let the packet go through a different interface than the other packets of the same flow: # tcpdump -qltnni veth0 |sed 's/^/0: /' & # tcpdump -qltnni veth1 |sed 's/^/1: /' & # hping3 -2 192.168.0.2 -p 9 0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 An ICMP error packet contains the header of the packet which caused the network error, so inspect it and match the flow against it, so we can send the ICMP via the same interface of the previous packet in the flow. Move the IP and port dissect code into a generic function bond_flow_ip() and if we are dissecting an ICMP error packet, call it again with the adjusted offset. # hping3 -2 192.168.0.2 -p 9 1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0 0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0 0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
One conflict in the BPF samples Makefile, some fixes in 'net' whilst we were converting over to Makefile.target rules in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05bonding: fix state transition issue in link monitoringJay Vosburgh
Since de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring"), the bonding driver has utilized two separate variables to indicate the next link state a particular slave should transition to. Each is used to communicate to a different portion of the link state change commit logic; one to the bond_miimon_commit function itself, and another to the state transition logic. Unfortunately, the two variables can become unsynchronized, resulting in incorrect link state transitions within bonding. This can cause slaves to become stuck in an incorrect link state until a subsequent carrier state transition. The issue occurs when a special case in bond_slave_netdev_event sets slave->link directly to BOND_LINK_FAIL. On the next pass through bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL case will set the proposed next state (link_new_state) to BOND_LINK_UP, but the new_link to BOND_LINK_DOWN. The setting of the final link state from new_link comes after that from link_new_state, and so the slave will end up incorrectly in _DOWN state. Resolve this by combining the two variables into one. Reported-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Reported-by: Sha Zhang <zhangsha.zhang@huawei.com> Cc: Mahesh Bandewar <maheshb@google.com> Fixes: de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring") Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30bonding: balance ICMP echoes in layer3+4 modeMatteo Croce
The bonding uses the L4 ports to balance flows between slaves. As the ICMP protocol has no ports, those packets are sent all to the same device: # tcpdump -qltnni veth0 ip |sed 's/^/0: /' & # tcpdump -qltnni veth1 ip |sed 's/^/1: /' & # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64 # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64 # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64 But some ICMP packets have an Identifier field which is used to match packets within sessions, let's use this value in the hash function to balance these packets between bond slaves: # ping -qc1 192.168.0.2 0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64 0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64 # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64 Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP, so we can balance pings encapsulated in a tunnel when using mode encap3+4: # ping -q 192.168.1.2 -c1 0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64 0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64 # ping -q 192.168.1.2 -c1 1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64 Signed-off-by: Matteo Croce <mcroce@redhat.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29bonding: fix using uninitialized mode_lockTaehee Yoo
When a bonding interface is being created, it setups its mode and options. At that moment, it uses mode_lock so mode_lock should be initialized before that moment. rtnl_newlink() rtnl_create_link() alloc_netdev_mqs() ->setup() //bond_setup() ->newlink //bond_newlink bond_changelink() register_netdevice() ->ndo_init() //bond_init() After commit 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass"), mode_lock is initialized in bond_init(). So in the bond_changelink(), un-initialized mode_lock can be used. mode_lock should be initialized in bond_setup(). This patch partially reverts commit 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass") Test command: ip link add bond0 type bond mode 802.3ad lacp_rate 0 Splat looks like: [ 60.615127] INFO: trying to register non-static key. [ 60.615900] the code is fine but needs lockdep annotation. [ 60.616697] turning off the locking correctness validator. [ 60.617490] CPU: 1 PID: 957 Comm: ip Not tainted 5.4.0-rc3+ #109 [ 60.618350] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 60.619481] Call Trace: [ 60.619918] dump_stack+0x7c/0xbb [ 60.620453] register_lock_class+0x1215/0x14d0 [ 60.621131] ? alloc_netdev_mqs+0x7b3/0xcc0 [ 60.621771] ? is_bpf_text_address+0x86/0xf0 [ 60.622416] ? is_dynamic_key+0x230/0x230 [ 60.623032] ? unwind_get_return_address+0x5f/0xa0 [ 60.623757] ? create_prof_cpu_mask+0x20/0x20 [ 60.624408] ? arch_stack_walk+0x83/0xb0 [ 60.625023] __lock_acquire+0xd8/0x3de0 [ 60.625616] ? stack_trace_save+0x82/0xb0 [ 60.626225] ? stack_trace_consume_entry+0x160/0x160 [ 60.626957] ? deactivate_slab.isra.80+0x2c5/0x800 [ 60.627668] ? register_lock_class+0x14d0/0x14d0 [ 60.628380] ? alloc_netdev_mqs+0x7b3/0xcc0 [ 60.629020] ? save_stack+0x69/0x80 [ 60.629574] ? save_stack+0x19/0x80 [ 60.630121] ? __kasan_kmalloc.constprop.4+0xa0/0xd0 [ 60.630859] ? __kmalloc_node+0x16f/0x480 [ 60.631472] ? alloc_netdev_mqs+0x7b3/0xcc0 [ 60.632121] ? rtnl_create_link+0x2ed/0xad0 [ 60.634388] ? __rtnl_newlink+0xad4/0x11b0 [ 60.635024] lock_acquire+0x164/0x3b0 [ 60.635608] ? bond_3ad_update_lacp_rate+0x91/0x200 [bonding] [ 60.636463] _raw_spin_lock_bh+0x38/0x70 [ 60.637084] ? bond_3ad_update_lacp_rate+0x91/0x200 [bonding] [ 60.637930] bond_3ad_update_lacp_rate+0x91/0x200 [bonding] [ 60.638753] ? bond_3ad_lacpdu_recv+0xb30/0xb30 [bonding] [ 60.639552] ? bond_opt_get_val+0x180/0x180 [bonding] [ 60.640307] ? ___slab_alloc+0x5aa/0x610 [ 60.640925] bond_option_lacp_rate_set+0x71/0x140 [bonding] [ 60.641751] __bond_opt_set+0x1ff/0xbb0 [bonding] [ 60.643217] ? kasan_unpoison_shadow+0x30/0x40 [ 60.643924] bond_changelink+0x9a4/0x1700 [bonding] [ 60.644653] ? memset+0x1f/0x40 [ 60.742941] ? bond_slave_changelink+0x1a0/0x1a0 [bonding] [ 60.752694] ? alloc_netdev_mqs+0x8ea/0xcc0 [ 60.753330] ? rtnl_create_link+0x2ed/0xad0 [ 60.753964] bond_newlink+0x1e/0x60 [bonding] [ 60.754612] __rtnl_newlink+0xb9f/0x11b0 [ ... ] Reported-by: syzbot+8da67f407bcba2c72e6e@syzkaller.appspotmail.com Reported-by: syzbot+0d083911ab18b710da71@syzkaller.appspotmail.com Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24net: remove unnecessary variables and callbackTaehee Yoo
This patch removes variables and callback these are related to the nested device structure. devices that can be nested have their own nest_level variable that represents the depth of nested devices. In the previous patch, new {lower/upper}_level variables are added and they replace old private nest_level variable. So, this patch removes all 'nest_level' variables. In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added to get lockdep subclass value, which is actually lower nested depth value. But now, they use the dynamic lockdep key to avoid lockdep warning instead of the subclass. So, this patch removes ->ndo_get_lock_subclass() callback. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24bonding: use dynamic lockdep key instead of subclassTaehee Yoo
All bonding device has same lockdep key and subclass is initialized with nest_level. But actual nest_level value can be changed when a lower device is attached. And at this moment, the subclass should be updated but it seems to be unsafe. So this patch makes bonding use dynamic lockdep key instead of the subclass. Test commands: ip link add bond0 type bond for i in {1..5} do let A=$i-1 ip link add bond$i type bond ip link set bond$i master bond$A done ip link set bond5 master bond0 Splat looks like: [ 307.992912] WARNING: possible recursive locking detected [ 307.993656] 5.4.0-rc3+ #96 Tainted: G W [ 307.994367] -------------------------------------------- [ 307.995092] ip/761 is trying to acquire lock: [ 307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding] [ 307.997045] but task is already holding lock: [ 307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding] [ 307.999215] other info that might help us debug this: [ 308.000251] Possible unsafe locking scenario: [ 308.001137] CPU0 [ 308.001533] ---- [ 308.001915] lock(&(&bond->stats_lock)->rlock#2/2); [ 308.002609] lock(&(&bond->stats_lock)->rlock#2/2); [ 308.003302] *** DEADLOCK *** [ 308.004310] May be due to missing lock nesting notation [ 308.005319] 3 locks held by ip/761: [ 308.005830] #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0 [ 308.006894] #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding] [ 308.008243] #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding] [ 308.009422] stack backtrace: [ 308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G W 5.4.0-rc3+ #96 [ 308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 308.012179] Call Trace: [ 308.012601] dump_stack+0x7c/0xbb [ 308.013089] __lock_acquire+0x269d/0x3de0 [ 308.013669] ? register_lock_class+0x14d0/0x14d0 [ 308.014318] lock_acquire+0x164/0x3b0 [ 308.014858] ? bond_get_stats+0xb8/0x500 [bonding] [ 308.015520] _raw_spin_lock_nested+0x2e/0x60 [ 308.016129] ? bond_get_stats+0xb8/0x500 [bonding] [ 308.017215] bond_get_stats+0xb8/0x500 [bonding] [ 308.018454] ? bond_arp_rcv+0xf10/0xf10 [bonding] [ 308.019710] ? rcu_read_lock_held+0x90/0xa0 [ 308.020605] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 308.021286] ? bond_get_stats+0x9f/0x500 [bonding] [ 308.021953] dev_get_stats+0x1ec/0x270 [ 308.022508] bond_get_stats+0x1d1/0x500 [bonding] Fixes: d3fff6c443fe ("net: add netdev_lockdep_set_classes() helper") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24bonding: fix unexpected IFF_BONDING bit unsetTaehee Yoo
The IFF_BONDING means bonding master or bonding slave device. ->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() unsets IFF_BONDING flag. bond0<--bond1 Both bond0 and bond1 are bonding device and these should keep having IFF_BONDING flag until they are removed. But bond1 would lose IFF_BONDING at ->ndo_del_slave() because that routine do not check whether the slave device is the bonding type or not. This patch adds the interface type check routine before removing IFF_BONDING flag. Test commands: ip link add bond0 type bond ip link add bond1 type bond ip link set bond1 master bond0 ip link set bond1 nomaster ip link del bond1 type bond ip link add bond1 type bond Splat looks like: [ 226.665555] proc_dir_entry 'bonding/bond1' already registered [ 226.666440] WARNING: CPU: 0 PID: 737 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0 [ 226.667571] Modules linked in: bonding af_packet sch_fq_codel ip_tables x_tables unix [ 226.668662] CPU: 0 PID: 737 Comm: ip Not tainted 5.4.0-rc3+ #96 [ 226.669508] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 226.670652] RIP: 0010:proc_register+0x2a9/0x3e0 [ 226.671612] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 a0 0b 14 9f 48 8b b0 e 0 00 00 00 e8 07 e7 88 ff <0f> 0b 48 c7 c7 40 2d a5 9f e8 59 d6 23 01 48 8b 4c 24 10 48 b8 00 [ 226.675007] RSP: 0018:ffff888050e17078 EFLAGS: 00010282 [ 226.675761] RAX: dffffc0000000008 RBX: ffff88805fdd0f10 RCX: ffffffff9dd344e2 [ 226.676757] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88806c9f6b8c [ 226.677751] RBP: ffff8880507160f3 R08: ffffed100d940019 R09: ffffed100d940019 [ 226.678761] R10: 0000000000000001 R11: ffffed100d940018 R12: ffff888050716008 [ 226.679757] R13: ffff8880507160f2 R14: dffffc0000000000 R15: ffffed100a0e2c1e [ 226.680758] FS: 00007fdc217cc0c0(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000 [ 226.681886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 226.682719] CR2: 00007f49313424d0 CR3: 0000000050e46001 CR4: 00000000000606f0 [ 226.683727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 226.684725] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 226.685681] Call Trace: [ 226.687089] proc_create_seq_private+0xb3/0xf0 [ 226.687778] bond_create_proc_entry+0x1b3/0x3f0 [bonding] [ 226.691458] bond_netdev_event+0x433/0x970 [bonding] [ 226.692139] ? __module_text_address+0x13/0x140 [ 226.692779] notifier_call_chain+0x90/0x160 [ 226.693401] register_netdevice+0x9b3/0xd80 [ 226.694010] ? alloc_netdev_mqs+0x854/0xc10 [ 226.694629] ? netdev_change_features+0xa0/0xa0 [ 226.695278] ? rtnl_create_link+0x2ed/0xad0 [ 226.695849] bond_newlink+0x2a/0x60 [bonding] [ 226.696422] __rtnl_newlink+0xb9f/0x11b0 [ 226.696968] ? rtnl_link_unregister+0x220/0x220 [ ... ] Fixes: 0b680e753724 ("[PATCH] bonding: Add priv_flag to avoid event mishandling") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24net: core: add generic lockdep keysTaehee Yoo
Some interface types could be nested. (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..) These interface types should set lockdep class because, without lockdep class key, lockdep always warn about unexisting circular locking. In the current code, these interfaces have their own lockdep class keys and these manage itself. So that there are so many duplicate code around the /driver/net and /net/. This patch adds new generic lockdep keys and some helper functions for it. This patch does below changes. a) Add lockdep class keys in struct net_device - qdisc_running, xmit, addr_list, qdisc_busylock - these keys are used as dynamic lockdep key. b) When net_device is being allocated, lockdep keys are registered. - alloc_netdev_mqs() c) When net_device is being free'd llockdep keys are unregistered. - free_netdev() d) Add generic lockdep key helper function - netdev_register_lockdep_key() - netdev_unregister_lockdep_key() - netdev_update_lockdep_key() e) Remove unnecessary generic lockdep macro and functions f) Remove unnecessary lockdep code of each interfaces. After this patch, each interface modules don't need to maintain their lockdep keys. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-09bonding: fix potential NULL deref in bond_update_slave_arrEric Dumazet
syzbot got a NULL dereference in bond_update_slave_arr() [1], happening after a failure to allocate bond->slave_arr A workqueue (bond_slave_arr_handler) is supposed to retry the allocation later, but if the slave is removed before the workqueue had a chance to complete, bond->slave_arr can still be NULL. [1] Failed to build slave-array. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN PTI Modules linked in: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:bond_update_slave_arr.cold+0xc6/0x198 drivers/net/bonding/bond_main.c:4039 RSP: 0018:ffff88018fe33678 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000290b000 RDX: 0000000000000000 RSI: ffffffff82b63037 RDI: ffff88019745ea20 RBP: ffff88018fe33760 R08: ffff880170754280 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff88019745ea00 R14: 0000000000000000 R15: ffff88018fe338b0 FS: 00007febd837d700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004540a0 CR3: 00000001c242e005 CR4: 00000000001626f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff82b5b45e>] __bond_release_one+0x43e/0x500 drivers/net/bonding/bond_main.c:1923 [<ffffffff82b5b966>] bond_release drivers/net/bonding/bond_main.c:2039 [inline] [<ffffffff82b5b966>] bond_do_ioctl+0x416/0x870 drivers/net/bonding/bond_main.c:3562 [<ffffffff83ae25f4>] dev_ifsioc+0x6f4/0x940 net/core/dev_ioctl.c:328 [<ffffffff83ae2e58>] dev_ioctl+0x1b8/0xc70 net/core/dev_ioctl.c:495 [<ffffffff83995ffd>] sock_do_ioctl+0x1bd/0x300 net/socket.c:1088 [<ffffffff83996a80>] sock_ioctl+0x300/0x5d0 net/socket.c:1196 [<ffffffff81b124db>] vfs_ioctl fs/ioctl.c:47 [inline] [<ffffffff81b124db>] file_ioctl fs/ioctl.c:501 [inline] [<ffffffff81b124db>] do_vfs_ioctl+0xacb/0x1300 fs/ioctl.c:688 [<ffffffff81b12dc6>] SYSC_ioctl fs/ioctl.c:705 [inline] [<ffffffff81b12dc6>] SyS_ioctl+0xb6/0xe0 fs/ioctl.c:696 [<ffffffff8101ccc8>] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305 [<ffffffff84400091>] entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: ee6377147409 ("bonding: Simplify the xmit function for modes that use xmit_hash") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-08bonding: Add vlan tx offload to hw_enc_featuresYueHaibing
As commit 30d8177e8ac7 ("bonding: Always enable vlan tx offload") said, we should always enable bonding's vlan tx offload, pass the vlan packets to the slave devices with vlan tci, let them to handle vlan implementation. Now if encapsulation protocols like VXLAN is used, skb->encapsulation may be set, then the packet is passed to vlan device which based on bonding device. However in netif_skb_features(), the check of hw_enc_features: if (skb->encapsulation) features &= dev->hw_enc_features; clears NETIF_F_HW_VLAN_CTAG_TX/NETIF_F_HW_VLAN_STAG_TX. This results in same issue in commit 30d8177e8ac7 like this: vlan_dev_hard_start_xmit -->dev_queue_xmit -->validate_xmit_skb -->netif_skb_features //NETIF_F_HW_VLAN_CTAG_TX is cleared -->validate_xmit_vlan -->__vlan_hwaccel_push_inside //skb->tci is cleared ... --> bond_start_xmit --> bond_xmit_hash //BOND_XMIT_POLICY_ENCAP34 --> __skb_flow_dissect // nhoff point to IP header --> case htons(ETH_P_8021Q) // skb_vlan_tag_present is false, so vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), //vlan point to ip header wrongly Fixes: b2a103e6d0af ("bonding: convert to ndo_fix_features") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22bonding: Force slave speed check after link state recovery for 802.3adThomas Falcon
The following scenario was encountered during testing of logical partition mobility on pseries partitions with bonded ibmvnic adapters in LACP mode. 1. Driver receives a signal that the device has been swapped, and it needs to reset to initialize the new device. 2. Driver reports loss of carrier and begins initialization. 3. Bonding driver receives NETDEV_CHANGE notifier and checks the slave's current speed and duplex settings. Because these are unknown at the time, the bond sets its link state to BOND_LINK_FAIL and handles the speed update, clearing AD_PORT_LACP_ENABLE. 4. Driver finishes recovery and reports that the carrier is on. 5. Bond receives a new notification and checks the speed again. The speeds are valid but miimon has not altered the link state yet. AD_PORT_LACP_ENABLE remains off. Because the slave's link state is still BOND_LINK_FAIL, no further port checks are made when it recovers. Though the slave devices are operational and have valid speed and duplex settings, the bond will not send LACPDU's. The simplest fix I can see is to force another speed check in bond_miimon_commit. This way the bond will update AD_PORT_LACP_ENABLE if needed when transitioning from BOND_LINK_FAIL to BOND_LINK_UP. CC: Jarod Wilson <jarod@redhat.com> CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two cases of overlapping changes, nothing fancy. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04bonding: add an option to specify a delay between peer notificationsVincent Bernat
Currently, gratuitous ARP/ND packets are sent every `miimon' milliseconds. This commit allows a user to specify a custom delay through a new option, `peer_notif_delay'. Like for `updelay' and `downdelay', this delay should be a multiple of `miimon' to avoid managing an additional work queue. The configuration logic is copied from `updelay' and `downdelay'. However, the default value cannot be set using a module parameter: Netlink or sysfs should be used to configure this feature. When setting `miimon' to 100 and `peer_notif_delay' to 500, we can observe the 500 ms delay is respected: 20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 In bond_mii_monitor(), I have tried to keep the lock logic readable. The change is due to the fact we cannot rely on a notification to lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is only triggered once every N times, while we need to decrement the counter each time. iproute2 also needs to be updated to be able to specify this new attribute through `ip link'. Signed-off-by: Vincent Bernat <vincent@bernat.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03bonding: validate ip header before check IPPROTO_IGMPCong Wang
bond_xmit_roundrobin() checks for IGMP packets but it parses the IP header even before checking skb->protocol. We should validate the IP header with pskb_may_pull() before using iph->protocol. Reported-and-tested-by: syzbot+e5be16aa39ad6e755391@syzkaller.appspotmail.com Fixes: a2fd940f4cff ("bonding: fix broken multicast with round-robin mode") Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02bonding/main: fix NULL dereference in bond_select_active_slave()Eric Dumazet
A bonding master can be up while best_slave is NULL. [12105.636318] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [12105.638204] mlx4_en: eth1: Linkstate event 1 -> 1 [12105.648984] IP: bond_select_active_slave+0x125/0x250 [12105.653977] PGD 0 P4D 0 [12105.656572] Oops: 0000 [#1] SMP PTI [12105.660487] gsmi: Log Shutdown Reason 0x03 [12105.664620] Modules linked in: kvm_intel loop act_mirred uhaul vfat fat stg_standard_ftl stg_megablocks stg_idt stg_hdi stg elephant_dev_num stg_idt_eeprom w1_therm wire i2c_mux_pca954x i2c_mux mlx4_i2c i2c_usb cdc_acm ehci_pci ehci_hcd i2c_iimc mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [last unloaded: kvm_intel] [12105.685686] mlx4_core 0000:03:00.0: dispatching link up event for port 2 [12105.685700] mlx4_en: eth2: Linkstate event 2 -> 1 [12105.685700] mlx4_en: eth2: Link Up (linkstate) [12105.724452] Workqueue: bond0 bond_mii_monitor [12105.728854] RIP: 0010:bond_select_active_slave+0x125/0x250 [12105.734355] RSP: 0018:ffffaf146a81fd88 EFLAGS: 00010246 [12105.739637] RAX: 0000000000000003 RBX: ffff8c62b03c6900 RCX: 0000000000000000 [12105.746838] RDX: 0000000000000000 RSI: ffffaf146a81fd08 RDI: ffff8c62b03c6000 [12105.754054] RBP: ffffaf146a81fdb8 R08: 0000000000000001 R09: ffff8c517d387600 [12105.761299] R10: 00000000001075d9 R11: ffffffffaceba92f R12: 0000000000000000 [12105.768553] R13: ffff8c8240ae4800 R14: 0000000000000000 R15: 0000000000000000 [12105.775748] FS: 0000000000000000(0000) GS:ffff8c62bfa40000(0000) knlGS:0000000000000000 [12105.783892] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12105.789716] CR2: 0000000000000000 CR3: 0000000d0520e001 CR4: 00000000001626f0 [12105.796976] Call Trace: [12105.799446] [<ffffffffac31d387>] bond_mii_monitor+0x497/0x6f0 [12105.805317] [<ffffffffabd42643>] process_one_work+0x143/0x370 [12105.811225] [<ffffffffabd42c7a>] worker_thread+0x4a/0x360 [12105.816761] [<ffffffffabd48bc5>] kthread+0x105/0x140 [12105.821865] [<ffffffffabd42c30>] ? rescuer_thread+0x380/0x380 [12105.827757] [<ffffffffabd48ac0>] ? kthread_associate_blkcg+0xc0/0xc0 [12105.834266] [<ffffffffac600241>] ret_from_fork+0x51/0x60 Fixes: e2a7420df2e0 ("bonding/main: convert to using slave printk macros") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: John Sperbeck <jsperbeck@google.com> Cc: Jarod Wilson <jarod@redhat.com> CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The new route handling in ip_mc_finish_output() from 'net' overlapped with the new support for returning congestion notifications from BPF programs. In order to handle this I had to take the dev_loopback_xmit() calls out of the switch statement. The aquantia driver conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26bonding: Always enable vlan tx offloadYueHaibing
We build vlan on top of bonding interface, which vlan offload is off, bond mode is 802.3ad (LACP) and xmit_hash_policy is BOND_XMIT_POLICY_ENCAP34. Because vlan tx offload is off, vlan tci is cleared and skb push the vlan header in validate_xmit_vlan() while sending from vlan devices. Then in bond_xmit_hash, __skb_flow_dissect() fails to get information from protocol headers encapsulated within vlan, because 'nhoff' is points to IP header, so bond hashing is based on layer 2 info, which fails to distribute packets across slaves. This patch always enable bonding's vlan tx offload, pass the vlan packets to the slave devices with vlan tci, let them to handle vlan implementation. Fixes: 278339a42a1b ("bonding: propogate vlan_features to bonding master") Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09bonding/main: convert to using slave printk macrosJarod Wilson
All of these printk instances benefit from having both master and slave device information included, so convert to using a standardized macro format and remove redundant information. Suggested-by: Joe Perches <joe@perches.com> CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09bonding: fix error messages in bond_do_fail_over_macJarod Wilson
Passing the bond name again to debug output when referencing slave is wrong. We're trying to set the bond's MAC to that of the new_active slave, so adjust the error message slightly and pass in the slave's name, not the bond's. Then we're trying to set the MAC on the old active slave, but putting the new active slave's name in the output. While we're at it, clarify the error messages so you know which one actually triggered. CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09bonding: improve event debug usabilityJarod Wilson
Seeing bonding debug log data along the lines of "event: 5" is a bit spartan, and often requires a lookup table if you don't remember what every event is. Make use of netdev_cmd_to_name for an improved debugging experience, so for the prior example, you'll see: "bond_netdev_event received NETDEV_REGISTER" instead (both are prefixed with the device for which the event pertains). CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net: bonding: Inherit MPLS features from slave devicesAriel Levkovich
When setting the bonding interface net device features, the kernel code doesn't address the slaves' MPLS features and doesn't inherit them. Therefore, HW offloads that enhance performance such as checksumming and TSO are disabled for MPLS tagged traffic flowing via the bonding interface. The patch add the inheritance of the MPLS features from the slave devices with a similar logic to setting the bonding device's VLAN and encapsulation features. CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26bonding/802.3ad: fix slave link initialization transition statesJarod Wilson
Once in a while, with just the right timing, 802.3ad slaves will fail to properly initialize, winding up in a weird state, with a partner system mac address of 00:00:00:00:00:00. This started happening after a fix to properly track link_failure_count tracking, where an 802.3ad slave that reported itself as link up in the miimon code, but wasn't able to get a valid speed/duplex, started getting set to BOND_LINK_FAIL instead of BOND_LINK_DOWN. That was the proper thing to do for the general "my link went down" case, but has created a link initialization race that can put the interface in this odd state. The simple fix is to instead set the slave link to BOND_LINK_DOWN again, if the link has never been up (last_link_up == 0), so the link state doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed in this case, it simply hasn't been up yet, and this prevents the unnecessary state change from DOWN to FAIL and getting stuck in an init failure w/o a partner mac. Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking") CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org Tested-by: Heesoon Kim <Heesoon.Kim@stratus.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15bonding: fix event handling for stacked bondsSabrina Dubroca
When a bond is enslaved to another bond, bond_netdev_event() only handles the event as if the bond is a master, and skips treating the bond as a slave. This leads to a refcount leak on the slave, since we don't remove the adjacency to its master and the master holds a reference on the slave. Reproducer: ip link add bondL type bond ip link add bondU type bond ip link set bondL master bondU ip link del bondL No "Fixes:" tag, this code is older than git history. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20net: remove 'fallback' argument from dev->ndo_select_queue()Paolo Abeni
After the previous patch, all the callers of ndo_select_queue() provide as a 'fallback' argument netdev_pick_tx. The only exceptions are nested calls to ndo_select_queue(), which pass down the 'fallback' available in the current scope - still netdev_pick_tx. We can drop such argument and replace fallback() invocation with netdev_pick_tx(). This avoids an indirect call per xmit packet in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen) with device drivers implementing such ndo. It also clean the code a bit. Tested with ixgbe and CONFIG_FCOE=m With pktgen using queue xmit: threads vanilla patched (kpps) (kpps) 1 2334 2428 2 4166 4278 4 7895 8100 v1 -> v2: - rebased after helper's name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net: Remove switchdev.h inclusion from team/bond/vlanFlorian Fainelli
This is no longer necessary after eca59f691566 ("net: Remove support for bridge bypass ndos from stacked devices") Suggested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andy Gospodarek <andy@greyhouse.net> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21bonding: fix PACKET_ORIGDEV regressionMichal Soltys
This patch fixes a subtle PACKET_ORIGDEV regression which was a side effect of fixes introduced by: 6a9e461f6fe4 bonding: pass link-local packets to bonding master also. ... to: b89f04c61efe bonding: deliver link-local packets with skb->dev set to link that packets arrived on While 6a9e461f6fe4 restored pre-b89f04c61efe presence of link-local packets on bonding masters (which is required e.g. by linux bridges participating in spanning tree or needed for lab-like setups created with group_fwd_mask) it also caused the originating device information to be lost due to cloning. Maciej Żenczykowski proposed another solution that doesn't require packet cloning and retains original device information - instead of returning RX_HANDLER_PASS for all link-local packets it's now limited only to packets from inactive slaves. At the same time, packets passed to bonding masters retain correct information about the originating device and PACKET_ORIGDEV can be used to determine it. This elegantly solves all issues so far: - link-local packets that were removed from bonding masters - LLDP daemons being forced to explicitly bind to slave interfaces - PACKET_ORIGDEV having no effect on bond interfaces Fixes: 6a9e461f6fe4 (bonding: pass link-local packets to bonding master also.) Reported-by: Vincent Bernat <vincent@bernat.ch> Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>