summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-07-20bpf: make xdp sample variable names more meaningfulBrenden Blanco
The naming choice of index is not terribly descriptive, and dropcnt is in fact incorrect for xdp2. Pick better names for these: ipproto and rxcnt. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20rtnl: protect do_setlink from IFLA_XDP_ATTACHEDBrenden Blanco
The IFLA_XDP_ATTACHED nested attribute is meant for read-only, and while do_setlink properly ignores it, it should be more paranoid and reject commands that try to set it. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net/mlx4_en: use READ_ONCE when freeing xdp_progBrenden Blanco
For consistency, and in order to hint at the synchronous nature of the xdp_prog field, use READ_ONCE in the destroy path of the ring. All occurrences should now use either READ_ONCE or xchg. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20Merge branch '100GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 100GbE Intel Wired LAN Driver Updates 2016-07-20 This series contains updates to fm10k only. Ngai-Mint provides a fix to clear PCIE_GMBX bits to ensure the proper functioning of the mailbox global interrupt after a data path reset. Jake provides most of the patches in the series, starting with a early return from fm10k_down() if we are already down to prevent conflict with other threads. Fixed an issue where fm10k_update_stats() could cause a null pointer dereference, specifically if it is called when we are going down and the rings have been removed. Cleans up and fixes the data path reset flow, Tx hang routine and stop_hw(). Re-worked the fm10k_reinit() to be more maintainable and fixed several inconsistencies with the work flow. Implemented fm10k_prepare_suspend() and fm10k_handle_resume() which abstract around the now existing fm10k_prepare_for_reset and fm10k_handle_reset. The new functions also handle stopping the service task, which is something that the original re-init flow does not need. Fixed an issue where if an FLR occurs, VF devices will be knocked out of bus master mode, and the driver will be unable to recover from the reset properly, so ensure bus master is enabled after every reset. Fixed an issue where a reset will occur as if for no reason, regularly every few minutes until the switch manager software is loaded, which is caused by continuously requesting the lport map so only do the request after we have verified the switch mailbox is tx_ready. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20Merge branch 'mv88r6xxx-eeprom-rework'David S. Miller
Vivien Didelot says: ==================== net: dsa: mv88e6xxx: rework EEPROM code Some switches can access an optional external EEPROM via its registers. The 88E6352 family of switches have 8-bit address / 16-bit data access. The new 88E6390 family has 16-bit address / 8-bit data access. This patchset cleans up the EEPROM code with 16-suffixed Global2 helpers and makes it easy to add future support for 8-bit data EEPROM access. It also removes unnecessary mutexes and a few locked access functions. Changes in v2: - add missing Signed-off-by tag ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net: dsa: mv88e6xxx: kill last locked reg_readVivien Didelot
Get rid of the last usage of the locked mv88e6xxx_reg_read function with a new mv88e6xxx_port_read helper, useful later for chips with different port registers base address. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net: dsa: mv88e6xxx: rework EEPROM accessVivien Didelot
The 6352 family of switches and compatibles provide a 8-bit address and 16-bit data access to an optional EEPROM. Newer chip such as the 6390 family slightly changed the access to 16-bit address and 8-bit data. This commit cleans up the EEPROM access code for 16-bit access and makes it easy to eventually introduce future support for 8-bit access. Here's a list of notable changes brought by this patch: - provide Global2 unlocked helpers for EEPROM commands - remove eeprom_mutex, only reg_lock is necessary for driver functions - eeprom_len is 0 for chip without EEPROM, so return it directly - the Running bit must be 0 before r/w, so wait for Busy *and* Running - remove now unused mv88e6xxx_wait and mv88e6xxx_reg_write - other than that, the logic (in _{get,set}_eeprom16) didn't change Chips with an 8-bit EEPROM access will require to implement the 8-suffixed variant of G2 helpers and the related flag: #define MV88E6XXX_FLAGS_EEPROM8 \ (MV88E6XXX_FLAG_G2_EEPROM_CMD | \ MV88E6XXX_FLAG_G2_EEPROM_ADDR) Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net: dsa: mv88e6xxx: remove unused phy_mutexVivien Didelot
Only reg_lock is necessary now and phy_mutex is dead. Remove it. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net/faraday: Disallow using reversed MAC address from hardwareGavin Shan
The initial MAC address is retrieved from hardware if it's not provided by device-tree. The reserved MAC address from hardware will be used if non-reserved MAC address is invalid. It will cause mismatched MAC address seen by hardware and software. This disallows using the reserved hardware MAC address to avoid the mismatched MAC address seen by hardware and software. Fixes: 113ce107afe9 ("net/faraday: Read MAC address from chip") Suggested-by: David Laight <David.Laight@ACULAB.COM> Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20fm10k: bump version numberJacob Keller
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: return proper error code when pci_enable_msix_range failsJacob Keller
The pci_enable_msix_range() function returns a positive value of the number of allocated vectors if it succeeds. On failure it returns a negative error code. Return this code properly so that the error message printed by the driver will show the actual error code instead of being masked by -ENOMEM. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: force link to remain down for at least a second on resume eventsJacob Keller
When we resume from an AER recovery with many active VFs, the PF sees many spurious link up and link down events. Prevent this by delaying link down for at least one second after the resume event. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: implement request_lport_map pointerJacob Keller
If the fm10k interface is brought up, but the switch manager software is not running, the driver will continuously request the lport map every few seconds in the base driver watchdog routine. Eventually after several minutes the switch mailbox Tx fifo will fill up and the mailbox will timeout, resulting in a reset. This reset will appear as if for no reason, and occurs regularly every few minutes until the switch manager software is loaded. Prevent this from happening by only requesting the lport map after we've verified the switch mailbox is tx_ready. In order to simplify code logic and reduce code duplication, implement this as a new function pointer "mac.ops.request_lport_map" which the VF will not implement. Otherwise, we have to duplicate the tx_ready check outside of fm10k_get_host_state_generic, or re-implement most of fm10k_get_host_state_generic in the pf version. The resulting code is simpler and easier to understand, and prevents the PF from continuously requesting lport map and filling the Tx fifo of a switch mailbox that isn't ready. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: check if PCIe link is restoredJacob Keller
Sometimes, a VF driver will lose PCIe address access, such as due to a PF FLR event. In fm10k_detach_subtask, poll and check whether the PCIe register space is active again and restore the device when it has. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: enable bus master after every resetJacob Keller
If an FLR occurs, VF devices will be knocked out of bus master mode, and the driver will be unable to recover from the reset properly, resulting in malicious driver events and an infinite reset loop. In the normal case, the bus master mode will already be enabled and this call will essentially be a no-op. Since we're doing this every reset, it is possible we could remove the other calls to pci_set_master() but it seems not harmful to just leave them in place. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: use common flow for suspend and resumeJacob Keller
Continuing the effort to commonize the similar suspend/resume flows, finish up by using the new fm10k_handle_suspand and fm10k_handle_resume functions for the standard suspend/resume flow. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: implement reset_notify handler for PCIe FLR eventsJacob Keller
When a function level PCI reset is triggered using sysfs, it calls the driver's .reset_notify error handler. Implement a handler based on the now split fm10k_prepare_for_reset and fm10k_handle_reset functions, so that we fully reset the driver when the PCI function level reset occurs. This also ensures the reset is handled in a clean way by first disabling all the driver bits first and then restoring them after the function reset. Previously the stack simply performed a blind function reset and our driver didn't take any part in the process. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: use common reset flow when handling io errors from PCI stackJacob Keller
Now that we have extracted the necessary steps for a split suspend/resume flow, re-use these functions instead of using the current open coded flow. This ensures that we don't miss any steps. It also ensures that we have the correct driver states set. Since we'll be handling all of the reset flow ourselves, we no longer need to request a reset in the io_slot_reset() function. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: implement prepare_suspend and handle_resumeJacob Keller
Implement fm10k_prepare_suspend and fm10k_handle_resume functions which abstract around the now existing fm10k_prepare_for_reset and fm10k_handle_reset. The new functions also handle stopping the service task, which is something that the original re-init flow does not need. Every other location that does a suspend/resume type flow is expected to use these functions, because otherwise they may have conflicts with the running watchdog routines. This also has the effect of preventing possible surprise remove events during handling of FLR events and PCIe errors. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: split fm10k_reinit into two functionsJacob Keller
There are several flows in the driver which perform the similar function of tearing down software and restoring software to recover from certain errors or PCIe events, including: * fm10k_reinit * fm10k_suspend/resume * fm10k_io_error_detected/fm10k_io_resume In addition, we want to implement a .reset_notify() handler as well which will also perform similar function. Rework how the driver codes reset and resume flows by separating out the reinit logic into two functions "fm10k_prepare_for_reset" and "fm10k_handle_reset". This first step will allow us to re-use this functionality in the similar blocks of code instead of re-coding the same sequence of events slightly different. The end result should be more maintainable and correct, fixing several inconsistencies with the work flow. The new functions expect to take the rtnl_lock() themselves, and it does have the unfortunate side effect of having the reinit flow take then release then take the rtnl_lock. However, this minor downside is out weighted by the benefits of code reduction and reducing needless difference between these flows. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: wait for queues to drain if stop_hw() fails onceJacob Keller
It turns out that sometimes during a reset the Tx queues will be temporarily stuck longer than .stop_hw() expects. Work around this issue by attempting to .stop_hw() first. If it tails, wait a number of attempts until the Tx queues appear to be drained. After this, attempt stop_hw() again. This ensures that we avoid waiting if we don't need to, such as during the first initialization of a VF, and give the proper amount of time necessary to recover from most situations. It is possible that the hardware is actually stuck. For PFs, this is usually fixed by a datapath reset. Unfortunately the VF cannot request a similar reset for itself. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: only warn when stop_hw fails with FM10K_ERR_REQUESTS_PENDINGJacob Keller
When stop_hw() routine fails with FM10K_ERR_REQUESTS_PENDING, this indicates that the Tx or Rx queues did not shutdown within the time limit. Print a more suitable message at the dev_info level instead of dev_err. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: use actual hardware registers when checking for pending TxJacob Keller
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: perform data path reset even when switch is not readyJacob Keller
A while ago, an additional check for the switch being ready was added to reset_hw. A recent refactor accidentally made this check return an error code on failure which caused fm10k_probe to fail when the switch wasn't brought up first. The original reasoning for the check was to prevent additional data path reset when the fabric wasn't ready yet. However, there isn't a compelling reason to keep the check, as the data path reset will restore hardware to a known good state. Remove the check and perform the data path reset regardless of the switch manager state. An alternative fix is to return FM10K_SUCCESS instead, and bypass the actual data path reset. This should be fine as we will perform a reset_hw once the switch is active. However, since data path reset will reset many parts of the hardware it seems better to just perform the reset regardless of switch state. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: don't stop reset due to FM10K_ERR_REQUESTS_PENDINGJacob Keller
Don't report FM10K_ERR_REQUESTS_PENDING when we fail to disable queues within the timeout. This can occur due to a hardware Tx hang, or when the switch ethernet fabric is resetting while we are transmitting traffic. It can sometimes take up to 500ms before the Tx DMA engine gives up. Instead, just skip the DMA engine check and perform a data-path reset anyways. Add a statistic counter to keep track of the number of resets occurring while we have pending DMA on the rings. In order to prevent having to re-assign err to 0, re-order the last few items of the reset_hw_pf function so that we don't perform "return err" at the end. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: Reset mailbox global interruptsNgai-Mint Kwan
When a data path reset is initiated, write control to the PCIE_GMBX is yanked from the switch manager. The switch manager writes to this register to clear mailbox global interrupt bits as part of its mailbox interrupt handling routine. When the device recovers from the data path reset and these bits are not cleared, it will prevent future mailbox global interrupts from being triggered. Upon confirming that the device has exited from a data path reset, clear these bits to ensure the proper functioning of the mailbox global interrupt. Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: prevent multiple threads updating statisticsJacob Keller
Also prevent updating stats while the interface is down. If we're already updating stats, just return doing nothing. When we take the device down, block stat updates until we come back up. This ensures that we avoid tearing down rings when we're updating statistics, and prevents updating statistics until we're up. We can't re-use the __FM10K_DOWN for this because it wouldn't prevent multiple threads from accessing statistics. Neither does it prevent the case where we start updating stats and then start going down in another thread. The fm10k_get_stats64 is except from this, because it has a completely different flow which does not suffer from the same issues as fm10k_update_stats might. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: avoid possible null pointer dereference in fm10k_update_statsJacob Keller
It's currently possible for fm10k_update_stats to be called during the window when we go down and the rings are removed. This can result in a null pointer dereference. In fm10k_get_stats64 we work around this by using ACCESS_ONCE and a null pointer check inside the loop. Use this same flow in the fm10k_update_stats to avoid the potential null pointer. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20fm10k: no need to continue in fm10k_down if __FM10K_DOWN already setJacob Keller
Return early from fm10k_down() when we are already down, since that means another thread is either already finished or has started going down, so shouldn't conflict with them. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-07-20Merge branch 'mlxsw-per-prio-tc-counters'David S. Miller
Jiri Pirko says: ==================== mlxsw: Add per-{Prio,TC} counters Ido says: Add per-priority and per-tc counters, which are very useful for debugging purposes and fine-tuning. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20mlxsw: spectrum: Expose per-tc counters via ethtoolIdo Schimmel
Expose the transmit queue length of each traffic class and the amount of unicast packets discarded due to insufficient room in the shared buffer. The first counter allows us to debug user priority to traffic class mapping, whereas the drop counter is useful when determining shared buffer configuration. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20mlxsw: spectrum: Expose per-priority counters via ethtoolIdo Schimmel
Expose per-priority bytes / packets / PFC packets counters via ethtool. These counters are very useful when debugging QoS functionality and provide a better insight into the device's forwarding plane. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net: cpmac: fix error handling of cpmac_probe()Wei Yongjun
Add the missing free_netdev() before return from function cpmac_probe() in the error handling case. This patch revert commit 0465be8f4f1d ("net: cpmac: fix in releasing resources"), which changed to only free_netdev while register_netdev failed. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net/mlx5: Use PTR_ERR_OR_ZERO() to simplify the codeWei Yongjun
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR. Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net: ethernet: nb8800: fix error handling of nb8800_probe()Wei Yongjun
In ops->reset() error handling case, clk_disable_unprepare() is missed before return from this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Mans Rullgard <mans@mansr.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20wan/fsl_ucc_hdlc: use module_platform_driver to simplify the codeWei Yongjun
module_platform_driver() makes the code simpler by eliminating boilerplate code. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20wan/fsl_ucc_hdlc: remove .owner field for driverWei Yongjun
Remove .owner field if calls are used which set it automatically. Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20net: axienet: Fix return value check in axienet_probe()Wei Yongjun
In case of error, the function of_parse_phandle() returns NULL pointer not ERR_PTR(). The IS_ERR() test in the return value check should be replaced with NULL test. Fixes: 46aa27df8853 ('net: axienet: Use devm_* calls') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-07-19 Here's likely the last bluetooth-next pull request for the 4.8 kernel: - Fix for L2CAP setsockopt - Fix for is_suspending flag handling in btmrvl driver - Addition of Bluetooth HW & FW info fields to debugfs - Fix to use int instead of char for callback status. The last one (from Geert Uytterhoeven) is actually not purely a Bluetooth (or 802.15.4) patch, but it was agreed with other maintainers that we take it through the bluetooth-next tree. Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20bpf, elf: add official ELF machine define for eBPFDaniel Borkmann
Add the official BPF ELF e_machine value that was assigned recently [1,2] and will be propagated to glibc, et al. LLVM is switching to it in 3.9 release. [1] https://github.com/llvm-mirror/llvm/commit/36b9c09330bfb5e771914cfe307588f30d5510d2 [2] http://lists.iovisor.org/pipermail/iovisor-dev/2016-June/000266.html Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-20bpf: fix implicit declaration of bpf_prog_addBrenden Blanco
For the ifndef case of CONFIG_BPF_SYSCALL, an inline version of bpf_prog_add needs to exist otherwise the build breaks on some configs. drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2544:10: error: implicit declaration of function 'bpf_prog_add' prog = bpf_prog_add(prog, priv->rx_ring_num - 1); The function is introduced in 59d3656d5bf50 ("bpf: add bpf_prog_add api for bulk prog refcnt") and first used in 47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program"). Fixes: 47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program") Reported-by: kbuild test robot <fengguang.wu@intel.com> Reported-by: Tariq Toukan <ttoukan.linux@gmail.com> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19Merge branch 'xdp'David S. Miller
Brenden Blanco says: ==================== Add driver bpf hook for early packet drop and forwarding This patch set introduces new infrastructure for programmatically processing packets in the earliest stages of rx, as part of an effort others are calling eXpress Data Path (XDP) [1]. Start this effort by introducing a new bpf program type for early packet filtering, before even an skb has been allocated. Extend on this with the ability to modify packet data and send back out on the same port. Patch 1 adds an API for bulk bpf prog refcnt incrememnt. Patch 2 introduces the new prog type and helpers for validating the bpf program. A new userspace struct is defined containing only data and data_end as fields, with others to follow in the future. In patch 3, create a new ndo to pass the fd to supported drivers. In patch 4, expose a new rtnl option to userspace. In patch 5, enable support in mlx4 driver. In patch 6, create a sample drop and count program. With single core, achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes packet data access, bpf array lookup, and increment. In patch 7, add a page recycle facility to mlx4 rx, enabled when xdp is active. In patch 8, add the XDP_TX type to bpf.h In patch 9, add helper in tx patch for writing tx_desc In patch 10, add support in mlx4 for packet data write and forwarding In patch 11, turn on packet write support in the bpf verifier In patch 12, add a sample program for packet write and forwarding. With single core, achieved ~10 Mpps rewrite and forwarding. [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf v10: 1/12: Add bulk refcnt api. 5/12: Move prog from priv to ring. This attribute is still only set globally, but the path to finer granularity should be clear. No lock is taken, so some rings may operate on older programs for a time (one napi loop). Looked into options such as napi_synchronize, but they were deemed too slow (calls to msleep). Rename prog to xdp_prog. Add xdp_ring_num to help with accounting, used more heavily in later patches. 7/12: Adjust to use per-ring xdp prog. Use priv->xdp_ring_num where before priv->prog was used to determine buffer allocations. 9/12: Add cpu_to_be16 to vlan_tag in mxl4_en_xmit(). Remove unused variable from mlx4_en_xmit and unused params from build_inline_wqe. v9: 4/11: Add missing newline in en_err message. 6/11: Move page_cache cleanup from mlx4_en_destroy_rx_ring to mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set. Adopt instead the approach of mlx4_en_change_mtu to use a watchdog. 9/11: Use a per-ring function pointer in tx to separate out the code for regular and recycle paths of tx completion handling. Add a helper function to init the recycle ring and callback, called just after activating tx. Remove extra tx ring resource requirement, and instead steal from the upper rings. This helps to avoid needing mlx4_en_alloc_resources. Add some hopefully meaningful error messages for the various error cases. Reverted some of the hard-to-follow logic that was accounting for the extra tx rings. v8: 1/11: Reduce WARN_ONCE to single line. Also, change act param of that function to u32 to match return type of bpf_prog_run_xdp. 2/11: Clarify locking semantics in ndo comment. 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation. v7: Addressing two of the major discussion points: return codes and ndo. The rest will be taken as todo items for separate patches. Add an XDP_ABORTED type, which explicitly falls through to DROP. The same result must be taken for the default case as well, as it is now well-defined API behavior. Merge ndo_xdp_* into a single ndo. The style is similar to ndo_setup_tc, but with less unidirectional naming convention. The IFLA parameter names are unchanged. TODOs: Add ethtool per-ring stats for aborted, default cases, maybe even drop and tx as well. Avoid duplicate dma sync operation in XDP_PASS case as mentioned by Saeed. 1/12: Add XDP_ABORTED enum, reword API comment, and update commit message. 2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style calling convention. 3/12: Switch to ndo_xdp callback. 4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement ndo_xdp. 12/12: Dropped, this will need some more work. v6: 2/12: drop unnecessary netif_device_present check 4/12, 6/12, 9/12: Reorder default case statement above drop case to remove some copy/paste. v5: 0/12: Rebase and remove previous 1/13 patch 1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed in future. Add bpf_warn_invalid_xdp_action() helper, to be used when out of bounds action is returned by the program. Add a comment to bpf.h denoting the undefined nature of out of bounds returns. 2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to ndo_xdp_attached(). 3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED. 4/12: Fixup the use of READ_ONCE in the ndos. Add a user of bpf_warn_invalid_xdp_action helper. 5/12: Adjust to using the nested netlink options. 6/12: kbuild was complaining about overflow of u16 on tile architecture...bump frag_stride to u32. The page_offset member that is computed from this was already u32. v4: 2/12: Add inline helper for calling xdp bpf prog under rcu 3/12: Add detail to ndo comments 5/12: Remove mlx4_call_xdp and use inline helper instead. 6/12: Fix checkpatch complaints 9/12: Introduce new patch 9/12 with common helper for tx_desc write Refactor to use common tx_desc write helper 11/12: Fix checkpatch complaints v3: Rewrite from v2 trying to incorporate feedback from multiple sources. Specifically, add ability to forward packets out the same port and allow packet modification. For packet forwarding, the driver reserves a dedicated set of tx rings for exclusive use by xdp. Upon completion, the pages on this ring are recycled directly back to a small per-rx-ring page cache without being dma unmapped. Use of the percpu skb is dropped in favor of a lightweight struct xdp_buff. The direct packet access feature is leveraged to remove dependence on the skb. The mlx4 driver implementation allocates a page-per-packet and maps it in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated. Naming is converted to use "xdp" instead of "phys_dev". v2: 1/5: Drop xdp from types, instead consistently use bpf_phys_dev_ Introduce enum for return values from phys_dev hook 2/5: Move prog->type check to just before invoking ndo Change ndo to take a bpf_prog * instead of fd Add ndo_bpf_get rather than keeping a bool in the netdev struct 3/5: Use ndo_bpf_get to fetch bool 4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely to set a bpf prog when priv->num_frags > 1 Rename pseudo_skb to bpf_phys_dev_md Implement ndo_bpf_get Add dma sync just before invoking prog Check for explicit bpf return code rather than nonzero Remove increment of rx_dropped 5/5: Use explicit bpf return code in example Update commit log with higher pps numbers ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19bpf: add sample for xdp forwarding and rewriteBrenden Blanco
Add a sample that rewrites and forwards packets out on the same interface. Observed single core forwarding performance of ~10Mpps. Since the mlx4 driver under test recycles every single packet page, the perf output shows almost exclusively just the ring management and bpf program work. Slowdowns are likely occurring due to cache misses. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19bpf: enable direct packet data write for xdp progsBrenden Blanco
For forwarding to be effective, XDP programs should be allowed to rewrite packet data. This requires that the drivers supporting XDP must all map the packet memory as TODEVICE or BIDIRECTIONAL before invoking the program. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19net/mlx4_en: add xdp forwarding and data write supportBrenden Blanco
A user will now be able to loop packets back out of the same port using a bpf program attached to xdp hook. Updates to the packet contents from the bpf program is also supported. For the packet write feature to work, the rx buffers are now mapped as bidirectional when the page is allocated. This occurs only when the xdp hook is active. When the program returns a TX action, enqueue the packet directly to a dedicated tx ring, so as to avoid completely any locking. This requires the tx ring to be allocated 1:1 for each rx ring, as well as the tx completion running in the same softirq. Upon tx completion, this dedicated tx ring recycles pages without unmapping directly back to the original rx ring. In steady state tx/drop workload, effectively 0 page allocs/frees will occur. In order to separate out the paths between free and recycle, a free_tx_desc func pointer is introduced that is optionally updated whenever recycle_ring is activated. By default the original free function is always initialized. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19net/mlx4_en: break out tx_desc write into separate functionBrenden Blanco
In preparation for writing the tx descriptor from multiple functions, create a helper for both normal and blueflame access. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19bpf: add XDP_TX xdp_action for direct forwardingBrenden Blanco
XDP enabled drivers must transmit received packets back out on the same port they were received on when a program returns this action. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19net/mlx4_en: add page recycle to prepare rx ring for tx supportBrenden Blanco
The mlx4 driver by default allocates order-3 pages for the ring to consume in multiple fragments. When the device has an xdp program, this behavior will prevent tx actions since the page must be re-mapped in TODEVICE mode, which cannot be done if the page is still shared. Start by making the allocator configurable based on whether xdp is running, such that order-0 pages are always used and never shared. Since this will stress the page allocator, add a simple page cache to each rx ring. Pages in the cache are left dma-mapped, and in drop-only stress tests the page allocator is eliminated from the perf report. Note that setting an xdp program will now require the rings to be reconfigured. Before: 26.91% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 17.88% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags 6.00% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag 4.49% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist 3.21% swapper [kernel.vmlinux] [k] intel_idle 2.73% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 2.57% swapper [mlx4_en] [k] mlx4_en_process_rx_cq After: 31.72% swapper [kernel.vmlinux] [k] intel_idle 8.79% swapper [mlx4_en] [k] mlx4_en_process_rx_cq 7.54% swapper [kernel.vmlinux] [k] poll_idle 6.36% swapper [mlx4_core] [k] mlx4_eq_int 4.21% swapper [kernel.vmlinux] [k] tasklet_action 4.03% swapper [kernel.vmlinux] [k] cpuidle_enter_state 3.43% swapper [mlx4_en] [k] mlx4_en_prepare_rx_desc 2.18% swapper [kernel.vmlinux] [k] native_irq_return_iret 1.37% swapper [kernel.vmlinux] [k] menu_select 1.09% swapper [kernel.vmlinux] [k] bpf_map_lookup_elem Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19Add sample for adding simple drop program to linkBrenden Blanco
Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX hook of a link. With the drop-only program, observed single core rate is ~20Mpps. Other tests were run, for instance without the dropcnt increment or without reading from the packet header, the packet rate was mostly unchanged. $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex) proto 17: 20403027 drops/s ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4 Running... ctrl^C to stop Device: eth4@0 Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags) 5056638pps 2427Mb/sec (2427186240bps) errors: 0 Device: eth4@1 Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags) 5133311pps 2463Mb/sec (2463989280bps) errors: 0 Device: eth4@2 Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags) 5077431pps 2437Mb/sec (2437166880bps) errors: 0 Device: eth4@3 Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags) 5043067pps 2420Mb/sec (2420672160bps) errors: 0 perf report --no-children: 26.05% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 17.84% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags 5.52% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag 4.90% swapper [kernel.vmlinux] [k] poll_idle 4.14% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist 2.78% ksoftirqd/0 [kernel.vmlinux] [k] __free_pages_ok 2.57% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 2.51% swapper [mlx4_en] [k] mlx4_en_process_rx_cq 1.94% ksoftirqd/0 [kernel.vmlinux] [k] percpu_array_map_lookup_elem 1.45% swapper [mlx4_en] [k] mlx4_en_alloc_frags 1.35% ksoftirqd/0 [kernel.vmlinux] [k] free_one_page 1.33% swapper [kernel.vmlinux] [k] intel_idle 1.04% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c5c5 0.96% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c58d 0.93% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c6ee 0.92% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c6b9 0.89% ksoftirqd/0 [kernel.vmlinux] [k] __alloc_pages_nodemask 0.83% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c686 0.83% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c5d5 0.78% ksoftirqd/0 [mlx4_en] [k] mlx4_alloc_pages.isra.23 0.77% ksoftirqd/0 [mlx4_en] [k] 0x000000000001c5b4 0.77% ksoftirqd/0 [kernel.vmlinux] [k] net_rx_action machine specs: receiver - Intel E5-1630 v3 @ 3.70GHz sender - Intel E5645 @ 2.40GHz Mellanox ConnectX-3 @40G Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19net/mlx4_en: add support for fast rx drop bpf programBrenden Blanco
Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver. In tc/socket bpf programs, helpers linearize skb fragments as needed when the program touches the packet data. However, in the pursuit of speed, XDP programs will not be allowed to use these slower functions, especially if it involves allocating an skb. Therefore, disallow MTU settings that would produce a multi-fragment packet that XDP programs would fail to access. Future enhancements could be done to increase the allowable MTU. The xdp program is present as a per-ring data structure, but as of yet it is not possible to set at that granularity through any ndo. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>