summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-07-12net: ipv4: fix listify ip_rcv_finish in case of forwardingJesper Dangaard Brouer
In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls dst_input(skb) in a loop. The problem is that ip_sublist_rcv_finish() forgot to remove the SKB from the list before invoking dst_input(). Further more we need to clear skb->next as other parts of the network stack use another kind of SKB lists for xmit_more (see dev_hard_start_xmit). A crash occurs if e.g. dst_input() invoke ip_forward(), which calls dst_output()/ip_output() that eventually calls __dev_queue_xmit() + sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). This patch only fixes the crash, but there is a huge potential for a performance boost if we can pass an SKB-list through to ip_forward. Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12nfp: avoid using getnstimeofday64()Arnd Bergmann
getnstimeofday64 is deprecated in favor of the ktime_get() family of functions. The direct replacement would be ktime_get_real_ts64(), but I'm picking the basic ktime_get() instead: - using a ktime_t simplifies the code compared to timespec64 - using monotonic time instead of real time avoids issues caused by a concurrent settimeofday() or during a leap second adjustment. Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12liquidio: use ktime_get_real_ts64() instead of getnstimeofday64()Arnd Bergmann
The two do the same thing, but we want to have a consistent naming in the kernel. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12Merge branch 'net-sched-act_skbedit-lockless-data-path'David S. Miller
Davide Caratti says: ==================== net/sched: act_skbedit: lockless data path the data path of act_skbedit can be faster if we avoid using spinlocks: - patch 1 converts act_skbedit statistics to use per-cpu counters - patch 2 lets act_skbedit use RCU to read/update its configuration test procedure (using pktgen from https://github.com/netoptimizer): # ip link add name eth1 type dummy # ip link set dev eth1 up # tc qdisc add dev eth1 clsact # tc filter add dev eth1 egress matchall action skbedit priority c1a0:c1a0 # for c in 1 2 4 ; do > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 5000000 -i eth1 > done test results (avg. pps/thread) $c | before patch | after patch | improvement ----+--------------+--------------+------------ 1 | 3917464 ± 3% | 4000458 ± 3% | irrelevant 2 | 3455367 ± 4% | 3953076 ± 1% | +14% 4 | 2496594 ± 2% | 3801123 ± 3% | +52% v2: rebased on latest net-next ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/sched: act_skbedit: don't use spinlock in the data pathDavide Caratti
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on act_skbedit configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/sched: skbedit: use per-cpu countersDavide Caratti
use per-CPU counters, instead of sharing a single set of stats with all cores: this removes the need of spinlocks when stats are read/updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12tcp: use monotonic timestamps for PAWSArnd Bergmann
Using get_seconds() for timestamps is deprecated since it can lead to overflows on 32-bit systems. While the interface generally doesn't overflow until year 2106, the specific implementation of the TCP PAWS algorithm breaks in 2038 when the intermediate signed 32-bit timestamps overflow. A related problem is that the local timestamps in CLOCK_REALTIME form lead to unexpected behavior when settimeofday is called to set the system clock backwards or forwards by more than 24 days. While the first problem could be solved by using an overflow-safe method of comparing the timestamps, a nicer solution is to use a monotonic clocksource with ktime_get_seconds() that simply doesn't overflow (at least not until 136 years after boot) and that doesn't change during settimeofday(). To make 32-bit and 64-bit architectures behave the same way here, and also save a few bytes in the tcp_options_received structure, I'm changing the type to a 32-bit integer, which is now safe on all architectures. Finally, the ts_recent_stamp field also (confusingly) gets used to store a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow(). This is currently safe, but changing the type to 32-bit requires some small changes there to keep it working. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/tls: Use aead_request_alloc/free for request alloc/freeVakul Garg
Instead of kzalloc/free for aead_request allocation and free, use functions aead_request_alloc(), aead_request_free(). It ensures that any sensitive crypto material held in crypto transforms is securely erased from memory. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12tc-testing: add geneve options in tunnel_key unit testsPieter Jansen van Vuuren
Extend tc tunnel_key action unit tests with geneve options. Tests include testing single and multiple geneve options, as well as testing geneve options that are expected to fail. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12Merge branch 'be2net-small-structures-clean-up'David S. Miller
Ivan Vecera says: ==================== be2net: small structures clean-up The series: - removes unused / unneccessary fields in several be2net structures - re-order fields in some structures to eliminate holes, cache-lines crosses - as result reduces size of main struct be_adapter by 4kB ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: move rss_flags field in rss_info to ensure proper alignmentIvan Vecera
The current position of .rss_flags field in struct rss_info causes that fields .rsstable and .rssqueue (both 128 bytes long) crosses cache-line boundaries. Moving it at the end properly align all fields. Before patch: struct rss_info { u64 rss_flags; /* 0 8 */ u8 rsstable[128]; /* 8 128 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ u8 rss_queue[128]; /* 136 128 */ /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ u8 rss_hkey[40]; /* 264 40 */ }; After patch: struct rss_info { u8 rsstable[128]; /* 0 128 */ /* --- cacheline 2 boundary (128 bytes) --- */ u8 rss_queue[128]; /* 128 128 */ /* --- cacheline 4 boundary (256 bytes) --- */ u8 rss_hkey[40]; /* 256 40 */ u64 rss_flags; /* 296 8 */ }; Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: re-order fields in be_error_recovert to avoid holeIvan Vecera
- Unionize two u8 fields where only one of them is used depending on NIC chipset. - Move recovery_supported field after that union These changes eliminate 7-bytes hole in the struct and makes it smaller by 8 bytes. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: remove unused tx_jiffies field from be_tx_statsIvan Vecera
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: move txcp field in be_tx_obj to eliminate holes in the structIvan Vecera
Before patch: struct be_tx_obj { u32 db_offset; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct be_queue_info q; /* 8 56 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct be_queue_info cq; /* 64 56 */ struct be_tx_compl_info txcp; /* 120 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ struct sk_buff * sent_skb_list[2048]; /* 128 16384 */ ... }: After patch: struct be_tx_obj { u32 db_offset; /* 0 4 */ struct be_tx_compl_info txcp; /* 4 4 */ struct be_queue_info q; /* 8 56 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct be_queue_info cq; /* 64 56 */ struct sk_buff * sent_skb_list[2048]; /* 120 16384 */ ... }; Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: reorder fields in be_eq_obj structureIvan Vecera
Re-order fields in struct be_eq_obj to ensure that .napi field begins at start of cache-line. Also the .adapter field is moved to the first cache-line next to .q field and 3 fields (idx,msi_idx,spurious_intr) and the 4-bytes hole to 3rd cache-line. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: remove desc field from be_eq_objIvan Vecera
The event queue description (be_eq_obj.desc) field is used only to format string for IRQ name and it is not really needed to hold this value. Remove it and use local variable to format string for IRQ name. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: remove unused old custom busy-poll fieldsIvan Vecera
The commit fb6113e688e0 ("be2net: get rid of custom busy poll code") replaced custom busy-poll code by the generic one but left several macros and fields in struct be_eq_obj that are currently unused. Remove this stuff. Fixes: fb6113e688e0 ("be2net: get rid of custom busy poll code") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12be2net: remove unused old AIC infoIvan Vecera
The commit 2632bafd74ae ("be2net: fix adaptive interrupt coalescing") introduced a separate struct be_aic_obj to hold AIC information but unfortunately left the old stuff in be_eq_obj. So remove it. Fixes: 2632bafd74ae ("be2net: fix adaptive interrupt coalescing") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net: ethernet: ti: cpts: break cycle once late ts is matchedIvan Khoronzhuk
The late ts queue can contain a bunch of skbs while hi rate testing, no need to check all of them if timestamp is already matched. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11selftests: forwarding: mirror_gre_nh: Unset rp_filter on host VRFPetr Machata
The mirrored packets arrive at $h3 encapsulated in GRE/IPv4, with IP address from 192.0.2.128/28 network. However the interface is configured as a member of 192.0.2.160/28 and there's no route directing traffic from the former network through that interface. Correspondingly, the RP filter on the VRF rejects it. Therefore turn off the VRF's RP filter. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Merge branch 'mlxsw-ERSPAN-Take-LACP-state-into-consideration'David S. Miller
Ido Schimmel says: ==================== mlxsw: ERSPAN: Take LACP state into consideration Petr says: When offloading mirror-to-gretap, mlxsw needs to preroute the path that the encapsulated packet will take. That path may include a LAG device above a front panel port. So far, mlxsw resolved the path to the first up front panel slave of the LAG interface, but that only reflects administrative state of the port. It neglects to consider whether the port actually has a carrier, and what the LACP state is. This patch set aims to address these problems. Patch #1 publishes team_port_get_rcu(). Then in patch #2, a new function is introduced, mlxsw_sp_port_dev_check(). That returns, for a given netdevice that is a slave of a LAG device, whether that device is "txable", i.e. whether the LAG master would send traffic through it. Since there's no good place to put LAG-wide helpers, introduce a new header include/net/lag.h. Finally in patch #3, fix the slave selection logic to take into consideration whether a given slave has a carrier and whether it is txable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11mlxsw: spectrum_span: Change LAG lower selectionPetr Machata
When offloading mirror-to-gretap, mlxsw needs to preroute the path that the encapsulated packet will take. That path may include a LAG device above a front panel port. So far, mlxsw resolved the path to the first up front panel slave of the LAG interface, but that only reflects administrative state of the port. It neglects to consider whether the port actually has a carrier, and what the LACP state is. So instead of checking upness of the device, check carrier state and txability. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net: Add lag.h, net_lag_port_dev_txable()Petr Machata
LAG devices (team or bond) recognize for each one of their slave devices whether LAG traffic is going to be sent through that device. Bond calls such devices "active", team calls them "txable". When this state changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together with a netdev_notifier_changelowerstate_info structure that for LAG devices includes a tx_enabled flag that refers to the new state. The notification thus makes it possible to react to the changes in txability in drivers. However there's no way to query txability from the outside on demand. That is problematic namely for mlxsw, which when resolving ERSPAN packet path, may encounter a LAG device, and needs to determine which of the slaves it should choose. To that end, introduce a new function, net_lag_port_dev_txable(), which determines whether a given slave device is "active" or "txable" (depending on the flavor of the LAG device). That function then dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp. team_port_dev_txable(). Because there currently is no good place where net_lag_port_dev_txable() should be added, introduce a new header file, lag.h, which should from now on hold any logic common to both team and bond. (But keep netif_is_lag_master() together with the rest of netif_is_*_master() functions). Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11team: Publish team_port_get_rcu()Petr Machata
A follow-up patch adds a new entry point, team_port_dev_txable(). Making it an ordinary exported function would mean that any module that may need the service in one of the supported configurations also unconditionally needs to pull in the team module, whether or not the user actually intends to create team interfaces. To prevent that, team_port_dev_txable() is defined in if_team.h, and therefore all dependencies of that function also need to be publicly-visible. Therefore move team_port_get_rcu() from team.c to if_team.h. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11macvlan: Change status when lower device goes downTravis Brown
Today macvlan ignores the notification when a lower device goes administratively down, preventing the lack of connectivity from bubbling up. Processing NETDEV_DOWN results in a macvlan state of LOWERLAYERDOWN with NO-CARRIER which should be easy to interpret in userspace. 2: lower: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 3: macvlan@lower: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000 Signed-off-by: Suresh Krishnan <skrishnan@arista.com> Signed-off-by: Travis Brown <travisb@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Merge branch 'tipc-make-link-protocol-more-resilient'David S. Miller
Jon Maloy says: ==================== tipc: make link protocol more resilient These two commits make the link ptotocol more resilient to infrastructures with frequent packet duplication and long delays. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tipc: check session number before accepting link protocol messagesJon Maloy
In some virtual environments we observe a significant higher number of packet reordering and delays than we have been used to traditionally. This makes it necessary with stricter checks on incoming link protocol messages' session number, which until now only has been validated for RESET messages. Since the other two message types, ACTIVATE and STATE messages also carry this number, it is easy to extend the validation check to those messages. We also introduce a flag indicating if a link has a valid peer session number or not. This eliminates the mixing of 32- and 16-bit arithmethics we are currently using to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tipc: add sequence number check for link STATE messagesJon Maloy
Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Merge branch '10GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09 This patch series is meant to allow support for the L2 forward offload, aka MACVLAN offload without the need for using ndo_select_queue. The existing solution currently requires that we use ndo_select_queue in the transmit path if we want to associate specific Tx queues with a given MACVLAN interface. In order to get away from this we need to repurpose the tc_to_txq array and XPS pointer for the MACVLAN interface and use those as a means of accessing the queues on the lower device. As a result we cannot offload a device that is configured as multiqueue, however it doesn't really make sense to configure a macvlan interfaced as being multiqueue anyway since it doesn't really have a qdisc of its own in the first place. The big changes in this set are: Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN Disable XPS for single queue devices Replace accel_priv with sb_dev in ndo_select_queue Add sb_dev parameter to fallback function for ndo_select_queue Consolidated ndo_select_queue functions that appeared to be duplicates ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tcp: expose both send and receive intervals for rate sampleDeepti Raghavan
Congestion control algorithms, which access the rate sample through the tcp_cong_control function, only have access to the maximum of the send and receive interval, for cases where the acknowledgment rate may be inaccurate due to ACK compression or decimation. Algorithms may want to use send rates and receive rates as separate signals. Signed-off-by: Deepti Raghavan <deeptir@mit.edu> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net: sched: fix unprotected access to rcu cookie pointerVlad Buslov
Fix action attribute size calculation function to take rcu read lock and access act_cookie pointer with rcu dereference. Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Merge branch 'cxgb4-move-stats-fetched-from-firmware-to-debugfs'David S. Miller
Rahul Lakkireddy says: ==================== cxgb4: move stats fetched from firmware to debugfs Some stats are fetched via slow firmware mailbox, which can cause packet drops under heavy load. So, this series removes these stats from ethtool -S and expose them via debugfs. Patch 1 removes stats fetched via firmware from ethtool -S. Patch 2 exposes stats removed in Patch 1 via debugfs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11cxgb4: expose stats fetched from firmware via debugfsRahul Lakkireddy
Expose stats obtained from firmware via debugfs. These stats can't be part of ethtool -S because the slow firmware mailbox can cause packet drops under heavy load. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11cxgb4: remove stats fetched from firmwareRahul Lakkireddy
When running ethtool -S, some stats are requested from firmware. Since getting these stats via firmware mailbox is slow, some packets get dropped under heavy load while running ethtool -S. So, remove these stats from ethtool -S. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net: mvpp2: explicitly include linux/interrupt.hAntoine Tenart
The Marvell PPv2 driver uses interrupts and tasklet but does not explicitly include linux/interrupt.h, relying on implicit includes. This one particularly is included by chance after a long unlogical chain of inclusions. Fix this so we do not get future build breaks. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11cnic: use kvzalloc to allocate memory for csk_tblJan Dakinevich
Size of csk_tbl is about 58K, which means 3rd order page allocation. kvzalloc provides a fallback if no high order memory is available. Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11wimax/i2400m: remove redundant variables ack_status, bcf and protocolColin Ian King
Variables ack_status, bcf and protocol are being assigned but are never used hence they are redundant and can be removed. Also declare ack_type as unsigned int rather than unsigned to clean up a checkpatch warning. Cleans up clang warnings: warning: variable 'ack_status' set but not used [-Wunused-but-set-variable] warning: variable 'bcf' set but not used [-Wunused-but-set-variable] warning: variable 'protocol' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net: sched: act_ife: fix memory leak in ife initVlad Buslov
Free params if tcf_idr_check_alloc() returned error. Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11cxgb4: specify IQTYPE in fw_iq_cmdArjun Vynipadath
congestion argument passed to t4_sge_alloc_rxq() is used to differentiate between nic/ofld queues. Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Merge branch 'net-ipv6-addr_gen_mode-fixes'David S. Miller
Sabrina Dubroca says: ==================== net/ipv6: addr_gen_mode fixes This series fixes bugs in handling of the addr_gen_mode option, mainly related to the sysctl. A minor netlink issue was also present in the initial commit introducing the option on a per-netdevice basis. v2: add patch 4, requested by David Ahern during review of v1 add patch 5, missing documentation for the sysctl patches 1, 2, 3 are unchanged ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Documentation: ip-sysctl.txt: document addr_gen_modeSabrina Dubroca
addr_gen_mode was introduced in without documentation, add it now. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devicesSabrina Dubroca
This aligns the addr_gen_mode sysctl with the expected behavior of the "all" variant. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODESabrina Dubroca
inet6_ifla6_size() is called to check how much space is needed by inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_devSabrina Dubroca
The value has already been copied from this netns's devconf_dflt, it shouldn't be reset to the global kernel default. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: fix addrconf_sysctl_addr_gen_modeSabrina Dubroca
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores the errors returned by proc_dointvec(). addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which writes the value to memory, and then checks if it's valid and may return EINVAL. If a bad value is given, the value displayed when reading net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the value provided by the user was valid, addrconf_dev_config() won't be called since idev->cnf.addr_gen_mode has already been updated. Fix this in the usual way we deal with values that need to be checked after the proc_do*() helper has returned: define a local ctl_table and storage, call proc_dointvec() on that temporary area, then check and store. addrconf_sysctl_addr_gen_mode() also writes the new value to the global ipv6_devconf_dflt, when we're writing to some netns's default, so that new netns will inherit the value that was set by the change occuring in any netns. That doesn't make any sense, so let's drop this assignment. Finally, since addr_gen_mode is a __u32, switch to proc_douintvec(). Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/sched: flower: Fix null pointer dereference when run tc vlan commandJianbo Liu
Zahari issued tc vlan command without setting vlan_ethtype, which will crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] is not null before use it. Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. Fixes: d64efd0926ba ('net/sched: flower: Add supprt for matching on QinQ vlan headers') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Zahari Doychev <zahari.doychev@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10selftests: forwarding: mirror_lib: Tighten up VLAN capturePetr Machata
The function do_test_span_vlan_dir_ips() is used for testing whether mirrored packets are VLAN-encapsulated. But since it only considers VLAN encapsulation, it may end up matching unmirrored ARP traffic as well. One consequence is a rare failure of mirror_gre_vlan_bridge_1q's test_gretap_untagged_egress. Decreasing ping cadence in mirror_test() makes the problem easily reproducible. Therefore tighten up the match criterion to only count those 802.1q packets where the next header is IP. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10Merge branch 'cake-qdisc'David S. Miller
Toke Høiland-Jørgensen says: ==================== sched: Add Common Applications Kept Enhanced (cake) qdisc This patch series adds the CAKE qdisc, and has been split up to ease review. I have attempted to split out each configurable feature into its own patch. The first commit adds the base shaper and packet scheduler, while subsequent commits add the optional features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. The result of applying the entire series is identical to the out of tree version that have seen extensive testing in previous deployments, most notably as an out of tree patch to OpenWrt. However, note that I have only compile tested the individual patches; so the whole series should be considered as a unit. --- Changelog v19: - Rebase to current net-next. - Don't rely on the value of sch->q.qlen to break loops; fixes possible infinite loop on multi-queue devices. - Don't overwrite NAT flag when setting flow mode. v18: - Rework classification logic in the diffserv case to always hash if filter doesn't select a queue, and to run TC filters before selecting the diffserv tin (allowing filter to influence this). - Make sure we always call qdisc_watchdog_init() in cake_init(), so we don't crash in cake_destroy(). v17: - Rebase to newest net-next and move the conntrack callback to nf_ct_hook - Fix a compile error when NF_CONNTRACK is unset. v16: - Move conntrack lookup function into conntrack core and read it via RCU so it is only active when the nf_conntrack module is loaded. This avoids the module dependency on conntrack for NAT mode. Thanks to Pablo for the idea. v15: - Handle ECN flags in ACK filter v14: - Handle seqno wraps and DSACKs in ACK filter v13: - Avoid ktime_t to scalar compares - Add class dumping and basic stats - Fail with ENOTSUPP when requesting NAT mode and conntrack is not available. - Parse all TCP options in ACK filter and make sure to only drop safe ones. Also handle SACK ranges properly. v12: - Get rid of custom time typedefs. Use ktime_t for time and u64 for duration instead. v11: - Fix overhead compensation calculation for GSO packets - Change configured rate to be u64 (I ran out of bits before I ran out of CPU when testing the effects of the above) v10: - Christmas tree gardening (fix variable declarations to be in reverse line length order) v9: - Remove duplicated checks around kvfree() and just call it unconditionally. - Don't pass __GFP_NOWARN when allocating memory - Move options in cake_dump() that are related to optional features to later patches implementing the features. - Support attaching filters to the qdisc and use the classification result to select flow queue. - Support overriding diffserv priority tin from skb->priority v8: - Remove inline keyword from function definitions - Simplify ACK filter; remove the complex state handling to make the logic easier to follow. This will potentially be a bit less efficient, but I have not been able to measure a difference. v7: - Split up patch into a series to ease review. - Constify the ACK filter. v6: - Fix 6in4 encapsulation checks in ACK filter code - Checkpatch fixes v5: - Refactor ACK filter code and hopefully fix the safety issues properly this time. v4: - Only split GSO packets if shaping at speeds <= 1Gbps - Fix overhead calculation code to also work for GSO packets - Don't re-implement kvzalloc() - Remove local header include from out-of-tree build (fixes kbuild-bot complaint). - Several fixes to the ACK filter: - Check pskb_may_pull() before deref of transport headers. - Don't run ACK filter logic on split GSO packets - Fix TCP sequence number compare to deal with wraparounds v3: - Use IS_REACHABLE() macro to fix compilation when sch_cake is built-in and conntrack is a module. - Switch the stats output to use nested netlink attributes instead of a versioned struct. - Remove GPL boilerplate. - Fix array initialisation style. v2: - Fix kbuild test bot complaint - Clean up the netlink ABI - Fix checkpatch complaints - A few tweaks to the behaviour of cake based on testing carried out while writing the paper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Conditionally split GSO segmentsToke Høiland-Jørgensen
At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Add overhead compensation support to the rate shaperToke Høiland-Jørgensen
This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>