summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2014-11-10net: gro: add a per device gro flush timerEric Dumazet
Tuning coalescing parameters on NIC can be really hard. Servers can handle both bulk and RPC like traffic, with conflicting goals : bulk flows want as big GRO packets as possible, RPC want minimal latencies. To reach big GRO packets on 10Gbe NIC, one can use : ethtool -C eth0 rx-usecs 4 rx-frames 44 But this penalizes rpc sessions, with an increase of latencies, up to 50% in some cases, as NICs generally do not force an interrupt when a packet with TCP Push flag is received. Some NICs do not have an absolute timer, only a timer rearmed for every incoming packet. This patch uses a different strategy : Let GRO stack decides what do do, based on traffic pattern. Packets with Push flag wont be delayed. Packets without Push flag might be held in GRO engine, if we keep receiving data. This new mechanism is off by default, and shall be enabled by setting /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond. To fully enable this mechanism, drivers should use napi_complete_done() instead of napi_complete(). Tested: Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues) Without this feature, we send back about 305,000 ACK per second. GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet) Setting a timer of 2000 nsec is enough to increase GRO packet sizes and reduce number of ACK packets. (811/19.2 = 42) Receiver performs less calls to upper stacks, less wakes up. This also reduces cpu usage on the sender, as it receives less ACK packets. Note that reducing number of wakes up increases cpu efficiency, but can decrease QPS, as applications wont have the chance to warmup cpu caches doing a partial read of RPC requests/answers if they fit in one skb. B:~# sar -n DEV 1 10 | grep eth0 | tail -1 Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00 0.00 0.50 B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout B:~# sar -n DEV 1 10 | grep eth0 | tail -1 Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00 0.00 0.50 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07net: Kill skb_copy_datagram_const_iovecHerbert Xu
Now that both macvtap and tun are using skb_copy_datagram_iter, we can kill the abomination that is skb_copy_datagram_const_iovec. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07inet: Add skb_copy_datagram_iterHerbert Xu
This patch adds skb_copy_datagram_iter, which is identical to skb_copy_datagram_iovec except that it operates on iov_iter instead of iovec. Eventually all users of skb_copy_datagram_iovec should switch over to iov_iter and then we can remove skb_copy_datagram_iovec. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05openvswitch: Add basic MPLS support to kernelSimon Horman
Allow datapath to recognize and extract MPLS labels into flow keys and execute actions which push, pop, and set labels on packets. Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer. Cc: Ravi K <rkerur@gmail.com> Cc: Leo Alterman <lalterman@nicira.com> Cc: Isaku Yamahata <yamahata@valinux.co.jp> Cc: Joe Stringer <joe@wand.net.nz> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05net: Remove MPLS GSO feature.Pravin B Shelar
Device can export MPLS GSO support in dev->mpls_features same way it export vlan features in dev->vlan_features. So it is safe to remove NETIF_F_GSO_MPLS redundant flag. Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05net: Add and use skb_copy_datagram_msg() helper.David S. Miller
This encapsulates all of the skb_copy_datagram_iovec() callers with call argument signature "skb, offset, msghdr->msg_iov, length". When we move to iov_iters in the networking, the iov_iter object will sit in the msghdr. Having a helper like this means there will be less places to touch during that transformation. Based upon descriptions and patch from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05udp: Changes to udp_offload to support remote checksum offloadTom Herbert
Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote checksum offload being done (in this case inner checksum must not be offloaded to the NIC). Added logic in __skb_udp_tunnel_segment to handle remote checksum offload case. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03net: less interrupt masking in NAPIEric Dumazet
net_rx_action() can mask irqs a single time to transfert sd->poll_list into a private list, for a very short duration. Then, napi_complete() can avoid masking irqs again, and net_rx_action() only needs to mask irq again in slow path. This patch removes 2 couples of irq mask/unmask per typical NAPI run, more if multiple napi were triggered. Note this also allows to give control back to caller (do_softirq()) more often, so that other softirq handlers can be called a bit earlier, or ksoftirqd can be wakeup earlier under pressure. This was developed while testing an alternative to RX interrupt mitigation to reduce latencies while keeping or improving GRO aggregation on fast NIC. Idea is to test napi->gro_list at the end of a napi->poll() and reschedule one NAPI poll, but after servicing a full round of softirqs (timers, TX, rcu, ...). This will be allowed only if softirq is currently serviced by idle task or ksoftirqd, and resched not needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/marvell.c Simple overlapping changes in drivers/net/phy/marvell.c Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0Guenter Roeck
If a driver supports reading EEPROM but no EEPROM is installed in the system, the driver's get_eeprom_len function returns 0. ethtool will subsequently try to read that zero-length EEPROM anyway. If the driver does not support EEPROM access at all, this operation will return -EOPNOTSUPP. If the driver does support EEPROM access but no EEPROM is installed, the operation will return -EINVAL. Return -EOPNOTSUPP in both cases for consistency. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29neigh: optimize neigh_parms_release()Nicolas Dichtel
In neigh_parms_release() we loop over all entries to find the entry given in argument and being able to remove it from the list. By using a double linked list, we can avoid this loop. Here are some numbers with 30 000 dummy interfaces configured: Before the patch: $ time rmmod dummy real 2m0.118s user 0m0.000s sys 1m50.048s After the patch: $ time rmmod dummy real 1m9.970s user 0m0.000s sys 0m47.976s Suggested-by: Thierry Herbelot <thierry.herbelot@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29net: introduce napi_schedule_irqoff()Eric Dumazet
napi_schedule() can be called from any context and has to mask hard irqs. Add a variant that can only be called from hard interrupts handlers or when irqs are already masked. Many NIC drivers can use it from their hard IRQ handler instead of generic variant. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29net: skb_segment() should preserve backpressureToshiaki Makita
This patch generalizes commit d6a4a1041176 ("tcp: GSO should be TSQ friendly") to protocols using skb_set_owner_w() TCP uses its own destructor (tcp_wfree) and needs a more complex scheme as explained in commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets") This allows UDP sockets using UFO to get proper backpressure, thus avoiding qdisc drops and excessive cpu usage. Here are performance test results (macvlan on vlan): - Before # netperf -t UDP_STREAM ... Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 65507 60.00 144096 1224195 1258.56 212992 60.00 51 0.45 Average: CPU %user %nice %system %iowait %steal %idle Average: all 0.23 0.00 25.26 0.08 0.00 74.43 - After # netperf -t UDP_STREAM ... Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 65507 60.00 109593 0 957.20 212992 60.00 109593 957.20 Average: CPU %user %nice %system %iowait %steal %idle Average: all 0.18 0.00 8.38 0.02 0.00 91.43 [edumazet] Rewrote patch and changelog. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-26net: napi_reuse_skb() should check pfmemallocEric Dumazet
Do not reuse skb if it was pfmemalloc tainted, otherwise future frame might be dropped anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-22net: tso: fix unaligned access to crafted TCP header in helper APIKarl Beldan
The crafted header start address is from a driver supplied buffer, which one can reasonably expect to be aligned on a 4-bytes boundary. However ATM the TSO helper API is only used by ethernet drivers and the tcp header will then be aligned to a 2-bytes only boundary from the header start address. Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com> Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-20net: core: handle encapsulation offloads when computing segment lengthsFlorian Westphal
if ->encapsulation is set we have to use inner_tcp_hdrlen and add the size of the inner network headers too. This is 'mostly harmless'; tbf might send skb that is slightly over quota or drop skb even if it would have fit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Include fixes for netrom and dsa (Fabian Frederick and Florian Fainelli) 2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO. 3) Several SKB use after free fixes (vxlan, openvswitch, vxlan, ip_tunnel, fou), from Li ROngQing. 4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy. 5) Use after free in virtio_net, from Michael S Tsirkin. 6) Fix flow mask handling for megaflows in openvswitch, from Pravin B Shelar. 7) ISDN gigaset and capi bug fixes from Tilman Schmidt. 8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin. 9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov. 10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong Wang and Eric Dumazet. 11) Don't overwrite end of SKB when parsing malformed sctp ASCONF chunks, from Daniel Borkmann. 12) Don't call sock_kfree_s() with NULL pointers, this function also has the side effect of adjusting the socket memory usage. From Cong Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits) bna: fix skb->truesize underestimation net: dsa: add includes for ethtool and phy_fixed definitions openvswitch: Set flow-key members. netrom: use linux/uaccess.h dsa: Fix conversion from host device to mii bus tipc: fix bug in bundled buffer reception ipv6: introduce tcp_v6_iif() sfc: add support for skb->xmit_more r8152: return -EBUSY for runtime suspend ipv4: fix a potential use after free in fou.c ipv4: fix a potential use after free in ip_tunnel_core.c hyperv: Add handling of IP header with option field in netvsc_set_hash() openvswitch: Create right mask with disabled megaflows vxlan: fix a free after use openvswitch: fix a use after free ipv4: dst_entry leak in ip_send_unicast_reply() ipv4: clean up cookie_v4_check() ipv4: share tcp_v4_save_options() with cookie_v4_check() ipv4: call __ip_options_echo() in cookie_v4_check() atm: simplify lanai.c by using module_pci_driver ...
2014-10-15net: Add ndo_gso_checkTom Herbert
Add ndo_gso_check which a device can define to indicate whether is is capable of doing GSO on a packet. This funciton would be called from the stack to determine whether software GSO is needed to be done. A driver should populate this function if it advertises GSO types for which there are combinations that it wouldn't be able to handle. For instance a device that performs UDP tunneling might only implement support for transparent Ethernet bridging type of inner packets or might have limitations on lengths of inner headers. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
2014-10-14net: Trap attempts to call sock_kfree_s() with a NULL pointer.David S. Miller
Unlike normal kfree() it is never right to call sock_kfree_s() with a NULL pointer, because sock_kfree_s() also has the side effect of discharging the memory from the sockets quota. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10net: fix races in page->_count manipulationEric Dumazet
This is illegal to use atomic_set(&page->_count, ...) even if we 'own' the page. Other entities in the kernel need to use get_page_unless_zero() to get a reference to the page before testing page properties, so we could loose a refcount increment. The only case it is valid is when page->_count is 0 Fixes: 540eb7bf0bbed ("net: Update alloc frag to reduce get/put page usage and recycle pages") Signed-off-by: Eric Dumaze <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10flow-dissector: Fix alignment issue in __skb_flow_get_portsAlexander Duyck
This patch addresses a kernel unaligned access bug seen on a sparc64 system with an igb adapter. Specifically the __skb_flow_get_ports was returning a be32 pointer which was then having the value directly returned. In order to prevent this it is actually easier to simply not populate the ports or address values when an skb is not present. In this case the assumption is that the data isn't needed and rather than slow down the faster aligned accesses by making them have to assume the unaligned path on architectures that don't support efficent unaligned access it makes more sense to simply switch off the bits that were copying the source and destination address/port for the case where we only care about the protocol types and lengths which are normally 16 bit fields anyway. Reported-by: David S. Miller <davem@davemloft.net> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10net: filter: fix the commentsLi RongQing
1. sk_run_filter has been renamed, sk_filter() is using SK_RUN_FILTER. 2. Remove wrong comments about storing intermediate value. 3. replace sk_run_filter with __bpf_prog_run for check_load_and_stores's comments Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-09net: Missing @ before descriptions cause make xmldocs warningMasanari Iida
This patch fix following warning. Warning(.//net/core/skbuff.c:4142): No description found for parameter 'header_len' Warning(.//net/core/skbuff.c:4142): No description found for parameter 'data_len' Warning(.//net/core/skbuff.c:4142): No description found for parameter 'max_page_order' Warning(.//net/core/skbuff.c:4142): No description found for parameter 'errcode' Warning(.//net/core/skbuff.c:4142): No description found for parameter 'gfp_mask' Acutually the descriptions exist, but missing "@" in front. This problem start to happen when following commit was merged into Linus's tree during 3.18-rc1 merge period. commit 2e4e44107176d552f8bb1bb76053e850e3809841 net: add alloc_skb_with_frags() helper Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
2014-10-07Merge tag 'dmaengine-3.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine Pull dmaengine updates from Dan Williams: "Even though this has fixes marked for -stable, given the size and the needed conflict resolutions this is 3.18-rc1/merge-window material. These patches have been languishing in my tree for a long while. The fact that I do not have the time to do proper/prompt maintenance of this tree is a primary factor in the decision to step down as dmaengine maintainer. That and the fact that the bulk of drivers/dma/ activity is going through Vinod these days. The net_dma removal has not been in -next. It has developed simple conflicts against mainline and net-next (for-3.18). Continuing thanks to Vinod for staying on top of drivers/dma/. Summary: 1/ Step down as dmaengine maintainer see commit 08223d80df38 "dmaengine maintainer update" 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit 77873803363c "net_dma: mark broken"), without reports of performance regression. 3/ Miscellaneous fixes" * tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine: net: make tcp_cleanup_rbuf private net_dma: revert 'copied_early' net_dma: simple removal dmaengine maintainer update dmatest: prevent memory leakage on error path in thread ioat: Use time_before_jiffies() dmaengine: fix xor sources continuation dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup() dma: mv_xor: Remove all callers of mv_xor_slot_cleanup() dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call ioat: Use pci_enable_msix_exact() instead of pci_enable_msix() drivers: dma: Include appropriate header file in dca.c drivers: dma: Mark functions as static in dma_v3.c dma: mv_xor: Add DMA API error checks ioat/dca: Use dev_is_pci() to check whether it is pci device
2014-10-07net: better IFF_XMIT_DST_RELEASE supportEric Dumazet
Testing xmit_more support with netperf and connected UDP sockets, I found strange dst refcount false sharing. Current handling of IFF_XMIT_DST_RELEASE is not optimal. Dropping dst in validate_xmit_skb() is certainly too late in case packet was queued by cpu X but dequeued by cpu Y The logical point to take care of drop/force is in __dev_queue_xmit() before even taking qdisc lock. As Julian Anastasov pointed out, need for skb_dst() might come from some packet schedulers or classifiers. This patch adds new helper to cleanly express needs of various drivers or qdiscs/classifiers. Drivers that need skb_dst() in their ndo_start_xmit() should call following helper in their setup instead of the prior : dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; -> netif_keep_dst(dev); Instead of using a single bit, we use two bits, one being eventually rebuilt in bonding/team drivers. The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being rebuilt in bonding/team. Eventually, we could add something smarter later. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07net_sched: fix unused variables in __gnet_stats_copy_basic_cpu()WANG Cong
Probably not a big deal, but we'd better just use the one we get in retry loop. Fixes: commit 22e0f8b9322cb1a48b1357e8 ("net: sched: make bstats per cpu and estimator RCU safe") Reported-by: Joe Perches <joe@perches.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06net: validate_xmit_vlan() is staticEric Dumazet
Marking this as static allows compiler to inline it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06net: introduce netdevice gso_min_segs attributeEric Dumazet
Some TSO engines might have a too heavy setup cost, that impacts performance on hosts sending small bursts (2 MSS per packet). This patch adds a device gso_min_segs, allowing drivers to set a minimum segment size for TSO packets, according to the NIC performance. Tested on a mlx4 NIC, this allows to get a ~110% increase of throughput when sending 2 MSS per packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06ethtool: Ethtool parameter to dynamically change tx_copybreakEric Dumazet
Use new ethtool [sg]et_tunable() to set tx_copybread (inline threshold) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06net: skb_segment() provides list head and tailEric Dumazet
Its unfortunate we have to walk again skb list to find the tail after segmentation, even if data is probably hot in cpu caches. skb_segment() can store the tail of the list into segs->prev, and validate_xmit_skb_list() can immediately get the tail. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains another batch with Netfilter/IPVS updates for net-next, they are: 1) Add abstracted ICMP codes to the nf_tables reject expression. We introduce four reasons to reject using ICMP that overlap in IPv4 and IPv6 from the semantic point of view. This should simplify the maintainance of dual stack rule-sets through the inet table. 2) Move nf_send_reset() functions from header files to per-family nf_reject modules, suggested by Patrick McHardy. 3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the code now that br_netfilter can be modularized. Convert remaining spots in the network stack code. 4) Use rcu_barrier() in the nf_tables module removal path to ensure that we don't leave object that are still pending to be released via call_rcu (that may likely result in a crash). 5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad) idea was to probe the word size based on the xtables match/target info size, but this assumption is wrong when you have to dump the information back to userspace. 6) Allow to filter from prerouting and postrouting in the nf_tables bridge. In order to emulate the ebtables NAT chains (which are actually simple filter chains with no special semantics), we have support filtering from this hooks too. 7) Add explicit module dependency between xt_physdev and br_netfilter. This provides a way to detect if the user needs br_netfilter from the configuration path. This should reduce the breakage of the br_netfilter modularization. 8) Cleanup coding style in ip_vs.h, from Simon Horman. 9) Fix crash in the recently added nf_tables masq expression. We have to register/unregister the notifiers to clean up the conntrack table entries from the module init/exit path, not from the rule addition / deletion path. From Arturo Borrero. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-04net: Cleanup skb cloning by adding SKB_FCLONE_FREEVijay Subramanian
SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb. 1: If skb is allocated from head_cache, it indicates fclone is not available. 2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates it is available to be used. To avoid confusion for case 2 above, this patch replaces SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone companion skbs, this indicates it is free for use. SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and cannot / will not have a companion fclone. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03net: do not export skb_gro_receive()Eric Dumazet
skb_gro_receive() is only called from tcp_gro_receive() which is not in a module. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03qdisc: validate skb without holding lockEric Dumazet
Validation of skb can be pretty expensive : GSO segmentation and/or checksum computations. We can do this without holding qdisc lock, so that other cpus can queue additional packets. Trick is that requeued packets were already validated, so we carry a boolean so that sch_direct_xmit() can validate a fresh skb list, or directly use an old one. Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads host. Turning TSO on or off had no effect on throughput, only few more cpu cycles. Lock contention on qdisc lock disappeared. Same if disabling TX checksum offload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/r8152.c net/netfilter/nfnetlink.c Both r8152 and nfnetlink conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-02netfilter: use IS_ENABLED(CONFIG_BRIDGE_NETFILTER)Pablo Neira Ayuso
In 34666d4 ("netfilter: bridge: move br_netfilter out of the core"), the bridge netfilter code has been modularized. Use IS_ENABLED instead of ifdef to cover the module case. Fixes: 34666d4 ("netfilter: bridge: move br_netfilter out of the core") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-10-01net: pktgen: packet bursting via skb->xmit_moreAlexei Starovoitov
This patch demonstrates the effect of delaying update of HW tailptr. (based on earlier patch by Jesper) burst=1 is the default. It sends one packet with xmit_more=false burst=2 sends one packet with xmit_more=true and 2nd copy of the same packet with xmit_more=false burst=3 sends two copies of the same packet with xmit_more=true and 3rd copy with xmit_more=false Performance with ixgbe (usec 30): burst=1 tx:9.2 Mpps burst=2 tx:13.5 Mpps burst=3 tx:14.5 Mpps full 10G line rate Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01net: avoid one atomic operation in skb_clone()Eric Dumazet
Fast clone cloning can actually avoid an atomic_inc(), if we guarantee prior clone_ref value is 1. This requires a change kfree_skbmem(), to perform the atomic_dec_and_test() on clone_ref before setting fclone to SKB_FCLONE_UNAVAILABLE. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01net: cleanup and document skb fclone layoutEric Dumazet
Lets use a proper structure to clearly document and implement skb fast clones. Then, we might experiment more easily alternative layouts. This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm, to stop leaking of implementation details. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: enable per cpu qstatsJohn Fastabend
After previous patches to simplify qstats the qstats can be made per cpu with a packed union in Qdisc struct. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: restrict use of qstats qlenJohn Fastabend
This removes the use of qstats->qlen variable from the classifiers and makes it an explicit argument to gnet_stats_copy_queue(). The qlen represents the qdisc queue length and is packed into the qstats at the last moment before passnig to user space. By handling it explicitely we avoid, in the percpu stats case, having to figure out which per_cpu variable to put it in. It would probably be best to remove it from qstats completely but qstats is a user space ABI and can't be broken. A future patch could make an internal only qstats structure that would avoid having to allocate an additional u32 variable on the Qdisc struct. This would make the qstats struct 128bits instead of 128+32. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: make bstats per cpu and estimator RCU safeJohn Fastabend
In order to run qdisc's without locking statistics and estimators need to be handled correctly. To resolve bstats make the statistics per cpu. And because this is only needed for qdiscs that are running without locks which is not the case for most qdiscs in the near future only create percpu stats when qdiscs set the TCQ_F_CPUSTATS flag. Next because estimators use the bstats to calculate packets per second and bytes per second the estimator code paths are updated to use the per cpu statistics. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29gro: fix aggregation for skb using frag_listEric Dumazet
In commit 8a29111c7ca6 ("net: gro: allow to build full sized skb") I added a regression for linear skb that traditionally force GRO to use the frag_list fallback. Erez Shitrit found that at most two segments were aggregated and the "if (skb_gro_len(p) != pinfo->gso_size)" test was failing. This is because pinfo at this spot still points to the last skb in the chain, instead of the first one, where we find the correct gso_size information. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 8a29111c7ca6 ("net: gro: allow to build full sized skb") Reported-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: reorganize sk_buff for faster __copy_skb_header()Eric Dumazet
With proliferation of bit fields in sk_buff, __copy_skb_header() became quite expensive, showing as the most expensive function in a GSO workload. __copy_skb_header() performance is also critical for non GSO TCP operations, as it is used from skb_clone() This patch carefully moves all the fields that were not copied in a separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more Then I moved all other fields and all other copied fields in a section delimited by headers_start[0]/headers_end[0] section so that we can use a single memcpy() call, inlined by compiler using long word load/stores. I also tried to make all copies in the natural orders of sk_buff, to help hardware prefetching. I made sure sk_buff size did not change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net_dma: simple removalDan Williams
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-26net : optimize skb_release_data()Eric Dumazet
Cache skb_shinfo(skb) in a variable to avoid computing it multiple times. Reorganize the tests to remove one indentation level. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: optimise inet_proto_csum_replace4()LEROY Christophe
csum_partial() is a generic function which is not optimised for small fixed length calculations, and its use requires to store "from" and "to" values in memory while we already have them available in registers. This also has impact, especially on RISC processors. In the same spirit as the change done by Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4() taking into account RFC1624. I spotted during a NATted tcp transfert that csum_partial() is one of top 5 consuming functions (around 8%), and the second user of csum_partial() is inet_proto_csum_replace4(). Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: introduce __skb_header_release()Eric Dumazet
While profiling TCP stack, I noticed one useless atomic operation in tcp_sendmsg(), caused by skb_header_release(). It turns out all current skb_header_release() users have a fresh skb, that no other user can see, so we can avoid one atomic operation. Introduce __skb_header_release() to clearly document this. This gave me a 1.5 % improvement on TCP_RR workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>