summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-05-13net: sched: use counter to break reclassify loopsFlorian Westphal
Seems all we want here is to avoid endless 'goto reclassify' loop. tc_classify_compat even resets this counter when something other than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't break hypothetical loops induced by something other than perpetual TC_ACT_RECLASSIFY return values. skb_act_clone is now identical to skb_clone, so just use that. Tested with following (bogus) filter: tc filter add dev eth0 parent ffff: \ protocol ip u32 match u32 0 0 police rate 10Kbit burst \ 64000 mtu 1500 action reclassify Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Four minor merge conflicts: 1) qca_spi.c renamed the local variable used for the SPI device from spi_device to spi, meanwhile the spi_set_drvdata() call got moved further up in the probe function. 2) Two changes were both adding new members to codel params structure, and thus we had overlapping changes to the initializer function. 3) 'net' was making a fix to sk_release_kernel() which is completely removed in 'net-next'. 4) In net_namespace.c, the rtnl_net_fill() call for GET operations had the command value fixed, meanwhile 'net-next' adjusted the argument signature a bit. This also matches example merge resolutions posted by Stephen Rothwell over the past two days. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13switchdev: don't use anonymous union on switchdev attr/obj structsScott Feldman
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions which was causing build issues on the newly added switchdev attr/obj structs. Fix this by using named union on structs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13switchdev: sparse warning: pass ipv4 fib dst as network-byte orderScott Feldman
And let driver convert it to host-byte order as needed. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13switchdev: sparse warning: make __switchdev_port_obj_add staticScott Feldman
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from Avri Altman. 2) Use the correct FW API for scan completions in iwlwifi, from Avraham Stern. 3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad Kaufman. 4) rhashtable conversion of mac80211 station table was buggy, the virtual interface was not taken into account. Fix from Johannes Berg. 5) Fix deadlock in rtlwifi by not using a zero timeout for usb_control_msg(), from Larry Finger. 6) Update reordering state before calculating loss detection, from Yuchung Cheng. 7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter. 8) Fix extended frame handling in xiling_can driver, from Jeppe Ledet-Pedersen. 9) Fix CODEL packet scheduler behavior in the presence of TSO packets, from Eric Dumazet. 10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck. 11) macvlan needs to propagate promisc settings down the the lower device, from Vlad Yasevich. 12) igb driver can oops when changing number of rings, from Toshiaki Makita. 13) Source specific default routes not handled properly in ipv6, from Markus Stenberg. 14) Use after free in tc_ctl_tfilter(), from WANG Cong. 15) Use softirq spinlocking in netxen driver, from Tony Camuso. 16) Two ARM bpf JIT fixes from Nicolas Schichan. 17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from Mathias Kretschmer. 18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei Starovoitov. 19) ll_temac driver DMA maps TX packet header with incorrect length, fix from Michal Simek. 20) We removed pm_qos bits from netdevice.h, but some indirect references remained. Kill them. From David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits) net: Remove remaining remnants of pm_qos from netdevice.h e1000e: Add pm_qos header net: phy: micrel: Fix regression in kszphy_probe net: ll_temac: Fix DMA map size bug x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions netns: return RTM_NEWNSID instead of RTM_GETNSID on a get Update be2net maintainers' email addresses net_sched: gred: use correct backlog value in WRED mode pppoe: drop pppoe device in pppoe_unbind_sock_work net: qca_spi: Fix possible race during probe net: mdio-gpio: Allow for unspecified bus id af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT). bnx2x: limit fw delay in kdump to 5s after boot ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits. ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction. mpls: Change reserved label names to be consistent with netbsd usbnet: avoid integer overflow in start_xmit netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2) net: xgene_enet: Set hardware dependency net: amd-xgbe: Add hardware dependency ...
2015-05-12net: make skb_dst_pop routine staticYing Xue
As xfrm_output_one() is the only caller of skb_dst_pop(), we should make skb_dst_pop() localized. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12pktgen: fix packet generationAlexei Starovoitov
pkt_gen->last_ok was not set properly, so after the first burst pktgen instead of allocating new packet, will reuse old one, advance eth_type_trans further, which would mean the stack will be seeing very short bogus packets. Fixes: 62f64aed622b ("pktgen: introduce xmit_mode '<start_xmit|netif_receive>'") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: deinline netif_tx_stop_all_queues(), remove WARN_ON in ↵Denys Vlasenko
netif_tx_stop_queue() These functions compile to 60 bytes of machine code each. With this .config: http://busybox.net/~vda/kernel_config there are 617 calls of netif_tx_stop_queue() and 49 calls of netif_tx_stop_all_queues() in vmlinux. To fix this, remove WARN_ON in netif_tx_stop_queue() as suggested by davem, and deinline netif_tx_stop_all_queues(). Change in code size is about 20k: text data bss dec hex filename 82426986 22255416 20627456 125309858 77813a2 vmlinux.before 82406248 22255416 20627456 125289120 777c2a0 vmlinux gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue sometimes: $ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l 190 ffffffff81b558a8 <netif_tx_stop_queue>: ffffffff81b558a8: 55 push %rbp ffffffff81b558a9: 48 89 e5 mov %rsp,%rbp ffffffff81b558ac: f0 80 8f e0 01 00 00 lock orb $0x1,0x1e0(%rdi) ffffffff81b558b3: 01 ffffffff81b558b4: 5d pop %rbp ffffffff81b558b5: c3 retq This needs additional fixing. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Alexei Starovoitov <alexei.starovoitov@gmail.com> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Joe Perches <joe@perches.com> CC: David S. Miller <davem@davemloft.net> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12netns: return RTM_NEWNSID instead of RTM_GETNSID on a getNicolas Dichtel
Usually, RTM_NEWxxx is returned on a get (same as a dump). Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: remove NETIF_F_HW_SWITCH_OFFLOAD feature flagScott Feldman
Roopa said remove the feature flag for this series and she'll work on bringing it back if needed at a later date. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/delScott Feldman
The IPv4 FIB ops convert nicely to the switchdev objs and we're left with only four switchdev ops: port get/set and port add/del. Other objs will follow, such as FDB. So go ahead and convert IPv4 FIB over to switchdev obj for consistency, anticipating more objs to come. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add new switchdev_port_bridge_getlinkScott Feldman
Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and call into port driver to get port attrs. For now, only BR_LEARNING and BR_LEARNING_SYNC are returned. To add more, we'll probably want to break away from ndo_dflt_bridge_getlink() and build the netlink skb directly in the switchdev code. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12bridge: revert br_dellink change back to originalScott Feldman
This is revert of: commit 68e331c785b8 ("bridge: offload bridge port attributes to switch asic if feature flag set") Restore br_dellink back to original and don't call into SELF port driver. rtnetlink.c:bridge_dellink() already does a call into port driver for SELF. bridge vlan add/del cmd defaults to MASTER. From man page for bridge vlan add/del cmd: self the vlan is configured on the specified physical device. Required if the device is the bridge device. master the vlan is configured on the software bridge (default). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: remove unused switchdev_port_bridge_dellinkScott Feldman
Now we can remove old wrappers for dellink. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add new switchdev_port_bridge_dellinkScott Feldman
Same change as setlink. Provide the wrapper op for SELF ndo_bridge_dellink and call into the switchdev driver to delete afspec VLANs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12bridge: restore br_setlink back to originalScott Feldman
This is revert of: commit 68e331c785b8 ("bridge: offload bridge port attributes to switch asic if feature flag set") Restore br_setlink back to original and don't call into SELF port driver. rtnetlink.c:bridge_setlink() already does a call into port driver for SELF. bridge set link cmd defaults to MASTER. From man page for bridge link set cmd: self link setting is configured on specified physical device master link setting is configured on the software bridge (default) The link setting has two values: the device-side value and the software bridge-side value. These are independent and settable using the bridge link set cmd by specifying some combination of [master] | [self]. Furthermore, the device-side and bridge-side settings have their own initial value, viewable from bridge -d link show cmd. Restoring br_setlink back to original makes rocker (the only in-kernel user of SELF link settings) work as first implement: two-sided values. It's true that when both MASTER and SELF are specified from the command, two netlink notifications are generated, one for each side of the settings. The user-space app can distiquish between the two notifications by observing the MASTER or SELF flag. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: remove old switchdev_port_bridge_setlinkScott Feldman
New attr-based bridge_setlink can recurse lower devs and recover on err, so remove old wrapper (including ndo_dflt_switchdev_port_bridge_setlink). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add new switchdev bridge setlinkScott Feldman
Add new switchdev_port_bridge_setlink that can be used by drivers implementing .ndo_bridge_setlink to set switchdev bridge attributes. Basically turn the raw rtnl_bridge_setlink netlink into switchdev attr sets. Proper netlink attr policy checking is done on the protinfo part of the netlink msg. Currently, for protinfo, only bridge port attrs BR_LEARNING and BR_LEARNING_SYNC are parsed and passed to port driver. For afspec, VLAN objs are passed so switchdev driver can set VLANs assigned to SELF. To illustrate with iproute2 cmd, we have: bridge vlan add vid 10 dev sw1p1 self master To add VLAN 10 to port sw1p1 for both the bridge (master) and the device (self). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: introduce switchdev add/del obj opsScott Feldman
Like switchdev attr get/set, add new switchdev obj add/del. switchdev objs will be things like VLANs or FIB entries, so add/del fits better for objects than get/set used for attributes. Use same two-phase prepare-commit transaction model as in attr set. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: convert STP update to switchdev attr setScott Feldman
STP update is just a settable port attribute, so convert switchdev_port_stp_update to an attr set. For DSA, the prepare phase is skipped and STP updates are only done in the commit phase. This is because currently the DSA drivers don't need to allocate any memory for STP updates and the STP update will not fail to HW (unless something horrible goes wrong on the MDIO bus, in which case the prepare phase wouldn't have been able to predict anyway). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: convert parent_id_get to switchdev attr getScott Feldman
Switch ID is just a gettable port attribute. Convert switchdev op switchdev_parent_id_get to a switchdev attr. Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only port netdevs. So when a port is stacked under bond/bridge, the user can only query switch id via the switch ports, but not via the upper devices Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: introduce get/set attrs opsScott Feldman
Add two new swdev ops for get/set switch port attributes. Most swdev interactions on a port are gets or sets on port attributes, so rather than adding ops for each attribute, let's define clean get/set ops for all attributes, and then we can have clear, consistent rules on how attributes propagate on stacked devs. Add the basic algorithms for get/set attr ops. Use the same recusive algo to walk lower devs we've used for STP updates, for example. For get, compare attr value for each lower dev and only return success if attr values match across all lower devs. For sets, set the same attr value for all lower devs. We'll use a two-phase prepare-commit transaction model for sets. In the first phase, the driver(s) are asked if attr set is OK. If all OK, the commit attr set in second phase. A driver would NACK the prepare phase if it can't set the attr due to lack of resources or support, within it's control. RTNL lock must be held across both phases because we'll recurse all lower devs first in prepare phase, and then recurse all lower devs again in commit phase. If any lower dev fails the prepare phase, we need to abort the transaction for all lower devs. If lower dev recusion isn't desired, allow a flag SWITCHDEV_F_NO_RECURSE to indicate get/set only work on port (lowest) device. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: s/swdev_/switchdev_/Jiri Pirko
Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/Jiri Pirko
Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net_sched: gred: add TCA_GRED_LIMIT attributeDavid Ward
In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop parameters configured, then packets for the default VQ are not subjected to RED and are only dropped if the queue is larger than the net_device's tx_queue_len. This behavior is useful for WRED mode, since these packets will still influence the calculated average queue length and (therefore) the drop probability for all of the other VQs. However, for some drivers tx_queue_len is zero. In other cases the user may wish to make the limit the same for all VQs (including the default VQ with no drop parameters). This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit, in bytes, during qdisc setup. (This limit is in bytes to be consistent with the drop parameters.) The default limit is the same as for a bfifo queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are configured with a smaller limit than the GRED queue limit, that VQ will still observe the smaller limit instead. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: Add skb_free_frag to replace use of put_page in freeing skb->headAlexander Duyck
This change adds a function called skb_free_frag which is meant to compliment the function netdev_alloc_frag. The general idea is to enable a more lightweight version of page freeing since we don't actually need all the overhead of a put_page, and we don't quite fit the model of __free_pages. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12mm/net: Rename and move page fragment handling from net/ to mm/Alexander Duyck
This change moves the __alloc_page_frag functionality out of the networking stack and into the page allocation portion of mm. The idea it so help make this maintainable by placing it with other page allocation functions. Since we are moving it from skbuff.c to page_alloc.c I have also renamed the basic defines and structure from netdev_alloc_cache to page_frag_cache to reflect that this is now part of a different kernel subsystem. I have also added a simple __free_page_frag function which can handle freeing the frags based on the skb->head pointer. The model for this is based off of __free_pages since we don't actually need to deal with all of the cases that put_page handles. I incorporated the virt_to_head_page call and compound_order into the function as it actually allows for a signficant size reduction by reducing code duplication. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: Store virtual address instead of page in netdev_alloc_cacheAlexander Duyck
This change makes it so that we store the virtual address of the page in the netdev_alloc_cache instead of the page pointer. The idea behind this is to avoid multiple calls to page_address since the virtual address is required for every access, but the page pointer is only needed at allocation or reset of the page. While I was at it I also reordered the netdev_alloc_cache structure a bit so that the size is always 16 bytes by dropping size in the case where PAGE_SIZE is greater than or equal to 32KB. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: Use cached copy of pfmemalloc to avoid accessing pageAlexander Duyck
While testing I found that the testing for pfmemalloc in build_skb was rather expensive. I found the issue to be two-fold. First we have to get from the virtual address to the head page and that comes at the cost of something like 11 cycles. Then there is the cost for reading pfmemalloc out of the head page which can be cache cold due to the fact that put_page_testzero is likely invalidating the cache-line on one or more CPUs as the fragments can be shared. To avoid this extra expense I have added a pfmemalloc member to the netdev_alloc_cache. I then pushed pieces of __alloc_rx_skb into __napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to make use of the cached pfmemalloc value. The result is that my perf traces show a reduction from 9.28% overhead to 3.7% for the code covered by build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with the packet being dropped instead of being handed to napi_gro_receive. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: deprecate enqueue_root()Eric Dumazet
Only left enqueue_root() user is netem, and it looks not necessary : qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net_sched: gred: use correct backlog value in WRED modeDavid Ward
In WRED mode, the backlog for a single virtual queue (VQ) should not be used to determine queue behavior; instead the backlog is summed across all VQs. This sum is currently used when calculating the average queue lengths. It also needs to be used when determining if the queue's hard limit has been reached, or when reporting each VQ's backlog via netlink. q->backlog will only be used if the queue switches out of WRED mode. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: further simplify handle_ingDaniel Borkmann
Ingress qdisc has no other purpose than calling into tc_classify() that executes attached classifier(s) and action(s). It has a 1:1 relationship to dev->ingress_queue. After having commit 087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed the central ingress lock, one major contention point is gone. The extra indirection layers however, are not necessary for calling into ingress qdisc. pktgen calling locally into netif_receive_skb() with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps. We can redirect the private classifier list to the netdev directly, without changing any classifier API bits (!) and execute on that from handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed, ingress qdisc doesn't have a queue and thus dev_deactivate_queue() is also not applicable, ingress_cl_list provides similar behaviour. In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc. One next possible step is the removal of the dev's ingress (dummy) netdev_queue, and to only have the list member in the netdevice itself. Note, the filter chain is RCU protected and individual filter elements are being kfree'd by sched subsystem after RCU grace period. RCU read lock is being held by __netif_receive_skb_core(). Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: consolidate handle_ing and ing_filterDaniel Borkmann
Given quite some code has been removed from ing_filter(), we can just consolidate that function into handle_ing() and get rid of a few instructions at the same time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: kill sk_change_net and sk_release_kernelEric W. Biederman
These functions are no longer needed and no longer used kill them. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11netlink: Create kernel netlink sockets in the proper network namespaceEric W. Biederman
Utilize the new functionality of sk_alloc so that nothing needs to be done to suprress the reference counting on kernel sockets. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: Modify sk_alloc to not reference count the netns of kernel sockets.Eric W. Biederman
Now that sk_alloc knows when a kernel socket is being allocated modify it to not reference count the network namespace of kernel sockets. Keep track of if a socket needs reference counting by adding a flag to struct sock called sk_net_refcnt. Update all of the callers of sock_create_kern to stop using sk_change_net and sk_release_kernel as those hacks are no longer needed, to avoid reference counting a kernel socket. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: Pass kern from net_proto_family.create to sk_allocEric W. Biederman
In preparation for changing how struct net is refcounted on kernel sockets pass the knowledge that we are creating a kernel socket from sock_create_kern through to sk_alloc. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: Add a struct net parameter to sock_create_kernEric W. Biederman
This is long overdue, and is part of cleaning up how we allocate kernel sockets that don't reference count struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11tun: Utilize the normal socket network namespace refcounting.Eric W. Biederman
There is no need for tun to do the weird network namespace refcounting. The existing network namespace refcounting in tfile has almost exactly the same lifetime. So rewrite the code to use the struct sock network namespace refcounting and remove the unnecessary hand rolled network namespace refcounting and the unncesary tfile->net. This change allows the tun code to directly call sock_put bypassing sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary. Remove the now unncessary tun_release so that if anything tries to use the sock_release code path the kernel will oops, and let us know about the bug. The macvtap code already uses it's internal socket this way. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-10codel: add ce_threshold attributeEric Dumazet
For DCTCP or similar ECN based deployments on fabrics with shallow buffers, hosts are responsible for a good part of the buffering. This patch adds an optional ce_threshold to codel & fq_codel qdiscs, so that DCTCP can have feedback from queuing in the host. A DCTCP enabled egress port simply have a queue occupancy threshold above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays. This makes the host an active participant in the health of the whole network. This also helps experimenting DCTCP in a setup without DCTCP compliant fabric. On following example, ce_threshold is set to 1ms, and we can see from 'ldelay xxx us' that TCP is not trying to go around the 5ms codel target. Queue has more capacity to absorb inelastic bursts (say from UDP traffic), as queues are maintained to an optimal level. lpaa23:~# ./tc -s -d qd sh dev eth1 qdisc mq 1: dev eth1 root Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961) backlog 3108242b 364p requeues 42961 qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503) rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503 count 0 lastcount 0 ldelay 1.0ms drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384 qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186) rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186 count 0 lastcount 0 ldelay 694us drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873 qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554) rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554 count 0 lastcount 0 ldelay 889us drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780 ... Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-10af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).Kretschmer, Mathias
This patch fixes an issue where the send(MSG_DONTWAIT) call on a TX_RING is not fully non-blocking in cases where the device's sndBuf is full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly occuring error code (most likely EGAIN) to the caller. As the fast-path stays as it is, we keep the unlikely() around skb == NULL. Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09mpls: Change reserved label names to be consistent with netbsdTom Herbert
Since these are now visible to userspace it is nice to be consistent with BSD (sys/netmpls/mpls.h in netBSD). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09pktgen: introduce xmit_mode '<start_xmit|netif_receive>'Alexei Starovoitov
Introduce xmit_mode 'netif_receive' for pktgen which generates the packets using familiar pktgen commands, but feeds them into netif_receive_skb() instead of ndo_start_xmit(). Default mode is called 'start_xmit'. It is designed to test netif_receive_skb and ingress qdisc performace only. Make sure to understand how it works before using it for other rx benchmarking. Sample script 'pktgen.sh': \#!/bin/bash function pgset() { local result echo $1 > $PGDEV result=`cat $PGDEV | fgrep "Result: OK:"` if [ "$result" = "" ]; then cat $PGDEV | fgrep Result: fi } [ -z "$1" ] && echo "Usage: $0 DEV" && exit 1 ETH=$1 PGDEV=/proc/net/pktgen/kpktgend_0 pgset "rem_device_all" pgset "add_device $ETH" PGDEV=/proc/net/pktgen/$ETH pgset "xmit_mode netif_receive" pgset "pkt_size 60" pgset "dst 198.18.0.1" pgset "dst_mac 90:e2:ba:ff:ff:ff" pgset "count 10000000" pgset "burst 32" PGDEV=/proc/net/pktgen/pgctrl echo "Running... ctrl^C to stop" pgset "start" echo "Done" cat /proc/net/pktgen/$ETH Usage: $ sudo ./pktgen.sh eth2 ... Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags) 43033682pps 20656Mb/sec (20656167360bps) errors: 10000000 Raw netif_receive_skb speed should be ~43 million packet per second on 3.7Ghz x86 and 'perf report' should look like: 37.69% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 25.81% kpktgend_0 [kernel.vmlinux] [k] kfree_skb 7.22% kpktgend_0 [kernel.vmlinux] [k] ip_rcv 5.68% kpktgend_0 [pktgen] [k] pktgen_thread_worker If fib_table_lookup is seen on top, it means skb was processed by the stack. To benchmark netif_receive_skb only make sure that 'dst_mac' of your pktgen script is different from receiving device mac and it will be dropped by ip_rcv Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliantJesper Dangaard Brouer
Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags, with a negation of the flag like !NO_TIMESTAMP. Also document the option flag NO_TIMESTAMP. Fixes: afb84b626184 ("pktgen: add flag NO_TIMESTAMP to disable timestamping") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netlink: allow to listen "all" netnsNicolas Dichtel
More accurately, listen all netns that have a nsid assigned into the netns where the netlink socket is opened. For this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this socket will receive netlink notifications from all netns that have a nsid assigned into the netns where the socket has been opened. The nsid is sent to userland via an anscillary data. With this patch, a daemon needs only one socket to listen many netns. This is useful when the number of netns is high. Because 0 is a valid value for a nsid, the field nsid_is_set indicates if the field nsid is valid or not. skb->cb is initialized to 0 on skb allocation, thus we are sure that we will never send a nsid 0 by error to the userland. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netlink: rename private flags and statesNicolas Dichtel
These flags and states have the same prefix (NETLINK_) that netlink socket options. To avoid confusion and to be able to name a flag like a socket option, let's use an other prefix: NETLINK_[S|F]_. Note: a comment has been fixed, it was talking about NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: use a spin_lock to protect nsid managementNicolas Dichtel
Before this patch, nsid were protected by the rtnl lock. The goal of this patch is to be able to find a nsid without needing to hold the rtnl lock. The next patch will introduce a netlink socket option to listen to all netns that have a nsid assigned into the netns where the socket is opened. Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to avoid a recursive lock (nsid are notified via rtnl). This was the main reason of the previous patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: notify new nsid outside __peernet2id()Nicolas Dichtel
There is no functional change with this patch. It will ease the refactoring of the locking system that protects nsids and the support of the netlink socket option NETLINK_LISTEN_ALL_NSID. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: rename peernet2id() to peernet2id_alloc()Nicolas Dichtel
In a following commit, a new function will be introduced to only lookup for a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the existing function is renamed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>