summaryrefslogtreecommitdiff
path: root/net/openvswitch/flow.c
AgeCommit message (Collapse)Author
2017-02-09openvswitch: Add original direction conntrack tuple to sw_flow_key.Jarno Rajahalme
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP nor ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27openvswitch: upcall: Fix vlan handling.pravin shelar
Networking stack accelerate vlan tag handling by keeping topmost vlan header in skb. This works as long as packet remains in OVS datapath. But during OVS upcall vlan header is pushed on to the packet. When such packet is sent back to OVS datapath, core networking stack might not handle it correctly. Following patch avoids this issue by accelerating the vlan tag during flow key extract. This simplifies datapath by bringing uniform packet processing for packets from all code paths. Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets"). CC: Jarno Rajahalme <jarno@ovn.org> CC: Jiri Benc <jbenc@redhat.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13openvswitch: add processing of L3 packetsJiri Benc
Support receiving, extracting flow key and sending of L3 packets (packets without an Ethernet header). Note that even after this patch, non-Ethernet interfaces are still not allowed to be added to bridges. Similarly, netlink interface for sending and receiving L3 packets to/from user space is not in place yet. Based on previous versions by Lorand Jakab and Simon Horman. Signed-off-by: Lorand Jakab <lojakab@cisco.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13openvswitch: add mac_proto field to the flow keyJiri Benc
Use a hole in the structure. We support only Ethernet so far and will add a support for L2-less packets shortly. We could use a bool to indicate whether the Ethernet header is present or not but the approach with the mac_proto field is more generic and occupies the same number of bytes in the struct, while allowing later extensibility. It also makes the code in the next patches more self explaining. It would be nice to use ARPHRD_ constants but those are u16 which would be waste. Thus define our own constants. Another upside of this is that we can overload this new field to also denote whether the flow key is valid. This has the advantage that on refragmentation, we don't have to reparse the packet but can rely on the stored eth.type. This is especially important for the next patches in this series - instead of adding another branch for L2-less packets before calling ovs_fragment, we can just remove all those branches completely. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13openvswitch: vlan: remove wrong likely statementJiri Benc
This code is called whenever flow key is being extracted from the packet. The packet may be as likely vlan tagged as not. Fixes: 018c1dda5ff1 ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes") Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Eric Garver <e@erig.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03openvswitch: mpls: set network header correctly on key extractJiri Benc
After the 48d2ab609b6b ("net: mpls: Fixups for GSO"), MPLS handling in openvswitch was changed to have network header pointing to the start of the MPLS headers and inner_network_header pointing after the MPLS headers. However, key_extract was missed by the mentioned commit, causing incorrect headers to be set when a MPLS packet just enters the bridge or after it is recirculated. Fixes: 48d2ab609b6b ("net: mpls: Fixups for GSO") Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20openvswitch: avoid resetting flow key while installing new flow.pravin shelar
since commit commit db74a3335e0f6 ("openvswitch: use percpu flow stats") flow alloc resets flow-key. So there is no need to reset the flow-key again if OVS is using newly allocated flow-key. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18openvswitch: use percpu flow statsThadeu Lima de Souza Cascardo
Instead of using flow stats per NUMA node, use it per CPU. When using megaflows, the stats lock can be a bottleneck in scalability. On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps when forwarding between two 40GbE ports with a single flow configured on the datapath. This has been tested on a system with possible CPUs 0-7,16-23. After module removal, there were no corruption on the slab cache. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Cc: pravin shelar <pshelar@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18openvswitch: fix flow stats accounting when node 0 is not possibleThadeu Lima de Souza Cascardo
On a system with only node 1 as possible, all statistics is going to be accounted on node 0 as it will have a single writer. However, when getting and clearing the statistics, node 0 is not going to be considered, as it's not a possible node. Tested that statistics are not zero on a system with only node 1 possible. Also compile-tested with CONFIG_NUMA off. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributesEric Garver
Add support for 802.1ad including the ability to push and pop double tagged vlans. Add support for 802.1ad to netlink parsing and flow conversion. Uses double nested encap attributes to represent double tagged vlan. Inner TPID encoded along with ctci in nested attributes. This is based on Thomas F Herbert's original v20 patch. I made some small clean ups and bug fixes. Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com> Signed-off-by: Eric Garver <e@erig.me> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07openvswitch: add tunnel protocol to sw_flow_keyJiri Benc
Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now also acts as an indicator whether the flow contains tunnel data (this was previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses in an union with IPv4 ones this won't work anymore). The new field was added to a hole in sw_flow_key. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31ip-tunnel: Use API to access tunnel metadata options.Pravin B Shelar
Currently tun-info options pointer is used in few cases to pass options around. But tunnel options can be accessed using ip_tunnel_info_opts() API without using the pointer. Following patch removes the redundant pointer and consistently make use of API. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29openvswitch: Remove vport-netPravin B Shelar
This structure is not used anymore. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29openvswitch: retain parsed IPv6 header fields in flow on error skipping ↵Simon Horman
extension headers When an error occurs skipping IPv6 extension headers retain the already parsed IP protocol and IPv6 addresses in the flow. Also assume that the packet is not a fragment in the absence of information to the contrary; that is always use the frag_off value set by ipv6_skip_exthdr(). This allows matching on the IP protocol and IPv6 addresses of packets with malformed extension headers. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29ip_tunnels: record IP version in tunnel infoJiri Benc
There's currently nothing preventing directing packets with IPv6 encapsulation data to IPv4 tunnels (and vice versa). If this happens, IPv6 addresses are incorrectly interpreted as IPv4 ones. Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this in ip_tunnel_info. Reject packets at appropriate places if they are supposed to be encapsulated into an incompatible protocol. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27openvswitch: Allow matching on conntrack labelJoe Stringer
Allow matching and setting the ct_label field. As with ct_mark, this is populated by executing the CT action. The label field may be modified by specifying a label and mask nested under the CT action. It is stored as metadata attached to the connection. Label modification occurs after lookup, and will only persist when the conntrack entry is committed by providing the COMMIT flag to the CT action. Labels are currently fixed to 128 bits in size. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27openvswitch: Add conntrack actionJoe Stringer
Expose the kernel connection tracker via OVS. Userspace components can make use of the CT action to populate the connection state (ct_state) field for a flow. This state can be subsequently matched. Exposed connection states are OVS_CS_F_*: - NEW (0x01) - Beginning of a new connection. - ESTABLISHED (0x02) - Part of an existing connection. - RELATED (0x04) - Related to an established connection. - INVALID (0x20) - Could not track the connection for this packet. - REPLY_DIR (0x40) - This packet is in the reply direction for the flow. - TRACKED (0x80) - This packet has been sent through conntrack. When the CT action is executed by itself, it will send the packet through the connection tracker and populate the ct_state field with one or more of the connection state flags above. The CT action will always set the TRACKED bit. When the COMMIT flag is passed to the conntrack action, this specifies that information about the connection should be stored. This allows subsequent packets for the same (or related) connections to be correlated with this connection. Sending subsequent packets for the connection through conntrack allows the connection tracker to consider the packets as ESTABLISHED, RELATED, and/or REPLY_DIR. The CT action may optionally take a zone to track the flow within. This allows connections with the same 5-tuple to be kept logically separate from connections in other zones. If the zone is specified, then the "ct_zone" match field will be subsequently populated with the zone id. IP fragments are handled by transparently assembling them as part of the CT action. The maximum received unit (MRU) size is tracked so that refragmentation can occur during output. IP frag handling contributed by Andy Zhou. Based on original design by Justin Pettit. Signed-off-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: Justin Pettit <jpettit@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel genericThomas Graf
Rename the tunnel metadata data structures currently internal to OVS and make them generic for use by all IP tunnels. Both structures are kernel internal and will stay that way. Their members are exposed to user space through individual Netlink attributes by OVS. It will therefore be possible to extend/modify these structures without affecting user ABI. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05openvswitch: Use eth_proto_is_802_3Alexander Duyck
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto). Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14mm: remove GFP_THISNODEDavid Rientjes
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11openvswitch: Reset key metadata for packet execution.Pravin B Shelar
Userspace packet execute command pass down flow key for given packet. But userspace can skip some parameter with zero value. Therefore kernel needs to initialize key metadata to zero. Fixes: 0714812134 ("openvswitch: Eliminate memset() from flow_extract.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()Thomas Graf
Also factors out Geneve validation code into a new separate function validate_and_copy_geneve_opts(). A subsequent patch will introduce VXLAN options. Rename the existing GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic tunnel metadata options. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13net: rename vlan_tx_* helpers since "tx" is misleading thereJiri Pirko
The same macros are used for rx as well. So rename it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-02openvswitch: Consistently include VLAN header in flow and port stats.Ben Pfaff
Until now, when VLAN acceleration was in use, the bytes of the VLAN header were not included in port or flow byte counters. They were however included when VLAN acceleration was not used. This commit corrects the inconsistency, by always including the VLAN header in byte counters. Previous discussion at http://openvswitch.org/pipermail/dev/2014-December/049521.html Reported-by: Motonori Shindo <mshindo@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Reviewed-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-09openvswitch: Add support for OVS_FLOW_ATTR_PROBE.Jarno Rajahalme
This new flag is useful for suppressing error logging while probing for datapath features using flow commands. For backwards compatibility reasons the commands are executed normally, but error logging is suppressed. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-09openvswitch: Constify various function argumentsThomas Graf
Help produce better optimized code. Signed-off-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05openvswitch: Add basic MPLS support to kernelSimon Horman
Allow datapath to recognize and extract MPLS labels into flow keys and execute actions which push, pop, and set labels on packets. Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer. Cc: Ravi K <rkerur@gmail.com> Cc: Leo Alterman <lalterman@nicira.com> Cc: Isaku Yamahata <yamahata@valinux.co.jp> Cc: Joe Stringer <joe@wand.net.nz> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-10-17openvswitch: Set flow-key members.Pravin B Shelar
This patch adds missing memset which are required to initialize flow key member. For example for IP flow we need to initialize ip.frag for all cases. Found by inspection. This bug is introduced by commit 0714812134d7dcadeb7ecfbfeb18788aa7e1eaac ("openvswitch: Eliminate memset() from flow_extract"). Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-17openvswitch: fix a use after freeLi RongQing
pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp setting after arphdr_ok to avoid the use the freed memory Fixes: 0714812134d7d ("openvswitch: Eliminate memset() from flow_extract.") Cc: Jesse Gross <jesse@nicira.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06openvswitch: Add support for Geneve tunneling.Jesse Gross
The Openvswitch implementation is completely agnostic to the options that are in use and can handle newly defined options without further work. It does this by simply matching on a byte array of options and allowing userspace to setup flows on this array. Signed-off-by: Jesse Gross <jesse@nicira.com> Singed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Thomas Graf <tgraf@noironetworks.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.Jesse Gross
Currently, the flow information that is matched for tunnels and the tunnel data passed around with packets is the same. However, as additional information is added this is not necessarily desirable, as in the case of pointers. This adds a new structure for tunnel metadata which currently contains only the existing struct. This change is purely internal to the kernel since the current OVS_KEY_ATTR_IPV4_TUNNEL is simply a compressed version of OVS_KEY_ATTR_TUNNEL that is translated at flow setup. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06openvswitch: Eliminate memset() from flow_extract.Jesse Gross
As new protocols are added, the size of the flow key tends to increase although few protocols care about all of the fields. In order to optimize this for hashing and matching, OVS uses a variable length portion of the key. However, when fields are extracted from the packet we must still zero out the entire key. This is no longer necessary now that OVS implements masking. Any fields (or holes in the structure) which are not part of a given protocol will be by definition not part of the mask and zeroed out during lookup. Furthermore, since masking already uses variable length keys this zeroing operation automatically benefits as well. In principle, the only thing that needs to be done at this point is remove the memset() at the beginning of flow. However, some fields assume that they are initialized to zero, which now must be done explicitly. In addition, in the event of an error we must also zero out corresponding fields to signal that there is no valid data present. These increase the total amount of code but very little of it is executed in non-error situations. Removing the memset() reduces the profile of ovs_flow_extract() from 0.64% to 0.56% when tested with large packets on a 10G link. Suggested-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-15openvswitch: Add recirc and hash action.Andy Zhou
Recirc action allows a packet to reenter openvswitch processing. currently openvswitch lookup flow for packet received and execute set of actions on that packet, with help of recirc action we can process/modify the packet and recirculate it back in openvswitch for another pass. OVS hash action calculates 5-tupple hash and set hash in flow-key hash. This can be used along with recirculation for distributing packets among different ports for bond devices. For example: OVS bonding can use following actions: Match on: bond flow; Action: hash, recirc(id) Match on: recirc-id == id and hash lower bits == a; Action: output port_bond_a Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-09-15openvswitch: Use tun_key only for egress tunnel path.Pravin B Shelar
Currently tun_key is used for passing tunnel information on ingress and egress path, this cause confusion. Following patch removes its use on ingress path make it egress only parameter. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
2014-09-15openvswitch: refactor ovs flow extract API.Pravin B Shelar
OVS flow extract is called on packet receive or packet execute code path. Following patch defines separate API for extracting flow-key in packet execute code path. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
2014-08-22net/openvswitch/flow.c: Replace rcu_dereference() with rcu_access_pointer()Andreea-Cristina Bernat
The "rcu_dereference()" call is used directly in a condition. Since its return value is never dereferenced it is recommended to use "rcu_access_pointer()" instead of "rcu_dereference()". Therefore, this patch makes the replacement. The following Coccinelle semantic patch was used: @@ @@ ( if( (<+... - rcu_dereference + rcu_access_pointer (...) ...+>)) {...} | while( (<+... - rcu_dereference + rcu_access_pointer (...) ...+>)) {...} ) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-29openvswitch: Fix tracking of flags seen in TCP flows.Ben Pfaff
Flow statistics need to take into account the TCP flags from the packet currently being processed (in 'key'), not the TCP flags matched by the flow found in the kernel flow table (in 'flow'). This bug made the Open vSwitch userspace fin_timeout action have no effect in many cases. This bug is introduced by commit 88d73f6c411ac2f0578 (openvswitch: Use TCP flags in the flow key for stats.) Reported-by: Len Gao <leng@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-05-22openvswitch: Fix ovs_flow_stats_get/clear RCU dereference.Jarno Rajahalme
For ovs_flow_stats_get() using ovsl_dereference() was wrong, since flow dumps call this with RCU read lock. ovs_flow_stats_clear() is always called with ovs_mutex, so can use ovsl_dereference(). Also, make the ovs_flow_stats_get() 'flow' argument const to make later patches cleaner. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-05-22openvswitch: Clarify locking.Jarno Rajahalme
Remove unnecessary locking from functions that are always called with appropriate locking. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-05-22openvswitch: Compact sw_flow_key.Jarno Rajahalme
Minimize padding in sw_flow_key and move 'tp' top the main struct. These changes simplify code when accessing the transport port numbers and the tcp flags, and makes the sw_flow_key 8 bytes smaller on 64-bit systems (128->120 bytes). These changes also make the keys for IPv4 packets to fit in one cache line. There is a valid concern for safety of packing the struct ovs_key_ipv4_tunnel, as it would be possible to take the address of the tun_id member as a __be64 * which could result in unaligned access in some systems. However: - sw_flow_key itself is 64-bit aligned, so the tun_id within is always 64-bit aligned. - We never make arrays of ovs_key_ipv4_tunnel (which would force every second tun_key to be misaligned). - We never take the address of the tun_id in to a __be64 *. - Whereever we use struct ovs_key_ipv4_tunnel outside the sw_flow_key, it is in stack (on tunnel input functions), where compiler has full control of the alignment. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-05-16openvswitch: Use TCP flags in the flow key for stats.Jarno Rajahalme
We already extract the TCP flags for the key, might as well use that for stats. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Per NUMA node flow stats.Jarno Rajahalme
Keep kernel flow stats for each NUMA node rather than each (logical) CPU. This avoids using the per-CPU allocator and removes most of the kernel-side OVS locking overhead otherwise on the top of perf reports and allows OVS to scale better with higher number of threads. With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup rate doubles on a server with two hyper-threaded physical CPUs (16 logical cores each) compared to the current OVS master. Tested with non-trivial flow table with a TCP port match rule forcing all new connections with unique port numbers to OVS userspace. The IP addresses are still wildcarded, so the kernel flows are not considered as exact match 5-tuple flows. This type of flows can be expected to appear in large numbers as the result of more effective wildcarding made possible by improvements in OVS userspace flow classifier. Perf results for this test (master): Events: 305K cycles + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.03% swapper [kernel.kallsyms] [k] intel_idle + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock ... And after this patch: Events: 356K cycles + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 1.47% swapper [kernel.kallsyms] [k] intel_idle + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock ... There is a small increase in kernel spinlock overhead due to the same spinlock being shared between multiple cores of the same physical CPU, but that is barely visible in the netperf TCP_CRR test performance (maybe ~1% performance drop, hard to tell exactly due to variance in the test results), when testing for kernel module throughput (with no userspace activity, handful of kernel flows). On flow setup, a single stats instance is allocated (for the NUMA node 0). As CPUs from multiple NUMA nodes start updating stats, new NUMA-node specific stats instances are allocated. This allocation on the packet processing code path is made to never block or look for emergency memory pools, minimizing the allocation latency. If the allocation fails, the existing preallocated stats instance is used. Also, if only CPUs from one NUMA-node are updating the preallocated stats instance, no additional stats instances are allocated. This eliminates the need to pre-allocate stats instances that will not be used, also relieving the stats reader from the burden of reading stats that are never used. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Remove 5-tuple optimization.Jarno Rajahalme
The 5-tuple optimization becomes unnecessary with a later per-NUMA node stats patch. Remove it first to make the changes easier to grasp. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Use ether_addr_copyJoe Perches
It's slightly smaller/faster for some architectures. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-03-28openvswitch: fix a possible deadlock and lockdep warningFlavio Leitner
There are two problematic situations. A deadlock can happen when is_percpu is false because it can get interrupted while holding the spinlock. Then it executes ovs_flow_stats_update() in softirq context which tries to get the same lock. The second sitation is that when is_percpu is true, the code correctly disables BH but only for the local CPU, so the following can happen when locking the remote CPU without disabling BH: CPU#0 CPU#1 ovs_flow_stats_get() stats_read() +->spin_lock remote CPU#1 ovs_flow_stats_get() | <interrupted> stats_read() | ... +--> spin_lock remote CPU#0 | | <interrupted> | ovs_flow_stats_update() | ... | spin_lock local CPU#0 <--+ ovs_flow_stats_update() +---------------------------------- spin_lock local CPU#1 This patch disables BH for both cases fixing the deadlocks. Acked-by: Jesse Gross <jesse@nicira.com> ================================= [ INFO: inconsistent lock state ] 3.14.0-rc8-00007-g632b06a #1 Tainted: G I --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/0/0 [HC0[0]:SC1[5]:HE1:SE0] takes: (&(&cpu_stats->lock)->rlock){+.?...}, at: [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch] {SOFTIRQ-ON-W} state was registered at: [<ffffffff810f973f>] __lock_acquire+0x68f/0x1c40 [<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0 [<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80 [<ffffffffa05dd9e4>] ovs_flow_stats_get+0xc4/0x1e0 [openvswitch] [<ffffffffa05da855>] ovs_flow_cmd_fill_info+0x185/0x360 [openvswitch] [<ffffffffa05daf05>] ovs_flow_cmd_build_info.constprop.27+0x55/0x90 [openvswitch] [<ffffffffa05db41d>] ovs_flow_cmd_new_or_set+0x4dd/0x570 [openvswitch] [<ffffffff816c245d>] genl_family_rcv_msg+0x1cd/0x3f0 [<ffffffff816c270e>] genl_rcv_msg+0x8e/0xd0 [<ffffffff816c0239>] netlink_rcv_skb+0xa9/0xc0 [<ffffffff816c0798>] genl_rcv+0x28/0x40 [<ffffffff816bf830>] netlink_unicast+0x100/0x1e0 [<ffffffff816bfc57>] netlink_sendmsg+0x347/0x770 [<ffffffff81668e9c>] sock_sendmsg+0x9c/0xe0 [<ffffffff816692d9>] ___sys_sendmsg+0x3a9/0x3c0 [<ffffffff8166a911>] __sys_sendmsg+0x51/0x90 [<ffffffff8166a962>] SyS_sendmsg+0x12/0x20 [<ffffffff817e3ce9>] system_call_fastpath+0x16/0x1b irq event stamp: 1740726 hardirqs last enabled at (1740726): [<ffffffff8175d5e0>] ip6_finish_output2+0x4f0/0x840 hardirqs last disabled at (1740725): [<ffffffff8175d59b>] ip6_finish_output2+0x4ab/0x840 softirqs last enabled at (1740674): [<ffffffff8109be12>] _local_bh_enable+0x22/0x50 softirqs last disabled at (1740675): [<ffffffff8109db05>] irq_exit+0xc5/0xd0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&cpu_stats->lock)->rlock); <Interrupt> lock(&(&cpu_stats->lock)->rlock); *** DEADLOCK *** 5 locks held by swapper/0/0: #0: (((&ifa->dad_timer))){+.-...}, at: [<ffffffff810a7155>] call_timer_fn+0x5/0x320 #1: (rcu_read_lock){.+.+..}, at: [<ffffffff81788a55>] mld_sendpack+0x5/0x4a0 #2: (rcu_read_lock_bh){.+....}, at: [<ffffffff8175d149>] ip6_finish_output2+0x59/0x840 #3: (rcu_read_lock_bh){.+....}, at: [<ffffffff8168ba75>] __dev_queue_xmit+0x5/0x9b0 #4: (rcu_read_lock){.+.+..}, at: [<ffffffffa05e41b5>] internal_dev_xmit+0x5/0x110 [openvswitch] stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G I 3.14.0-rc8-00007-g632b06a #1 Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012 0000000000000000 0fcf20709903df0c ffff88042d603808 ffffffff817cfe3c ffffffff81c134c0 ffff88042d603858 ffffffff817cb6da 0000000000000005 ffffffff00000001 ffff880400000000 0000000000000006 ffffffff81c134c0 Call Trace: <IRQ> [<ffffffff817cfe3c>] dump_stack+0x4d/0x66 [<ffffffff817cb6da>] print_usage_bug+0x1f4/0x205 [<ffffffff810f7f10>] ? check_usage_backwards+0x180/0x180 [<ffffffff810f8963>] mark_lock+0x223/0x2b0 [<ffffffff810f96d3>] __lock_acquire+0x623/0x1c40 [<ffffffff810f5707>] ? __lock_is_held+0x57/0x80 [<ffffffffa05e26c6>] ? masked_flow_lookup+0x236/0x250 [openvswitch] [<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch] [<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch] [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch] [<ffffffffa05dcc64>] ovs_dp_process_received_packet+0x84/0x120 [openvswitch] [<ffffffff810f93f7>] ? __lock_acquire+0x347/0x1c40 [<ffffffffa05e3bea>] ovs_vport_receive+0x2a/0x30 [openvswitch] [<ffffffffa05e4218>] internal_dev_xmit+0x68/0x110 [openvswitch] [<ffffffffa05e41b5>] ? internal_dev_xmit+0x5/0x110 [openvswitch] [<ffffffff8168b4a6>] dev_hard_start_xmit+0x2e6/0x8b0 [<ffffffff8168be87>] __dev_queue_xmit+0x417/0x9b0 [<ffffffff8168ba75>] ? __dev_queue_xmit+0x5/0x9b0 [<ffffffff8175d5e0>] ? ip6_finish_output2+0x4f0/0x840 [<ffffffff8168c430>] dev_queue_xmit+0x10/0x20 [<ffffffff8175d641>] ip6_finish_output2+0x551/0x840 [<ffffffff8176128a>] ? ip6_finish_output+0x9a/0x220 [<ffffffff8176128a>] ip6_finish_output+0x9a/0x220 [<ffffffff8176145f>] ip6_output+0x4f/0x1f0 [<ffffffff81788c29>] mld_sendpack+0x1d9/0x4a0 [<ffffffff817895b8>] mld_send_initial_cr.part.32+0x88/0xa0 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220 [<ffffffff8178e301>] ipv6_mc_dad_complete+0x31/0x50 [<ffffffff817690d7>] addrconf_dad_completed+0x147/0x220 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220 [<ffffffff8176934f>] addrconf_dad_timer+0x19f/0x1c0 [<ffffffff810a71e9>] call_timer_fn+0x99/0x320 [<ffffffff810a7155>] ? call_timer_fn+0x5/0x320 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220 [<ffffffff810a76c4>] run_timer_softirq+0x254/0x3b0 [<ffffffff8109d47d>] __do_softirq+0x12d/0x480 Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20openvswitch: Correctly report flow used times for first 5 minutes after boot.Ben Pfaff
The kernel starts out its "jiffies" timer as 5 minutes below zero, as shown in include/linux/jiffies.h: /* * Have the 32 bit jiffies value wrap 5 minutes after boot * so jiffies wrap bugs show up earlier. */ #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ)) The loop in ovs_flow_stats_get() starts out with 'used' set to 0, then takes any "later" time. This means that for the first five minutes after boot, flows will always be reported as never used, since 0 is greater than any time already seen. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-15openvswitch: Read tcp flags only then the tranport header is present.Jarno Rajahalme
Only the first IP fragment can have a TCP header, check for this. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Per cpu flow stats.Pravin B Shelar
With mega flow implementation ovs flow can be shared between multiple CPUs which makes stats updates highly contended operation. This patch uses per-CPU stats in cases where a flow is likely to be shared (if there is a wildcard in the 5-tuple and therefore likely to be spread by RSS). In other situations, it uses the current strategy, saving memory and allocation time. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-11-01openvswitch: TCP flags matching support.Jarno Rajahalme
tcp_flags=flags/mask Bitwise match on TCP flags. The flags and mask are 16-bit num‐ bers written in decimal or in hexadecimal prefixed by 0x. Each 1-bit in mask requires that the corresponding bit in port must match. Each 0-bit in mask causes the corresponding bit to be ignored. TCP protocol currently defines 9 flag bits, and additional 3 bits are reserved (must be transmitted as zero), see RFCs 793, 3168, and 3540. The flag bits are, numbering from the least significant bit: 0: FIN No more data from sender. 1: SYN Synchronize sequence numbers. 2: RST Reset the connection. 3: PSH Push function. 4: ACK Acknowledgement field significant. 5: URG Urgent pointer field significant. 6: ECE ECN Echo. 7: CWR Congestion Windows Reduced. 8: NS Nonce Sum. 9-11: Reserved. 12-15: Not matchable, must be zero. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-11-01openvswitch: Widen TCP flags handling.Jarno Rajahalme
Widen TCP flags handling from 7 bits (uint8_t) to 12 bits (uint16_t). The kernel interface remains at 8 bits, which makes no functional difference now, as none of the higher bits is currently of interest to the userspace. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>