summaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)Author
2018-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree: 1) No need to set ttl from reject action for the bridge family, from Taehee Yoo. 2) Use a fixed timeout for flow that are passed up from the flowtable to conntrack, from Florian Westphal. 3) More preparation patches for tproxy support for nf_tables, from Mate Eckl. 4) Remove unnecessary indirection in core IPv6 checksum function, from Florian Westphal. 5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it. From Florian Westphal. 6) socket match now selects socket infrastructure, instead of depending on it. From Mate Eckl. 7) Patch series to simplify conntrack tuple building/parsing from packet path and ctnetlink, from Florian Westphal. 8) Fetch timeout policy from protocol helpers, instead of doing it from core, from Florian Westphal. 9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from Florian Westphal. 10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES respectively, instead of IPV6. Patch from Mate Eckl. 11) Add specific function for garbage collection in conncount, from Yi-Hung Wei. 12) Catch number of elements in the connlimit list, from Yi-Hung Wei. 13) Move locking to nf_conncount, from Yi-Hung Wei. 14) Series of patches to add lockless tree traversal in nf_conncount, from Yi-Hung Wei. 15) Resolve clash in matching conntracks when race happens, from Martynas Pumputis. 16) If connection entry times out, remove template entry from the ip_vs_conn_tab table to improve behaviour under flood, from Julian Anastasov. 17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng. 18) Call abort from 2-phase commit protocol before requesting modules, make sure this is done under the mutex, from Florian Westphal. 19) Grab module reference when starting transaction, also from Florian. 20) Dynamically allocate expression info array for pre-parsing, from Florian. 21) Add per netns mutex for nf_tables, from Florian Westphal. 22) A couple of patches to simplify and refactor nf_osf code to prepare for nft_osf support. 23) Break evaluation on missing socket, from Mate Eckl. 24) Allow to match socket mark from nft_socket, from Mate Eckl. 25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is built-in into nf_conntrack. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-19net/sched: cls_flower: Support matching on ip tos and ttl for tunnelsOr Gerlitz
Allow users to set rules matching on ipv4 tos and ttl or ipv6 traffic-class and hoplimit of tunnel headers. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-19net/sched: tunnel_key: Allow to set tos and ttl for tc based ip tunnelsOr Gerlitz
Allow user-space to provide tos and ttl to be set for the tunnel headers. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Lots of fixes, here goes: 1) NULL deref in qtnfmac, from Gustavo A. R. Silva. 2) Kernel oops when fw download fails in rtlwifi, from Ping-Ke Shih. 3) Lost completion messages in AF_XDP, from Magnus Karlsson. 4) Correct bogus self-assignment in rhashtable, from Rishabh Bhatnagar. 5) Fix regression in ipv6 route append handling, from David Ahern. 6) Fix masking in __set_phy_supported(), from Heiner Kallweit. 7) Missing module owner set in x_tables icmp, from Florian Westphal. 8) liquidio's timeouts are HZ dependent, fix from Nicholas Mc Guire. 9) Link setting fixes for sh_eth and ravb, from Vladimir Zapolskiy. 10) Fix NULL deref when using chains in act_csum, from Davide Caratti. 11) XDP_REDIRECT needs to check if the interface is up and whether the MTU is sufficient. From Toshiaki Makita. 12) Net diag can do a double free when killing TCP_NEW_SYN_RECV connections, from Lorenzo Colitti. 13) nf_defrag in ipv6 can unnecessarily hold onto dst entries for a full minute, delaying device unregister. From Eric Dumazet. 14) Update MAC entries in the correct order in ixgbe, from Alexander Duyck. 15) Don't leave partial mangles bpf program in jit_subprogs, from Daniel Borkmann. 16) Fix pfmemalloc SKB state propagation, from Stefano Brivio. 17) Fix ACK handling in DCTCP congestion control, from Yuchung Cheng. 18) Use after free in tun XDP_TX, from Toshiaki Makita. 19) Stale ipv6 header pointer in ipv6 gre code, from Prashant Bhole. 20) Don't reuse remainder of RX page when XDP is set in mlx4, from Saeed Mahameed. 21) Fix window probe handling of TCP rapair sockets, from Stefan Baranoff. 22) Missing socket locking in smc_ioctl(), from Ursula Braun. 23) IPV6_ILA needs DST_CACHE, from Arnd Bergmann. 24) Spectre v1 fix in cxgb3, from Gustavo A. R. Silva. 25) Two spots in ipv6 do a rol32() on a hash value but ignore the result. Fixes from Colin Ian King" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (176 commits) tcp: identify cryptic messages as TCP seq # bugs ptp: fix missing break in switch hv_netvsc: Fix napi reschedule while receive completion is busy MAINTAINERS: Drop inactive Vitaly Bordug's email net: cavium: Add fine-granular dependencies on PCI net: qca_spi: Fix log level if probe fails net: qca_spi: Make sure the QCA7000 reset is triggered net: qca_spi: Avoid packet drop during initial sync ipv6: fix useless rol32 call on hash ipv6: sr: fix useless rol32 call on hash net: sched: Using NULL instead of plain integer net: usb: asix: replace mii_nway_restart in resume path net: cxgb3_main: fix potential Spectre v1 lib/rhashtable: consider param->min_size when setting initial table size net/smc: reset recv timeout after clc handshake net/smc: add error handling for get_user() net/smc: optimize consumer cursor updates net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL. ipv6: ila: select CONFIG_DST_CACHE net: usb: rtl8150: demote allmulti message to dev_dbg() ...
2018-07-18netfilter: nf_osf: add missing definitions to header fileFernando Fernandez Mancera
Add missing definitions from nf_osf.h in order to extract Passive OS fingerprint infrastructure from xt_osf. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: nft_socket: Expose socket markMáté Eckl
Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16tcp: Fix broken repair socket window probe patchStefan Baranoff
Correct previous bad attempt at allowing sockets to come out of TCP repair without sending window probes. To avoid changing size of the repair variable in struct tcp_sock, this lets the decision for sending probes or not to be made when coming out of repair by introducing two ways to turn it off. v2: * Remove erroneous comment; defines now make behavior clear Fixes: 70b7ff130224 ("tcp: allow user to create repair socket without window probes") Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-15bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CBAndrey Ignatov
Add new TCP-BPF callback that is called on listen(2) right after socket transition to TCP_LISTEN state. It fills the gap for listening sockets in TCP-BPF. For example BPF program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening and track later transition from TCP_LISTEN to TCP_CLOSE with BPF_SOCK_OPS_STATE_CB callback. Before there was no way to do it with TCP-BPF and other options were much harder to work with. E.g. socket state tracking can be done with tracepoints (either raw or regular) but they can't be attached to cgroup and their lifetime has to be managed separately. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13net: ethtool: fix spelling mistake: "tubale" -> "tunable"Michael Heimpold
Signed-off-by: Michael Heimpold <mhei@heimpold.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13net: ipmr: add support for passing full packet on wrong vifNikolay Aleksandrov
This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass full packet and real vif id when the incoming interface is wrong. While the RP and FHR are setting up state we need to be sending the registers encapsulated with all the data inside otherwise we lose it. The RP then decapsulates it and forwards it to the interested parties. Currently with WRONGVIF we can only be sending empty register packets and will lose that data. This behaviour can be enabled by using MRT_PIM with val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from happening, it happens in addition to it, also it is controlled by the same throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently). Both messages are generated to keep backwards compatibily and avoid breaking someone who was enabling MRT_PIM with val == 4, since any positive val is accepted and treated the same. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13xdp: support simultaneous driver and hw XDP attachmentJakub Kicinski
Split the query of HW-attached program from the software one. Introduce new .ndo_bpf command to query HW-attached program. This will allow drivers to install different programs in HW and SW at the same time. Netlink can now also carry multiple programs on dump (in which case mode will be set to XDP_ATTACHED_MULTI and user has to check per-attachment point attributes, IFLA_XDP_PROG_ID will not be present). We reuse IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size() doesn't need to be updated. Note that the installation side is still not there, since all drivers currently reject installing more than one program at the time. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13xdp: add per mode attributes for attached programsJakub Kicinski
In preparation for support of simultaneous driver and hardware XDP support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID will still be reported, but user space can now also access the program ID in a new IFLA_XDP_<mode>_PROG_ID attribute. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-12devlink: Add support for region snapshot read commandAlex Vesker
Add support for DEVLINK_CMD_REGION_READ_GET used for both reading and dumping region data. Read allows reading from a region specific address for given length. Dump allows reading the full region. If only snapshot ID is provided a snapshot dump will be done. If snapshot ID, Address and Length are provided a snapshot read will done. This is used for both snapshot access and will be used in the same way to access current data on the region. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Add support for region snapshot delete commandAlex Vesker
Add support for DEVLINK_CMD_REGION_DEL used for deleting a snapshot from a region. The snapshot ID is required. Also added notification support for NEW and DEL of snapshots. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Extend the support querying for region snapshot IDsAlex Vesker
Extend the support for DEVLINK_CMD_REGION_GET command to also return the IDs of the snapshot currently present on the region. Each reply will include a nested snapshots attribute that can contain multiple snapshot attributes each with an ID. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12devlink: Add support for region get commandAlex Vesker
Add support for DEVLINK_CMD_REGION_GET command which is used for querying for the supported DEV/REGION values of devlink devices. The support is both for doit and dumpit. Reply includes: BUS_NAME, DEVICE_NAME, REGION_NAME, REGION_SIZE Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12bpf: fix documentation for eBPF helpersQuentin Monnet
Minor formatting edits for eBPF helpers documentation, including blank lines removal, fix of item list for return values in bpf_fib_lookup(), and missing prefix on bpf_skb_load_bytes_relative(). Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-10sched: Add Common Applications Kept Enhanced (cake) qdiscToke Høiland-Jørgensen
sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12) backlog 0b 0p requeues 12 memory used: 1053008b of 15140Kb capacity estimate: 970Mbit min/max network layer size: 28 / 1500 min/max overhead-adjusted size: 84 / 1538 average network hdr offset: 14 Bulk Best Effort Voice thresh 62500Kbit 1Gbit 250Mbit target 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 5us 5us 6us av_delay 3us 2us 2us sp_delay 2us 1us 1us backlog 0b 0b 0b pkts 3164050 25030267 9530280 bytes 3227519915 35396974782 12879808898 way_inds 0 8 0 way_miss 21 366 25 way_cols 0 0 0 drops 5 0 1 marks 0 0 0 ack_drop 0 0 0 sp_flows 1 3 0 bk_flows 0 1 1 un_flows 0 0 0 max_len 68130 68130 68130 Tested-by: Pete Heist <peteheist@gmail.com> Tested-by: Georgios Amanakis <gamanakis@gmail.com> Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10rseq: Remove unused types_32_64.h uapi headerMathieu Desnoyers
This header was introduced in the 4.18 merge window, and rseq does not need it anymore. Nuke it before the final release. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-6-mathieu.desnoyers@efficios.com
2018-07-10rseq: uapi: Declare rseq_cs field as union, update includesMathieu Desnoyers
Declaring the rseq_cs field as a union between __u64 and two __u32 allows both 32-bit and 64-bit kernels to read the full __u64, and therefore validate that a 32-bit user-space cleared the upper 32 bits, thus ensuring a consistent behavior between native 32-bit kernels and 32-bit compat tasks on 64-bit kernels. Check that the rseq_cs value read is < TASK_SIZE. The asm/byteorder.h header needs to be included by rseq.h, now that it is not using linux/types_32_64.h anymore. Considering that only __32 and __u64 types are declared in linux/rseq.h, the linux/types.h header should always be included for both kernel and user-space code: including stdint.h is just for u64 and u32, which are not used in this header at all. Use copy_from_user()/clear_user() to interact with a 64-bit field, because arm32 does not implement 64-bit __get_user, and ppc32 does not 64-bit get_user. Considering that the rseq_cs pointer does not need to be loaded/stored with single-copy atomicity from the kernel anymore, we can simply use copy_from_user()/clear_user(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
2018-07-10rseq: uapi: Update uapi commentsMathieu Desnoyers
Update rseq uapi header comments to reflect that user-space need to do thread-local loads/stores from/to the struct rseq fields. As a consequence of this added requirement, the kernel does not need to perform loads/stores with single-copy atomicity. Update the comment associated to the "flags" fields to describe more accurately that it's only useful to facilitate single-stepping through rseq critical sections with debuggers. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
2018-07-10rseq: Use __u64 for rseq_cs fields, validate user inputsMathieu Desnoyers
Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a consistent behavior for a 32-bit binary executed on 32-bit kernels and in compat mode on 64-bit kernels. Validating the value of abort_ip field to be below TASK_SIZE ensures the kernel don't return to an invalid address when returning to userspace after an abort. I don't fully trust each architecture code to consistently deal with invalid return addresses. Validating the value of the start_ip and post_commit_offset fields prevents overflow on arithmetic performed on those values, used to check whether abort_ip is within the rseq critical section. If validation fails, the process is killed with a segmentation fault. When the signature encountered before abort_ip does not match the expected signature, return -EINVAL rather than -EPERM to be consistent with other input validation return codes from rseq_get_rseq_cs(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-api@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Watson <davejwatson@fb.com> Cc: Paul Turner <pjt@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chris Lameter <cl@linux.com> Cc: Ben Maurer <bmaurer@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
2018-07-09net: Use __u32 in uapi net_stamp.hJesus Sanchez-Palencia
We are not supposed to use u32 in uapi, so change the flags member of struct sock_txtime from u32 to __u32 instead. Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08openvswitch: kernel datapath clone actionYifeng Sun
Add 'clone' action to kernel datapath by using existing functions. When actions within clone don't modify the current flow, the flow key is not cloned before executing clone actions. This is a follow up patch for this incomplete work: https://patchwork.ozlabs.org/patch/722096/ v1 -> v2: Refactor as advised by reviewer. Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Add supprt for matching on QinQ vlan headersJianbo Liu
As support dissecting of QinQ inner and outer vlan headers, user can add rules to match on QinQ vlan headers. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add devlink notifications support for paramsMoshe Shemesh
Add devlink_param_notify() function to support devlink param notifications. Add notification call to devlink param set, register and unregister functions. Add devlink_param_value_changed() function to enable the driver notify devlink on value change. Driver should use this function after value was changed on any configuration mode part to driverinit. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add param set commandMoshe Shemesh
Add param set command to set value for a parameter. Value can be set to any of the supported configuration modes. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add param get commandMoshe Shemesh
Add param get command which gets data per parameter. Option to dump the parameters data per device. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add devlink_param register and unregisterMoshe Shemesh
Define configuration parameters data structure. Add functions to register and unregister the driver supported configuration parameters table. For each parameter registered, the driver should fill all the parameter's fields. In case the only supported configuration mode is "driverinit" the parameter's get()/set() functions are not required and should be set to NULL, for any other configuration mode, these functions are required and should be set by the driver. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Make etf report drops on error_queueJesus Sanchez-Palencia
Use the socket error queue for reporting dropped packets if the socket has enabled that feature through the SO_TXTIME API. Packets are dropped either on enqueue() if they aren't accepted by the qdisc or on dequeue() if the system misses their deadline. Those are reported as different errors so applications can react accordingly. Userspace can retrieve the errors through the socket error queue and the corresponding cmsg interfaces. A struct sock_extended_err* is used for returning the error data, and the packet's timestamp can be retrieved by adding both ee_data and ee_info fields as e.g.: ((__u64) serr->ee_data << 32) + serr->ee_info This feature is disabled by default and must be explicitly enabled by applications. Enabling it can bring some overhead for the Tx cycles of the application. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Add HW offloading capability to ETFJesus Sanchez-Palencia
Add infra so etf qdisc supports HW offload of time-based transmission. For hw offload, the time sorted list is still used, so packets are dequeued always in order of txtime. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \ clockid CLOCK_REALTIME In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. The hrtimer used for packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME as reference and packets leave the Qdisc "delta" (100000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by the hrtimer, the system clock and the PHC clock must be synchronized for this mode to behave as expected. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Introduce the ETF QdiscVinicius Costa Gomes
The ETF (Earliest TxTime First) qdisc uses the information added earlier in this series (the socket option SO_TXTIME and the new role of sk_buff->tstamp) to schedule packets transmission based on absolute time. For some workloads, just bandwidth enforcement is not enough, and precise control of the transmission of packets is necessary. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \ clockid CLOCK_TAI In this example, the Qdisc will provide SW best-effort for the control of the transmission time to the network adapter, the time stamp in the socket will be in reference to the clockid CLOCK_TAI and packets will leave the qdisc "delta" (100000) nanoseconds before its transmission time. The ETF qdisc will buffer packets sorted by their txtime. It will drop packets on enqueue() if their skbuff clockid does not match the clock reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped if it expires while being enqueued. The qdisc also supports the SO_TXTIME deadline mode. For this mode, it will dequeue a packet as soon as possible and change the skb timestamp to 'now' during etf_dequeue(). Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid to be configured, but it's been decided that usage of CLOCK_TAI should be enforced until we decide to allow for other clockids to be used. The rationale here is that PTP times are usually in the TAI scale, thus no other clocks should be necessary. For now, the qdisc will return EINVAL if any clocks other than CLOCK_TAI are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: Add a new socket option for a future transmit time.Richard Cochran
This patch introduces SO_TXTIME. User space enables this option in order to pass a desired future transmit time in a CMSG when calling sendmsg(2). The argument to this socket option is a 8-bytes long struct provided by the uapi header net_tstamp.h defined as: struct sock_txtime { clockid_t clockid; u32 flags; }; Note that new fields were added to struct sock by filling a 2-bytes hole found in the struct. For that reason, neither the struct size or number of cachelines were altered. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net:sched: add action inheritdsfield to skbeditQiaobin Fu
The new action inheritdsfield copies the field DS of IPv4 and IPv6 packets into skb->priority. This enables later classification of packets based on the DS field. v5: *Update the drop counter for TC_ACT_SHOT v4: *Not allow setting flags other than the expected ones. *Allow dumping the pure flags. v3: *Use optional flags, so that it won't break old versions of tc. *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags. v2: *Fix the style issue *Move the code from skbmod to skbedit Original idea by Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu> Reviewed-by: Michel Machado <michel@digirati.com.br> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparamsXin Long
spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in this patch so that users could set sctp_sock/asoc/transport dscp and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458. As said in last patch, it uses '| 0x100000' or '|0x1' to mark flowlabel or dscp is set, so that their values could be set to 0. Note that to guarantee that an old app built with old kernel headers could work on the newer kernel, the param's check in sctp_g/setsockopt_peer_addr_params() is also improved, which follows the way that sctp_g/setsockopt_delayed_ack() or some other sockopts' process that accept two types of params does. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-03Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Simple overlapping changes in stmmac driver. Adjust skb_gro_flush_final_remcsum function signature to make GRO list changes in net-next, as per Stephen Rothwell's example merge resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Verify netlink attributes properly in nf_queue, from Eric Dumazet. 2) Need to bump memory lock rlimit for test_sockmap bpf test, from Yonghong Song. 3) Fix VLAN handling in lan78xx driver, from Dave Stevenson. 4) Fix uninitialized read in nf_log, from Jann Horn. 5) Fix raw command length parsing in mlx5, from Alex Vesker. 6) Cleanup loopback RDS connections upon netns deletion, from Sowmini Varadhan. 7) Fix regressions in FIB rule matching during create, from Jason A. Donenfeld and Roopa Prabhu. 8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren. 9) More bpfilter build fixes/adjustments from Masahiro Yamada. 10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper Dangaard Brouer. 11) fib_tests.sh file permissions were broken, from Shuah Khan. 12) Make sure BH/preemption is disabled in data path of mac80211, from Denis Kenzior. 13) Don't ignore nla_parse_nested() return values in nl80211, from Johannes berg. 14) Properly account sock objects ot kmemcg, from Shakeel Butt. 15) Adjustments to setting bpf program permissions to read-only, from Daniel Borkmann. 16) TCP Fast Open key endianness was broken, it always took on the host endiannness. Whoops. Explicitly make it little endian. From Yuching Cheng. 17) Fix prefix route setting for link local addresses in ipv6, from David Ahern. 18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva. 19) Various bpf sockmap fixes, from John Fastabend. 20) Use after free for GRO with ESP, from Sabrina Dubroca. 21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from Eric Biggers. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits) qede: Adverstise software timestamp caps when PHC is not available. qed: Fix use of incorrect size in memcpy call. qed: Fix setting of incorrect eswitch mode. qed: Limit msix vectors in kdump kernel to the minimum required count. ipvlan: call dev_change_flags when ipvlan mode is reset ipv6: sr: fix passing wrong flags to crypto_alloc_shash() net: fix use-after-free in GRO with ESP tcp: prevent bogus FRTO undos with non-SACK flows bpf: sockhash, add release routine bpf: sockhash fix omitted bucket lock in sock_close bpf: sockmap, fix smap_list_map_remove when psock is in many maps bpf: sockmap, fix crash when ipv6 sock is added net: fib_rules: bring back rule_exists to match rule during add hv_netvsc: split sub-channel setup into async and sync net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN atm: zatm: Fix potential Spectre v1 s390/qeth: consistently re-enable device features s390/qeth: don't clobber buffer on async TX completion s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6] s390/qeth: fix race when setting MAC address ...
2018-07-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-07-01 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) A bpf_fib_lookup() helper fix to change the API before freeze to return an encoding of the FIB lookup result and return the nexthop device index in the params struct (instead of device index as return code that we had before), from David. 2) Various BPF JIT fixes to address syzkaller fallout, that is, do not reject progs when set_memory_*() fails since it could still be RO. Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was an issue, and a memory leak in s390 JIT found during review, from Daniel. 3) Multiple fixes for sockmap/hash to address most of the syzkaller triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(), a missing sock_map_release() routine to hook up to callbacks, and a fix for an omitted bucket lock in sock_close(), from John. 4) Two bpftool fixes to remove duplicated error message on program load, and another one to close the libbpf object after program load. One additional fix for nfp driver's BPF offload to avoid stopping offload completely if replace of program failed, from Jakub. 5) Couple of BPF selftest fixes that bail out in some of the test scripts if the user does not have the right privileges, from Jeffrin. 6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set where we need to set the flag that some of the test cases are expected to fail, from Kleber. 7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF since it has no relation to it and lirc2 users often have configs without cgroups enabled and thus would not be able to use it, from Sean. 8) Fix a selftest failure in sockmap by removing a useless setrlimit() call that would set a too low limit where at the same time we are already including bpf_rlimit.h that does the job, from Yonghong. 9) Fix BPF selftest config with missing missing NET_SCHED, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30Merge tag 'mac80211-next-for-davem-2018-06-29' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Small merge conflict in net/mac80211/scan.c, I preserved the kcalloc() conversion. -DaveM Johannes Berg says: ==================== This round's updates: * finally some of the promised HE code, but it turns out to be small - but everything kept changing, so one part I did in the driver was >30 patches for what was ultimately <200 lines of code ... similar here for this code. * improved scan privacy support - can now specify scan flags for randomizing the sequence number as well as reducing the probe request element content * rfkill cleanups * a timekeeping cleanup from Arnd * various other cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30tipc: extend sock diag for group communicationGhantaKrishnamurthy MohanKrishna
This commit extends the existing TIPC socket diagnostics framework for information related to TIPC group communication. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30net/smc: add SMC-D diag supportHans Wippel
This patch adds diag support for SMC-D. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Suggested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30tcp: add new SNMP counter for drops when try to queue in rcv queueYafang Shao
When sk_rmem_alloc is larger than the receive buffer and we can't schedule more memory for it, the skb will be dropped. In above situation, if this skb is put into the ofo queue, LINUX_MIB_TCPOFODROP is incremented to track it. While if this skb is put into the receive queue, there's no record. So a new SNMP counter is introduced to track this behavior. LINUX_MIB_TCPRCVQDROP: Number of packets meant to be queued in rcv queue but dropped because socket rcvbuf limit hit. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29net/sched: add tunnel option support to act_tunnel_keySimon Horman
Allow setting tunnel options using the act_tunnel_key action. Options are expressed as class:type:data and multiple options may be listed using a comma delimiter. # ip link add name geneve0 type geneve dstport 0 external # tc qdisc add dev eth0 ingress # tc filter add dev eth0 protocol ip parent ffff: \ flower indev eth0 \ ip_proto udp \ action tunnel_key \ set src_ip 10.0.99.192 \ dst_ip 10.0.99.193 \ dst_port 6081 \ id 11 \ geneve_opts 0102:80:00800022,0102:80:00800022 \ action mirred egress redirect dev geneve0 Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29aio: mark __aio_sigset::sigmask constAvi Kivity
io_pgetevents() will not change the signal mask. Mark it const to make it clear and to reduce the need for casts in user code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Avi Kivity <avi@scylladb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [hch: reapply the patch that got incorrectly reverted] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-29sctp: add support for SCTP_REUSE_PORT sockoptXin Long
This feature is actually already supported by sk->sk_reuse which can be set by socket level opt SO_REUSEADDR. But it's not working exactly as RFC6458 demands in section 8.1.27, like: - This option only supports one-to-one style SCTP sockets - This socket option must not be used after calling bind() or sctp_bindx(). Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs. Otherwise, the programs with SCTP_REUSE_PORT from other systems will not work in linux. To separate it from the socket level version, this patch adds 'reuse' in sctp_sock and it works pretty much as sk->sk_reuse, but with some extra setup limitations that are needed when it is being enabled. "It should be noted that the behavior of the socket-level socket option to reuse ports and/or addresses for SCTP sockets is unspecified", so it leaves SO_REUSEADDR as is for the compatibility. Note that the name SCTP_REUSE_PORT is somewhat confusing, as its functionality is nearly identical to SO_REUSEADDR, but with some extra restrictions. Here it uses 'reuse' in sctp_sock instead of 'reuseport'. As for sk->sk_reuseport support for SCTP, it will be added in another patch. Thanks to Neil to make this clear. v1->v2: - add sctp_sk->reuse to separate it from the socket level version. v2->v3: - improve changelog according to Marcelo's suggestion. Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29ila: Flush netlink command to clear xlat tableTom Herbert
Add ILA_CMD_FLUSH netlink command to clear the ILA translation table. Signed-off-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29bpf: Change bpf_fib_lookup to return lookup statusDavid Ahern
For ACLs implemented using either FIB rules or FIB entries, the BPF program needs the FIB lookup status to be able to drop the packet. Since the bpf_fib_lookup API has not reached a released kernel yet, change the return code to contain an encoding of the FIB lookup result and return the nexthop device index in the params struct. In addition, inform the BPF program of any post FIB lookup reason as to why the packet needs to go up the stack. The fib result for unicast routes must have an egress device, so remove the check that it is non-NULL. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-28Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLLLinus Torvalds
The poll() changes were not well thought out, and completely unexplained. They also caused a huge performance regression, because "->poll()" was no longer a trivial file operation that just called down to the underlying file operations, but instead did at least two indirect calls. Indirect calls are sadly slow now with the Spectre mitigation, but the performance problem could at least be largely mitigated by changing the "->get_poll_head()" operation to just have a per-file-descriptor pointer to the poll head instead. That gets rid of one of the new indirections. But that doesn't fix the new complexity that is completely unwarranted for the regular case. The (undocumented) reason for the poll() changes was some alleged AIO poll race fixing, but we don't make the common case slower and more complex for some uncommon special case, so this all really needs way more explanations and most likely a fundamental redesign. [ This revert is a revert of about 30 different commits, not reverted individually because that would just be unnecessarily messy - Linus ] Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>