summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-07-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Unified UDP encapsulation offload methods for drivers, from Alexander Duyck. 2) Make DSA binding more sane, from Andrew Lunn. 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli. 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar. 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX packets as soon as the device sees them, with the option to mirror the packet on TX via the same interface. From Brenden Blanco and others. 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet. 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli. 8) Simplify netlink conntrack entry layout, from Florian Westphal. 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido Schimmel, Yotam Gigi, and Jiri Pirko. 10) Add SKB array infrastructure and convert tun and macvtap over to it. From Michael S Tsirkin and Jason Wang. 11) Support qdisc packet injection in pktgen, from John Fastabend. 12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy. 13) Add NV congestion control support to TCP, from Lawrence Brakmo. 14) Add GSO support to SCTP, from Marcelo Ricardo Leitner. 15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni. 16) Support MPLS over IPV4, from Simon Horman. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) xgene: Fix build warning with ACPI disabled. be2net: perform temperature query in adapter regardless of its interface state l2tp: Correctly return -EBADF from pppol2tp_getname. net/mlx5_core/health: Remove deprecated create_singlethread_workqueue net: ipmr/ip6mr: update lastuse on entry change macsec: ensure rx_sa is set when validation is disabled tipc: dump monitor attributes tipc: add a function to get the bearer name tipc: get monitor threshold for the cluster tipc: make cluster size threshold for monitoring configurable tipc: introduce constants for tipc address validation net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update() MAINTAINERS: xgene: Add driver and documentation path Documentation: dtb: xgene: Add MDIO node dtb: xgene: Add MDIO node drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset drivers: net: xgene: Use exported functions drivers: net: xgene: Enable MDIO driver drivers: net: xgene: Add backward compatibility drivers: net: phy: xgene: Add MDIO driver ...
2016-07-27Merge tag 'for-linus-4.8-rc0-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from David Vrabel: "Features and fixes for 4.8-rc0: - ACPI support for guests on ARM platforms. - Generic steal time support for arm and x86. - Support cases where kernel cpu is not Xen VCPU number (e.g., if in-guest kexec is used). - Use the system workqueue instead of a custom workqueue in various places" * tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits) xen: add static initialization of steal_clock op to xen_time_ops xen/pvhvm: run xen_vcpu_setup() for the boot CPU xen/evtchn: use xen_vcpu_id mapping xen/events: fifo: use xen_vcpu_id mapping xen/events: use xen_vcpu_id mapping in events_base x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op xen: introduce xen_vcpu_id mapping x86/acpi: store ACPI ids from MADT for future usage x86/xen: update cpuid.h from Xen-4.7 xen/evtchn: add IOCTL_EVTCHN_RESTRICT xen-blkback: really don't leak mode property xen-blkback: constify instance of "struct attribute_group" xen-blkfront: prefer xenbus_scanf() over xenbus_gather() xen-blkback: prefer xenbus_scanf() over xenbus_gather() xen: support runqueue steal time on xen arm/xen: add support for vm_assist hypercall xen: update xen headers xen-pciback: drop superfluous variables xen-pciback: short-circuit read path used for merging write values ...
2016-07-27stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPAREOleg Nesterov
Suppose that stop_machine(fn) hangs because fn() hangs. In this case NMI hard-lockup can be triggered on another CPU which does nothing wrong and the trace from nmi_panic() won't help to investigate the problem. And this change "fixes" the problem we (seem to) hit in practice. - stop_two_cpus(0, 1) races with show_state_filter() running on CPU_0. - CPU_1 already spins in MULTI_STOP_PREPARE state, it detects the soft lockup and tries to report the problem. - show_state_filter() enables preemption, CPU_0 calls multi_cpu_stop() which goes to MULTI_STOP_DISABLE_IRQ state and disables interrupts. - CPU_1 spends more than 10 seconds trying to flush the log buffer to the slow serial console. - NMI interrupt on CPU_0 (which now waits for CPU_1) calls nmi_panic(). Reported-by: Wang Shu <shuwang@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Anderson <anderson@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20160726185736.GB4088@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-27module: Invalidate signatures on force-loaded modulesBen Hutchings
Signing a module should only make it trusted by the specific kernel it was built for, not anything else. Loading a signed module meant for a kernel with a different ABI could have interesting effects. Therefore, treat all signatures as invalid when a module is force-loaded. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2016-07-27module: Issue warnings when tainting kernelLibor Pechacek
While most of the locations where a kernel taint bit is set are accompanied with a warning message, there are two which set their bits silently. If the tainting module gets unloaded later on, it is almost impossible to tell what was the reason for setting the flag. Signed-off-by: Libor Pechacek <lpechacek@suse.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2016-07-27module: fix redundant test.Rusty Russell
[linux-4.5-rc4/kernel/module.c:1692]: (style) Redundant condition: attr.test. '!attr.test || (attr.test && attr.test(mod))' is equivalent to '!attr.test || attr.test(mod)' This code was added like this ten years ago, in c988d2b284549 "modules: add version and srcversion to sysfs". Reported-by: David Binderman <dcb314@hotmail.com> Cc: Matt Domsch <Matt_Domsch@dell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2016-07-27module: fix noreturn attribute for __module_put_and_exit()Jiri Kosina
__module_put_and_exit() is makred noreturn in module.h declaration, but is lacking the attribute in the definition, which makes some tools (such as sparse) unhappy. Amend the definition with the attribute as well (and reformat the declaration so that it uses more common format). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2016-07-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
2016-07-26Merge tag 'pm-4.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Again, the majority of changes go into the cpufreq subsystem, but there are no big features this time. The cpufreq changes that stand out somewhat are the governor interface rework and improvements related to the handling of frequency tables. Apart from those, there are fixes and new device/CPU IDs in drivers, cleanups and an improvement of the new schedutil governor. Next, there are some changes in the hibernation core, including a fix for a nasty problem related to the MONITOR/MWAIT usage by CPU offline during resume from hibernation, a few core improvements related to memory management during resume, a couple of additional debug features and cleanups. Finally, we have some fixes and cleanups in the devfreq subsystem, generic power domains framework improvements related to system suspend/resume, support for some new chips in intel_idle and in the power capping RAPL driver, a new version of the AnalyzeSuspend utility and some assorted fixes and cleanups. Specifics: - Rework the cpufreq governor interface to make it more straightforward and modify the conservative governor to avoid using transition notifications (Rafael Wysocki). - Rework the handling of frequency tables by the cpufreq core to make it more efficient (Viresh Kumar). - Modify the schedutil governor to reduce the number of wakeups it causes to occur in cases when the CPU frequency doesn't need to be changed (Steve Muckle, Viresh Kumar). - Fix some minor issues and clean up code in the cpufreq core and governors (Rafael Wysocki, Viresh Kumar). - Add Intel Broxton support to the intel_pstate driver (Srinivas Pandruvada). - Fix problems related to the config TDP feature and to the validity of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka, Srinivas Pandruvada). - Make intel_pstate update the cpu_frequency tracepoint even if the frequency doesn't change to avoid confusing powertop (Rafael Wysocki). - Clean up the usage of __init/__initdata in intel_pstate, mark some of its internal variables as __read_mostly and drop an unused structure element from it (Jisheng Zhang, Carsten Emde). - Clean up the usage of some duplicate MSR symbols in intel_pstate and turbostat (Srinivas Pandruvada). - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay Adiga, Viresh Kumar, Ben Dooks). - Fix a regression (introduced during the 4.5 cycle) in the pcc-cpufreq driver by reverting the problematic commit (Andreas Herrmann). - Add support for Intel Denverton to intel_idle, clean up Broxton support in it and make it explicitly non-modular (Jacob Pan, Jan Beulich, Paul Gortmaker). - Add support for Denverton and Ivy Bridge server to the Intel RAPL power capping driver and make it more careful about the handing of MSRs that may not be present (Jacob Pan, Xiaolong Wang). - Fix resume from hibernation on x86-64 by making the CPU offline during resume avoid using MONITOR/MWAIT in the "play dead" loop which may lead to an inadvertent "revival" of a "dead" CPU and a page fault leading to a kernel crash from it (Rafael Wysocki). - Make memory management during resume from hibernation more straightforward (Rafael Wysocki). - Add debug features that should help to detect problems related to hibernation and resume from it (Rafael Wysocki, Chen Yu). - Clean up hibernation core somewhat (Rafael Wysocki). - Prevent KASAN from instrumenting the hibernation core which leads to large numbers of false-positives from it (James Morse). - Prevent PM (hibernate and suspend) notifiers from being called during the cleanup phase if they have not been called during the corresponding preparation phase which is possible if one of the other notifiers returns an error at that time (Lianwei Wang). - Improve suspend-related debug printout in the tasks freezer and clean up suspend-related console handling (Roger Lu, Borislav Petkov). - Update the AnalyzeSuspend script in the kernel sources to version 4.2 (Todd Brandt). - Modify the generic power domains framework to make it handle system suspend/resume better (Ulf Hansson). - Make the runtime PM framework avoid resuming devices synchronously when user space changes the runtime PM settings for them and improve its error reporting (Rafael Wysocki, Linus Walleij). - Fix error paths in devfreq drivers (exynos, exynos-ppmu, exynos-bus) and in the core, make some devfreq code explicitly non-modular and change some of it into tristate (Bartlomiej Zolnierkiewicz, Peter Chen, Paul Gortmaker). - Add DT support to the generic PM clocks management code and make it export some more symbols (Jon Hunter, Paul Gortmaker). - Make the PCI PM core code slightly more robust against possible driver errors (Andy Shevchenko). - Make it possible to change DESTDIR and PREFIX in turbostat (Andy Shevchenko)" * tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits) Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency" PM / hibernate: Introduce test_resume mode for hibernation cpufreq: export cpufreq_driver_resolve_freq() cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index() PCI / PM: check all fields in pci_set_platform_pm() cpufreq: acpi-cpufreq: use cached frequency mapping when possible cpufreq: schedutil: map raw required frequency to driver frequency cpufreq: add cpufreq_driver_resolve_freq() cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT intel_pstate: Update cpu_frequency tracepoint every time cpufreq: intel_pstate: clean remnant struct element PM / tools: scripts: AnalyzeSuspend v4.2 x86 / hibernate: Use hlt_play_dead() when resuming from hibernation cpufreq: powernv: Replacing pstate_id with frequency table index intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate() PM / hibernate: Image data protection during restoration PM / hibernate: Add missing braces in __register_nosave_region() PM / hibernate: Clean up comments in snapshot.c PM / hibernate: Clean up function headers in snapshot.c PM / hibernate: Add missing braces in hibernate_setup() ...
2016-07-26cgroup: remove unnecessary 0 check from css_from_id()Johannes Weiner
css_idr allocation starts at 1, so index 0 will never point to an item. css_from_id() currently filters that before asking idr_find(), but idr_find() would also just return NULL, so this is not needed. Link: http://lkml.kernel.org/r/20160617162427.GC19084@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26cgroup: fix idr leak for the first cgroup rootJohannes Weiner
The valid cgroup hierarchy ID range includes 0, so we can't filter for positive numbers when freeing it, or it'll leak the first ID. No big deal, just disruptive when reading the code. The ID is freed during error handling and when the reference count hits zero, so the double-free test is not necessary; remove it. Link: http://lkml.kernel.org/r/20160617162359.GB19084@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: charge/uncharge kmemcg from generic page allocator pathsVladimir Davydov
Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "This branch also contains core changes. I've come to the conclusion that from 4.9 and forward, I'll be doing just a single branch. We often have dependencies between core and drivers, and it's hard to always split them up appropriately without pulling core into drivers when that happens. That said, this contains: - separate secure erase type for the core block layer, from Christoph. - set of discard fixes, from Christoph. - bio shrinking fixes from Christoph, as a followup up to the op/flags change in the core branch. - map and append request fixes from Christoph. - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty exciting! - nvme-loop fixes from Arnd. - removal of ->driverfs_dev from Dan, after providing a device_add_disk() helper. - bcache fixes from Bhaktipriya and Yijing. - cdrom subchannel read fix from Vchannaiah. - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier. - set of drbd updates and fixes from Fabian, Lars, and Philipp. - mg_disk error path fix from Bart. - user notification for failed device add for loop, from Minfei. - NVMe in general: + NVMe delay quirk from Guilherme. + SR-IOV support and command retry limits from Keith. + fix for memory-less NUMA node from Masayoshi. + use UINT_MAX for discard sectors, from Minfei. + cancel IO fixes from Ming. + don't allocate unused major, from Neil. + error code fixup from Dan. + use constants for PSDT/FUSE from James. + variable init fix from Jay. + fabrics fixes from Ming, Sagi, and Wei. + various fixes" * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits) nvme/pci: Provide SR-IOV support nvme: initialize variable before logical OR'ing it block: unexport various bio mapping helpers scsi/osd: open code blk_make_request target: stop using blk_make_request block: simplify and export blk_rq_append_bio block: ensure bios return from blk_get_request are properly initialized virtio_blk: use blk_rq_map_kern memstick: don't allow REQ_TYPE_BLOCK_PC requests block: shrink bio size again block: simplify and cleanup bvec pool handling block: get rid of bio_rw and READA block: don't ignore -EOPNOTSUPP blkdev_issue_write_same block: introduce BLKDEV_DISCARD_ZERO to fix zeroout NVMe: don't allocate unused nvme_major nvme: avoid crashes when node 0 is memoryless node. nvme: Limit command retries loop: Make user notify for adding loop device failed nvme-loop: fix nvme-loop Kconfig dependencies nvmet: fix return value check in nvmet_subsys_alloc() ...
2016-07-26Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...
2016-07-26Merge branch 'for-4.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too exciting. - updates to the pids controller so that pid limit breaches can be noticed and monitored from userland. - cleanups and non-critical bug fixes" * 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: remove duplicated include from cgroup.c cgroup: Use lld instead of ld when printing pids controller events_limit cgroup: Add pids controller event when fork fails because of pid limit cgroup: allow NULL return from ss->css_alloc() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root
2016-07-25Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq department delivers: - new core infrastructure to allow better management of multi-queue devices (interrupt spreading, node aware descriptor allocation ...) - a new interrupt flow handler to support the new fangled Intel VMD devices. - yet another new interrupt controller driver. - a series of fixes which addresses sparse warnings, missing includes, missing static declarations etc from Ben Dooks. - a fix for the error handling in the hierarchical domain allocation code. - the usual pile of small updates to core and driver code" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) genirq: Fix missing irq allocation affinity hint irqdomain: Fix irq_domain_alloc_irqs_recursive() error handling irq/Documentation: Correct result of echnoing 5 to smp_affinity MAINTAINERS: Remove Jiang Liu from irq domains genirq/msi: Fix broken debug output genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors genirq/msi: Make use of affinity aware allocations genirq: Use affinity hint in irqdesc allocation genirq: Add affinity hint to irq allocation genirq: Introduce IRQD_AFFINITY_MANAGED flag genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP irqchip/s3c24xx: Fixup IO accessors for big endian irqchip/exynos-combiner: Fix usage of __raw IO irqdomain: Fix disposal of mappings for interrupt hierarchies irqchip/aspeed-vic: Add irq controller for Aspeed doc/devicetree: Add Aspeed VIC bindings x86/PCI/VMD: Use untracked irq handler genirq: Add untracked irq handler irqchip/mips-gic: Populate irq_domain names irqchip/gicv3-its: Implement two-level(indirect) device table support ...
2016-07-25Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "This update provides the following changes: - The rework of the timer wheel which addresses the shortcomings of the current wheel (cascading, slow search for next expiring timer, etc). That's the first major change of the wheel in almost 20 years since Finn implemted it. - A large overhaul of the clocksource drivers init functions to consolidate the Device Tree initialization - Some more Y2038 updates - A capability fix for timerfd - Yet another clock chip driver - The usual pile of updates, comment improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits) tick/nohz: Optimize nohz idle enter clockevents: Make clockevents_subsys static clocksource/drivers/time-armada-370-xp: Fix return value check timers: Implement optimization for same expiry time in mod_timer() timers: Split out index calculation timers: Only wake softirq if necessary timers: Forward the wheel clock whenever possible timers/nohz: Remove pointless tick_nohz_kick_tick() function timers: Optimize collect_expired_timers() for NOHZ timers: Move __run_timers() function timers: Remove set_timer_slack() leftovers timers: Switch to a non-cascading wheel timers: Reduce the CPU index space to 256k timers: Give a few structs and members proper names hlist: Add hlist_is_singular_node() helper signals: Use hrtimer for sigtimedwait() timers: Remove the deprecated mod_timer_pinned() API timers, net/ipv4/inet: Initialize connection request timers as pinned timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned timers, drivers/tty/metag_da: Initialize the poll timer as pinned ...
2016-07-25bpf: Add bpf_probe_write_user BPF helper to be called in tracersSargun Dhillon
This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. We do this as opposed to calling copy_to_user directly, in order to avoid sleeping. In addition we ensure the threads's current fs / segment is USER_DS and the thread isn't exiting nor a kernel thread. Given this feature is meant for experiments, and it has a risk of crashing the system, and running programs, we print a warning on when a proglet that attempts to use this helper is installed, along with the pid and process name. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25Merge branch 'x86-boot-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: "The main changes: - add initial commits to randomize kernel memory section virtual addresses, enabled via a new kernel option: RANDOMIZE_MEMORY (Thomas Garnier, Kees Cook, Baoquan He, Yinghai Lu) - enhance KASLR (RANDOMIZE_BASE) physical memory randomization (Kees Cook) - EBDA/BIOS region boot quirk cleanups (Andy Lutomirski, Ingo Molnar) - misc cleanups/fixes" * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Simplify EBDA-vs-BIOS reservation logic x86/boot: Clarify what x86_legacy_features.reserve_bios_regions does x86/boot: Reorganize and clean up the BIOS area reservation code x86/mm: Do not reference phys addr beyond kernel x86/mm: Add memory hotplug support for KASLR memory randomization x86/mm: Enable KASLR for vmalloc memory regions x86/mm: Enable KASLR for physical mapping memory regions x86/mm: Implement ASLR for kernel memory regions x86/mm: Separate variable for trampoline PGD x86/mm: Add PUD VA support for physical mapping x86/mm: Update physical mapping variable names x86/mm: Refactor KASLR entropy functions x86/KASLR: Fix boot crash with certain memory configurations x86/boot/64: Add forgotten end of function marker x86/KASLR: Allow randomization below the load address x86/KASLR: Extend kernel image physical address randomization to addresses larger than 4G x86/KASLR: Randomize virtual address separately x86/KASLR: Clarify identity map interface x86/boot: Refuse to build with data relocations x86/KASLR, x86/power: Remove x86 hibernation restrictions
2016-07-25Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Ingo Molnar: - fix system/idle cputime leaked on cputime accounting (all nohz configs) (Rik van Riel) - remove the messy, ad-hoc irqtime account on nohz-full and make it compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel) - cleanups (Frederic Weisbecker) - remove unecessary irq disablement in the irqtime code (Rik van Riel) * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime: Drop local_irq_save/restore from irqtime_account_irq() sched/cputime: Reorganize vtime native irqtime accounting headers sched/cputime: Clean up the old vtime gen irqtime accounting completely sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code sched/cputime: Count actually elapsed irq & softirq time
2016-07-25Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - introduce and use task_rcu_dereference()/try_get_task_struct() to fix and generalize task_struct handling (Oleg Nesterov) - do various per entity load tracking (PELT) fixes and optimizations (Peter Zijlstra) - cputime virt-steal time accounting enhancements/fixes (Wanpeng Li) - introduce consolidated cputime output file cpuacct.usage_all and related refactorings (Zhao Lei) - ... plus misc fixes and enhancements * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show() sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums sched/fair: Rework throttle_count sync sched/core: Fix sched_getaffinity() return value kerneldoc comment sched/fair: Reorder cgroup creation code sched/fair: Apply more PELT fixes sched/fair: Fix PELT integrity for new tasks sched/cgroup: Fix cpu_cgroup_fork() handling sched/fair: Fix PELT integrity for new groups sched/fair: Fix and optimize the fork() path sched/cputime: Add steal time support to full dynticks CPU time accounting sched/cputime: Fix prev steal time accouting during CPU hotplug KVM: Fix steal clock warp during guest CPU hotplug sched/debug: Always show 'nr_migrations' sched/fair: Use task_rcu_dereference() sched/api: Introduce task_rcu_dereference() and try_get_task_struct() sched/idle: Optimize the generic idle loop sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
2016-07-25Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "With over 300 commits it's been a busy cycle - with most of the work concentrated on the tooling side (as it should). The main kernel side enhancements were: - Add per event callchain limit: Recently we introduced a sysctl to tune the max-stack for all events for which callchains were requested: $ sysctl kernel.perf_event_max_stack kernel.perf_event_max_stack = 127 Now this patch introduces a way to configure this per event, i.e. this becomes possible: $ perf record -e sched:*/max-stack=2/ -e block:*/max-stack=10/ -a allowing finer tuning of how much buffer space callchains use. This uses an u16 from the reserved space at the end, leaving another u16 for future use. There has been interest in even finer tuning, namely to control the max stack for kernel and userspace callchains separately. Further discussion is needed, we may for instance use the remaining u16 for that and when it is present, assume that the sample_max_stack introduced in this patch applies for the kernel, and the u16 left is used for limiting the userspace callchain (Arnaldo Carvalho de Melo) - Optimize AUX event (hardware assisted side-band event) delivery (Kan Liang) - Rework Intel family name macro usage (this is partially x86 arch work) (Dave Hansen) - Refine and fix Intel LBR support (David Carrillo-Cisneros) - Add support for Intel 'TopDown' events (Andi Kleen) - Intel uncore PMU driver fixes and enhancements (Kan Liang) - ... other misc changes. Here's an incomplete list of the tooling enhancements (but there's much more, see the shortlog and the git log for details): - Support cross unwinding, i.e. collecting '--call-graph dwarf' perf.data files in one machine and then doing analysis in another machine of a different hardware architecture. This enables, for instance, to do: $ perf record -a --call-graph dwarf on a x86-32 or aarch64 system and then do 'perf report' on it on a x86_64 workstation (He Kuang) - Allow reading from a backward ring buffer (one setup via sys_perf_event_open() with perf_event_attr.write_backward = 1) (Wang Nan) - Finish merging initial SDT (Statically Defined Traces) support, see cset comments for details about how it all works (Masami Hiramatsu) - Support attaching eBPF programs to tracepoints (Wang Nan) - Add demangling of symbols in programs written in the Rust language (David Tolnay) - Add support for tracepoints in the python binding, including an example, that sets up and parses sched:sched_switch events, tools/perf/python/tracepoint.py (Jiri Olsa) - Introduce --stdio-color to set up the color output mode selection in 'annotate' and 'report', allowing emit color escape sequences when redirecting the output of these tools (Arnaldo Carvalho de Melo) - Add 'callindent' option to 'perf script -F', to indent the Intel PT call stack, making this output more ftrace-like (Adrian Hunter, Andi Kleen) - Allow dumping the object files generated by llvm when processing eBPF scriptlet events (Wang Nan) - Add stackcollapse.py script to help generating flame graphs (Paolo Bonzini) - Add --ldlat option to 'perf mem' to specify load latency for loads event (e.g. cpu/mem-loads/ ) (Jiri Olsa) - Tooling support for Intel TopDown counters, recently added to the kernel (Andi Kleen)" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (303 commits) perf tests: Add is_printable_array test perf tools: Make is_printable_array global perf script python: Fix string vs byte array resolving perf probe: Warn unmatched function filter correctly perf cpu_map: Add more helpers perf stat: Balance opening and reading events tools: Copy linux/{hash,poison}.h and check for drift perf tools: Remove include/linux/list.h from perf's MANIFEST tools: Copy the bitops files accessed from the kernel and check for drift Remove: kernel unistd*h files from perf's MANIFEST, not used perf tools: Remove tools/perf/util/include/linux/const.h perf tools: Remove tools/perf/util/include/asm/byteorder.h perf tools: Add missing linux/compiler.h include to perf-sys.h perf jit: Remove some no-op error handling perf jit: Add missing curly braces objtool: Initialize variable to silence old compiler objtool: Add -I$(srctree)/tools/arch/$(ARCH)/include/uapi perf record: Add --tail-synthesize option perf session: Don't warn about out of order event if write_backward is used perf tools: Enable overwrite settings ...
2016-07-25Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The locking tree was busier in this cycle than the usual pattern - a couple of major projects happened to coincide. The main changes are: - implement the atomic_fetch_{add,sub,and,or,xor}() API natively across all SMP architectures (Peter Zijlstra) - add atomic_fetch_{inc/dec}() as well, using the generic primitives (Davidlohr Bueso) - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso, Waiman Long) - optimize smp_cond_load_acquire() on arm64 and implement LSE based atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() on arm64 (Will Deacon) - introduce smp_acquire__after_ctrl_dep() and fix various barrier mis-uses and bugs (Peter Zijlstra) - after discovering ancient spin_unlock_wait() barrier bugs in its implementation and usage, strengthen its semantics and update/fix usage sites (Peter Zijlstra) - optimize mutex_trylock() fastpath (Peter Zijlstra) - ... misc fixes and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits) locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire() locking/static_keys: Fix non static symbol Sparse warning locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec() locking/atomic, arch/tile: Fix tilepro build locking/atomic, arch/m68k: Remove comment locking/atomic, arch/arc: Fix build locking/Documentation: Clarify limited control-dependency scope locking/atomic, arch/rwsem: Employ atomic_long_fetch_add() locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire() locking/atomic, arch/mips: Convert to _relaxed atomics locking/atomic, arch/alpha: Convert to _relaxed atomics locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions locking/atomic: Remove linux/atomic.h:atomic_fetch_or() locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() locking/atomic: Fix atomic64_relaxed() bits locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}() locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() ...
2016-07-25Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - documentation updates - miscellaneous fixes - minor reorganization of code - torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) rcu: Correctly handle sparse possible cpus rcu: sysctl: Panic on RCU Stall rcu: Fix a typo in a comment rcu: Make call_rcu_tasks() tolerate first call with irqs disabled rcu: Disable TASKS_RCU for usermode Linux rcu: No ordering for rcu_assign_pointer() of NULL rcutorture: Fix error return code in rcu_perf_init() torture: Inflict default jitter rcuperf: Don't treat gp_exp mis-setting as a WARN rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments rcutorture: Don't specify the cpu type of QEMU on PPC rcutorture: Make -soundhw a x86 specific option rcutorture: Use vmlinux as the fallback kernel image rcutorture/doc: Create initrd using dracut torture: Stop onoff task if there is only one cpu torture: Add starvation events to error summary torture: Break online and offline functions out of torture_onoff() torture: Forgive lengthy trace dumps and preemption torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE ...
2016-07-25bpf, events: fix offset in skb copy handlerDaniel Borkmann
This patch fixes the __output_custom() routine we currently use with bpf_skb_copy(). I missed that when len is larger than the size of the current handle, we can issue multiple invocations of copy_func, and __output_custom() advances destination but also source buffer by the written amount of bytes. When we have __output_custom(), this is actually wrong since in that case the source buffer points to a non-linear object, in our case an skb, which the copy_func helper is supposed to walk. Therefore, since this is non-linear we thus need to pass the offset into the helper, so that copy_func can use it for extracting the data from the source object. Therefore, adjust the callback signatures properly and pass offset into the skb_header_pointer() invoked from bpf_skb_copy() callback. The __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things: i) to pass in whether we should advance source buffer or not; this is a compile-time constant condition, ii) to pass in the offset for __output_custom(), which we do with help of __VA_ARGS__, so everything can stay inlined as is currently. Both changes allow for adapting the __output_* fast-path helpers w/o extra overhead. Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output") Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25Merge branch 'pm-cpufreq'Rafael J. Wysocki
* pm-cpufreq: (41 commits) Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency" cpufreq: export cpufreq_driver_resolve_freq() cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index() cpufreq: acpi-cpufreq: use cached frequency mapping when possible cpufreq: schedutil: map raw required frequency to driver frequency cpufreq: add cpufreq_driver_resolve_freq() cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT intel_pstate: Update cpu_frequency tracepoint every time cpufreq: intel_pstate: clean remnant struct element cpufreq: powernv: Replacing pstate_id with frequency table index intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate() cpufreq: Reuse new freq-table helpers cpufreq: Handle sorted frequency tables more efficiently cpufreq: Drop redundant check from cpufreq_update_current_freq() intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostly intel_pstate: add __init/__initdata marker to some functions/variables intel_pstate: Fix incorrect placement of __initdata cpufreq: mvebu: fix integer to pointer cast cpufreq: intel_pstate: Broxton support cpufreq: conservative: Do not use transition notifications ...
2016-07-25Merge branches 'pm-sleep' and 'pm-tools'Rafael J. Wysocki
* pm-sleep: PM / hibernate: Introduce test_resume mode for hibernation x86 / hibernate: Use hlt_play_dead() when resuming from hibernation PM / hibernate: Image data protection during restoration PM / hibernate: Add missing braces in __register_nosave_region() PM / hibernate: Clean up comments in snapshot.c PM / hibernate: Clean up function headers in snapshot.c PM / hibernate: Add missing braces in hibernate_setup() PM / hibernate: Recycle safe pages after image restoration PM / hibernate: Simplify mark_unsafe_pages() PM / hibernate: Do not free preallocated safe pages during image restore PM / suspend: show workqueue state in suspend flow PM / sleep: make PM notifiers called symmetrically PM / sleep: Make pm_prepare_console() return void PM / Hibernate: Don't let kasan instrument snapshot.c * pm-tools: PM / tools: scripts: AnalyzeSuspend v4.2 tools/turbostat: allow user to alter DESTDIR and PREFIX
2016-07-24Merge tag 'staging-4.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver updates from Greg KH: "Here is the big Staging and IIO driver update for 4.8-rc1. We ended up adding more code than removing, again, but it's not all that bad. Lots of cleanups all over the staging tree, and new IIO drivers, full details in the shortlog. All of these have been in linux-next for a while with no reported issues" * tag 'staging-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (417 commits) drivers:iio:accel:mma8452: removed unwanted return statements drivers:iio:accel:mma8452: added cleanup provision in case of failure. iio: Add iio.git tree to MAINTAINERS iio:st_pressure: clean useless static channel initializers iio:st_pressure:lps22hb: temperature support iio:st_pressure:lps22hb: open drain support iio:st_pressure: temperature triggered buffering iio:st_pressure: document sampling gains iio:st_pressure: align storagebits on power of 2 iio:st_sensors: align on storagebits boundaries staging:iio:lis3l02dq drop separate driver iio: accel: st_accel: Add lis3l02dq support iio: adc: add missing of_node references to iio_dev iio: adc: ti-ads1015: add indio_dev->dev.of_node reference iio: potentiometer: Fix typo in Kconfig iio: potentiometer: mcp4531: Add device tree binding iio: potentiometer: mcp4531: Add device tree binding documentation iio: potentiometer: mcp4531: Add support for MCP454x, MCP456x, MCP464x and MCP466x iio:imu:mpu6050: icm20608 initial support iio: adc: max1363: Add device tree binding ...
2016-07-24Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-nextDan Williams
2016-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-22PM / hibernate: Introduce test_resume mode for hibernationChen Yu
test_resume mode is to verify if the snapshot data written to swap device can be successfully restored to memory. It is useful to ease the debugging process on hibernation, since this mode can not only bypass the BIOSes/bootloader, but also the system re-initialization. To avoid the risk to break the filesystm on persistent storage, this patch resumes the image with tasks frozen. For example: echo test_resume > /sys/power/disk echo disk > /sys/power/state [ 187.306470] PM: Image saving progress: 70% [ 187.395298] PM: Image saving progress: 80% [ 187.476697] PM: Image saving progress: 90% [ 187.554641] PM: Image saving done. [ 187.558896] PM: Wrote 594600 kbytes in 0.90 seconds (660.66 MB/s) [ 187.566000] PM: S| [ 187.589742] PM: Basic memory bitmaps freed [ 187.594694] PM: Checking hibernation image [ 187.599865] PM: Image signature found, resuming [ 187.605209] PM: Loading hibernation image. [ 187.665753] PM: Basic memory bitmaps created [ 187.691397] PM: Using 3 thread(s) for decompression. [ 187.691397] PM: Loading and decompressing image data (148650 pages)... [ 187.889719] PM: Image loading progress: 0% [ 188.100452] PM: Image loading progress: 10% [ 188.244781] PM: Image loading progress: 20% [ 189.057305] PM: Image loading done. [ 189.068793] PM: Image successfully loaded Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-21cpufreq: schedutil: map raw required frequency to driver frequencySteve Muckle
The slow-path frequency transition path is relatively expensive as it requires waking up a thread to do work. Should support be added for remote CPU cpufreq updates that is also expensive since it requires an IPI. These activities should be avoided if they are not necessary. To that end, calculate the actual driver-supported frequency required by the new utilization value in schedutil by using the recently added cpufreq_driver_resolve_freq API. If it is the same as the previously requested driver frequency then there is no need to continue with the update assuming the cpu frequency limits have not changed. This will have additional benefits should the semantics of the rate limit be changed to apply solely to frequency transitions rather than to frequency calculations in schedutil. The last raw required frequency is cached. This allows the driver frequency lookup to be skipped in the event that the new raw required frequency matches the last one, assuming a frequency update has not been forced due to limits changing (indicated by a next_freq value of UINT_MAX, see sugov_should_update_freq). Signed-off-by: Steve Muckle <smuckle@linaro.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-20audit: fix a double fetch in audit_log_single_execve_arg()Paul Moore
There is a double fetch problem in audit_log_single_execve_arg() where we first check the execve(2) argumnets for any "bad" characters which would require hex encoding and then re-fetch the arguments for logging in the audit record[1]. Of course this leaves a window of opportunity for an unsavory application to munge with the data. This patch reworks things by only fetching the argument data once[2] into a buffer where it is scanned and logged into the audit records(s). In addition to fixing the double fetch, this patch improves on the original code in a few other ways: better handling of large arguments which require encoding, stricter record length checking, and some performance improvements (completely unverified, but we got rid of some strlen() calls, that's got to be a good thing). As part of the development of this patch, I've also created a basic regression test for the audit-testsuite, the test can be tracked on GitHub at the following link: * https://github.com/linux-audit/audit-testsuite/issues/25 [1] If you pay careful attention, there is actually a triple fetch problem due to a strnlen_user() call at the top of the function. [2] This is a tiny white lie, we do make a call to strnlen_user() prior to fetching the argument data. I don't like it, but due to the way the audit record is structured we really have no choice unless we copy the entire argument at once (which would require a rather wasteful allocation). The good news is that with this patch the kernel no longer relies on this strnlen_user() value for anything beyond recording it in the log, we also update it with a trustworthy value whenever possible. Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-07-19bpf: enable direct packet data write for xdp progsBrenden Blanco
For forwarding to be effective, XDP programs should be allowed to rewrite packet data. This requires that the drivers supporting XDP must all map the packet memory as TODEVICE or BIDIRECTIONAL before invoking the program. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19bpf: add XDP prog type for early driver filterBrenden Blanco
Add a new bpf prog type that is intended to run in early stages of the packet rx path. Only minimal packet metadata will be available, hence a new context type, struct xdp_md, is exposed to userspace. So far only expose the packet start and end pointers, and only in read mode. An XDP program must return one of the well known enum values, all other return codes are reserved for future use. Unfortunately, this restriction is hard to enforce at verification time, so take the approach of warning at runtime when such programs are encountered. Out of bounds return codes should alias to XDP_ABORTED. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19bpf: add bpf_prog_add api for bulk prog refcntBrenden Blanco
A subsystem may need to store many copies of a bpf program, each deserving its own reference. Rather than requiring the caller to loop one by one (with possible mid-loop failure), add a bulk bpf_prog_add api. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19kernel/trace/bpf_trace.c: work around gcc-4.4.4 anon union initialization bugAndrew Morton
kernel/trace/bpf_trace.c: In function 'bpf_event_output': kernel/trace/bpf_trace.c:312: error: unknown field 'next' specified in initializer kernel/trace/bpf_trace.c:312: warning: missing braces around initializer kernel/trace/bpf_trace.c:312: warning: (near initialization for 'raw.frag.<anonymous>') Fixes: 555c8a8623a3a87 ("bpf: avoid stack copy and use skb ctx for event output") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19cgroup: remove duplicated include from cgroup.cWei Yongjun
Remove duplicated include. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-19tick/nohz: Optimize nohz idle enterGaurav Jindal
tick_nohz_start_idle is called before checking whether the idle tick can be stopped. If the tick cannot be stopped, calling tick_nohz_start_idle() is pointless and just wasting CPU cycles. Only invoke tick_nohz_start_idle() when can_stop_idle_tick() returns true. A short one minute observation of the effect on ARM64 shows a reduction of calls by 1.5% thus optimizing the idle entry sequence. [tglx: Massaged changelog ] Co-developed-by: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com> Signed-off-by: Gaurav Jindal<gaurav.jindal@spreadtrum.com> Link: http://lkml.kernel.org/r/20160714120416.GB21099@gaurav.jindal@spreadtrum.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-19genirq: Fix missing irq allocation affinity hintVincent Stehle
The new affinity hint argument of __irq_domain_alloc_irqs() is missing in irq_reserve_ipi(). Add it. This fixes the following compilation error: kernel/irq/ipi.c: In function ‘irq_reserve_ipi’: kernel/irq/ipi.c:85:9: error: too few arguments to function ‘__irq_domain_alloc_irqs’ virq = __irq_domain_alloc_irqs(domain, virq, nr_irqs, NUMA_NO_NODE, ^ Fixes: 06ee6d571f0e ("genirq: Add affinity hint to irq allocation") Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net> Cc: linux-pci@vger.kernel.org Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-19clockevents: Make clockevents_subsys staticBen Dooks
The clockevents_subsys struct is used for sysfs support and is not declared or used outside the file it is defined in. Fix the following warning by making it static: kernel/time/clockevents.c:648:17: warning: symbol 'clockevents_subsys' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Cc: linux-kernel@lists.codethink.co.uk Link: http://lkml.kernel.org/r/1466178974-7105-1-git-send-email-ben.dooks@codethink.co.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-16bpf: bpf_event_entry_gen's alloc needs to be in atomic contextDaniel Borkmann
Should have been obvious, only called from bpf() syscall via map_update_elem() that calls bpf_fd_array_map_update_elem() under RCU read lock and thus this must also be in GFP_ATOMIC, of course. Fixes: 3b1efb196eee ("bpf, maps: flush own entries on perf map release") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-16Merge branch 'for-4.7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "The optimization for setting unbound worker affinity masks collided with recent scheduler changes triggering warning messages. This late pull request fixes the bug by removing the optimization" * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix setting affinity of unbound worker threads
2016-07-15bpf: avoid stack copy and use skb ctx for event outputDaniel Borkmann
This work addresses a couple of issues bpf_skb_event_output() helper currently has: i) We need two copies instead of just a single one for the skb data when it should be part of a sample. The data can be non-linear and thus needs to be extracted via bpf_skb_load_bytes() helper first, and then copied once again into the ring buffer slot. ii) Since bpf_skb_load_bytes() currently needs to be used first, the helper needs to see a constant size on the passed stack buffer to make sure BPF verifier can do sanity checks on it during verification time. Thus, just passing skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the proper solution, since the two copies are generally still needed. iii) bpf_skb_load_bytes() is just for rather small buffers like headers, since they need to sit on the limited BPF stack anyway. Instead of working around in bpf_skb_load_bytes(), this work improves the bpf_skb_event_output() helper to address all 3 at once. We can make use of the passed in skb context that we have in the helper anyway, and use some of the reserved flag bits as a length argument. The helper will use the new __output_custom() facility from perf side with bpf_skb_copy() as callback helper to walk and extract the data. It will pass the data for setup to bpf_event_output(), which generates and pushes the raw record with an additional frag part. The linear data used in the first frag of the record serves as programmatically defined meta data passed along with the appended sample. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15bpf, perf: split bpf_perf_event_outputDaniel Borkmann
Split the bpf_perf_event_output() helper as a preparation into two parts. The new bpf_perf_event_output() will prepare the raw record itself and test for unknown flags from BPF trace context, where the __bpf_perf_event_output() does the core work. The latter will be reused later on from bpf_event_output() directly. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15perf, events: add non-linear data support for raw recordsDaniel Borkmann
This patch adds support for non-linear data on raw records. It extends raw records to have one or multiple fragments that will be written linearly into the ring slot, where each fragment can optionally have a custom callback handler to walk and extract complex, possibly non-linear data. If a callback handler is provided for a fragment, then the new __output_custom() will be used instead of __output_copy() for the perf_output_sample() part. perf_prepare_sample() does all the size calculation only once, so perf_output_sample() doesn't need to redo the same work anymore, meaning real_size and padding will be cached in the raw record. The raw record becomes 32 bytes in size without holes; to not increase it further and to avoid doing unnecessary recalculations in fast-path, we can reuse next pointer of the last fragment, idea here is borrowed from ZERO_OR_NULL_PTR(), which should keep the perf_output_sample() path for PERF_SAMPLE_RAW minimal. This facility is needed for BPF's event output helper as a first user that will, in a follow-up, add an additional perf_raw_frag to its perf_raw_record in order to be able to more efficiently dump skb context after a linear head meta data related to it. skbs can be non-linear and thus need a custom output function to dump buffers. Currently, the skb data needs to be copied twice; with the help of __output_custom() this work only needs to be done once. Future users could be things like XDP/BPF programs that work on different context though and would thus also have a different callback function. The few users of raw records are adapted to initialize their frag data from the raw record itself, no change in behavior for them. The code is based upon a PoC diff provided by Peter Zijlstra [1]. [1] http://thread.gmane.org/gmane.linux.network/421294 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15x86 / hibernate: Use hlt_play_dead() when resuming from hibernationRafael J. Wysocki
On Intel hardware, native_play_dead() uses mwait_play_dead() by default and only falls back to the other methods if that fails. That also happens during resume from hibernation, when the restore (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs except for the boot one offline. However, that is problematic, because the address passed to __monitor() in mwait_play_dead() is likely to be written to in the last phase of hibernate image restoration and that causes the "dead" CPU to start executing instructions again. Unfortunately, the page containing the address in that CPU's instruction pointer may not be valid any more at that point. First, that page may have been overwritten with image kernel memory contents already, so the instructions the CPU attempts to execute may simply be invalid. Second, the page tables previously used by that CPU may have been overwritten by image kernel memory contents, so the address in its instruction pointer is impossible to resolve then. A report from Varun Koyyalagunta and investigation carried out by Chen Yu show that the latter sometimes happens in practice. To prevent it from happening, temporarily change the smp_ops.play_dead pointer during resume from hibernation so that it points to a special "play dead" routine which uses hlt_play_dead() and avoids the inadvertent "revivals" of "dead" CPUs this way. A slightly unpleasant consequence of this change is that if the system is hibernated with one or more CPUs offline, it will generally draw more power after resume than it did before hibernation, because the physical state entered by CPUs via hlt_play_dead() is higher-power than the mwait_play_dead() one in the majority of cases. It is possible to work around this, but it is unclear how much of a problem that's going to be in practice, so the workaround will be implemented later if it turns out to be necessary. Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371 Reported-by: Varun Koyyalagunta <cpudebug@centtech.com> Original-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-15cgroupns: Only allow creation of hierarchies in the initial cgroup namespaceEric W. Biederman
Unprivileged users can't use hierarchies if they create them as they do not have privilieges to the root directory. Which means the only thing a hiearchy created by an unprivileged user is good for is expanding the number of cgroup links in every css_set, which is a DOS attack. We could allow hierarchies to be created in namespaces in the initial user namespace. Unfortunately there is only a single namespace for the names of heirarchies, so that is likely to create more confusion than not. So do the simple thing and restrict hiearchy creation to the initial cgroup namespace. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15cgroupns: Close race between cgroup_post_fork and copy_cgroup_nsEric W. Biederman
In most code paths involving cgroup migration cgroup_threadgroup_rwsem is taken. There are two exceptions: - remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks - vhost_attach_cgroups_work calls cgroup_attach_task_all With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork and copy_cgroup_ns will reference the same css_set from the process calling fork. Without such an interlock there process after fork could reference one css_set from it's new cgroup namespace and another css_set from task->cgroups, which semantically is nonsensical. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15cgroupns: Fix the locking in copy_cgroup_nsEric W. Biederman
If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep valid splat. In __cgroup_proc_write the lock ordering is: cgroup_mutex -- through cgroup_kn_lock_live cgroup_threadgroup_rwsem In copy_process the guts of clone the lock ordering is: cgroup_threadgroup_rwsem -- through threadgroup_change_begin cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns lockdep reports some a different call chains for the first ordering of cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace. This is most definitely deadlock potential under the right circumstances. Fix this by by skipping the cgroup_mutex and making the locking in copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs during fork under the cgroup_threadgroup_rwsem. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org>