summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2015-06-23Merge tag 'pci-v4.2-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "PCI changes for the v4.2 merge window: Enumeration - Move pci_ari_enabled() to global header (Alex Williamson) - Account for ARI in _PRT lookups (Alex Williamson) - Remove unused pci_scan_bus_parented() (Yijing Wang) Resource management - Use host bridge _CRS info on systems with >32 bit addressing (Bjorn Helgaas) - Use host bridge _CRS info on Foxconn K8M890-8237A (Bjorn Helgaas) - Fix pci_address_to_pio() conversion of CPU address to I/O port (Zhichang Yuan) - Add pci_bus_addr_t (Yinghai Lu) PCI device hotplug - Wait for pciehp command completion where necessary (Alex Williamson) - Drop pointless ACPI-based "slot detection" check (Rafael J. Wysocki) - Check ignore_hotplug for all downstream devices (Rafael J. Wysocki) - Propagate the "ignore hotplug" setting to parent (Rafael J. Wysocki) - Inline pciehp "handle event" functions into the ISR (Bjorn Helgaas) - Clean up pciehp debug logging (Bjorn Helgaas) Power management - Remove redundant PCIe port type checking (Yijing Wang) - Add dev->has_secondary_link to track downstream PCIe links (Yijing Wang) - Use dev->has_secondary_link to find downstream links for ASPM (Yijing Wang) - Drop __pci_disable_link_state() useless "force" parameter (Bjorn Helgaas) - Simplify Clock Power Management setting (Bjorn Helgaas) Virtualization - Add ACS quirks for Intel 9-series PCH root ports (Alex Williamson) - Add function 1 DMA alias quirk for Marvell 9120 (Sakari Ailus) MSI - Disable MSI at enumeration even if kernel doesn't support MSI (Michael S. Tsirkin) - Remove unused pci_msi_off() (Bjorn Helgaas) - Rename msi_set_enable(), msix_clear_and_set_ctrl() (Michael S. Tsirkin) - Export pci_msi_set_enable(), pci_msix_clear_and_set_ctrl() (Michael S. Tsirkin) - Drop pci_msi_off() calls during probe (Michael S. Tsirkin) APM X-Gene host bridge driver - Add APM X-Gene v1 PCIe MSI/MSIX termination driver (Duc Dang) - Add APM X-Gene PCIe MSI DTS nodes (Duc Dang) - Disable Configuration Request Retry Status for v1 silicon (Duc Dang) - Allow config access to Root Port even when link is down (Duc Dang) Broadcom iProc host bridge driver - Allow override of device tree IRQ mapping function (Hauke Mehrtens) - Add BCMA PCIe driver (Hauke Mehrtens) - Directly add PCI resources (Hauke Mehrtens) - Free resource list after registration (Hauke Mehrtens) Freescale i.MX6 host bridge driver - Add speed change timeout message (Troy Kisky) - Rename imx6_pcie_start_link() to imx6_pcie_establish_link() (Bjorn Helgaas) Freescale Layerscape host bridge driver - Use dw_pcie_link_up() consistently (Bjorn Helgaas) - Factor out ls_pcie_establish_link() (Bjorn Helgaas) Marvell MVEBU host bridge driver - Remove mvebu_pcie_scan_bus() (Yijing Wang) NVIDIA Tegra host bridge driver - Remove tegra_pcie_scan_bus() (Yijing Wang) Synopsys DesignWare host bridge driver - Consolidate outbound iATU programming functions (Jisheng Zhang) - Use iATU0 for cfg and IO, iATU1 for MEM (Jisheng Zhang) - Add support for x8 links (Zhou Wang) - Wait for link to come up with consistent style (Bjorn Helgaas) - Use pci_scan_root_bus() for simplicity (Yijing Wang) TI DRA7xx host bridge driver - Use dw_pcie_link_up() consistently (Bjorn Helgaas) Miscellaneous - Include <linux/pci.h>, not <asm/pci.h> (Bjorn Helgaas) - Remove unnecessary #includes of <asm/pci.h> (Bjorn Helgaas) - Remove unused pcibios_select_root() (again) (Bjorn Helgaas) - Remove unused pci_dma_burst_advice() (Bjorn Helgaas) - xen/pcifront: Don't use deprecated function pci_scan_bus_parented() (Arnd Bergmann)" * tag 'pci-v4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (58 commits) PCI: pciehp: Inline the "handle event" functions into the ISR PCI: pciehp: Rename queue_interrupt_event() to pciehp_queue_interrupt_event() PCI: pciehp: Make queue_interrupt_event() void PCI: xgene: Allow config access to Root Port even when link is down PCI: xgene: Disable Configuration Request Retry Status for v1 silicon PCI: pciehp: Clean up debug logging x86/PCI: Use host bridge _CRS info on systems with >32 bit addressing PCI: imx6: Add #define PCIE_RC_LCSR PCI: imx6: Use "u32", not "uint32_t" PCI: Remove unused pci_scan_bus_parented() xen/pcifront: Don't use deprecated function pci_scan_bus_parented() PCI: imx6: Add speed change timeout message PCI/ASPM: Simplify Clock Power Management setting PCI: designware: Wait for link to come up with consistent style PCI: layerscape: Factor out ls_pcie_establish_link() PCI: layerscape: Use dw_pcie_link_up() consistently PCI: dra7xx: Use dw_pcie_link_up() consistently x86/PCI: Use host bridge _CRS info on Foxconn K8M890-8237A PCI: pciehp: Wait for hotplug command completion where necessary PCI: Remove unused pci_dma_burst_advice() ...
2015-06-22Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather largish update for everything time and timer related: - Cache footprint optimizations for both hrtimers and timer wheel - Lower the NOHZ impact on systems which have NOHZ or timer migration disabled at runtime. - Optimize run time overhead of hrtimer interrupt by making the clock offset updates smarter - hrtimer cleanups and removal of restrictions to tackle some problems in sched/perf - Some more leap second tweaks - Another round of changes addressing the 2038 problem - First step to change the internals of clock event devices by introducing the necessary infrastructure - Allow constant folding for usecs/msecs_to_jiffies() - The usual pile of clockevent/clocksource driver updates The hrtimer changes contain updates to sched, perf and x86 as they depend on them plus changes all over the tree to cleanup API changes and redundant code, which got copied all over the place. The y2038 changes touch s390 to remove the last non 2038 safe code related to boot/persistant clock" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) clocksource: Increase dependencies of timer-stm32 to limit build wreckage timer: Minimize nohz off overhead timer: Reduce timer migration overhead if disabled timer: Stats: Simplify the flags handling timer: Replace timer base by a cpu index timer: Use hlist for the timer wheel hash buckets timer: Remove FIFO "guarantee" timers: Sanitize catchup_timer_jiffies() usage hrtimer: Allow hrtimer::function() to free the timer seqcount: Introduce raw_write_seqcount_barrier() seqcount: Rename write_seqcount_barrier() hrtimer: Fix hrtimer_is_queued() hole hrtimer: Remove HRTIMER_STATE_MIGRATE selftest: Timers: Avoid signal deadlock in leap-a-day timekeeping: Copy the shadow-timekeeper over the real timekeeper last clockevents: Check state instead of mode in suspend/resume path selftests: timers: Add leap-second timer edge testing to leap-a-day.c ntp: Do leapsecond adjustment in adjtimex read path time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge ntp: Introduce and use SECS_PER_DAY macro instead of 86400 ...
2015-06-22Merge branch 'x86-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Ingo Molnar: "There were so many changes in the x86/asm, x86/apic and x86/mm topics in this cycle that the topical separation of -tip broke down somewhat - so the result is a more traditional architecture pull request, collected into the 'x86/core' topic. The topics were still maintained separately as far as possible, so bisectability and conceptual separation should still be pretty good - but there were a handful of merge points to avoid excessive dependencies (and conflicts) that would have been poorly tested in the end. The next cycle will hopefully be much more quiet (or at least will have fewer dependencies). The main changes in this cycle were: * x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas Gleixner) - This is the second and most intrusive part of changes to the x86 interrupt handling - full conversion to hierarchical interrupt domains: [IOAPIC domain] ----- | [MSI domain] --------[Remapping domain] ----- [ Vector domain ] | (optional) | [HPET MSI domain] ----- | | [DMAR domain] ----------------------------- | [Legacy domain] ----------------------------- This now reflects the actual hardware and allowed us to distangle the domain specific code from the underlying parent domain, which can be optional in the case of interrupt remapping. It's a clear separation of functionality and removes quite some duct tape constructs which plugged the remap code between ioapic/msi/hpet and the vector management. - Intel IOMMU IRQ remapping enhancements, to allow direct interrupt injection into guests (Feng Wu) * x86/asm changes: - Tons of cleanups and small speedups, micro-optimizations. This is in preparation to move a good chunk of the low level entry code from assembly to C code (Denys Vlasenko, Andy Lutomirski, Brian Gerst) - Moved all system entry related code to a new home under arch/x86/entry/ (Ingo Molnar) - Removal of the fragile and ugly CFI dwarf debuginfo annotations. Conversion to C will reintroduce many of them - but meanwhile they are only getting in the way, and the upstream kernel does not rely on them (Ingo Molnar) - NOP handling refinements. (Borislav Petkov) * x86/mm changes: - Big PAT and MTRR rework: making the code more robust and preparing to phase out exposing direct MTRR interfaces to drivers - in favor of using PAT driven interfaces (Toshi Kani, Luis R Rodriguez, Borislav Petkov) - New ioremap_wt()/set_memory_wt() interfaces to support Write-Through cached memory mappings. This is especially important for good performance on NVDIMM hardware (Toshi Kani) * x86/ras changes: - Add support for deferred errors on AMD (Aravind Gopalakrishnan) This is an important RAS feature which adds hardware support for poisoned data. That means roughly that the hardware marks data which it has detected as corrupted but wasn't able to correct, as poisoned data and raises an APIC interrupt to signal that in the form of a deferred error. It is the OS's responsibility then to take proper recovery action and thus prolonge system lifetime as far as possible. - Add support for Intel "Local MCE"s: upcoming CPUs will support CPU-local MCE interrupts, as opposed to the traditional system- wide broadcasted MCE interrupts (Ashok Raj) - Misc cleanups (Borislav Petkov) * x86/platform changes: - Intel Atom SoC updates ... and lots of other cleanups, fixlets and other changes - see the shortlog and the Git log for details" * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits) x86/hpet: Use proper hpet device number for MSI allocation x86/hpet: Check for irq==0 when allocating hpet MSI interrupts x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail genirq: Prevent crash in irq_move_irq() genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain iommu, x86: Properly handle posted interrupts for IOMMU hotplug iommu, x86: Provide irq_remapping_cap() interface iommu, x86: Setup Posted-Interrupts capability for Intel iommu iommu, x86: Add cap_pi_support() to detect VT-d PI capability iommu, x86: Avoid migrating VT-d posted interrupts iommu, x86: Save the mode (posted or remapped) of an IRTE iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip iommu: dmar: Provide helper to copy shared irte fields iommu: dmar: Extend struct irte for VT-d Posted-Interrupts iommu: Add new member capability to struct irq_remap_ops x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry() ...
2015-06-22Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 warning fixlet from Ingo Molnar: "A build fix for certain (rare) variants of binutils that did not make it into v4.1" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Fix overflow warning with 32-bit binutils
2015-06-22Merge branch 'x86-microcode-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pul x86 microcode updates from Ingo Molnar: "x86 microcode loader updates from Borislav Petkov: - early parsing of the built-in microcode - cleanups - misc smaller fixes" * 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode: Correct CPU family related variable types x86/microcode: Disable builtin microcode loading on 32-bit for now x86/microcode/intel: Rename get_matching_sig() x86/microcode/intel: Simplify get_matching_sig() x86/microcode/intel: Simplify update_match_cpu() x86/microcode/intel: Rename get_matching_microcode x86/cpu/microcode: Zap changelog x86/microcode: Parse built-in microcode early x86/microcode/intel: Remove unused @rev arg of get_matching_sig() x86/microcode/intel: Get rid of revision_is_newer()
2015-06-22Merge branch 'x86-kdump-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 kdump updates from Ingo Molnar: "Three kdump robustness related improvements (Joerg Roedel)" * 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/crash: Allocate enough low memory when crashkernel=high x86/swiotlb: Try coherent allocations with __GFP_NOWARN swiotlb: Warn on allocation failure in swiotlb_alloc_coherent()
2015-06-22Merge branch 'x86-fpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FPU updates from Ingo Molnar: "This tree contains two main changes: - The big FPU code rewrite: wide reaching cleanups and reorganization that pulls all the FPU code together into a clean base in arch/x86/fpu/. The resulting code is leaner and faster, and much easier to understand. This enables future work to further simplify the FPU code (such as removing lazy FPU restores). By its nature these changes have a substantial regression risk: FPU code related bugs are long lived, because races are often subtle and bugs mask as user-space failures that are difficult to track back to kernel side backs. I'm aware of no unfixed (or even suspected) FPU related regression so far. - MPX support rework/fixes. As this is still not a released CPU feature, there were some buglets in the code - should be much more robust now (Dave Hansen)" * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (250 commits) x86/fpu: Fix double-increment in setup_xstate_features() x86/mpx: Allow 32-bit binaries on 64-bit kernels again x86/mpx: Do not count MPX VMAs as neighbors when unmapping x86/mpx: Rewrite the unmap code x86/mpx: Support 32-bit binaries on 64-bit kernels x86/mpx: Use 32-bit-only cmpxchg() for 32-bit apps x86/mpx: Introduce new 'directory entry' to 'addr' helper function x86/mpx: Add temporary variable to reduce masking x86: Make is_64bit_mm() widely available x86/mpx: Trace allocation of new bounds tables x86/mpx: Trace the attempts to find bounds tables x86/mpx: Trace entry to bounds exception paths x86/mpx: Trace #BR exceptions x86/mpx: Introduce a boot-time disable flag x86/mpx: Restrict the mmap() size check to bounds tables x86/mpx: Remove redundant MPX_BNDCFG_ADDR_MASK x86/mpx: Clean up the code by not passing a task pointer around when unnecessary x86/mpx: Use the new get_xsave_field_ptr()API x86/fpu/xstate: Wrap get_xsave_addr() to make it safer x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions ...
2015-06-22Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU features from Ingo Molnar: "Various CPU feature support related changes: in particular the /proc/cpuinfo model name sanitization change should be monitored, it has a chance to break stuff. (but really shouldn't and there are no regression reports)" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Give access to the number of nodes in a physical package x86/cpu: Trim model ID whitespace x86/cpu: Strip any /proc/cpuinfo model name field whitespace x86/cpu/amd: Set X86_FEATURE_EXTD_APICID for future processors x86/gart: Check for GART support before accessing GART registers
2015-06-22Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Clean up types in xlate_dev_mem_ptr() some more x86: Deinline dma_free_attrs() x86: Deinline dma_alloc_attrs() x86: Remove unused TI_cpu x86: Merge common 32-bit values in asm-offsets.c
2015-06-22Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes are: - lockless wakeup support for futexes and IPC message queues (Davidlohr Bueso, Peter Zijlstra) - Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability (Jason Low) - NUMA balancing improvements (Rik van Riel) - SCHED_DEADLINE improvements (Wanpeng Li) - clean up and reorganize preemption helpers (Frederic Weisbecker) - decouple page fault disabling machinery from the preemption counter, to improve debuggability and robustness (David Hildenbrand) - SCHED_DEADLINE documentation updates (Luca Abeni) - topology CPU masks cleanups (Bartosz Golaszewski) - /proc/sched_debug improvements (Srikar Dronamraju)" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits) sched/deadline: Remove needless parameter in dl_runtime_exceeded() sched: Remove superfluous resetting of the p->dl_throttled flag sched/deadline: Drop duplicate init_sched_dl_class() declaration sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target sched/deadline: Make init_sched_dl_class() __init sched/deadline: Optimize pull_dl_task() sched/preempt: Add static_key() to preempt_notifiers sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration sched/stop_machine: Fix deadlock between multiple stop_two_cpus() sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched sched/debug: Replace vruntime with wait_sum in /proc/sched_debug sched/debug: Properly format runnable tasks in /proc/sched_debug sched/numa: Only consider less busy nodes as numa balancing destinations Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") sched/fair: Prevent throttling in early pick_next_task_fair() preempt: Reorganize the notrace definitions a bit preempt: Use preempt_schedule_context() as the official tracing preemption point sched: Make preempt_schedule_context() function-tracing safe x86: Remove cpu_sibling_mask() and cpu_core_mask() x86: Replace cpu_**_mask() with topology_**_cpumask() ...
2015-06-22Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "These are the left over fixes from the v4.1 cycle" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf tools: Fix build breakage if prefix= is specified perf/x86: Honor the architectural performance monitoring version perf/x86/intel: Fix PMI handling for Intel PT perf/x86/intel/bts: Fix DS area sharing with x86_pmu events perf/x86: Add more Broadwell model numbers perf: Fix ring_buffer_attach() RCU sync, again
2015-06-22Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "Kernel side changes mostly consist of work on x86 PMU drivers: - x86 Intel PT (hardware CPU tracer) improvements (Alexander Shishkin) - x86 Intel CQM (cache quality monitoring) improvements (Thomas Gleixner) - x86 Intel PEBSv3 support (Peter Zijlstra) - x86 Intel PEBS interrupt batching support for lower overhead sampling (Zheng Yan, Kan Liang) - x86 PMU scheduler fixes and improvements (Peter Zijlstra) There's too many tooling improvements to list them all - here are a few select highlights: 'perf bench': - Introduce new 'perf bench futex' benchmark: 'wake-parallel', to measure parallel waker threads generating contention for kernel locks (hb->lock). (Davidlohr Bueso) 'perf top', 'perf report': - Allow disabling/enabling events dynamicaly in 'perf top': a 'perf top' session can instantly become a 'perf report' one, i.e. going from dynamic analysis to a static one, returning to a dynamic one is possible, to toogle the modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo) - Make Ctrl-C stop processing on TUI, allowing interrupting the load of big perf.data files (Namhyung Kim) 'perf probe': (Masami Hiramatsu) - Support glob wildcards for function name - Support $params special probe argument: Collect all function arguments - Make --line checks validate C-style function name. - Add --no-inlines option to avoid searching inline functions - Greatly speed up 'perf probe --list' by caching debuginfo. - Improve --filter support for 'perf probe', allowing using its arguments on other commands, as --add, --del, etc. 'perf sched': - Add option in 'perf sched' to merge like comms to lat output (Josef Bacik) Plus tons of infrastructure work - in particular preparation for upcoming threaded perf report support, but also lots of other work - and fixes and other improvements. See (much) more details in the shortlog and in the git log" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits) perf tools: Configurable per thread proc map processing time out perf tools: Add time out to force stop proc map processing perf report: Fix sort__sym_cmp to also compare end of symbol perf hists browser: React to unassigned hotkey pressing perf top: Tell the user how to unfreeze events after pressing 'f' perf hists browser: Honour the help line provided by builtin-{top,report}.c perf hists browser: Do not exit when 'f' is pressed in 'report' mode perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events perf annotate: Rename source_line_percent to source_line_samples perf annotate: Display total number of samples with --show-total-period perf tools: Ensure thread-stack is flushed perf top: Allow disabling/enabling events dynamicly perf evlist: Add toggle_enable() method perf trace: Fix race condition at the end of started workloads perf probe: Speed up perf probe --list by caching debuginfo perf probe: Show usage even if the last event is skipped perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable perf tools: Fix a problem when opening old perf.data with different byte order perf tools: Ignore .config-detected in .gitignore perf probe: Fix to return error if no probe is added ...
2015-06-22Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes are: - 'qspinlock' support, enabled on x86: queued spinlocks - these are now the spinlock variant used by x86 as they outperform ticket spinlocks in every category. (Waiman Long) - 'pvqspinlock' support on x86: paravirtualized variant of queued spinlocks. (Waiman Long, Peter Zijlstra) - 'qrwlock' support, enabled on x86: queued rwlocks. Similar to queued spinlocks, they are now the variant used by x86: CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_ARCH_USE_QUEUED_RWLOCKS=y CONFIG_QUEUED_RWLOCKS=y - various lockdep fixlets - various locking primitives cleanups, further WRITE_ONCE() propagation" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) locking/lockdep: Remove hard coded array size dependency locking/qrwlock: Don't contend with readers when setting _QW_WAITING lockdep: Do not break user-visible string locking/arch: Rename set_mb() to smp_store_mb() locking/arch: Add WRITE_ONCE() to set_mb() rtmutex: Warn if trylock is called from hard/softirq context arch: Remove __ARCH_HAVE_CMPXCHG locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS locking/pvqspinlock: Replace xchg() by the more descriptive set_mb() locking/pvqspinlock, x86: Enable PV qspinlock for Xen locking/pvqspinlock, x86: Enable PV qspinlock for KVM locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching locking/pvqspinlock: Implement simple paravirt support for the qspinlock locking/qspinlock: Revert to test-and-set on hypervisors locking/qspinlock: Use a simple write to grab the lock locking/qspinlock: Optimize for smaller NR_CPUS locking/qspinlock: Extract out code snippets for the next patch locking/qspinlock: Add pending bit ...
2015-06-22Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - Continued initialization/Kconfig updates: hide most Kconfig options from unsuspecting users. There's now a single high level configuration option: * * RCU Subsystem * Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW) Which if answered in the negative, leaves us with a single interactive configuration option: Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW) All the rest of the RCU options are configured automatically. Later on we'll remove this single leftover configuration option as well. - Remove all uses of RCU-protected array indexes: replace the rcu_[access|dereference]_index_check() APIs with READ_ONCE() and rcu_lockdep_assert() - RCU CPU-hotplug cleanups - Updates to Tiny RCU: a race fix and further code shrinkage. - RCU torture-testing updates: fixes, speedups, cleanups and documentation updates. - Miscellaneous fixes - Documentation updates * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) rcutorture: Allow repetition factors in Kconfig-fragment lists rcutorture: Display "make oldconfig" errors rcutorture: Update TREE_RCU-kconfig.txt rcutorture: Make rcutorture scripts force RCU_EXPERT rcutorture: Update configuration fragments for rcutree.rcu_fanout_exact rcutorture: TASKS_RCU set directly, so don't explicitly set it rcutorture: Test SRCU cleanup code path rcutorture: Replace barriers with smp_store_release() and smp_load_acquire() locktorture: Change longdelay_us to longdelay_ms rcutorture: Allow negative values of nreaders to oversubscribe rcutorture: Exchange TREE03 and TREE08 NR_CPUS, speed up CPU hotplug rcutorture: Exchange TREE03 and TREE04 geometries locktorture: fix deadlock in 'rw_lock_irq' type rcu: Correctly handle non-empty Tiny RCU callback list with none ready rcutorture: Test both RCU-sched and RCU-bh for Tiny RCU rcu: Further shrink Tiny RCU by making empty functions static inlines rcu: Conditionally compile RCU's eqs warnings rcu: Remove prompt for RCU implementation rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF ...
2015-06-22Merge branches 'x86/apic', 'x86/asm', 'x86/mm' and 'x86/platform' into ↵Ingo Molnar
x86/core, to merge last updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-21x86/hpet: Use proper hpet device number for MSI allocationThomas Gleixner
hpet_assign_irq() is called with hpet_device->num as "hardware interrupt number", but hpet_device->num is initialized after the interrupt has been assigned, so it's always 0. As a consequence only the first MSI allocation succeeds, the following ones fail because the "hardware interrupt number" already exists. Move the initialization of dev->num and other fields before the call to hpet_assign_irq(), which is the ordering before the offending commit which introduced that regression. Fixes: "3cb96f0c9733 x86/hpet: Enhance HPET IRQ to support hierarchical irqdomains" Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1506211635010.4107@nanos Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de>
2015-06-20x86/hpet: Check for irq==0 when allocating hpet MSI interruptsJiang Liu
irq == 0 is not a valid irq for a irqdomain MSI allocation, but hpet code checks only for negative return values. Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/558447AF.30703@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19x86/boot: Fix overflow warning with 32-bit binutilsBorislav Petkov
When building the kernel with 32-bit binutils built with support only for the i386 target, we get the following warning: arch/x86/kernel/head_32.S:66: Warning: shift count out of range (32 is not between 0 and 31) The problem is that in that case, binutils' internal type representation is 32-bit wide and the shift range overflows. In order to fix this, manipulate the shift expression which creates the 4GiB constant to not overflow the shift count. Suggested-by: Michael Matz <matz@suse.de> Reported-and-tested-by: Enrico Mioso <mrkiko.rs@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19perf/x86: Honor the architectural performance monitoring versionPalik, Imre
Architectural performance monitoring, version 1, doesn't support fixed counters. Currently, even if a hypervisor advertises support for architectural performance monitoring version 1, perf may still try to use the fixed counters, as the constraints are set up based on the CPU model. This patch ensures that perf honors the architectural performance monitoring version returned by CPUID, and it only uses the fixed counters for version 2 and above. (Some of the ideas in this patch came from Peter Zijlstra.) Signed-off-by: Imre Palik <imrep@amazon.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Anthony Liguori <aliguori@amazon.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1433767609-1039-1-git-send-email-imrep.amz@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19perf/x86/intel: Fix PMI handling for Intel PTAlexander Shishkin
Intel PT is a separate PMU and it is not using any of the x86_pmu code paths, which means in particular that the active_events counter remains intact when new PT events are created. However, PT uses the generic x86_pmu PMI handler for its PMI handling needs. The problem here is that the latter checks active_events and in case of it being zero, exits without calling the actual x86_pmu.handle_nmi(), which results in unknown NMI errors and massive data loss for PT. The effect is not visible if there are other perf events in the system at the same time that keep active_events counter non-zero, for instance if the NMI watchdog is running, so one needs to disable it to reproduce the problem. At the same time, the active_events counter besides doing what the name suggests also implicitly serves as a PMC hardware and DS area reference counter. This patch adds a separate reference counter for the PMC hardware, leaving active_events for actually counting the events and makes sure it also counts PT and BTS events. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Link: http://lkml.kernel.org/r/87k2v92t0s.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19perf/x86/intel/bts: Fix DS area sharing with x86_pmu eventsAlexander Shishkin
Currently, the intel_bts driver relies on the DS area allocated by the x86_pmu code in its event_init() path, which is a bug: creating a BTS event while no x86_pmu events are present results in a NULL pointer dereference. The same DS area is also used by PEBS sampling, which makes it quite a bit trickier to have a separate one for intel_bts' purposes. This patch makes intel_bts driver use the same DS allocation and reference counting code as x86_pmu to make sure it is always present when either intel_bts or x86_pmu need it. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Link: http://lkml.kernel.org/r/1434024837-9916-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19perf/x86: Add more Broadwell model numbersAndi Kleen
This patch adds additional model numbers for Broadwell to perf. Support for Broadwell with Iris Pro (Intel Core i7-57xxC) and support for Broadwell Server Xeon. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1434055942-28253-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-18x86/cpu/amd: Give access to the number of nodes in a physical packageAravind Gopalakrishnan
Stash the number of nodes in a physical processor package locally and add an accessor to be called by interested parties. The first user is the MCE injection module which uses it to find the node base core in a package for injecting a certain type of errors. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> [ Rewrote the commit message, merged it with the accessor patch and unified naming. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jacob Shin <jacob.w.shin@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: mchehab@osg.samsung.com Link: http://lkml.kernel.org/r/1433868317-18417-2-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-18x86/platform/intel/baytrail: Add comments about why we disabled HPET on BaytrailFeng Tang
This question has been asked many times, and finally I found the official document which explains the problem of HPET on Baytrail, that it will halt in deep idle states. Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john.stultz@linaro.org Cc: len.brown@intel.com Cc: matthew.lee@intel.com Link: http://lkml.kernel.org/r/1434361201-31743-1-git-send-email-feng.tang@intel.com [ Prettified things a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-12x86/fpu: Fix double-increment in setup_xstate_features()Dave Hansen
I noticed that my MPX tracepoints were producing garbage for the lower and upper bounds: mpx_bounds_register_exception: address referenced: 0x00007fffffffccb7 bounds: lower: 0x0 ~upper: 0xffffffffffffffff mpx_bounds_register_exception: address referenced: 0x00007fffffffccbf bounds: lower: 0x0 ~upper: 0xffffffffffffffff This is, of course, bogus because 0x00007fffffffccbf is *within* the bounds. I assumed that my instruction decoder was bad and went looking at it. But I eventually realized that I was getting a '0' offset back from xstate_offsets[BNDREGS]. It was being skipped in the initialization, which is obviously bogus, so remove the extra leaf++. This also goes an initializes xstate_offsets/sizes[] to -1 so so that bugs like this will oops instead of silently failing in interesting ways. This was introduced by: 39f1acd ("x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@sr71.net Link: http://lkml.kernel.org/r/20150611193400.2E0B00DB@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-11x86/crash: Allocate enough low memory when crashkernel=highJoerg Roedel
When the crash kernel is loaded above 4GiB in memory, the first kernel allocates only 72MiB of low-memory for the DMA requirements of the second kernel. On systems with many devices this is not enough and causes device driver initialization errors and failed crash dumps. Testing by SUSE and Redhat has shown that 256MiB is a good default value for now and the discussion has lead to this value as well. So set this default value to 256MiB to make sure there is enough memory available for DMA. Signed-off-by: Joerg Roedel <jroedel@suse.de> [ Reflow comment. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-11x86/swiotlb: Try coherent allocations with __GFP_NOWARNJoerg Roedel
When we boot a kdump kernel in high memory, there is by default only 72MB of low memory available. The swiotlb code takes 64MB of it (by default) so that there are only 8MB left to allocate from. On systems with many devices this causes page allocator warnings from dma_generic_alloc_coherent(): systemd-udevd: page allocation failure: order:0, mode:0x280d4 CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G W 3.12.28-4-default #1 Hardware name: HP ProLiant DL980 G7, BIOS P66 07/30/2012 ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90 0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000 0000000000000000 0000000000000000 0000000000000000 0000000000000001 Call Trace: dump_trace+0x7d/0x2d0 show_stack_log_lvl+0x94/0x170 show_stack+0x21/0x50 dump_stack+0x41/0x51 warn_alloc_failed+0xf0/0x160 __alloc_pages_slowpath+0x72f/0x796 __alloc_pages_nodemask+0x1ea/0x210 dma_generic_alloc_coherent+0x96/0x140 x86_swiotlb_alloc_coherent+0x1c/0x50 ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm] ttm_dma_populate+0x3ce/0x640 [ttm] ttm_tt_bind+0x36/0x60 [ttm] ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm] ttm_bo_move_buffer+0x105/0x130 [ttm] ttm_bo_validate+0xc1/0x130 [ttm] ttm_bo_init+0x24b/0x400 [ttm] radeon_bo_create+0x16c/0x200 [radeon] radeon_ring_init+0x11e/0x2b0 [radeon] r100_cp_init+0x123/0x5b0 [radeon] r100_startup+0x194/0x230 [radeon] r100_init+0x223/0x410 [radeon] radeon_device_init+0x6af/0x830 [radeon] radeon_driver_load_kms+0x89/0x180 [radeon] drm_get_pci_dev+0x121/0x2f0 [drm] local_pci_probe+0x39/0x60 pci_device_probe+0xa9/0x120 driver_probe_device+0x9d/0x3d0 __driver_attach+0x8b/0x90 bus_for_each_dev+0x5b/0x90 bus_add_driver+0x1f8/0x2c0 driver_register+0x5b/0xe0 do_one_initcall+0xf2/0x1a0 load_module+0x1207/0x1c70 SYSC_finit_module+0x75/0xa0 system_call_fastpath+0x16/0x1b 0x7fac533d2788 After these warnings the code enters a fall-back path and allocated directly from the swiotlb aperture in the end. So remove these warnings as this is not a fatal error. Signed-off-by: Joerg Roedel <jroedel@suse.de> [ Simplify, reflow comment. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1433500202-25531-3-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86: Make is_64bit_mm() widely availableDave Hansen
The uprobes code has a nice helper, is_64bit_mm(), that consults both the runtime and compile-time flags for 32-bit support. Instead of reinventing the wheel, pull it in to an x86 header so we can use it for MPX. I prefer passing the 'mm' around to test_thread_flag(TIF_IA32) because it makes it explicit where the context is coming from. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183704.F0209999@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Trace #BR exceptionsDave Hansen
This is the first in a series of MPX tracing patches. I've found these extremely useful in the process of debugging applications and the kernel code itself. This exception hooks in to the bounds (#BR) exception very early and allows capturing the key registers which would influence how the exception is handled. Note that bndcfgu/bndstatus are technically still 64-bit registers even in 32-bit mode. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183703.5FE2619A@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Introduce a boot-time disable flagDave Hansen
MPX has the _potential_ to cause some issues. Say part of your init system tried to protect one of its components from buffer overflows with MPX. If there were a false positive, it's possible that MPX could keep a system from booting. MPX could also potentially cause performance issues since it is present in hot paths like the unmap path. Allow it to be disabled at boot time. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150607183702.2E8B77AB@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Clean up the code by not passing a task pointer around when unnecessaryDave Hansen
The MPX code can only work on the current task. You can not, for instance, enable MPX management in another process or thread. You can also not handle a fault for another process or thread. Despite this, we pass a task_struct around prolifically. This patch removes all of the task struct passing for code paths where the code can not deal with another task (which turns out to be all of them). This has no functional changes. It's just a cleanup. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Use the new get_xsave_field_ptr()APIDave Hansen
The MPX registers (bndcsr/bndcfgu/bndstatus) are not directly accessible via normal instructions. They essentially act as if they were floating point registers and are saved/restored along with those registers. There are two main paths in the MPX code where we care about the contents of these registers: 1. #BR (bounds) faults 2. the prctl() code where we are setting MPX up Both of those paths _might_ be called without the FPU having been used. That means that 'tsk->thread.fpu.state' might never be allocated. Also, fpu_save_init() is not preempt-safe. It was a bug to call it without disabling preemption. The new get_xsave_addr() calls unlazy_fpu() instead and properly disables preemption. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Suresh Siddha <sbsiddha@gmail.com> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183701.BC0D37CF@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/fpu/xstate: Wrap get_xsave_addr() to make it saferDave Hansen
The MPX code appears is calling a low-level FPU function (copy_fpregs_to_fpstate()). This function is not able to be called in all contexts, although it is safe to call directly in some cases. Although probably correct, the current code is ugly and potentially error-prone. So, add a wrapper that calls the (slightly) higher-level fpu__save() (which is preempt- safe) and also ensures that we even *have* an FPU context (in the case that this was called when in lazy FPU mode). Ingo had this to say about the details about when we need preemption disabled: > it's indeed generally unsafe to access/copy FPU registers with preemption enabled, > for two reasons: > > - on older systems that use FSAVE the instruction destroys FPU register > contents, which has to be handled carefully > > - even on newer systems if we copy to FPU registers (which this code doesn't) > then we don't want a context switch to occur in the middle of it, because a > context switch will write to the fpstate, potentially overwriting our new data > with old FPU state. > > But it's safe to access FPU registers with preemption enabled in a couple of > special cases: > > - potentially destructively saving FPU registers: the signal handling code does > this in copy_fpstate_to_sigframe(), because it can rely on the signal restore > side to restore the original FPU state. > > - reading FPU registers on modern systems: we don't do this anywhere at the > moment, mostly to keep symmetry with older systems where FSAVE is > destructive. > > - initializing FPU registers on modern systems: fpu__clear() does this. Here > it's safe because we don't copy from the fpstate. > > - directly writing FPU registers from user-space memory (!). We do this in > fpu__restore_sig(), and it's safe because neither context switches nor > irq-handler FPU use can corrupt the source context of the copy (which is > user-space memory). > > Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think, > because: > > - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be > used. > > - the code was only reading FPU registers, and was doing it only in places that > guaranteed that an FPU state was already active (i.e. didn't do it in > kthreads) Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Suresh Siddha <sbsiddha@gmail.com> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183700.AA881696@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/fpu/xstate: Fix up bad get_xsave_addr() assumptionsDave Hansen
get_xsave_addr() assumes that if an xsave bit is present in the hardware (pcntxt_mask) that it is present in a given xsave buffer. Due to an bug in the xsave code on all of the systems that have MPX (and thus all the users of this code), that has been a true assumption. But, the bug is getting fixed, so our assumption is not going to hold any more. It's quite possible (and normal) for an enabled state to be present on 'pcntxt_mask', but *not* in 'xstate_bv'. We need to consult 'xstate_bv'. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183700.1E739B34@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09Revert "perf/x86/intel/uncore: Move uncore_box_init() out of driver ↵Ingo Molnar
initialization" This reverts commit c05199e5a57a579fea1e8fa65e2b511ceb524ffc. Vince Weaver reported the following crash while perf fuzzing: [ 79.473121] kernel BUG at mm/vmalloc.c:1335! [ 79.694391] Call Trace: [ 79.696997] <IRQ> [ 79.699090] [<ffffffff811b2130>] get_vm_area_caller+0x40/0x50 [ 79.705505] [<ffffffff81039f4d>] ? snb_uncore_imc_init_box+0x6d/0x90 [ 79.712414] [<ffffffff810635e5>] __ioremap_caller+0x195/0x350 [ 79.718610] [<ffffffff81039f4d>] ? snb_uncore_imc_init_box+0x6d/0x90 [ 79.725462] [<ffffffff81427f6b>] ? debug_object_activate+0x14b/0x1e0 [ 79.732346] [<ffffffff810637b7>] ioremap_nocache+0x17/0x20 [ 79.738283] [<ffffffff81039f4d>] snb_uncore_imc_init_box+0x6d/0x90 [ 79.744945] [<ffffffff81039cf7>] snb_uncore_imc_event_start+0xb7/0x110 [ 79.752020] [<ffffffff81039d97>] snb_uncore_imc_event_add+0x47/0x60 [ 79.758832] [<ffffffff81162cbb>] event_sched_in.isra.85+0xfb/0x330 [ 79.765519] [<ffffffff81162f5f>] group_sched_in+0x6f/0x1e0 [ 79.771481] [<ffffffff8101df1a>] ? native_sched_clock+0x2a/0x90 [ 79.777858] [<ffffffff811637bc>] __perf_event_enable+0x25c/0x2a0 [ 79.784418] [<ffffffff810f3e69>] ? tick_nohz_irq_exit+0x29/0x30 [ 79.790820] [<ffffffff8115ef30>] ? cpu_clock_event_start+0x40/0x40 [ 79.797546] [<ffffffff8115ef80>] remote_function+0x50/0x60 [ 79.803535] [<ffffffff810f8cd1>] flush_smp_call_function_queue+0x81/0x180 [ 79.810840] [<ffffffff810f9763>] generic_smp_call_function_single_interrupt+0x13/0x60 [ 79.819328] [<ffffffff8104b5e8>] smp_trace_call_function_single_interrupt+0x38/0xc0 [ 79.827614] [<ffffffff816de9be>] trace_call_function_single_interrupt+0x6e/0x80 [ 79.835465] <EOI> [ 79.837543] [<ffffffff8156e8b5>] ? cpuidle_enter_state+0x65/0x160 [ 79.844377] [<ffffffff8156e8a1>] ? cpuidle_enter_state+0x51/0x160 [ 79.851015] [<ffffffff8156e9e7>] cpuidle_enter+0x17/0x20 [ 79.856791] [<ffffffff810b6e39>] cpu_startup_entry+0x399/0x440 [ 79.863165] [<ffffffff816c9ddb>] rest_init+0xbb/0xd0 The offending commit is clearly confused as it moves heavy initialization work into IPI context. Revert it. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08x86/asm/entry: (Re-)rename __NR_entry_INT80_compat_max to ↵Ingo Molnar
__NR_syscall_compat_max Brian Gerst noticed that I did a weird rename in the following commit: b2502b418e63 ("x86/asm/entry: Untangle 'system_call' into two entry points: entry_SYSCALL_64 and entry_INT80_32") which renamed __NR_ia32_syscall_max to __NR_entry_INT80_compat_max. Now the original name was a misnomer, but the new one is a misnomer as well, as all the 32-bit compat syscall entry points (sysenter, syscall) share the system call table, not just the INT80 based one. Rename it to __NR_syscall_compat_max. Reported-by: Brian Gerst <brgerst@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08Merge branch 'x86/asm' into x86/core, to prepare for new patchIngo Molnar
Collect all changes to arch/x86/entry/entry_64.S, before applying patch that changes most of the file. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08PCI: Remove unnecessary #includes of <asm/pci.h>Bjorn Helgaas
In include/linux/pci.h, we already #include <asm/pci.h>, so we don't need to include <asm/pci.h> directly. Remove the unnecessary includes. All the files here already include <linux/pci.h>. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> # sh Acked-by: Ralf Baechle <ralf@linux-mips.org>
2015-06-08x86/asm/entry: Untangle 'system_call' into two entry points: ↵Ingo Molnar
entry_SYSCALL_64 and entry_INT80_32 The 'system_call' entry points differ starkly between native 32-bit and 64-bit kernels: on 32-bit kernels it defines the INT 0x80 entry point, while on 64-bit it's the SYSCALL entry point. This is pretty confusing when looking at generic code, and it also obscures the nature of the entry point at the assembly level. So unangle this by splitting the name into its two uses: system_call (32) -> entry_INT80_32 system_call (64) -> entry_SYSCALL_64 As per the generic naming scheme for x86 system call entry points: entry_MNEMONIC_qualifier where 'qualifier' is one of _32, _64 or _compat. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08x86/asm/entry: Untangle 'ia32_sysenter_target' into two entry points: ↵Ingo Molnar
entry_SYSENTER_32 and entry_SYSENTER_compat So the SYSENTER instruction is pretty quirky and it has different behavior depending on bitness and CPU maker. Yet we create a false sense of coherency by naming it 'ia32_sysenter_target' in both of the cases. Split the name into its two uses: ia32_sysenter_target (32) -> entry_SYSENTER_32 ia32_sysenter_target (64) -> entry_SYSENTER_compat As per the generic naming scheme for x86 system call entry points: entry_MNEMONIC_qualifier where 'qualifier' is one of _32, _64 or _compat. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-08x86/asm/entry: Rename compat syscall entry pointsIngo Molnar
Rename the following system call entry points: ia32_cstar_target -> entry_SYSCALL_compat ia32_syscall -> entry_INT80_compat The generic naming scheme for x86 system call entry points is: entry_MNEMONIC_qualifier where 'qualifier' is one of _32, _64 or _compat. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel/pebs: Add PEBSv3 decodingPeter Zijlstra
PEBSv3 as present on Skylake fixed the long standing issue of the status bits. They now really reflect the events that generated the record. Tested-by: Andi Kleen <ak@linux.intel.com> Tested-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Introduce PERF_RECORD_LOST_SAMPLESKan Liang
After enlarging the PEBS interrupt threshold, there may be some mixed up PEBS samples which are discarded by the kernel. This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with the number of possible discarded records when it is impossible to demux the samples. It makes sure the user is not left in the dark about such discards. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/intel/x86: Enlarge the PEBS bufferYan, Zheng
Currently the PEBS buffer size is 4k, it can only hold about 21 PEBS records. This patch enlarges the PEBS buffer size to 64k (the same as the BTS buffer). 64k memory can hold about 330 PEBS records. This will significantly reduce the number of PMIs when batched PEBS interrupts are enabled. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-7-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Drain the PEBS buffer during context switchesYan, Zheng
Flush the PEBS buffer during context switches if PEBS interrupt threshold is larger than one. This allows perf to supply TID for sample outputs. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS ↵Yan, Zheng
interrupt threshold) PEBS always had the capability to log samples to its buffers without an interrupt. Traditionally perf has not used this but always set the PEBS threshold to one. For frequently occurring events (like cycles or branches or load/store) this in term requires using a relatively high sampling period to avoid overloading the system, by only processing PMIs. This in term increases sampling error. For the common cases we still need to use the PMI because the PEBS hardware has various limitations. The biggest one is that it can not supply a callgraph. It also requires setting a fixed period, as the hardware does not support adaptive period. Another issue is that it cannot supply a time stamp and some other options. To supply a TID it requires flushing on context switch. It can however supply the IP, the load/store address, TSX information, registers, and some other things. So we can make PEBS work for some specific cases, basically as long as you can do without a callgraph and can set the period you can use this new PEBS mode. The main benefit is the ability to support much lower sampling period (down to -c 1000) without extensive overhead. One use cases is for example to increase the resolution of the c2c tool. Another is double checking when you suspect the standard sampling has too much sampling error. Some numbers on the overhead, using cycle soak, comparing the elapsed time from "kernbench -M -H" between plain (threshold set to one) and multi (large threshold). The test command for plain: "perf record --time -e cycles:p -c $period -- kernbench -M -H" The test command for multi: "perf record --no-time -e cycles:p -c $period -- kernbench -M -H" ( The only difference of test command between multi and plain is time stamp options. Since time stamp is not supported by large PEBS threshold, it can be used as a flag to indicate if large threshold is enabled during the test. ) period plain(Sec) multi(Sec) Delta 10003 32.7 16.5 16.2 20003 30.2 16.2 14.0 40003 18.6 14.1 4.5 80003 16.8 14.6 2.2 100003 16.9 14.1 2.8 800003 15.4 15.7 -0.3 1000003 15.3 15.2 0.2 2000003 15.3 15.1 0.1 With periods below 100003, plain (threshold one) cause much more overhead. With 10003 sampling period, the Elapsed Time for multi is even 2X faster than plain. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Handle multiple records in the PEBS bufferYan, Zheng
When the PEBS interrupt threshold is larger than one record and the machine supports multiple PEBS events, the records of these events are mixed up and we need to demultiplex them. Demuxing the records is hard because the hardware is deficient. The hardware has two issues that, when combined, create impossible scenarios to demux. The first issue is that the 'status' field of the PEBS record is a copy of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a problem let us first describe the regular PEBS cycle: A) the CTRn value reaches 0: - the corresponding bit in GLOBAL_STATUS gets set - we start arming the hardware assist < some unspecified amount of time later -- this could cover multiple events of interest > B) the hardware assist is armed, any next event will trigger it C) a matching event happens: - the hardware assist triggers and generates a PEBS record this includes a copy of GLOBAL_STATUS at this moment - if we auto-reload we (re)set CTRn - we clear the relevant bit in GLOBAL_STATUS Now consider the following chain of events: A0, B0, A1, C0 The event generated for counter 0 will include a status with counter 1 set, even though its not at all related to the record. A similar thing can happen with a !PEBS event if it just happens to overflow at the right moment. The second issue is that the hardware will only emit one record for two or more counters if the event that triggers the assist is 'close'. The 'close' can be several cycles. In some cases even the complete assist, if the event is something that doesn't need retirement. For instance, consider this chain of events: A0, B0, A1, B1, C01 Where C01 is an event that triggers both hardware assists, we will generate but a single record, but again with both counters listed in the status field. This time the record pertains to both events. Note that these two cases are different but undistinguishable with the data as generated. Therefore demuxing records with multiple PEBS bits (we can safely ignore status bits for !PEBS counters) is impossible. Furthermore we cannot emit the record to both events because that might cause a data leak -- the events might not have the same privileges -- so what this patch does is discard such events. The assumption/hope is that such discards will be rare. Here lists some possible ways you may get high discard rate. - when you count the same thing multiple times. But it is not a useful configuration. - you can be unfortunate if you measure with a userspace only PEBS event along with either a kernel or unrestricted PEBS event. Imagine the event triggering and setting the overflow flag right before entering the kernel. Then all kernel side events will end up with multiple bits set. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> [ Changelog improvements. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Introduce setup_pebs_sample_data()Yan, Zheng
Move code that sets up the PEBS sample data to a separate function. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-3-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Use the PEBS auto reload mechanism when possibleYan, Zheng
When a fixed period is specified, this patch makes perf use the PEBS auto reload mechanism. This makes normal profiling faster, because it avoids one costly MSR write in the PMI handler. However, the reset value will be loaded by hardware assist. There is a small delay compared to the previous non-auto-reload mechanism. The delay time is arbitrary, but very small. The assist cost is 400-800 cycles, assuming common cases with everything cached. The minimum period the patch currently uses is 10000. In that extreme case it can be ~10% if cycles are used. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: add support for PERF_SAMPLE_BRANCH_IND_JUMPStephane Eranian
This patch enables support for branch sampling filter for indirect jumps (IND_JUMP). It enables LBR IND_JMP filtering where available. There is also software filtering support. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@redhat.com Cc: dsahern@gmail.com Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: namhyung@kernel.org Link: http://lkml.kernel.org/r/1431637800-31061-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>