summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2011-01-15ARM: fix missing branch in __error_aRussell King
When DEBUG_LL is not set, we don't want __error_a re-entering __lookup_machine_type - we want it to go to the error function. This used to be the case before we reorganized the layout for hotplug cpu, as we used to fall through to __error. With the changed layout, we need an explicit branch here instead. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-15ARM: pxa: fix building issue of missing physmap.hEric Miao
Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-15ARM: mmp: PXA910 drive strength FAST using wrong valuePhilip Rakity
Drive strength for PXA910 is a 2 bit value but because of the mapping in plat-pxa/mfp.h needs to be shifted up one bit to handle real location in mfp registers. (MMP2 and PXA910 drive strength start at bit 11 while PXA168 starts at bit 10). Values 0, 1, 2, and 3 effectively need to be 0, 2, 4, and 6 to fit into register. 8 does not work. Signed-off-by: Philip Rakity <prakity@marvell.com> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-15ARM: mmp: MMP2 drive strength FAST using wrong valuePhilip Rakity
Drive strength for MMP2 is a 2 bit value but because of the mapping in plat-pxa/mfp.h needs to be shifted up one bit to handle real location in mfp registers. (MMP2 and PXA910 drive strength start at bit 11 while PXA168 starts at bit 10). Values 0, 1, 2, and 3 effectively need to be 0, 2, 4, and 6 to fit into register. 8 does not work. Signed-off-by: Philip Rakity <prakity@marvell.com> Tested-by: John Watlington <wad@laptop.org> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-15ARM: pxa: fix recursive calls in pxa_low_gpio_chipEric Miao
Signed-off-by: Eric Miao <eric.y.miao@gmail.com> Tested-by: Marek Vasut <marek.vasut@gmail.com>
2011-01-15ARM: fix /proc/$PID/stack on SMPRussell King
Rabin Vincent reports: | On SMP, this BUG() in save_stack_trace_tsk() can be easily triggered | from user space by reading /proc/$PID/stack, where $PID is any pid but | the current process: | | if (tsk != current) { | #ifdef CONFIG_SMP | /* | * What guarantees do we have here that 'tsk' | * is not running on another CPU? | */ | BUG(); | #else Fix this by replacing the BUG() with an entry to terminate the stack trace, returning an empty trace - I'd rather not expose the dwarf unwinder to a volatile stack of a running thread. Reported-by: Rabin Vincent <rabin@rab.in> Tested-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-15ARM: Fix build regression on SA11x0, PXA, and H720x targetsRussell King
Build errors similar this appeared in todays kautobuild for the above targets: In file included from arch/arm/include/asm/pgtable.h:461, from arch/arm/mach-pxa/generic.c:26: include/asm-generic/pgtable.h: In function 'ptep_test_and_clear_young': include/asm-generic/pgtable.h:29: error: dereferencing pointer to incomplete type None of the .c files including asm/pgtable.h with this error is using this header, so simply remove the include. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-14xen: export arbitrary_virt_to_machineStephen Rothwell
Fixes this build error: ERROR: "arbitrary_virt_to_machine" [drivers/xen/xen-gntdev.ko] undefined! Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14Merge branch 'for_rmk' of git://github.com/at91linux/linux-2.6-at91 into ↵Russell King
devel-stable
2011-01-14ARM: 6625/1: use memblock memory regions for "System RAM" I/O resourcesDima Zavin
Do not use memory bank info to request the "system ram" resources as they do not track holes created by memblock_remove inside machine's reserve callback. If the removed memory is passed as platform_device's ioresource, then drivers that call request_mem_region would fail due to a conflict with the incorrectly configured system ram resource. Instead, iterate through the regions of memblock.memory and add those as "System RAM" resources. Signed-off-by: Dima Zavin <dima@android.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-14Merge branch 'omap-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6 * 'omap-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6: (27 commits) omap4: Fix ULPI PHY init for ES1.0 SDP omap3: beaglexm: fix power on of DVI omap3: igep3: Add omap_reserve functionality omap3: beaglexm: fix DVI reset GPIO omap3: beaglexm: fix EHCI power up GPIO dir omap3: igep2: Add keypad support omap3: igep3: Fix IGEP module second MMC channel power supply omap3: igep3: Add USB EHCI support for IGEP module omap3: clocks: Fix build error 'CK_3430ES2' undeclared here arm: omap4: pandaboard: turn on PHY reference clock at init omap2plus: prm: Trvial build break fix for undefined reference to 'omap2_prm_read_mod_reg' omap2plus: voltage: Trivial linking fix for 'EINVAL' undeclared omap2plus: voltage: Trivial linking fix 'undefined reference' omap2plus: voltage: Trivial warning fix 'no return statement' omap2plus: clockdomain: Trivial fix for build break because of clktrctrl_mask arm: omap: gpio: don't access irq_desc array directly omap2+: pm_bus: make functions used as pointers as static OMAP: GPIO: fix _set_gpio_triggering() for OMAP2+ OMAP2+: TWL: include pm header for init protos OMAP2+: TWL: make conversion routines static ... Fix up conflicts in arch/arm/mach-omap2/board-omap3beagle.c ("DVI reset GPIO" vs "use generic DPI panel driver")
2011-01-14Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] fix ia64 build failure in pmdp_get_and_clear
2011-01-14x86, olpc: Add missing Kconfig dependenciesH. Peter Anvin
OLPC uses select for OLPC_OPENFIRMWARE, which means OLPC has to enforce the dependencies for OLPC_OPENFIRMWARE. Make sure it does so. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Daniel Drake <dsd@laptop.org> Cc: Andres Salomon <dilinger@queued.net> Cc: Grant Likely <grant.likely@secretlab.ca> LKML-Reference: <20100923162846.D8D409D401B@zog.reactivated.net> Cc: <stable@kernel.org> 2.6.37
2011-01-14x86, mrst: Set correct APB timer IRQ affinity for secondary cpuJacob Pan
Offlining the secondary CPU causes the timer irq affinity to be set to CPU 0. When the secondary CPU is back online again, the wrong irq affinity will be used. This patch ensures secondary per CPU timer always has the correct IRQ affinity when enabled. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> LKML-Reference: <1294963604-18111-1-git-send-email-jacob.jun.pan@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@kernel.org> 2.6.37
2011-01-14[IA64] fix ia64 build failure in pmdp_get_and_clearAndrea Arcangeli
Implement __pmd macro for ia64 too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2011-01-14AT91: Support for gsia18s boardIgor Plyatov
The GS_IA18_S (GMS) is a carrier board from GeoSIG Ltd used with the Stamp9G20 SoM from Taskit company. It operate as an internet accelerometer. Signed-off-by: Igor Plyatov <plyatov@gmail.com> [nicolas.ferre@atmel.com: rm Kconfig, whitespace fixes, change machine name] Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
2011-01-14AT91: Acme Systems FOX Board G20 board filesSergio Tanzilli
Signed-off-by: Sergio Tanzilli <tanzilli@acmesystems.it> [nicolas.ferre@atmel.com: whitespace fixes, change machine name] Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
2011-01-14AT91: board-sam9m10g45ek.c: Remove duplicate inclusion of mach/hardware.hJesper Juhl
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
2011-01-14Merge branch 'linux-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: PCI/PM: Report wakeup events before resuming devices PCI/PM: Use pm_wakeup_event() directly for reporting wakeup events PCI: sysfs: Update ROM to include default owner write access x86/PCI: make Broadcom CNB20LE driver EMBEDDED and EXPERIMENTAL x86/PCI: don't use native Broadcom CNB20LE driver when ACPI is available PCI/ACPI: Request _OSC control once for each root bridge (v3) PCI: enable pci=bfsort by default on future Dell systems PCI/PCIe: Clear Root PME Status bits early during system resume PCI: pci-stub: ignore zero-length id parameters x86/PCI: irq and pci_ids patch for Intel Patsburg PCI: Skip id checking if no id is passed PCI: fix __pci_device_probe kernel-doc warning PCI: make pci_restore_state return void PCI: Disable ASPM if BIOS asks us to PCI: Add mask bit definition for MSI-X table PCI: MSI: Move MSI-X entry definition to pci_regs.h Fix up trivial conflicts in drivers/net/{skge.c,sky2.c} that had in the meantime been converted to not use legacy PCI power management, and thus no longer use pci_restore_state() at all (and that caused trivial conflicts with the "make pci_restore_state return void" patch)
2011-01-14x86: tsc: Fix calibration refinement conditionals to avoid divide by zeroJohn Stultz
Konrad Wilk reported that the new delayed calibration crashes with a divide by zero on Xen. The reason is that Xen sets the pmtimer address, but reading from it returns 0xffffff. That results in the ref_start and ref_stop value being the same, so the delta is zero which causes the divide by zero later in the calculation. The conditional (!hpet && !ref_start && !ref_stop) which sanity checks the calibration reference values doesn't really make sense. If the refs are null, but hpet is on, we still want to break out. The div by zero would be possible to trigger by chance if both reads from the hardware provided the exact same value (due to hardware wrapping). So checking if both the ref values are the same should handle if we don't have hardware (both null) or if they are the same value (either by invalid hardware, or by chance), avoiding the div by zero issue. [ tglx: Applied the same fix to native_calibrate_tsc() where this check was copied from ] Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1295024788-15619-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-14Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (59 commits) mfd: ab8500-core chip version cut 2.0 support mfd: Flag WM831x /IRQ as a wake source mfd: Convert WM831x away from legacy I2C PM operations regulator: Support MAX8998/LP3974 DVS-GPIO mfd: Support LP3974 RTC i2c: Convert SCx200 driver from using raw PCI to platform device x86: OLPC: convert olpc-xo1 driver from pci device to platform device mfd: MAX8998/LP3974 hibernation support mfd/ab8500: remove spi support mfd: Remove ARCH_U8500 dependency from AB8500 misc: Make AB8500_PWM driver depend on U8500 due to PWM breakage mfd: Add __devexit annotation for vx855_remove mfd: twl6030 irq_data conversion. gpio: Fix cs5535 printk warnings misc: Fix cs5535 printk warnings mfd: Convert Wolfson MFD drivers to use irq_data accessor function mfd: Convert TWL4030 to new irq_ APIs mfd: Convert tps6586x driver to new irq_ API mfd: Convert tc6393xb driver to new irq_ APIs mfd: Convert t7166xb driver to new irq_ API ...
2011-01-14x86/PCI: make Broadcom CNB20LE driver EMBEDDED and EXPERIMENTALBjorn Helgaas
This functionality is known to be incomplete, so discourage its use in general-purpose kernels. The only reason to use this driver is to support PCI hotplug on CNB20LE- based machines that don't have ACPI, and there are very few such systems. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=665109 Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2011-01-14x86/PCI: don't use native Broadcom CNB20LE driver when ACPI is availableBjorn Helgaas
The broadcom_bus.c quirk was written (without benefit of documentation) to support PCI hotplug on an old system that doesn't have ACPI. As such, we should only use it when the system doesn't have ACPI. If the system does have ACPI and we need the host bridge description, we should get it from the ACPI _CRS method. On machines older than 2008, we currently ignore _CRS, but that doesn't mean we should use broadcom_bus.c. It means we should either (a) do what we've done in the past and assume everything in the PCI gap is routed to bus 0 (so hotplug may not work), or (b) arrange to use _CRS. This patch does (a). Reference: https://bugzilla.redhat.com/show_bug.cgi?id=665109 Acked-by: Ira W. Snyder <iws@ovro.caltech.edu> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2011-01-14PCI: enable pci=bfsort by default on future Dell systemsNarendra_K@Dell.com
This patch enables pci=bfsort by default on future Dell systems. It reads SMBIOS type 0xB1 vendor specific record and sets pci=bfsort accordingly. Offset Name Length Value Description 04 Flags0 Word Varies Bits 9-10 - 10:9 = 00 Unknown - 10:9 = 01 Breadth First - 10:9 = 10 Depth First - 10:9 = 11 Reserved 1. Any time pci=bfsort has to be enabled on a system, we need to add the model number of the system to the white list. With this patch, that is not required. 2. Typically, model number has to be added to the white list when the system is under development. With this change, that is not required. Signed-off-by: Jordan Hargrave <jordan_hargrave@dell.com> Signed-off-by: Narendra K <narendra_k@dell.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2011-01-14Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6Linus Torvalds
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: [S390] MAINTAINERS: Update zcrypt driver entry [S390] Randomize PIEs [S390] Randomise the brk region [S390] Add is_32bit_task() helper function [S390] Randomize lower bits of stack address [S390] Randomize mmap start address [S390] Rearrange mmap.c [S390] Enable flexible mmap layout for 64 bit processes [S390] vdso: dont map at mmap_base [S390] reduce miminum gap between stack and mmap_base [S390] mmap: consider stack address randomization [S390] Update default configuration [S390] cio: path_event overindication after resume
2011-01-14ARM: pxa: fix suspend/resume array index miscalculationMarek Vasut
Signed-off-by: Marek Vasut <marek.vasut@gmail.com> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-14ARM: pxa: use cpu_has_ipr() consistently in irq.cMarek Vasut
Signed-off-by: Marek Vasut <marek.vasut@gmail.com> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-14ARM: pxa: remove unused variable in clock-pxa3xx.cMarek Vasut
Signed-off-by: Marek Vasut <marek.vasut@gmail.com> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-14ARM: pxa: fix warning in zeus.cMarek Vasut
Signed-off-by: Marek Vasut <marek.vasut@gmail.com> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-14ARM: sa1111: fix typo in sa1111_retrigger_lowirq()Pavel Machek
Signed-off-by: Pavel Machek <pma@sysgo.com> Signed-off-by: Eric Miao <eric.y.miao@gmail.com>
2011-01-14Merge branch 'for-rmk' of git://git.pengutronix.de/git/imx/linux-2.6 into ↵Russell King
devel-stable
2011-01-14ARM: fix wrongly patched constantsRussell King
e3d9c625 (ARM: CPU hotplug: fix hard-coded control register constants) changed the wrong constants in the hotplug assembly code. Fix this. Reported-by: viresh kumar <viresh.kumar@st.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-14x86: OLPC: convert olpc-xo1 driver from pci device to platform deviceAndres Salomon
The cs5535-mfd driver now takes care of the PCI BAR handling; this means the olpc-xo1 driver shouldn't be touching the PCI device at all. This patch uses both cs5535-acpi and cs5535-pms platform devices rather than a single platform device because the cs5535-mfd driver may be used by other CS5535 platform-specific drivers; OLPC doesn't get to dictate that ACPI and PMS will always be used together. Signed-off-by: Andres Salomon <dilinger@queued.net> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-01-14ARM mxs: clkdev related compile fixesSascha Hauer
Since commit 6d803ba (ARM: 6483/1: arm & sh: factorised duplicated clkdev.c) platforms need to select CLKDEV_LOOKUP instead of COMMON_CLKDEV and need to include <linux/clkdev.h>. Cc: Shawn Guo <shawn.guo@freescale.com> Cc: Lothar Waßmann <LW@KARO-electronics.de> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2011-01-14ARM: 6624/1: fix dependency for CONFIG_SMP_ON_UPNicolas Pitre
This depends on !XIP_KERNEL and not !XIP. Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-14ARM: 6623/1: Thumb-2: Fix out-of-range offset for Thumb-2 in proc-v7.SDave Martin
Commit d30e45e (ARM: pgtable: switch order of Linux vs hardware page tables) introduced a pre-increment addressing offset which is out of range for Thumb-2. Thumb-2 only permits offsets <256. So split the intruction in two for Thumb-2. Signed-off-by: Dave Martin <dave.martin@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2011-01-14ARM i.MX mx31_3ds: Fix MC13783 regulator namesSascha Hauer
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2011-01-13Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits) ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework ACPI: fix resource check message ACPI / Battery: Update information on info notification and resume ACPI: Drop device flag wake_capable ACPI: Always check if _PRW is present before trying to evaluate it ACPI / PM: Check status of power resources under mutexes ACPI / PM: Rename acpi_power_off_device() ACPI / PM: Drop acpi_power_nocheck ACPI / PM: Drop acpi_bus_get_power() Platform / x86: Make fujitsu_laptop use acpi_bus_update_power() ACPI / Fan: Rework the handling of power resources ACPI / PM: Register power resource devices as soon as they are needed ACPI / PM: Register acpi_power_driver early ACPI / PM: Add function for updating device power state consistently ACPI / PM: Add function for device power state initialization ACPI / PM: Introduce __acpi_bus_get_power() ACPI / PM: Introduce function for refcounting device power resources ACPI / PM: Add functions for manipulating lists of power resources ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes ACPICA: Update version to 20101209 ...
2011-01-13Merge branch 'idle-release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6: cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer intel_idle: open broadcast clock event cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW cpuidle: delete NOP CPUIDLE_FLAG_POLL ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL ACPI, intel_idle: Cleanup idle= internal variables cpuidle: Make cpuidle_enable_device() call poll_idle_init() intel_idle: update Sandy Bridge core C-state residency targets
2011-01-13Merge branch 'stable/gntdev' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/p2m: Fix module linking error. xen p2m: clear the old pte when adding a page to m2p_override xen gntdev: use gnttab_map_refs and gnttab_unmap_refs xen: introduce gnttab_map_refs and gnttab_unmap_refs xen p2m: transparently change the p2m mappings in the m2p override xen/gntdev: Fix circular locking dependency xen/gntdev: stop using "token" argument xen: gntdev: move use of GNTMAP_contains_pte next to the map_op xen: add m2p override mechanism xen: move p2m handling to separate file xen/gntdev: add VM_PFNMAP to vma xen/gntdev: allow usermode to map granted pages xen: define gnttab_set_map_op/unmap_op Fix up trivial conflict in drivers/xen/Kconfig
2011-01-13thp: mm: define MADV_NOHUGEPAGEAndrea Arcangeli
Define MADV_NOHUGEPAGE. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: mmu_notifier_test_youngAndrea Arcangeli
For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's correct to return 0 when shadow_access_mask is zero, it requires gup-fast to set the referenced bit). qemu-kvm access already sets the young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we relay on gup-fast to signal the page is in use... We also need to check the young bits on the secondary pagetables for NPT and not nested shadow mmu as the data may never get accessed again by the primary pte. Without this closer accuracy, we'd have to remove the heuristic that avoids collapsing hugepages in hugepage virtual regions that have not even a single subpage in use. ->test_young is full backwards compatible with GRU and other usages that don't have young bits in pagetables set by the hardware and that should nuke the secondary mmu mappings when ->clear_flush_young runs just like EPT does. Removing the heuristic that checks the young bit in khugepaged/collapse_huge_page completely isn't so bad either probably but I thought it was worth it and this makes it reliable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: don't allow transparent hugepage support without PSEAndrea Arcangeli
Archs implementing Transparent Hugepage Support must implement a function called has_transparent_hugepage to be sure the virtual or physical CPU supports Transparent Hugepages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: add pmd_modifyJohannes Weiner
Add pmd_modify() for use with mprotect() on huge pmds. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: add x86 32bit supportJohannes Weiner
Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: transparent hugepage coreAndrea Arcangeli
Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: kvm mmu transparent hugepage supportAndrea Arcangeli
This should work for both hugetlbfs and transparent hugepages. [akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: split_huge_page_mm/vmaAndrea Arcangeli
split_huge_page_pmd compat code. Each one of those would need to be expanded to hundred of lines of complex code without a fully reliable split_huge_page_pmd design. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: pte alloc trans splittingAndrea Arcangeli
pte alloc routines must wait for split_huge_page if the pmd is not present and not null (i.e. pmd_trans_splitting). The additional branches are optimized away at compile time by pmd_trans_splitting if the config option is off. However we must pass the vma down in order to know the anon_vma lock to wait for. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: bail out gup_fast on splitting pmdAndrea Arcangeli
Force gup_fast to take the slow path and block if the pmd is splitting, not only if it's none. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>