summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2016-05-27mm/cma: silence warnings due to max() usageStephen Rothwell
pageblock_order can be (at least) an unsigned int or an unsigned long depending on the kernel config and architecture, so use max_t(unsigned long, ...) when comparing it. fixes these warnings: In file included from include/asm-generic/bug.h:13:0, from arch/powerpc/include/asm/bug.h:127, from include/linux/bug.h:4, from include/linux/mmdebug.h:4, from include/linux/mm.h:8, from include/linux/memblock.h:18, from mm/cma.c:28: mm/cma.c: In function 'cma_init_reserved_mem': include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast (void) (&_max1 == &_max2); ^ mm/cma.c:186:27: note: in expansion of macro 'max' alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order); ^ mm/cma.c: In function 'cma_declare_contiguous': include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast (void) (&_max1 == &_max2); ^ include/linux/kernel.h:747:9: note: in definition of macro 'max' typeof(y) _max2 = (y); ^ mm/cma.c:270:29: note: in expansion of macro 'max' (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order)); ^ include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast (void) (&_max1 == &_max2); ^ include/linux/kernel.h:747:21: note: in definition of macro 'max' typeof(y) _max2 = (y); ^ mm/cma.c:270:29: note: in expansion of macro 'max' (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order)); ^ [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()Kirill A. Shutemov
If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail page from a pte, the "address" must be THP aligned in order for the page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds. Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com> Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.5] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27oom_reaper: close race with exiting taskMichal Hocko
Tetsuo has reported: Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0 sh cpuset=/ mems_allowed=0 CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: dump_stack+0x85/0xc8 dump_header+0x5b/0x394 oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB In other words: __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable. We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path. This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected. [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen] [akpm@linux-foundation.org: coding-style fixes] Fixes: aac453635549 ("mm, oom: introduce oom reaper") Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27mm: use early_pfn_to_nid in register_page_bootmem_info_nodeYang Shi
register_page_bootmem_info_node() is invoked in mem_init(), so it will be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is enabled. But, pfn_to_nid() depends on memmap which won't be fully setup until page_alloc_init_late() is done, so replace pfn_to_nid() by early_pfn_to_nid(). Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27mm: use early_pfn_to_nid in page_ext_initYang Shi
page_ext_init() checks suitable pages with pfn_to_nid(), but pfn_to_nid() depends on memmap which will not be setup fully until page_alloc_init_late() is done. Use early_pfn_to_nid() instead of pfn_to_nid() so that page extension could be still used early even though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page allocation call sites. Suggested by Joonsoo Kim [1], this fix basically undoes the change introduced by commit b8f1a75d61d840 ("mm: call page_ext_init() after all struct pages are initialized") and fixes the same problem with a better approach. [1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27mm: oom: do not reap task if there are live threads in threadgroupVladimir Davydov
If the current process is exiting, we don't invoke oom killer, instead we give it access to memory reserves and try to reap its mm in case nobody is going to use it. There's a mistake in the code performing this check - we just ignore any process of the same thread group no matter if it is exiting or not - see try_oom_reaper. Fix it. Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27switch xattr_handler->set() to passing dentry and inode separatelyAl Viro
preparation for similar switch in ->setxattr() (see the next commit for rationale). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4 update "mm/zsmalloc: don't fail if can't create debugfs info" dma-debug: avoid spinlock recursion when disabling dma-debug mm: oom_reaper: remove some bloat memcg: fix mem_cgroup_out_of_memory() return value. ocfs2: fix improper handling of return errno mm: slub: remove unused virt_to_obj() mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly seqlock: fix raw_read_seqcount_latch()
2016-05-26Merge tag 'dax-locking-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX locking updates from Ross Zwisler: "Filesystem DAX locking for 4.7 - We use a bit in an exceptional radix tree entry as a lock bit and use it similarly to how page lock is used for normal faults. This fixes races between hole instantiation and read faults of the same index. - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD locking is implemented" * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Remove i_mmap_lock protection dax: Use radix tree entry lock to protect cow faults dax: New fault locking dax: Allow DAX code to replace exceptional entries dax: Define DAX lock bit for radix tree exceptional entry dax: Make huge page handling depend of CONFIG_BROKEN dax: Fix condition for filling of PMD holes
2016-05-26update "mm/zsmalloc: don't fail if can't create debugfs info"Dan Streetman
Some updates to commit d34f615720d1 ("mm/zsmalloc: don't fail if can't create debugfs info"): - add pr_warn to all stat failure cases - do not prevent module loading on stat failure Link: http://lkml.kernel.org/r/1463671123-5479-1-git-send-email-ddstreet@ieee.org Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reviewed-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26memcg: fix mem_cgroup_out_of_memory() return value.Tetsuo Handa
mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE task after an eligible task was found, "false" if it found a TIF_MEMDIE task before an eligible task is found. This difference confuses memory_max_write() which checks the return value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to continue looping, mem_cgroup_out_of_memory() should return "true" in this case. This patch sets a dummy pointer in order to return "true". Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26mm: kasan: remove unused 'reserved' field from struct kasan_alloc_metaAndrey Ryabinin
Commit cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB") added 'reserved' field, but never used it. Link: http://lkml.kernel.org/r/1464021054-2307-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitlyYang Shi
Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT requires some ordering wrt other initialization operations, e.g. page_ext_init has to happen after the whole memmap is initialized properly. For SPARSEMEM this requires to wait for page_alloc_init_late. Other memory models (e.g. flatmem) might have different initialization layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which in turn depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM memory model: config ARCH_FLATMEM_ENABLE def_bool y depends on X86_32 && !NUMA so FLATMEM is ruled out via dependency maze. Be explicit and disable FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce subtle initialization bugs [1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-25percpu: fix synchronization between synchronous map extension and chunk ↵Tejun Heo
destruction For non-atomic allocations, pcpu_alloc() can try to extend the area map synchronously after dropping pcpu_lock; however, the extension wasn't synchronized against chunk destruction and the chunk might get freed while extension is in progress. This patch fixes the bug by putting most of non-atomic allocations under pcpu_alloc_mutex to synchronize against pcpu_balance_work which is responsible for async chunk management including destruction. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # v3.18+ Fixes: 1a4d76076cda ("percpu: implement asynchronous chunk population")
2016-05-25percpu: fix synchronization between chunk->map_extend_work and chunk destructionTejun Heo
Atomic allocations can trigger async map extensions which is serviced by chunk->map_extend_work. pcpu_balance_work which is responsible for destroying idle chunks wasn't synchronizing properly against chunk->map_extend_work and may end up freeing the chunk while the work item is still in flight. This patch fixes the bug by rolling async map extension operations into pcpu_balance_work. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # v3.18+ Fixes: 9c824b6a172c ("percpu: make sure chunk->map array has available space")
2016-05-23Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: - Oleg's "wait/ptrace: assume __WALL if the child is traced". It's a kernel-based workaround for existing userspace issues. - A few hotfixes - befs cleanups - nilfs2 updates - sys_wait() changes - kexec updates - kdump - scripts/gdb updates - the last of the MM queue - a few other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (84 commits) kgdb: depends on VT drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable drm/radeon: make radeon_mn_get wait for mmap_sem killable drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable uprobes: wait for mmap_sem for write killable prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable exec: make exec path waiting for mmap_sem killable aio: make aio_setup_ring killable coredump: make coredump_wait wait for mmap_sem for write killable vdso: make arch_setup_additional_pages wait for mmap_sem for write killable ipc, shm: make shmem attach/detach wait for mmap_sem killable mm, fork: make dup_mmap wait for mmap_sem for write killable mm, proc: make clear_refs killable mm: make vm_brk killable mm, elf: handle vm_brk error mm, aout: handle vm_brk failures mm: make vm_munmap killable mm: make vm_mmap killable mm: make mmap_sem for write waits killable for mm syscalls MAINTAINERS: add co-maintainer for scripts/gdb ...
2016-05-23mm: make vm_brk killableMichal Hocko
Now that all the callers handle vm_brk failure we can change it wait for mmap_sem killable to help oom_reaper to not get blocked just because vm_brk gets blocked behind mmap_sem readers. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_munmap killableMichal Hocko
Almost all current users of vm_munmap are ignoring the return value and so they do not handle potential error. This means that some VMAs might stay behind. This patch doesn't try to solve those potential problems. Quite contrary it adds a new failure mode by using down_write_killable in vm_munmap. This should be safer than other failure modes, though, because the process is guaranteed to die as soon as it leaves the kernel and exit_mmap will clean the whole address space. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_mmap killableMichal Hocko
All the callers of vm_mmap seem to check for the failure already and bail out in one way or another on the error which means that we can change it to use killable version of vm_mmap_pgoff and return -EINTR if the current task gets killed while waiting for mmap_sem. This also means that vm_mmap_pgoff can be killable by default and drop the additional parameter. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Please note that load_elf_binary is ignoring vm_mmap error for current->personality & MMAP_PAGE_ZERO case but that shouldn't be a problem because the address is not used anywhere and we never return to the userspace if we got killed. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make mmap_sem for write waits killable for mm syscallsMichal Hocko
This is a follow up work for oom_reaper [1]. As the async OOM killing depends on oom_sem for read we would really appreciate if a holder for write didn't stood in the way. This patchset is changing many of down_write calls to be killable to help those cases when the writer is blocked and waiting for readers to release the lock and so help __oom_reap_task to process the oom victim. Most of the patches are really trivial because the lock is help from a shallow syscall paths where we can return EINTR trivially and allow the current task to die (note that EINTR will never get to the userspace as the task has fatal signal pending). Others seem to be easy as well as the callers are already handling fatal errors and bail and return to userspace which should be sufficient to handle the failure gracefully. I am not familiar with all those code paths so a deeper review is really appreciated. As this work is touching more areas which are not directly connected I have tried to keep the CC list as small as possible and people who I believed would be familiar are CCed only to the specific patches (all should have received the cover though). This patchset is based on linux-next and it depends on down_write_killable for rw_semaphores which got merged into tip locking/rwsem branch and it is merged into this next tree. I guess it would be easiest to route these patches via mmotm because of the dependency on the tip tree but if respective maintainers prefer other way I have no objections. I haven't covered all the mmap_write(mm->mmap_sem) instances here $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l 98 $ git grep "down_write(.*\<mmap_sem\>)" | wc -l 62 I have tried to cover those which should be relatively easy to review in this series because this alone should be a nice improvement. Other places can be changed on top. [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org This patch (of 18): This is the first step in making mmap_sem write waiters killable. It focuses on the trivial ones which are taking the lock early after entering the syscall and they are not changing state before. Therefore it is very easy to change them to use down_write_killable and immediately return with -EINTR. This will allow the waiter to pass away without blocking the mmap_sem which might be required to make a forward progress. E.g. the oom reaper will need the lock for reading to dismantle the OOM victim address space. The only tricky function in this patch is vm_mmap_pgoff which has many call sites via vm_mmap. To reduce the risk keep vm_mmap with the original non-killable semantic for now. vm_munmap callers do not bother checking the return value so open code it into the munmap syscall path for now for simplicity. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: memcontrol: fix possible css ref leak on oomVladimir Davydov
mem_cgroup_oom may be invoked multiple times while a process is handling a page fault, in which case current->memcg_in_oom will be overwritten leaking the previously taken css reference. Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm updates from Dave Airlie: "Here's the main drm pull request for 4.7, it's been a busy one, and I've been a bit more distracted in real life this merge window. Lots more ARM drivers, not sure if it'll ever end. I think I've at least one more coming the next merge window. But changes are all over the place, support for AMD Polaris GPUs is in here, some missing GM108 support for nouveau (found in some Lenovos), a bunch of MST and skylake fixes. I've also noticed a few fixes from Arnd in my inbox, that I'll try and get in asap, but I didn't think they should hold this up. New drivers: - Hisilicon kirin display driver - Mediatek MT8173 display driver - ARC PGU - bitstreamer on Synopsys ARC SDP boards - Allwinner A13 initial RGB output driver - Analogix driver for DisplayPort IP found in exynos and rockchip DRM Core: - UAPI headers fixes and C++ safety - DRM connector reference counting - DisplayID mode parsing for Dell 5K monitors - Removal of struct_mutex from drivers - Connector registration cleanups - MST robustness fixes - MAINTAINERS updates - Lockless GEM object freeing - Generic fbdev deferred IO support panel: - Support for a bunch of new panels i915: - VBT refactoring - PLL computation cleanups - DSI support for BXT - Color manager support - More atomic patches - GEM improvements - GuC fw loading fixes - DP detection fixes - SKL GPU hang fixes - Lots of BXT fixes radeon/amdgpu: - Initial Polaris support - GPUVM/Scheduler/Clock/Power improvements - ASYNC pageflip support - New mesa feature support nouveau: - GM108 support - Power sensor support improvements - GR init + ucode fixes. - Use GPU provided topology information vmwgfx: - Add host messaging support gma500: - Some cleanups and fixes atmel: - Bridge support - Async atomic commit support fsl-dcu: - Timing controller for LCD support - Pixel clock polarity support rcar-du: - Misc fixes exynos: - Pipeline clock support - Exynoss4533 SoC support - HW trigger mode support - export HDMI_PHY clock - DECON5433 fixes - Use generic prime functions - use DMA mapping APIs rockchip: - Lots of little fixes vc4: - Render node support - Gamma ramp support - DPI output support msm: - Mostly cleanups and fixes - Conversion to generic struct fence etnaviv: - Fix for prime buffer handling - Allow hangcheck to be coalesced with other wakeups tegra: - Gamme table size fix" * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1050 commits) drm/edid: add displayid detailed 1 timings to the modelist. (v1.1) drm/edid: move displayid validation to it's own function. drm/displayid: Iterate over all DisplayID blocks drm/edid: move displayid tiled block parsing into separate function. drm: Nuke ->vblank_disable_allowed drm/vmwgfx: Report vmwgfx version to vmware.log drm/vmwgfx: Add VMWare host messaging capability drm/vmwgfx: Kill some lockdep warnings drm/nouveau/gr/gf100-: fix race condition in fecs/gpccs ucode drm/nouveau/core: recognise GM108 chipsets drm/nouveau/gr/gm107-: fix touching non-existent ppcs in attrib cb setup drm/nouveau/gr/gk104-: share implementation of ppc exception init drm/nouveau/gr/gk104-: move rop_active_fbps init to nonctx drm/nouveau/bios/pll: check BIT table version before trying to parse it drm/nouveau/bios/pll: prevent oops when limits table can't be parsed drm/nouveau/volt/gk104: round up in gk104_volt_set drm/nouveau/fb/gm200: setup mmu debug buffer registers at init() drm/nouveau/fb/gk20a,gm20b: setup mmu debug buffer registers at init() drm/nouveau/fb/gf100-: allocate mmu debug buffers drm/nouveau/fb: allow chipset-specific actions for oneinit() ...
2016-05-23Merge tag 'libnvdimm-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this update was stabilized before the merge window and appeared in -next. The "device dax" implementation was revised this week in response to review feedback, and to address failures detected by the recently expanded ndctl unit test suite. Not included in this pull request are two dax topic branches (dax error handling, and dax radix-tree locking). These topics were deferred to get a few more days of -next integration testing, and to coordinate a branch baseline with Ted and the ext4 tree. Vishal and Ross will send the error handling and locking topics respectively in the next few days. This branch has received a positive build result from the kbuild robot across 226 configs. Summary: - Device DAX for persistent memory: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: a) Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. b) Enforces deterministic behavior by being strict about what fault scenarios are supported. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance/feature differentiated memory ranges. - Support for the HPE DSM (device specific method) command formats. This enables management of these first generation devices until a unified DSM specification materializes. - Further ACPI 6.1 compliance with support for the common dimm identifier format. - Various fixes and cleanups across the subsystem" * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits) libnvdimm, dax: fix deletion libnvdimm, dax: fix alignment validation libnvdimm, dax: autodetect support libnvdimm: release ida resources Revert "block: enable dax for raw block devices" /dev/dax, core: file operations and dax-mmap /dev/dax, pmem: direct access to persistent memory libnvdimm: stop requiring a driver ->remove() method libnvdimm, dax: record the specified alignment of a dax-device instance libnvdimm, dax: reserve space to store labels for device-dax libnvdimm, dax: introduce device-dax infrastructure nfit: add sysfs dimm 'family' and 'dsm_mask' attributes tools/testing/nvdimm: ND_CMD_CALL support nfit: disable vendor specific commands nfit: export subsystem ids as attributes nfit: fix format interface code byte order per ACPI6.1 nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism nfit, libnvdimm: clarify "commands" vs "_DSMs" libnvdimm: increase max envelope size for ioctl acpi/nfit: Add sysfs "id" for NVDIMM ID ...
2016-05-22x86: remove more uaccess_32.h complexityLinus Torvalds
I'm looking at trying to possibly merge the 32-bit and 64-bit versions of the x86 uaccess.h implementation, but first this needs to be cleaned up. For example, the 32-bit version of "__copy_from_user_inatomic()" is mostly the special cases for the constant size, and it's actually almost never relevant. Most users aren't actually using a constant size anyway, and the few cases that do small constant copies are better off just using __get_user() instead. So get rid of the unnecessary complexity. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20/dev/dax, core: file operations and dax-mmapDan Williams
The "Device DAX" core enables dax mappings of performance / feature differentiated memory. An open mapping or file handle keeps the backing struct device live, but new mappings are only possible while the device is enabled. Faults are handled under rcu_read_lock to synchronize with the enabled state of the device. Similar to the filesystem-dax case the backing memory may optionally have struct page entries. However, unlike fs-dax there is no support for private mappings, or mappings that are not backed by media (see use of zero-page in fs-dax). Mappings are always guaranteed to match the alignment of the dax_region. If the dax_region is configured to have a 2MB alignment, all mappings are guaranteed to be backed by a pmd entry. Contrast this determinism with the fs-dax case where pmd mappings are opportunistic. If userspace attempts to force a misaligned mapping, the driver will fail the mmap attempt. See dax_dev_check_vma() for other scenarios that are rejected, like MAP_PRIVATE mappings. Cc: Hannes Reinecke <hare@suse.de> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20radix-tree: introduce radix_tree_replace_clear_tags()Matthew Wilcox
In addition to replacing the entry, we also clear all associated tags. This is really a one-off special for page_cache_tree_delete() which had far too much detailed knowledge about how the radix tree works. For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It can be used by radix_tree_delete_item() as well as radix_tree_replace_clear_tags(). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20raxix-tree: introduce CONFIG_RADIX_TREE_MULTIORDERMatthew Wilcox
I've been receiving increasingly concerned notes from 0day about how much my recent changes have been bloating the radix tree. Make it happier by only including multiorder support if CONFIG_TRANSPARENT_HUGEPAGES is set. This is an independent Kconfig option, so other radix tree users can also set it if they have a need. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/zsmalloc: don't fail if can't create debugfs infoDan Streetman
Change the return type of zs_pool_stat_create() to void, and remove the logic to abort pool creation if the stat debugfs dir/file could not be created. The debugfs stat file is for debugging/information only, and doesn't affect operation of zsmalloc; there is no reason to abort creating the pool if the stat file can't be created. This was seen with zswap, which used the same name for all pool creations, which caused zsmalloc to fail to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/zswap: use workqueue to destroy poolDan Streetman
Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to use the workqueue instead of using call_rcu(). When zswap destroys a pool no longer in use, it uses call_rcu() to perform the destruction/freeing. Since that executes in softirq context, it must not sleep. However, actually destroying the pool involves freeing the per-cpu compressors (which requires locking the cpu_add_remove_lock mutex) and freeing the zpool, for which the implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy, which locks the slab_mutex). So if either mutex is currently taken, or any other part of the compressor or zpool implementation sleeps, it will result in a BUG(). It's not easy to reproduce this when changing zswap's params normally. In testing with a loaded system, this does not fail: $ cd /sys/module/zswap/parameters $ echo lz4 > compressor ; echo zsmalloc > zpool nor does this: $ while true ; do > echo lzo > compressor ; echo zbud > zpool > sleep 1 > echo lz4 > compressor ; echo zsmalloc > zpool > sleep 1 > done although it's still possible either of those might fail, depending on whether anything else besides zswap has locked the mutexes. However, changing a parameter with no delay immediately causes the schedule while atomic BUG: $ while true ; do > echo lzo > compressor ; echo lz4 > compressor > done This is essentially the same as Yu Zhao's proposed patch to zsmalloc, but moved to zswap, to cover compressor and zpool freeing. Fixes: f1c54846ee45 ("zswap: dynamic pool creation") Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reported-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zsmalloc: require GFP in zs_malloc()Sergey Senozhatsky
Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to zs_create_pool(), so we can be more flexible, but, more importantly, we need this to switch zram to per-cpu compression streams -- zram will try to allocate handle with preemption disabled in a fast path and switch to a slow path (using different gfp mask) if the fast one has failed. Apart from that, this also align zs_malloc() interface with zspool/zbud. [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask] Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zsmalloc: remove unused pool param in obj_freeMinchan Kim
Let's remove unused pool param in obj_free Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zsmalloc: reorder function parametersMinchan Kim
Clean up function parameter ordering to order higher data structure first. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zsmalloc: clean up many BUG_ONMinchan Kim
There are many BUG_ON in zsmalloc.c which is not recommened so change them as alternatives. Normal rule is as follows: 1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE 2. use VM_BUG_ON_PAGE if we need to see struct page's fields 3. use those assertion in primitive functions so higher functions can rely on the assertion in the primitive function. 4. Don't use assertion if following instruction can trigger Oops Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zsmalloc: use first_page rather than pageMinchan Kim
Clean up function parameter "struct page". Many functions of zsmalloc expect that page paramter is "first_page" so use "first_page" rather than "page" for code readability. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/kasan: add API to check memory regionsAndrey Ryabinin
Memory access coded in an assembly won't be seen by KASAN as a compiler can instrument only C code. Add kasan_check_[read,write]() API which is going to be used to check a certain memory range. Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/kasan: print name of mem[set,cpy,move]() caller in reportAndrey Ryabinin
When bogus memory access happens in mem[set,cpy,move]() it's usually caller's fault. So don't blame mem[set,cpy,move]() in bug report, blame the caller instead. Before: BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at <address> After: BUG: KASAN: out-of-bounds access in <memset_caller> at <address> Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, kasan: don't call kasan_krealloc() from ksize().Alexander Potapenko
Instead of calling kasan_krealloc(), which replaces the memory allocation stack ID (if stack depot is used), just unpoison the whole memory chunk. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: kasan: initial memory quarantine implementationAlexander Potapenko
Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. When the object is freed, its state changes from KASAN_STATE_ALLOC to KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine instead of being returned to the allocator, therefore every subsequent access to that object triggers a KASAN error, and the error handler is able to say where the object has been allocated and deallocated. When it's time for the object to leave quarantine, its state becomes KASAN_STATE_FREE and it's returned to the allocator. From now on the allocator may reuse it for another allocation. Before that happens, it's still possible to detect a use-after free on that object (it retains the allocation/deallocation stacks). When the allocator reuses this object, the shadow is unpoisoned and old allocation/deallocation stacks are wiped. Therefore a use of this object, even an incorrect one, won't trigger ASan warning. Without the quarantine, it's not guaranteed that the objects aren't reused immediately, that's why the probability of catching a use-after-free is lower than with quarantine in place. Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. Freed objects are first added to per-cpu quarantine queues. When a cache is destroyed or memory shrinking is requested, the objects are moved into the global quarantine queue. Whenever a kmalloc call allows memory reclaiming, the oldest objects are popped out of the global queue until the total size of objects in quarantine is less than 3/4 of the maximum quarantine size (which is a fraction of installed physical memory). As long as an object remains in the quarantine, KASAN is able to report accesses to it, so the chance of reporting a use-after-free is increased. Once the object leaves quarantine, the allocator may reuse it, in which case the object is unpoisoned and KASAN can't detect incorrect accesses to it. Right now quarantine support is only enabled in SLAB allocator. Unification of KASAN features in SLAB and SLUB will be done later. This patch is based on the "mm: kasan: quarantine" patch originally prepared by Dmitry Chernenkov. A number of improvements have been suggested by Andrey Ryabinin. [glider@google.com: v9] Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, migrate: increment fail count on ENOMEMDavid Rientjes
If page migration fails due to -ENOMEM, nr_failed should still be incremented for proper statistics. This was encountered recently when all page migration vmstats showed 0, and inferred that migrate_pages() was never called, although in reality the first page migration failed because compaction_alloc() failed to find a migration target. This patch increments nr_failed so the vmstat is properly accounted on ENOMEM. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/compaction.c: fix zoneindex in kcompactd()Chen Feng
While testing the kcompactd in my platform 3G MEM only DMA ZONE. I found the kcompactd never wakeup. It seems the zoneindex has already minus 1 before. So the traverse here should be <=. It fixes a regression where kswapd could previously compact, but kcompactd not. Not a crash fix though. [akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh] Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com Fixes: accf62422b3a ("mm, kswapd: replace kswapd compaction with waking up kcompactd") Signed-off-by: Chen Feng <puck.chen@hisilicon.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zhuangluan Su <suzhuangluan@hisilicon.com> Cc: Yiping Xu <xuyiping@hisilicon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, thp: khugepaged should scan when sleep value is writtenDavid Rientjes
If a large value is written to scan_sleep_millisecs, for example, that period must lapse before khugepaged will wake up for periodic collapsing. If this value is tuned to 1 day, for example, and then re-tuned to its default 10s, khugepaged will still wait for a day before scanning again. This patch causes khugepaged to wakeup immediately when the value is changed and then sleep until that value is rewritten or the new value lapses. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20MM: increase safety margin provided by PF_LESS_THROTTLENeilBrown
When nfsd is exporting a filesystem over NFS which is then NFS-mounted on the local machine there is a risk of deadlock. This happens when there are lots of dirty pages in the NFS filesystem and they cause NFSD to be throttled, either in throttle_vm_writeout() or in balance_dirty_pages(). To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads and it provides a 25% increase to the limits that affect NFSD. Any process writing to an NFS filesystem will be throttled well before the number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD will not deadlock on pages that it needs to write out. At least it shouldn't. All processes are allowed a small excess margin to avoid performing too many calculations: ratelimit_pages. ratelimit_pages is set so that if a thread on every CPU uses the entire margin, the total will only go 3% over the limit, and this is much less than the 25% bonus that PF_LESS_THROTTLE provides, so this margin shouldn't be a problem. But it is. The "total memory" that these 3% and 25% are calculated against are not really total memory but are "global_dirtyable_memory()" which doesn't include anonymous memory, just free memory and page-cache memory. The "ratelimit_pages" number is based on whatever the global_dirtyable_memory was on the last CPU hot-plug, which might not be what you expect, but is probably close to the total freeable memory. The throttle threshold uses the global_dirtable_memory at the moment when the throttling happens, which could be much less than at the last CPU hotplug. So if lots of anonymous memory has been allocated, thus pushing out lots of page-cache pages, then NFSD might end up being throttled due to dirty NFS pages because the "25%" bonus it gets is calculated against a rather small amount of dirtyable memory, while the "3%" margin that other processes are allowed to dirty without penalty is calculated against a much larger number. To remove this possibility of deadlock we need to make sure that the margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin. Simply adding ratelimit_pages isn't enough as that should be multiplied by the number of cpus. So add "global_wb_domain.dirty_limit / 32" as that more accurately reflects the current total over-shoot margin. This ensures that the number of dirty NFS pages never gets so high that nfsd will be throttled waiting for them to be written. Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: check_new_page_bad() directly returns in __PG_HWPOISON caseNaoya Horiguchi
Currently we check page->flags twice for "HWPoisoned" case of check_new_page_bad(), which can cause a race with unpoisoning. This race unnecessarily taints kernel with "BUG: Bad page state". check_new_page_bad() is the only caller of bad_page() which is interested in __PG_HWPOISON, so let's move the hwpoison related code in bad_page() to it. Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, kasan: fix to call kasan_free_pages() after poisoning pageseokhoon.yoon
When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled, free_pages_prepare()'s codeflow is below. 1)kmemcheck_free_shadow() 2)kasan_free_pages() - set shadow byte of page is freed 3)kernel_poison_pages() 3.1) check access to page is valid or not using kasan ---> error occur, kasan think it is invalid access 3.2) poison page 4)kernel_map_pages() So kasan_free_pages() should be called after poisoning the page. Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com Signed-off-by: seokhoon.yoon <iamyooon@gmail.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: disable fault around on emulated access bit architectureMinchan Kim
fault_around aims to reduce minor faults of file-backed pages via speculative ahead pte mapping and relying on readahead logic. However, on non-HW access bit architecture the benefit is highly limited because they should emulate the young bit with minor faults for reclaim's page aging algorithm. IOW, we cannot reduce minor faults on those architectures. I did quick a test on my ARM machine. 512M file mmap sequential every word read on eSATA drive 4 times. stddev is stable. = fault_around 4096 = elapsed time(usec): 6747645 = fault_around 65536 = elapsed time(usec): 6709263 0.5% gain. Even when I tested it with eMMC there is no gain because I guess with slow storage the major fault is the dominant factor. Also, fault_around has the side effect of shrinking slab more aggressively and causes higher vmpressure, so if such speculation fails, it can evict slab more which can result in page I/O (e.g., inode cache). In the end, it would make void any benefit of fault_around. So let's make the default "disabled" on those architectures. Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: make faultaround produce old ptesKirill A. Shutemov
Currently, faultaround code produces young pte. This can screw up vmscan behaviour[1], as it makes vmscan think that these pages are hot and not push them out on first round. During sparse file access faultaround gets more pages mapped and all of them are young. Under memory pressure, this makes vmscan swap out anon pages instead, or to drop other page cache pages which otherwise stay resident. Modify faultaround to produce old ptes, so they can easily be reclaimed under memory pressure. This can to some extend defeat the purpose of faultaround on machines without hardware accessed bit as it will not help us with reducing the number of minor page faults. We may want to disable faultaround on such machines altogether, but that's subject for separate patchset. Minchan: "I tested 512M mmap sequential word read test on non-HW access bit system (i.e., ARM) and confirmed it doesn't increase minor fault any more. old: 4096 fault_around minor fault: 131291 elapsed time: 6747645 usec new: 65536 fault_around minor fault: 131291 elapsed time: 6709263 usec 0.56% benefit" [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: use phys_addr_t for reserve_bootmem_region() argumentsStefan Bader
Since commit 92923ca3aace ("mm: meminit: only set page reserved in the memblock region") the reserved bit is set on reserved memblock regions. However start and end address are passed as unsigned long. This is only 32bit on i386, so it can end up marking the wrong pages reserved for ranges at 4GB and above. This was observed on a 32bit Xen dom0 which was booted with initial memory set to a value below 4G but allowing to balloon in memory (dom0_mem=1024M for example). This would define a reserved bootmem region for the additional memory (for example on a 8GB system there was a reverved region covering the 4GB-8GB range). But since the addresses were passed on as unsigned long, this was actually marking all pages from 0 to 4GB as reserved. Fixes: 92923ca3aacef63 ("mm: meminit: only set page reserved in the memblock region") Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/memblock.c: remove unnecessary always-true comparisonRichard Leitner
Comparing an u64 variable to >= 0 returns always true and can therefore be removed. This issue was detected using the -Wtype-limits gcc flag. This patch fixes following type-limits warning: mm/memblock.c: In function `__next_reserved_mem_region': mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits] if (*idx >= 0 && *idx < type->cnt) { Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.net Signed-off-by: Richard Leitner <dev@g0hl1n.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20z3fold: the 3-fold allocator for compressed pagesVitaly Wool
This patch introduces z3fold, a special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. It is a ZBUD derivative which allows for higher compression ratio keeping the simplicity and determinism of its predecessor. This patch comes as a follow-up to the discussions at the Embedded Linux Conference in San-Diego related to the talk [1]. The outcome of these discussions was that it would be good to have a compressed page allocator as stable and deterministic as zbud with with higher compression ratio. To keep the determinism and simplicity, z3fold, just like zbud, always stores an integral number of compressed pages per page, but it can store up to 3 pages unlike zbud which can store at most 2. Therefore the compression ratio goes to around 2.6x while zbud's one is around 1.7x. The patch is based on the latest linux.git tree. This version has been updated after testing on various simulators (e.g. ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on comments from Dan Streetman [3]. [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou [2] https://lkml.org/lkml/2016/4/21/799 [3] https://lkml.org/lkml/2016/5/4/852 Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: thp: split_huge_pmd_address() comment improvementAndrea Arcangeli
Comment is partly wrong, this improves it by including the case of split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD is set. Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>