summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2014-12-13memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cacheVladimir Davydov
__memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if the cgroup's kmem cache doesn't exist), because kmalloc may call __memcg_kmem_get_cache internally again. To avoid the recursion, we use the task_struct->memcg_kmem_skip_account flag. However, there's no need checking the flag in memcg_kmem_newpage_charge, because there's no way how this function could result in recursion, if called from memcg_kmem_get_cache. So let's remove the redundant code. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13memcg: zap kmem_account_flagsVladimir Davydov
The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of having a separate flags field. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm: mincore: add hwpoison page handleWeijie Yang
When the encountered pte is a swap entry, the current code handles two cases: migration and normal swapentry, but we have a third case: hwpoison page. This patch adds hwpoison page handle, consider hwpoison page incore as same as migration. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/rmap: calculate page offset when neededDavidlohr Bueso
Call page_to_pgoff() to get the page offset once we are sure we actually need it, and any very obvious initial function checks have passed. Trivial micro-optimization, and potentially save some cycles. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/debug-pagealloc: cleanup page guard codeJoonsoo Kim
Page guard is used by debug-pagealloc feature. Currently, it is open-coded, but, I think that more abstraction of it makes core page allocator code more readable. There is no functional difference. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Gioh Kim <gioh.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUGTony Luck
There is a lot of duplication in the rubric around actually setting or clearing a mem region flag. Create a new helper function to do this and reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a single line. This will be useful if someone were to add a new mem region flag - which I hope to be doing some day soon. But it looks like a plausible cleanup even without that - so I'd like to get it out of the way now. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Emil Medve <Emilian.Medve@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13memcg: do not abuse memcg_kmem_skip_accountVladimir Davydov
task_struct->memcg_kmem_skip_account was initially introduced to avoid recursion during kmem cache creation: memcg_kmem_get_cache, which is called by kmem_cache_alloc to determine the per-memcg cache to account allocation to, may issue lazy cache creation if the needed cache doesn't exist, which means issuing yet another kmem_cache_alloc. We can't just pass a flag to the nested kmem_cache_alloc disabling kmem accounting, because there are hidden allocations, e.g. in INIT_WORK. So we introduced a flag on the task_struct, memcg_kmem_skip_account, making memcg_kmem_get_cache return immediately. By its nature, the flag may also be used to disable accounting for allocations shared among different cgroups, and currently it is used this way in memcg_activate_kmem. Using it like this looks like abusing it to me. If we want to disable accounting for some allocations (which we will definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG flag in order to blacklist/whitelist some allocations. For now, let's simply remove memcg_stop/resume_kmem_account from memcg_activate_kmem. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13memcg: don't check mm in __memcg_kmem_{get_cache,newpage_charge}Vladimir Davydov
We already assured the current task has mm in memcg_kmem_should_charge, no need to double check. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13memcg: __mem_cgroup_free: remove stale disarm_static_keys commentVladimir Davydov
cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm: cma: align to physical address, not CMA region positionGregory Fong
The alignment in cma_alloc() was done w.r.t. the bitmap. This is a problem when, for example: - a device requires 16M (order 12) alignment - the CMA region is not 16 M aligned In such a case, can result with the CMA region starting at, say, 0x2f800000 but any allocation you make from there will be aligned from there. Requesting an allocation of 32 M with 16 M alignment will result in an allocation from 0x2f800000 to 0x31800000, which doesn't work very well if your strange device requires 16M alignment. Change to use bitmap_find_next_zero_area_off() to account for the difference in alignment at reserve-time and alloc-time. Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kukjin Kim <kgene.kim@samsung.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/memory.c: share the i_mmap_rwsemDavidlohr Bueso
The unmap_mapping_range family of functions do the unmapping of user pages (ultimately via zap_page_range_single) without touching the actual interval tree, thus share the lock. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/nommu: share the i_mmap_rwsemDavidlohr Bueso
Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify that any shared mappings of the inode in question aren't broken (dead zone). afaict the only user being ramfs to handle the size change attribute. Pretty much a no-brainer to share the lock. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/memory-failure: share the i_mmap_rwsemDavidlohr Bueso
No brainer conversion: collect_procs_file() only schedules a process for later kill, share the lock, similarly to the anon vma variant. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/xip: share the i_mmap_rwsemDavidlohr Bueso
__xip_unmap() will remove the xip sparse page from the cache and take down pte mapping, without altering the interval tree, thus share the i_mmap_rwsem when searching for the ptes to unmap. Additionally, tidy up the function a bit and make variables only local to the interval tree walk loop. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/rmap: share the i_mmap_rwsemDavidlohr Bueso
Similarly to the anon memory counterpart, we can share the mapping's lock ownership as the interval tree is not modified when doing doing the walk, only the file page. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm: convert i_mmap_mutex to rwsemDavidlohr Bueso
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [sfr@canb.auug.org.au: update fremap.c] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm: use new helper functions around the i_mmap_mutexDavidlohr Bueso
Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree update from Jiri Kosina: "Usual stuff: documentation updates, printk() fixes, etc" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits) intel_ips: fix a type in error message cpufreq: cpufreq-dt: Move newline to end of error message ps3rom: fix error return code treewide: fix typo in printk and Kconfig ARM: dts: bcm63138: change "interupts" to "interrupts" Replace mentions of "list_struct" to "list_head" kernel: trace: fix printk message scsi: mpt2sas: fix ioctl in comment zbud, zswap: change module author email clocksource: Fix 'clcoksource' typo in comment arm: fix wording of "Crotex" in CONFIG_ARCH_EXYNOS3 help gpio: msm-v1: make boolean argument more obvious usb: Fix typo in usb-serial-simple.c PCI: Fix comment typo 'COMFIG_PM_OPS' powerpc: Fix comment typo 'CONIFG_8xx' powerpc: Fix comment typos 'CONFiG_ALTIVEC' clk: st: Spelling s/stucture/structure/ isci: Spelling s/stucture/structure/ usb: gadget: zero: Spelling s/infrastucture/infrastructure/ treewide: Fix company name in module descriptions ...
2014-12-11Merge branch 'for-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup update from Tejun Heo: "cpuset got simplified a bit. cgroup core got a fix on unified hierarchy and grew some effective css related interfaces which will be used for blkio support for writeback IO traffic which is currently being worked on" * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: implement cgroup_get_e_css() cgroup: add cgroup_subsys->css_e_css_changed() cgroup: add cgroup_subsys->css_released() cgroup: fix the async css offline wait logic in cgroup_subtree_control_write() cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write() cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask() cpuset: lock vs unlock typo cpuset: simplify cpuset_node_allowed API cpuset: convert callback_mutex to a spinlock
2014-12-11Merge branch 'for-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "Nothing interesting. A patch to convert the remaining __get_cpu_var() users, another to fix non-critical off-by-one in an assertion and a cosmetic conversion to lockless_dereference() in percpu-ref. The back-merge from mainline is to receive lockless_dereference()" * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: Replace smp_read_barrier_depends() with lockless_dereference() percpu: Convert remaining __get_cpu_var uses in 3.18-rcX percpu: off by one in BUG_ON()
2014-12-11Merge tag 'powerpc-3.19-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: "Some nice cleanups like removing bootmem, and removal of __get_cpu_var(). There is one patch to mm/gup.c. This is the generic GUP implementation, but is only used by us and arm(64). We have an ack from Steve Capper, and although we didn't get an ack from Andrew he told us to take the patch through the powerpc tree. There's one cxl patch. This is in drivers/misc, but Greg said he was happy for us to manage fixes for it. There is an infrastructure patch to support an IPMI driver for OPAL. There is also an RTC driver for OPAL. We weren't able to get any response from the RTC maintainer, Alessandro Zummo, so in the end we just merged the driver. The usual batch of Freescale updates from Scott" * tag 'powerpc-3.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (101 commits) powerpc/powernv: Return to cpu offline loop when finished in KVM guest powerpc/book3s: Fix partial invalidation of TLBs in MCE code. powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault powerpc/xmon: Cleanup the breakpoint flags powerpc/xmon: Enable HW instruction breakpoint on POWER8 powerpc/mm/thp: Use tlbiel if possible powerpc/mm/thp: Remove code duplication powerpc/mm/hugetlb: Sanity check gigantic hugepage count powerpc/oprofile: Disable pagefaults during user stack read powerpc/mm: Check for matching hpte without taking hpte lock powerpc: Drop useless warning in eeh_init() powerpc/powernv: Cleanup unused MCE definitions/declarations. powerpc/eeh: Dump PHB diag-data early powerpc/eeh: Recover EEH error on ownership change for BCM5719 powerpc/eeh: Set EEH_PE_RESET on PE reset powerpc/eeh: Refactor eeh_reset_pe() powerpc: Remove more traces of bootmem powerpc/pseries: Initialise nvram_pstore_info's buf_lock cxl: Name interrupts in /proc/interrupt cxl: Return error to PSL if IRQ demultiplexing fails & print clearer warning ...
2014-12-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "The most notable change for this pull request is the ftrace rework from Heiko. It brings a small performance improvement and the ground work to support a new gcc option to replace the mcount blocks with a single nop. Two new s390 specific system calls are added to emulate user space mmio for PCI, an artifact of the how PCI memory is accessed. Two patches for the memory management with changes to common code. For KVM mm_forbids_zeropage is added which disables the empty zero page for an mm that is used by a KVM process. And an optimization, pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full. Some micro optimization for the cmpxchg and the spinlock code. And as usual bug fixes and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits) s390/cputime: fix 31-bit compile s390/scm_block: make the number of reqs per HW req configurable s390/scm_block: handle multiple requests in one HW request s390/scm_block: allocate aidaw pages only when necessary s390/scm_block: use mempool to manage aidaw requests s390/eadm: change timeout value s390/mm: fix memory leak of ptlock in pmd_free_tlb s390: use local symbol names in entry[64].S s390/ptrace: always include vector registers in core files s390/simd: clear vector register pointer on fork/clone s390: translate cputime magic constants to macros s390/idle: convert open coded idle time seqcount s390/idle: add missing irq off lockdep annotation s390/debug: avoid function call for debug_sprintf_* s390/kprobes: fix instruction copy for out of line execution s390: remove diag 44 calls from cpu_relax() s390/dasd: retry partition detection s390/dasd: fix list corruption for sleep_on requests s390/dasd: fix infinite term I/O loop s390/dasd: remove unused code ...
2014-12-10Merge branch 'akpm' (patchbomb from Andrew)Linus Torvalds
Merge first patchbomb from Andrew Morton: - a few minor cifs fixes - dma-debug upadtes - ocfs2 - slab - about half of MM - procfs - kernel/exit.c - panic.c tweaks - printk upates - lib/ updates - checkpatch updates - fs/binfmt updates - the drivers/rtc tree - nilfs - kmod fixes - more kernel/exit.c - various other misc tweaks and fixes * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) exit: pidns: fix/update the comments in zap_pid_ns_processes() exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting exit: exit_notify: re-use "dead" list to autoreap current exit: reparent: call forget_original_parent() under tasklist_lock exit: reparent: avoid find_new_reaper() if no children exit: reparent: introduce find_alive_thread() exit: reparent: introduce find_child_reaper() exit: reparent: document the ->has_child_subreaper checks exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting exit: proc: don't try to flush /proc/tgid/task/tgid exit: release_task: fix the comment about group leader accounting exit: wait: drop tasklist_lock before psig->c* accounting exit: wait: don't use zombie->real_parent exit: wait: cleanup the ptrace_reparented() checks usermodehelper: kill the kmod_thread_locker logic usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper() fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ...
2014-12-10mm: move page->mem_cgroup bad page handling into generic codeJohannes Weiner
Now that the external page_cgroup data structure and its lookup is gone, let the generic bad_page() check for page->mem_cgroup sanity. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: page_cgroup: rename file to mm/swap_cgroup.cJohannes Weiner
Now that the external page_cgroup data structure and its lookup is gone, the only code remaining in there is swap slot accounting. Rename it and move the conditional compilation into mm/Makefile. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: embed the memcg pointer directly into struct pageJohannes Weiner
Memory cgroups used to have 5 per-page pointers. To allow users to disable that amount of overhead during runtime, those pointers were allocated in a separate array, with a translation layer between them and struct page. There is now only one page pointer remaining: the memcg pointer, that indicates which cgroup the page is associated with when charged. The complexity of runtime allocation and the runtime translation overhead is no longer justified to save that *potential* 0.19% of memory. With CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding after the page->private member and doesn't even increase struct page, and then this patch actually saves space. Remaining users that care can still compile their kernels without CONFIG_MEMCG. text data bss dec hex filename 8828345 1725264 983040 11536649 b00909 vmlinux.old 8827425 1725264 966656 11519345 afc571 vmlinux.new [mhocko@suse.cz: update Documentation/cgroups/memory.txt] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: remove stale page_cgroup_lock commentJohannes Weiner
There is no cgroup-specific page lock anymore. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm/frontswap.c: fix the condition in BUG_ONLi Haifeng
The largest index of swap device is MAX_SWAPFILES-1. So the type should be less than MAX_SWAPFILES. Signed-off-by: Haifeng Li <omycle@gmail.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: fix a spelling mistakeWei Yuan
Signed-off-by Wei Yuan <weiyuan.wei@huawei.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: hugetlb: fix __unmap_hugepage_range()Hillf Danton
First, after flushing TLB, we have no need to scan pte from start again. Second, before bail out loop, the address is forwarded one step. Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm, memcg: fix potential undefined behaviour in page stat accountingMichal Hocko
Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback page accounting") mem_cgroup_end_page_stat consumes locked and flags variables directly rather than via pointers which might trigger C undefined behavior as those variables are initialized only in the slow path of mem_cgroup_begin_page_stat. Although mem_cgroup_end_page_stat handles parameters correctly and touches them only when they hold a sensible value it is caller which loads a potentially uninitialized value which then might allow compiler to do crazy things. I haven't seen any warning from gcc and it seems that the current version (4.9) doesn't exploit this type undefined behavior but Sasha has reported the following: UBSan: Undefined behaviour in mm/rmap.c:1084:2 load of value 255 is not a valid value for type '_Bool' CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427 Call Trace: dump_stack (lib/dump_stack.c:52) ubsan_epilogue (lib/ubsan.c:159) __ubsan_handle_load_invalid_value (lib/ubsan.c:482) page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096) unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303) unmap_single_vma (mm/memory.c:1348) unmap_vmas (mm/memory.c:1377 (discriminator 3)) exit_mmap (mm/mmap.c:2837) mmput (kernel/fork.c:659) do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747) do_group_exit (include/linux/sched.h:775 kernel/exit.c:873) SyS_exit_group (kernel/exit.c:901) tracesys_phase2 (arch/x86/kernel/entry_64.S:529) Fix this by using pointer parameters for both locked and flags and be more robust for future compiler changes even though the current code is implemented correctly. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree()Johannes Weiner
None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree()Johannes Weiner
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It makes a lot more sense to check where it's looked up, rather than check for it in __mem_cgroup_same_or_subtree() where it's unexpected. No other callsite passes NULL to __mem_cgroup_same_or_subtree(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: remove bogus NULL check after mem_cgroup_from_task()Johannes Weiner
That function acts like a typecast - unless NULL is passed in, no NULL can come out. task_in_mem_cgroup() callers don't pass NULL tasks. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: shorten the page statistics update slowpathJohannes Weiner
While moving charges from one memcg to another, page stat updates must acquire the old memcg's move_lock to prevent double accounting. That situation is denoted by an increased memcg->move_accounting. However, the charge moving code declares this way too early for now, even before summing up the RSS and pre-allocating destination charges. Shorten this slowpath mode by increasing memcg->move_accounting only right before walking the task's address space with the intention of actually moving the pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10thp: do not mark zero-page pmd write-protected explicitlyKirill A. Shutemov
Zero pages can be used only in anonymous mappings, which never have writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X definitions. Let's drop redundant pmd_wrprotect() in set_huge_zero_page(). Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10memcg: use generic slab iterators for showing slabinfoVladimir Davydov
Let's use generic slab_start/next/stop for showing memcg caches info. In contrast to the current implementation, this will work even if all memcg caches' info doesn't fit into a seq buffer (a page), plus it simply looks neater. Actually, the main reason I do this isn't mere cleanup. I'm going to zap the memcg_slab_caches list, because I find it useless provided we have the slab_caches list, and this patch is a step in this direction. It should be noted that before this patch an attempt to read memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted in -EIO, while after this patch it will silently show nothing except the header, but I don't think it will frustrate anyone. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10memcg: remove mem_cgroup_reclaimable check from soft reclaimVladimir Davydov
mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on *any* NUMA node. However, the only place where it's called is mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific* zone. So the way it is used is incorrect - it will return true even if the cgroup doesn't have pages on the zone we're scanning. I think we can get rid of this check completely, because mem_cgroup_shrink_node_zone(), which is called by mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is equivalent to shrink_lruvec(), which exits almost immediately if the lruvec passed to it is empty. So there's no need to optimize anything here. Besides, we don't have such a check in the general scan path (shrink_zone) either. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: fold mem_cgroup_start_move()/mem_cgroup_end_move()Johannes Weiner
Having these functions and their documentation split out and somewhere makes it harder, not easier, to follow what's going on. Inline them directly where charge moving is prepared and finished, and put an explanation right next to it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: don't pass a NULL memcg to mem_cgroup_end_move()Johannes Weiner
mem_cgroup_end_move() checks if the passed memcg is NULL, along with a lengthy comment to explain why this seemingly non-sensical situation is even possible. Check in cancel_attach() itself whether can_attach() set up the move context or not, it's a lot more obvious from there. Then remove the check and comment in mem_cgroup_end_move(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: inline memcg->move_lock lockingJohannes Weiner
The wrappers around taking and dropping the memcg->move_lock spinlock add nothing of value. Inline the spinlock calls into the callsites. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flagJohannes Weiner
pc->mem_cgroup had to be left intact after uncharge for the final LRU removal, and !PCG_USED indicated whether the page was uncharged. But since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages are uncharged after the final LRU removal. Uncharge can simply clear the pointer and the PCG_USED/PageCgroupUsed sites can test that instead. Because this is the last page_cgroup flag, this patch reduces the memcg per-page overhead to a single pointer. [akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: remove unnecessary PCG_MEM memory charge flagJohannes Weiner
PCG_MEM is a remnant from an earlier version of 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"), used to tell whether migration cleared a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the final version, mem_cgroup_migrate() directly uncharges the source page, rendering this distinction unnecessary. Remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flagJohannes Weiner
Now that mem_cgroup_swapout() fully uncharges the page, every page that is still in use when reaching mem_cgroup_uncharge() is known to carry both the memory and the memory+swap charge. Simplify the uncharge path and remove the PCG_MEMSW page flag accordingly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: uncharge pages on swapoutJohannes Weiner
This series gets rid of the remaining page_cgroup flags, thus cutting the memcg per-page overhead down to one pointer. This patch (of 4): mem_cgroup_swapout() is called with exclusive access to the page at the end of the page's lifetime. Instead of clearing the PCG_MEMSW flag and deferring the uncharge, just do it right away. This allows follow-up patches to simplify the uncharge code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: micro-optimize mem_cgroup_split_huge_fixup()Michal Hocko
Don't call lookup_page_cgroup() when memcg is disabled. Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10memcg: remove activate_kmem_mutexVladimir Davydov
The activate_kmem_mutex is used to serialize memcg.kmem.limit updates, but we already serialize them with memcg_limit_mutex so let's remove the former. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: clarify migration where old page is unchargedJohannes Weiner
Better explain re-entrant migration when compaction races with reclaim, and also mention swapcache readahead pages as possible uncharged migration sources. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: memcontrol: update mem_cgroup_page_lruvec() documentationJohannes Weiner
Commit 7512102cf64d ("memcg: fix GPF when cgroup removal races with last exit") added a pc->mem_cgroup reset into mem_cgroup_page_lruvec() to prevent a crash where an anon page gets uncharged on unmap, the memcg is released, and then the final LRU isolation on free dereferences the stale pc->mem_cgroup pointer. But since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"), pages are only uncharged AFTER that final LRU isolation, which guarantees the memcg's lifetime until then. pc->mem_cgroup now only needs to be reset for swapcache readahead pages. Update the comment and callsite requirements accordingly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10memcg: simplify unreclaimable groups handling in soft limit reclaimVladimir Davydov
If we fail to reclaim anything from a cgroup during a soft reclaim pass we want to get the next largest cgroup exceeding its soft limit. To achieve this, we should obviously remove the current group from the tree and then pick the largest group. Currently we have a weird loop instead. Let's simplify it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>