summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2020-08-07mm/vmalloc: simplify merge_or_add_vmap_area()Uladzislau Rezki (Sony)
Currently when a VA is deallocated and is about to be placed back to the tree, it can be either: merged with next/prev neighbors or inserted if not coalesced. On those steps the tree can be populated several times. For example when both neighbors are merged. It can be avoided and simplified in fact. Therefore do it only once when VA points to final merged area, after all manipulations: merging/removing/inserting. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200527205054.1696-1-urezki@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07vmalloc: convert to XArrayMatthew Wilcox (Oracle)
The radix tree of vmap blocks is simpler to express as an XArray. Reduces both the text and data sizes of the object file and eliminates a user of the radix tree preload API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200603171448.5894-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/sparse: cleanup the code surrounding memory_present()Mike Rapoport
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent functions that call memory_present() for each region in memblock.memory: sparse_memory_present_with_active_regions() and membocks_present(). Moreover, all architectures have a call to either of these functions preceding the call to sparse_init() and in the most cases they are called one after the other. Mark the regions from memblock.memory as present during sparce_init() by making sparse_init() call memblocks_present(), make memblocks_present() and memory_present() functions static and remove redundant sparse_memory_present_with_active_regions() function. Also remove no longer required HAVE_MEMORY_PRESENT configuration option. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/sparse: only sub-section aligned range would be populatedWei Yang
There are two code path which invoke __populate_section_memmap() * sparse_init_nid() * sparse_add_section() For both case, we are sure the memory range is sub-section aligned. * we pass PAGES_PER_SECTION to sparse_init_nid() * we check range by check_pfn_span() before calling sparse_add_section() Also, the counterpart of __populate_section_memmap(), we don't do such calculation and check since the range is checked by check_pfn_span() in __remove_pages(). Clear the calculation and check to keep it simple and comply with its counterpart. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200703031828.14645-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/sparse: never partially remove memmap for early sectionWei Yang
For early sections, its memmap is handled specially even sub-section is enabled. The memmap could only be populated as a whole. Quoted from the comment of section_activate(): * The early init code does not consider partially populated * initial sections, it simply assumes that memory will never be * referenced. If we hot-add memory into such a section then we * do not need to populate the memmap and can simply reuse what * is already there. While current section_deactivate() breaks this rule. When hot-remove a sub-section, section_deactivate() would depopulate its memmap. The consequence is if we hot-add this subsection again, its memmap never get proper populated. We can reproduce the case by following steps: 1. Hacking qemu to allow sub-section early section : diff --git a/hw/i386/pc.c b/hw/i386/pc.c : index 51b3050d01..c6a78d83c0 100644 : --- a/hw/i386/pc.c : +++ b/hw/i386/pc.c : @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms, : } : : machine->device_memory->base = : - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB); : + 0x100000000ULL + x86ms->above_4g_mem_size; : : if (pcmc->enforce_aligned_dimm) { : /* size device region assuming 1G page max alignment per slot */ 2. Bootup qemu with PSE disabled and a sub-section aligned memory size Part of the qemu command would look like this: sudo x86_64-softmmu/qemu-system-x86_64 \ --enable-kvm -cpu host,pse=off \ -m 4160M,maxmem=20G,slots=1 \ -smp sockets=2,cores=16 \ -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \ -machine pc,nvdimm \ -nographic \ -object memory-backend-ram,id=mem0,size=8G \ -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k 3. Re-config a pmem device with sub-section size in guest ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M Then you would see the following call trace: pmem0: detected capacity change from 0 to 16777216 BUG: unable to handle page fault for address: ffffec73c51000b4 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0 Oops: 0002 [#1] SMP NOPTI CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 RIP: 0010:memmap_init_zone+0x154/0x1c2 Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282 RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000 RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080 RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708 R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000 FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0 Call Trace: move_pfn_range_to_zone+0x128/0x150 memremap_pages+0x4e4/0x5a0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0x69/0x160 [device_dax] really_probe+0x298/0x3c0 driver_probe_device+0xe1/0x150 ? driver_allows_async_probing+0x50/0x50 bus_for_each_drv+0x7e/0xc0 __device_attach+0xdf/0x160 bus_probe_device+0x8e/0xa0 device_add+0x3b9/0x740 __devm_create_dev_dax+0x127/0x1c0 __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core] dax_pmem_probe+0xc/0x1b [dax_pmem] nvdimm_bus_probe+0x69/0x1c0 [libnvdimm] really_probe+0x147/0x3c0 driver_probe_device+0xe1/0x150 device_driver_attach+0x53/0x60 bind_store+0xd1/0x110 kernfs_fop_write+0xce/0x1b0 vfs_write+0xb6/0x1a0 ksys_write+0x5f/0xe0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/20200625223534.18024-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/mremap: start addresses are properly alignedWei Yang
After previous cleanup, extent is the minimal step for both source and destination. This means when extent is HPAGE_PMD_SIZE or PMD_SIZE, old_addr and new_addr are properly aligned too. Since these two functions are only invoked in move_page_tables, it is safe to remove the check now. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Dmitry Osipenko <digetx@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang.shi@linux.alibaba.com> Link: http://lkml.kernel.org/r/20200708095028.41706-4-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/mremap: calculate extent in one placeWei Yang
Page tables is moved on the base of PMD. This requires both source and destination range should meet the requirement. Current code works well since move_huge_pmd() and move_normal_pmd() would check old_addr and new_addr again. And then return to move_ptes() if the either of them is not aligned. Instead of calculating the extent separately, it is better to calculate in one place, so we know it is not necessary to try move pmd. By doing so, the logic seems a little clear. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Dmitry Osipenko <digetx@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang.shi@linux.alibaba.com> Link: http://lkml.kernel.org/r/20200708095028.41706-3-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/mremap: it is sure to have enough space when extent meets requirementWei Yang
Patch series "mm/mremap: cleanup move_page_tables() a little", v5. move_page_tables() tries to move page table by PMD or PTE. The root reason is if it tries to move PMD, both old and new range should be PMD aligned. But current code calculate old range and new range separately. This leads to some redundant check and calculation. This cleanup tries to consolidate the range check in one place to reduce some extra range handling. This patch (of 3): old_end is passed to these two functions to check whether there is enough space to do the move, while this check is done before invoking these functions. These two functions only would be invoked when extent meets the requirement and there is one check before invoking these functions: if (extent > old_end - old_addr) extent = old_end - old_addr; This implies (old_end - old_addr) won't fail the check in these two functions. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Dmitry Osipenko <digetx@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Hellstrom <thellstrom@vmware.com> Link: http://lkml.kernel.org/r/20200710092835.56368-1-richard.weiyang@linux.alibaba.com Link: http://lkml.kernel.org/r/20200710092835.56368-2-richard.weiyang@linux.alibaba.com Link: http://lkml.kernel.org/r/20200708095028.41706-1-richard.weiyang@linux.alibaba.com Link: http://lkml.kernel.org/r/20200708095028.41706-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: remove unnecessary wrapper function do_mmap_pgoff()Peter Collingbourne
The current split between do_mmap() and do_mmap_pgoff() was introduced in commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()") to support MPX. The wrapper function do_mmap_pgoff() always passed 0 as the value of the vm_flags argument to do_mmap(). However, MPX support has subsequently been removed from the kernel and there were no more direct callers of do_mmap(); all calls were going via do_mmap_pgoff(). Simplify the code by removing do_mmap_pgoff() and changing all callers to directly call do_mmap(), which now no longer takes a vm_flags argument. Signed-off-by: Peter Collingbourne <pcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: mmap: merge vma after call_mmap() if possibleMiaohe Lin
The vm_flags may be changed after call_mmap() because drivers may set some flags for their own purpose. As a result, we failed to merge the adjacent vma due to the different vm_flags as userspace can't pass in the same one. Try to merge vma after call_mmap() to fix this issue. Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1594954065-23733-1-git-send-email-linmiaohe@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf()Anshuman Khandual
There are many instances where vmemap allocation is often switched between regular memory and device memory just based on whether altmap is available or not. vmemmap_alloc_block_buf() is used in various platforms to allocate vmemmap mappings. Lets also enable it to handle altmap based device memory allocation along with existing regular memory allocations. This will help in avoiding the altmap based allocation switch in many places. To summarize there are two different methods to call vmemmap_alloc_block_buf(). vmemmap_alloc_block_buf(size, node, NULL) /* Allocate from system RAM */ vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */ This converts altmap_alloc_block_buf() into a static function, drops it's entry from the header and updates Documentation/vm/memory-model.rst. Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jia He <justin.he@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yu Zhao <yuzhao@google.com> Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages()Anshuman Khandual
Patch series "arm64: Enable vmemmap mapping from device memory", v4. This series enables vmemmap backing memory allocation from device memory ranges on arm64. But before that, it enables vmemmap_populate_basepages() and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based alocation requests. This patch (of 3): vmemmap_populate_basepages() is used across platforms to allocate backing memory for vmemmap mapping. This is used as a standard default choice or as a fallback when intended huge pages allocation fails. This just creates entire vmemmap mapping with base pages (PAGE_SIZE). On arm64 platforms, vmemmap_populate_basepages() is called instead of the platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs. At present vmemmap_populate_basepages() does not support allocating from driver defined struct vmem_altmap while trying to create vmemmap mapping for a device memory range. It prevents ARM64_16K_PAGES and ARM64_64K_PAGES configs on arm64 from supporting device memory with vmemap_altmap request. This enables vmem_altmap support in vmemmap_populate_basepages() unlocking device memory allocation for vmemap mapping on arm64 platforms with 16K or 64K base page configs. Each architecture should evaluate and decide on subscribing device memory based base page allocation through vmemmap_populate_basepages(). Hence lets keep it disabled on all archs in order to preserve the existing semantics. A subsequent patch enables it on arm64. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jia He <justin.he@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Yu Zhao <yuzhao@google.com> Link: http://lkml.kernel.org/r/1594004178-8861-1-git-send-email-anshuman.khandual@arm.com Link: http://lkml.kernel.org/r/1594004178-8861-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: adjust vm_committed_as_batch according to vm overcommit policyFeng Tang
When checking a performance change for will-it-scale scalability mmap test [1], we found very high lock contention for spinlock of percpu counter 'vm_committed_as': 94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap; 45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap; Actually this heavy lock contention is not always necessary. The 'vm_committed_as' needs to be very precise when the strict OVERCOMMIT_NEVER policy is set, which requires a rather small batch number for the percpu counter. So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also add a sysctl handler to adjust it when the policy is reconfigured. Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test platforms in 0day (server, desktop and laptop), and 80%+ platforms shows improvements with that test. And whether it shows improvements depends on if the test mmap size is bigger than the batch number computed. And if the lift is 16X, 1/3 of the platforms will show improvements, though it should help the mmap/unmap usage generally, as Michal Hocko mentioned: : I believe that there are non-synthetic worklaods which would benefit from : a larger batch. E.g. large in memory databases which do large mmaps : during startups from multiple threads. [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/ Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Qian Cai <cai@lca.pw> Cc: Kees Cook <keescook@chromium.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/util.c: make vm_memory_committed() more accurateFeng Tang
percpu_counter_sum_positive() will provide more accurate info. As with percpu_counter_read_positive(), in worst case the deviation could be 'batch * nr_cpus', which is totalram_pages/256 for now, and will be more when the batch gets enlarged. Its time cost is about 800 nanoseconds on a 2C/4T platform and 2~3 microseconds on a 2S/36C/72T Skylake server in normal case, and in worst case where vm_committed_as's spinlock is under severe contention, it costs 30~40 microseconds for the 2S/36C/72T Skylake sever, which should be fine for its only two users: /proc/meminfo and HyperV balloon driver's status trace per second. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> # for /proc/meminfo Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Qian Cai <cai@lca.pw> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: kernel test robot <rong.a.chen@intel.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1592725000-73486-3-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1594389708-60781-3-git-send-email-feng.tang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/mmap: optimize a branch judgment in ksys_mmap_pgoff()Zhen Lei
Look at the pseudo code below. It's very clear that, the judgement "!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is only needed by the branch 3), because "retval" will be overwritten at 4). No functional change, but it can reduce the code size. Maybe more clearer? Before: text data bss dec hex filename 28733 1590 1 30324 7674 mm/mmap.o After: text data bss dec hex filename 28701 1590 1 30292 7654 mm/mmap.o ====pseudo code====: if (!(flags & MAP_ANONYMOUS)) { ... 1) if (is_file_hugepages(file)) len = ALIGN(len, huge_page_size(hstate_file(file))); 2) retval = -EINVAL; 3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) goto out_fput; } else if (flags & MAP_HUGETLB) { ... } ... 4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); out_fput: ... return retval; Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: move p?d_alloc_track to separate header fileJoerg Roedel
The functions are only used in two source files, so there is no need for them to be in the global <linux/mm.h> header. Move them to the new <linux/pgalloc-track.h> header and include it only where needed. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200609120533.25867-1-joro@8bytes.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: move lib/ioremap.c to mm/Mike Rapoport
The functionality in lib/ioremap.c deals with pagetables, vmalloc and caches, so it naturally belongs to mm/ Moving it there will also allow declaring p?d_alloc_track functions in an header file inside mm/ rather than having those declarations in include/linux/mm.h Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joerg Roedel <joro@8bytes.org> Cc: Joerg Roedel <jroedel@suse.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200627143453.31835-8-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: remove unneeded includes of <asm/pgalloc.h>Mike Rapoport
Patch series "mm: cleanup usage of <asm/pgalloc.h>" Most architectures have very similar versions of pXd_alloc_one() and pXd_free_one() for intermediate levels of page table. These patches add generic versions of these functions in <asm-generic/pgalloc.h> and enable use of the generic functions where appropriate. In addition, functions declared and defined in <asm/pgalloc.h> headers are used mostly by core mm and early mm initialization in arch and there is no actual reason to have the <asm/pgalloc.h> included all over the place. The first patch in this series removes unneeded includes of <asm/pgalloc.h> In the end it didn't work out as neatly as I hoped and moving pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require unnecessary changes to arches that have custom page table allocations, so I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local to mm/. This patch (of 8): In most cases <asm/pgalloc.h> header is required only for allocations of page table memory. Most of the .c files that include that header do not use symbols declared in <asm/pgalloc.h> and do not require that header. As for the other header files that used to include <asm/pgalloc.h>, it is possible to move that include into the .c file that actually uses symbols from <asm/pgalloc.h> and drop the include from the header file. The process was somewhat automated using sed -i -E '/[<"]asm\/pgalloc\.h/d' \ $(grep -L -w -f /tmp/xx \ $(git grep -E -l '[<"]asm/pgalloc\.h')) where /tmp/xx contains all the symbols defined in arch/*/include/asm/pgalloc.h. [rppt@linux.ibm.com: fix powerpc warning] Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joerg Roedel <joro@8bytes.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joerg Roedel <jroedel@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/memory.c: make remap_pfn_range() reject unaligned addrAlex Zhang
This function implicitly assumes that the addr passed in is page aligned. A non page aligned addr could ultimately cause a kernel bug in remap_pte_range as the exit condition in the logic loop may never be satisfied. This patch documents the need for the requirement, as well as explicitly adds a check for it. Signed-off-by: Alex Zhang <zhangalex@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200617233512.177519-1-zhangalex@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: remove redundant check non_swap_entry()Ralph Campbell
In zap_pte_range(), the check for non_swap_entry() and is_device_private_entry() is unnecessary since the latter is sufficient to determine if the page is a device private page. Remove the test for non_swap_entry() to simplify the code and for clarity. Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200615175405.4613-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_counter.c: fix protection usage propagationMichal Koutný
When workload runs in cgroups that aren't directly below root cgroup and their parent specifies reclaim protection, it may end up ineffective. The reason is that propagate_protected_usage() is not called in all hierarchy up. All the protected usage is incorrectly accumulated in the workload's parent. This means that siblings_low_usage is overestimated and effective protection underestimated. Even though it is transitional phenomenon (uncharge path does correct propagation and fixes the wrong children_low_usage), it can undermine the intended protection unexpectedly. We have noticed this problem while seeing a swap out in a descendant of a protected memcg (intermediate node) while the parent was conveniently under its protection limit and the memory pressure was external to that hierarchy. Michal has pinpointed this down to the wrong siblings_low_usage which led to the unwanted reclaim. The fix is simply updating children_low_usage in respective ancestors also in the charging path. Fixes: 230671533d64 ("mm: memory.low hierarchical behavior") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.18+] Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: don't count limit-setting reclaim as memory pressureJohannes Weiner
When an outside process lowers one of the memory limits of a cgroup (or uses the force_empty knob in cgroup1), direct reclaim is performed in the context of the write(), in order to directly enforce the new limit and have it being met by the time the write() returns. Currently, this reclaim activity is accounted as memory pressure in the cgroup that the writer(!) belongs to. This is unexpected. It specifically causes problems for senpai (https://github.com/facebookincubator/senpai), which is an agent that routinely adjusts the memory limits and performs associated reclaim work in tens or even hundreds of cgroups running on the host. The cgroup that senpai is running in itself will report elevated levels of memory pressure, even though it itself is under no memory shortage or any sort of distress. Move the psi annotation from the central cgroup reclaim function to callsites in the allocation context, and thereby no longer count any limit-setting reclaim as memory pressure. If the newly set limit causes the workload inside the cgroup into direct reclaim, that of course will continue to count as memory pressure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: restore proper dirty throttling when memory.high changesJohannes Weiner
Commit 8c8c383c04f6 ("mm: memcontrol: try harder to set a new memory.high") inadvertently removed a callback to recalculate the writeback cache size in light of a newly configured memory.high limit. Without letting the writeback cache know about a potentially heavily reduced limit, it may permit too many dirty pages, which can cause unnecessary reclaim latencies or even avoidable OOM situations. This was spotted while reading the code, it hasn't knowingly caused any problems in practice so far. Fixes: 8c8c383c04f6 ("mm: memcontrol: try harder to set a new memory.high") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07memcg, oom: check memcg margin for parallel oomYafang Shao
Memcg oom killer invocation is synchronized by the global oom_lock and tasks are sleeping on the lock while somebody is selecting the victim or potentially race with the oom_reaper is releasing the victim's memory. This can result in a pointless oom killer invocation because a waiter might be racing with the oom_reaper P1 oom_reaper P2 oom_reap_task mutex_lock(oom_lock) out_of_memory # no victim because we have one already __oom_reap_task_mm mute_unlock(oom_lock) mutex_lock(oom_lock) set MMF_OOM_SKIP select_bad_process # finds a new victim The page allocator prevents from this race by trying to allocate after the lock can be acquired (in __alloc_pages_may_oom) which acts as a last minute check. Moreover page allocator simply doesn't block on the oom_lock and simply retries the whole reclaim process. Memcg oom killer should do the last minute check as well. Call mem_cgroup_margin to do that. Trylock on the oom_lock could be done as well but this doesn't seem to be necessary at this stage. [mhocko@kernel.org: commit log] Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, memcg: decouple e{low,min} state mutations from protection checksChris Down
mem_cgroup_protected currently is both used to set effective low and min and return a mem_cgroup_protection based on the result. As a user, this can be a little unexpected: it appears to be a simple predicate function, if not for the big warning in the comment above about the order in which it must be executed. This change makes it so that we separate the state mutations from the actual protection checks, which makes it more obvious where we need to be careful mutating internal state, and where we are simply checking and don't need to worry about that. [mhocko@suse.com - don't check protection on root memcgs] Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Yafang Shao <laoar.shao@gmail.com> Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, memcg: avoid stale protection values when cgroup is above protectionYafang Shao
Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4. This series contains a fix for a edge case in my earlier protection calculation patches, and a patch to make the area overall a little more robust to hopefully help avoid this in future. This patch (of 2): A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure. Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion. During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency. However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings. When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice. Workaround the problem by special casing reclaim roots in mem_cgroup_protection. These memcgs are never participating in the reclaim protection because the reclaim is internal. We have to ignore effective protection values for reclaim roots because mem_cgroup_protected might be called from racing reclaim contexts with different roots. Calculation is relying on root -> leaf tree traversal therefore top-down reclaim protection invariants should hold. The only exception is the reclaim root which should have effective protection set to 0 but that would be problematic for the following setup: Let's have global and A's reclaim in parallel: | A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G) |\ | C (low = 1G, usage = 2.5G) B (low = 1G, usage = 0.5G) for A reclaim we have B.elow = B.low C.elow = C.low For the global reclaim A.elow = A.low B.elow = min(B.usage, B.low) because children_low_usage <= A.elow C.elow = min(C.usage, C.low) With the effective values resetting we have A reclaim A.elow = 0 B.elow = B.low C.elow = C.low and global reclaim could see the above and then B.elow = C.elow = 0 because children_low_usage > A.elow Which means that protected memcgs would get reclaimed. In future we would like to make mem_cgroup_protected more robust against racing reclaim contexts but that is likely more complex solution than this simple workaround. [hannes@cmpxchg.org - large part of the changelog] [mhocko@suse.com - workaround explanation] [chris@chrisdown.name - retitle] Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, memcg: unify reclaim retry limits with page allocatorChris Down
Reclaim retries have been set to 5 since the beginning of time in commit 66e1707bc346 ("Memory controller: add per cgroup LRU and reclaim"). However, we now have a generally agreed-upon standard for page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later in commit 0a0337e0d1d1 ("mm, oom: rework oom detection"). In the absence of a compelling reason to declare an OOM earlier in memcg context than page allocator context, it seems reasonable to supplant MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page allocator and memcg internals more similar in semantics when reclaim fails to produce results, avoiding premature OOMs or throttling. Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, memcg: reclaim more aggressively before high allocator throttlingChris Down
Patch series "mm, memcg: reclaim harder before high throttling", v2. This patch (of 2): In Facebook production, we've seen cases where cgroups have been put into allocator throttling even when they appear to have a lot of slack file caches which should be trivially reclaimable. Looking more closely, the problem is that we only try a single cgroup reclaim walk for each return to usermode before calculating whether or not we should throttle. This single attempt doesn't produce enough pressure to shrink for cgroups with a rapidly growing amount of file caches prior to entering allocator throttling. As an example, we see that threads in an affected cgroup are stuck in allocator throttling: # for i in $(cat cgroup.threads); do > grep over_high "/proc/$i/stack" > done [<0>] mem_cgroup_handle_over_high+0x10b/0x150 [<0>] mem_cgroup_handle_over_high+0x10b/0x150 [<0>] mem_cgroup_handle_over_high+0x10b/0x150 ...however, there is no I/O pressure reported by PSI, despite a lot of slack file pages: # cat memory.pressure some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 # cat io.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 # grep _file memory.stat inactive_file 1370939392 active_file 661635072 This patch changes the behaviour to retry reclaim either until the current task goes below the 10ms grace period, or we are making no reclaim progress at all. In the latter case, we enter reclaim throttling as before. To a user, there's no intuitive reason for the reclaim behaviour to differ from hitting memory.high as part of a new allocation, as opposed to hitting memory.high because someone lowered its value. As such this also brings an added benefit: it unifies the reclaim behaviour between the two. There's precedent for this behaviour: we already do reclaim retries when writing to memory.{high,max}, in max reclaim, and in the page allocator itself. Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: avoid workload stalls when lowering memory.highRoman Gushchin
Memory.high limit is implemented in a way such that the kernel penalizes all threads which are allocating a memory over the limit. Forcing all threads into the synchronous reclaim and adding some artificial delays allows to slow down the memory consumption and potentially give some time for userspace oom handlers/resource control agents to react. It works nicely if the memory usage is hitting the limit from below, however it works sub-optimal if a user adjusts memory.high to a value way below the current memory usage. It basically forces all workload threads (doing any memory allocations) into the synchronous reclaim and sleep. This makes the workload completely unresponsive for a long period of time and can also lead to a system-wide contention on lru locks. It can happen even if the workload is not actually tight on memory and has, for example, a ton of cold pagecache. In the current implementation writing to memory.high causes an atomic update of page counter's high value followed by an attempt to reclaim enough memory to fit into the new limit. To fix the problem described above, all we need is to change the order of execution: try to push the memory usage under the limit first, and only then set the new high limit. Reported-by: Domas Mituzas <domas@fb.com> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Chris Down <chris@chrisdown.name> Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: slab: rename (un)charge_slab_page() to (un)account_slab_page()Roman Gushchin
charge_slab_page() and uncharge_slab_page() are not related anymore to memcg charging and uncharging. In order to make their names less confusing, let's rename them to account_slab_page() and unaccount_slab_page() respectively. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: remove unused argument by charge_slab_page()Roman Gushchin
charge_slab_page() is not using the gfp argument anymore, remove it. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: account kernel stack per nodeShakeel Butt
Currently the kernel stack is being accounted per-zone. There is no need to do that. In addition due to being per-zone, memcg has to keep a separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of node_stat_item. In addition localize the kernel stack stats updates to account_kernel_stack(). Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: use a single set of kmem_caches for all allocationsRoman Gushchin
Instead of having two sets of kmem_caches: one for system-wide and non-accounted allocations and the second one shared by all accounted allocations, we can use just one. The idea is simple: space for obj_cgroup metadata can be allocated on demand and filled only for accounted allocations. It allows to remove a bunch of code which is required to handle kmem_cache clones for accounted allocations. There is no more need to create them, accumulate statistics, propagate attributes, etc. It's a quite significant simplification. Also, because the total number of slab_caches is reduced almost twice (not all kmem_caches have a memcg clone), some additional memory savings are expected. On my devvm it additionally saves about 3.5% of slab memory. [guro@fb.com: fix build on MIPS] Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()Roman Gushchin
memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as a first argument, so the is_root_cache(s) check is redundant and can be removed without any functional change. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-17-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: deprecate slab_root_cachesRoman Gushchin
Currently there are two lists of kmem_caches: 1) slab_caches, which contains all kmem_caches, 2) slab_root_caches, which contains only root kmem_caches. And there is some preprocessor magic to have a single list if CONFIG_MEMCG_KMEM isn't enabled. It was required earlier because the number of non-root kmem_caches was proportional to the number of memory cgroups and could reach really big values. Now, when it cannot exceed the number of root kmem_caches, there is really no reason to maintain two lists. We never iterate over the slab_root_caches list on any hot paths, so it's perfectly fine to iterate over slab_caches and filter out non-root kmem_caches. It allows to remove a lot of config-dependent code and two pointers from the kmem_cache structure. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: remove memcg_kmem_get_cache()Roman Gushchin
The memcg_kmem_get_cache() function became really trivial, so let's just inline it into the single call point: memcg_slab_pre_alloc_hook(). It will make the code less bulky and can also help the compiler to generate a better code. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: simplify memcg cache creationRoman Gushchin
Because the number of non-root kmem_caches doesn't depend on the number of memory cgroups anymore and is generally not very big, there is no more need for a dedicated workqueue. Also, as there is no more need to pass any arguments to the memcg_create_kmem_cache() except the root kmem_cache, it's possible to just embed the work structure into the kmem_cache and avoid the dynamic allocation of the work structure. This will also simplify the synchronization: for each root kmem_cache there is only one work. So there will be no more concurrent attempts to create a non-root kmem_cache for a root kmem_cache: the second and all following attempts to queue the work will fail. On the kmem_cache destruction path there is no more need to call the expensive flush_workqueue() and wait for all pending works to be finished. Instead, cancel_work_sync() can be used to cancel/wait for only one work. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: use a single set of kmem_caches for all accounted allocationsRoman Gushchin
This is fairly big but mostly red patch, which makes all accounted slab allocations use a single set of kmem_caches instead of creating a separate set for each memory cgroup. Because the number of non-root kmem_caches is now capped by the number of root kmem_caches, there is no need to shrink or destroy them prematurely. They can be perfectly destroyed together with their root counterparts. This allows to dramatically simplify the management of non-root kmem_caches and delete a ton of code. This patch performs the following changes: 1) introduces memcg_params.memcg_cache pointer to represent the kmem_cache which will be used for all non-root allocations 2) reuses the existing memcg kmem_cache creation mechanism to create memcg kmem_cache on the first allocation attempt 3) memcg kmem_caches are named <kmemcache_name>-memcg, e.g. dentry-memcg 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache or schedule it's creation and return the root cache 5) removes almost all non-root kmem_cache management code (separate refcounter, reparenting, shrinking, etc) 6) makes slab debugfs to display root_mem_cgroup css id and never show :dead and :deact flags in the memcg_slabinfo attribute. Following patches in the series will simplify the kmem_cache creation. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.hRoman Gushchin
To make the memcg_kmem_bypass() function available outside of the memcontrol.c, let's move it to memcontrol.h. The function is small and nicely fits into static inline sort of functions. It will be used from the slab code. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: deprecate memory.kmem.slabinfoRoman Gushchin
Deprecate memory.kmem.slabinfo. An empty file will be presented if corresponding config options are enabled. The interface is implementation dependent, isn't present in cgroup v2, and is generally useful only for core mm debugging purposes. In other words, it doesn't provide any value for the absolute majority of users. A drgn-based replacement can be found in tools/cgroup/memcg_slabinfo.py. It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output and also allows to get any additional information without a need to recompile the kernel. If a drgn-based solution is too slow for a task, a bpf-based tracing tool can be used, which can easily keep track of all slab allocations belonging to a memory cgroup. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-11-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: charge individual slab objects instead of pagesRoman Gushchin
Switch to per-object accounting of non-root slab objects. Charging is performed using obj_cgroup API in the pre_alloc hook. Obj_cgroup is charged with the size of the object and the size of metadata: as now it's the size of an obj_cgroup pointer. If the amount of memory has been charged successfully, the actual allocation code is executed. Otherwise, -ENOMEM is returned. In the post_alloc hook if the actual allocation succeeded, corresponding vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the charge is canceled. On the free path obj_cgroup pointer is obtained and used to uncharge the size of the releasing object. Memcg and lruvec counters are now representing only memory used by active slab objects and do not include the free space. The free space is shared and doesn't belong to any specific cgroup. Global per-node slab vmstats are still modified from (un)charge_slab_page() functions. The idea is to keep all slab pages accounted as slab pages on system level. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: save obj_cgroup for non-root slab objectsRoman Gushchin
Store the obj_cgroup pointer in the corresponding place of page->obj_cgroups for each allocated non-root slab object. Make sure that each allocated object holds a reference to obj_cgroup. Objcg pointer is obtained from the memcg->objcg dereferencing in memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook. Then in case of successful allocation(s) it's getting stored in the page->obj_cgroups vector. The objcg obtaining part look a bit bulky now, but it will be simplified by next commits in the series. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: allocate obj_cgroups for non-root slab pagesRoman Gushchin
Allocate and release memory to store obj_cgroup pointers for each non-root slab page. Reuse page->mem_cgroup pointer to store a pointer to the allocated space. This commit temporarily increases the memory footprint of the kernel memory accounting. To store obj_cgroup pointers we'll need a place for an objcg_pointer for each allocated object. However, the following patches in the series will enable sharing of slab pages between memory cgroups, which will dramatically increase the total slab utilization. And the final memory footprint will be significantly smaller than before. To distinguish between obj_cgroups and memcg pointers in case when it's not obvious which one is used (as in page_cgroup_ino()), let's always set the lowest bit in the obj_cgroup case. The original obj_cgroups pointer is marked to be ignored by kmemleak, which otherwise would report a memory leak for each allocated vector. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg/slab: obj_cgroup APIRoman Gushchin
Obj_cgroup API provides an ability to account sub-page sized kernel objects, which potentially outlive the original memory cgroup. The top-level API consists of the following functions: bool obj_cgroup_tryget(struct obj_cgroup *objcg); void obj_cgroup_get(struct obj_cgroup *objcg); void obj_cgroup_put(struct obj_cgroup *objcg); int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg); struct obj_cgroup *get_obj_cgroup_from_current(void); Object cgroup is basically a pointer to a memory cgroup with a per-cpu reference counter. It substitutes a memory cgroup in places where it's necessary to charge a custom amount of bytes instead of pages. All charged memory rounded down to pages is charged to the corresponding memory cgroup using __memcg_kmem_charge(). It implements reparenting: on memcg offlining it's getting reattached to the parent memory cgroup. Each online memory cgroup has an associated active object cgroup to handle new allocations and the list of all attached object cgroups. On offlining of a cgroup this list is reparented and for each object cgroup in the list the memcg pointer is swapped to the parent memory cgroup. It prevents long-living objects from pinning the original memory cgroup in the memory. The implementation is based on byte-sized per-cpu stocks. A sub-page sized leftover is stored in an atomic field, which is a part of obj_cgroup object. So on cgroup offlining the leftover is automatically reparented. memcg->objcg is rcu protected. objcg->memcg is a raw pointer, which is always pointing at a memory cgroup, but can be atomically swapped to the parent memory cgroup. So a user must ensure the lifetime of the cgroup, e.g. grab rcu_read_lock or css_set_lock. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: decouple reference counting from page accountingJohannes Weiner
The reference counting of a memcg is currently coupled directly to how many 4k pages are charged to it. This doesn't work well with Roman's new slab controller, which maintains pools of objects and doesn't want to keep an extra balance sheet for the pages backing those objects. This unusual refcounting design (reference counts usually track pointers to an object) is only for historical reasons: memcg used to not take any css references and simply stalled offlining until all charges had been reparented and the page counters had dropped to zero. When we got rid of the reparenting requirement, the simple mechanical translation was to take a reference for every charge. More historical context can be found in commit e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page"), commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"). The new slab controller exposes the limitations in this scheme, so let's switch it to a more idiomatic reference counting model based on actual kernel pointers to the memcg: - The per-cpu stock holds a reference to the memcg its caching - User pages hold a reference for their page->mem_cgroup. Transparent huge pages will no longer acquire tail references in advance, we'll get them if needed during the split. - Kernel pages hold a reference for their page->mem_cgroup - Pages allocated in the root cgroup will acquire and release css references for simplicity. css_get() and css_put() optimize that. - The current memcg_charge_slab() already hacked around the per-charge references; this change gets rid of that as well. - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free} Roman: 1) Rebased on top of the current mm tree: added css_get() in mem_cgroup_charge(), dropped mem_cgroup_try_charge() part 2) I've reformatted commit references in the commit log to make checkpatch.pl happy. [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: slub: implement SLUB version of obj_to_index()Roman Gushchin
This commit implements SLUB version of the obj_to_index() function, which will be required to calculate the offset of obj_cgroup in the obj_cgroups vector to store/obtain the objcg ownership data. To make it faster, let's repeat the SLAB's trick introduced by commit 6a2d7a955d8d ("SLAB: use a multiply instead of a divide in obj_to_index()") and avoid an expensive division. Vlastimil Babka noticed, that SLUB does have already a similar function called slab_index(), which is defined only if SLUB_DEBUG is enabled. The function does a similar math, but with a division, and it also takes a page address instead of a page pointer. Let's remove slab_index() and replace it with the new helper __obj_to_index(), which takes a page address. obj_to_index() will be a simple wrapper taking a page pointer and passing page_address(page) into __obj_to_index(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg: convert vmstat slab counters to bytesRoman Gushchin
In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab pages will be shared between multiple cgroups, global and node counters will reflect the total number of slab pages. However memcg and lruvec counters will be used for per-memcg slab memory tracking, which will take separate kernel objects in the account. Keeping global and node counters in pages helps to avoid additional overhead. The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it will fit into atomic_long_t we use for vmstats. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg: prepare for byte-sized vmstat itemsRoman Gushchin
To implement per-object slab memory accounting, we need to convert slab vmstat counters to bytes. Actually, out of 4 levels of counters: global, per-node, per-memcg and per-lruvec only two last levels will require byte-sized counters. It's because global and per-node counters will be counting the number of slab pages, and per-memcg and per-lruvec will be counting the amount of memory taken by charged slab objects. Converting all vmstat counters to bytes or even all slab counters to bytes would introduce an additional overhead. So instead let's store global and per-node counters in pages, and memcg and lruvec counters in bytes. To make the API clean all access helpers (both on the read and write sides) are dealing with bytes. To avoid back-and-forth conversions a new flavor of read-side helpers is introduced, which always returns values in pages: node_page_state_pages() and global_node_page_state_pages(). Actually new helpers are just reading raw values. Old helpers are simple wrappers, which will complain on an attempt to read byte value, because at the moment no one actually needs bytes. Thanks to Johannes Weiner for the idea of having the byte-sized API on top of the page-sized internal storage. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg: factor out memcg- and lruvec-level changes out of ↵Roman Gushchin
__mod_lruvec_state() Patch series "The new cgroup slab memory controller", v7. The patchset moves the accounting from the page level to the object level. It allows to share slab pages between memory cgroups. This leads to a significant win in the slab utilization (up to 45%) and the corresponding drop in the total kernel memory footprint. The reduced number of unmovable slab pages should also have a positive effect on the memory fragmentation. The patchset makes the slab accounting code simpler: there is no more need in the complicated dynamic creation and destruction of per-cgroup slab caches, all memory cgroups use a global set of shared slab caches. The lifetime of slab caches is not more connected to the lifetime of memory cgroups. The more precise accounting does require more CPU, however in practice the difference seems to be negligible. We've been using the new slab controller in Facebook production for several months with different workloads and haven't seen any noticeable regressions. What we've seen were memory savings in order of 1 GB per host (it varied heavily depending on the actual workload, size of RAM, number of CPUs, memory pressure, etc). The third version of the patchset added yet another step towards the simplification of the code: sharing of slab caches between accounted and non-accounted allocations. It comes with significant upsides (most noticeable, a complete elimination of dynamic slab caches creation) but not without some regression risks, so this change sits on top of the patchset and is not completely merged in. So in the unlikely event of a noticeable performance regression it can be reverted separately. The slab memory accounting works in exactly the same way for SLAB and SLUB. With both allocators the new controller shows significant memory savings, with SLUB the difference is bigger. On my 16-core desktop machine running Fedora 32 the size of the slab memory measured after the start of the system was lower by 58% and 38% with SLUB and SLAB correspondingly. As an estimation of a potential CPU overhead, below are results of slab_bulk_test01 test, kindly provided by Jesper D. Brouer. He also helped with the evaluation of results. The test can be found here: https://github.com/netoptimizer/prototype-kernel/ The smallest number in each row should be picked for a comparison. SLUB-patched - bulk-API - SLUB-patched : bulk_quick_reuse objects=1 : 187 - 90 - 224 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=2 : 110 - 53 - 133 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=3 : 88 - 95 - 42 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=4 : 91 - 85 - 36 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=8 : 32 - 66 - 32 cycles(tsc) SLUB-original - bulk-API - SLUB-original: bulk_quick_reuse objects=1 : 87 - 87 - 142 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=2 : 52 - 53 - 53 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=3 : 42 - 42 - 91 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=4 : 91 - 37 - 37 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=8 : 31 - 79 - 76 cycles(tsc) SLAB-patched - bulk-API - SLAB-patched : bulk_quick_reuse objects=1 : 67 - 67 - 140 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=2 : 55 - 46 - 46 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=3 : 93 - 94 - 39 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=4 : 35 - 88 - 85 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=8 : 30 - 30 - 30 cycles(tsc) SLAB-original- bulk-API - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 - 67 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=2 : 45 - 46 - 46 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=3 : 38 - 39 - 39 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=4 : 35 - 87 - 87 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=8 : 29 - 66 - 30 cycles(tsc) This patch (of 19): To convert memcg and lruvec slab counters to bytes there must be a way to change these counters without touching node counters. Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: kmem: make memcg_kmem_enabled() irreversibleRoman Gushchin
Historically the kernel memory accounting was an opt-in feature, which could be enabled for individual cgroups. But now it's not true, and it's on by default both on cgroup v1 and cgroup v2. And as long as a user has at least one non-root memory cgroup, the kernel memory accounting is on. So in most setups it's either always on (if memory cgroups are in use and kmem accounting is not disabled), either always off (otherwise). memcg_kmem_enabled() is used in many places to guard the kernel memory accounting code. If memcg_kmem_enabled() can reverse from returning true to returning false (as now), we can't rely on it on release paths and have to check if it was on before. If we'll make memcg_kmem_enabled() irreversible (always returning true after returning it for the first time), it'll make the general logic more simple and robust. It also will allow to guard some checks which otherwise would stay unguarded. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200702180926.1330769-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>