summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2018-10-26mm, slab: combine kmalloc_caches and kmalloc_dma_cachesVlastimil Babka
Patch series "kmalloc-reclaimable caches", v4. As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series. With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation. If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls. Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes. This is how /proc/slabinfo looks like after booting in virtme: ... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ... /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter: ... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ... /proc/meminfo with new KReclaimable counter: ... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ... This patch (of 6): The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type. Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaultsAndrea Arcangeli
get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would not be waiting for userfaults before failing and it would hit on a SIGBUS instead. Using get_user_pages_locked/unlocked instead will allow get_mempolicy to allow userfaults to resolve the fault and fill the hole, before grabbing the node id of the page. If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an address inside an area managed by uffd and there is no page at that address, the page allocation from within get_mempolicy() will fail because get_user_pages() does not allow for page fault retry required for uffd; the user will get SIGBUS. With this patch, the page fault will be resolved by the uffd and the get_mempolicy() will continue normally. Background: Via code review, previously the syscall would have returned -EFAULT (vm_fault_to_errno), now it will block and wait for an userfault (if it's waken before the fault is resolved it'll still -EFAULT). This way get_mempolicy will give a chance to an "unaware" app to be compliant with userfaults. The reason this visible change is that becoming "userfault compliant" cannot regress anything: all other syscalls including read(2)/write(2) had to become "userfault compliant" long time ago (that's one of the things userfaultfd can do that PROT_NONE and trapping segfaults can't). So this is just one more syscall that become "userfault compliant" like all other major ones already were. This has been happening on virtio-bridge dpdk process which just called get_mempolicy on the guest space post live migration, but before the memory had a chance to be migrated to destination. I didn't run an strace to be able to show the -EFAULT going away, but I've the confirmation of the below debug aid information (only visible with CONFIG_DEBUG_VM=y) going away with the patch: [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0 [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1 [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017 [20116.371466] Call Trace: [20116.371473] dump_stack+0x5c/0x80 [20116.371476] handle_userfault.cold.37+0x1b/0x22 [20116.371479] ? remove_wait_queue+0x20/0x60 [20116.371481] ? poll_freewait+0x45/0xa0 [20116.371483] ? do_sys_poll+0x31c/0x520 [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50 [20116.371488] shmem_getpage_gfp+0xce7/0xe50 [20116.371491] ? page_add_file_rmap+0x1a/0x2c0 [20116.371493] shmem_fault+0x78/0x1e0 [20116.371495] ? filemap_map_pages+0x3a1/0x450 [20116.371498] __do_fault+0x1f/0xc0 [20116.371500] __handle_mm_fault+0xe2e/0x12f0 [20116.371502] handle_mm_fault+0xda/0x200 [20116.371504] __get_user_pages+0x238/0x790 [20116.371506] get_user_pages+0x3e/0x50 [20116.371510] kernel_get_mempolicy+0x40b/0x700 [20116.371512] ? vfs_write+0x170/0x1a0 [20116.371515] __x64_sys_get_mempolicy+0x21/0x30 [20116.371517] do_syscall_64+0x5b/0x160 [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The above harmless debug message (not a kernel crash, just a dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify and improve kernel spots that may have to become "userfaultfd compliant" like this one (without having to run an strace and search for syscall misbehavior). Spots like the above are more closer to a kernel bug for the non-cooperative usages that Mike focuses on, than for for dpdk qemu-cooperative usages that reproduced it, but it's still nicer to get this fixed for dpdk too. The part of the patch that caused me to think is only the implementation issue of mpol_get, but it looks like it should work safe no matter the kind of mempolicy structure that is (the default static policy also starts at 1 so it'll go to 2 and back to 1 without crashing everything at 0). [rppt@linux.vnet.ibm.com: changelog addition] http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: convert insert_pfn() to vm_fault_tMatthew Wilcox
All callers convert its errno into a vm_fault_t, so convert it to return a vm_fault_t directly. Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: convert __vm_insert_mixed() to vm_fault_tMatthew Wilcox
Both of its callers currently convert its errno return into a vm_fault_t, so move the conversion into __vm_insert_mixed(). Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: inline vm_insert_pfn_prot() into callerMatthew Wilcox
vm_insert_pfn_prot() is only called from vmf_insert_pfn_prot(), so inline it and convert some of the errnos into vm_fault codes earlier. Link: http://lkml.kernel.org/r/20180828145728.11873-9-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: remove vm_insert_pfn()Matthew Wilcox
All callers are now converted to vmf_insert_pfn() so convert vmf_insert_pfn() from being a compatibility wrapper around vm_insert_pfn() to being a compatibility wrapper around vmf_insert_pfn_prot(). Link: http://lkml.kernel.org/r/20180828145728.11873-8-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: make vm_insert_pfn_prot() staticMatthew Wilcox
Now this is no longer used outside mm/memory.c, make it static. Link: http://lkml.kernel.org/r/20180828145728.11873-6-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: introduce vmf_insert_pfn_prot()Matthew Wilcox
Like vm_insert_pfn_prot(), but returns a vm_fault_t instead of an errno. Also unexport vm_insert_pfn_prot as it has no modular users. Link: http://lkml.kernel.org/r/20180828145728.11873-4-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: remove vm_insert_mixed()Matthew Wilcox
All callers are now converted to vmf_insert_mixed() so convert vmf_insert_mixed() from being a compatibility wrapper into the real function. Link: http://lkml.kernel.org/r/20180828145728.11873-3-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: convert to use vm_fault_tSouptick Joarder
As part of vm_fault_t conversion filemap_page_mkwrite() for the NOMMU case was missed. Now converted. Link: http://lkml.kernel.org/r/20180828174952.GA29229@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/page_alloc.c: clean up check_for_memory()Oscar Salvador
check_for_memory() looks a bit confusing. First of all, we have this: if (N_MEMORY == N_NORMAL_MEMORY) return; Checking the ENUM declaration, looks like N_MEMORY canot be equal to N_NORMAL_MEMORY. I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other way around either, so unless I am missing something, this condition will never evaluate to true. It makes sense to get rid of it. Moving forward, the operations within the loop look a bit confusing as well. We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY in case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY) and zone <= ZONE_NORMAL. (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on !CONFIG_HIGHMEM systems, and that is why we can just go ahead and set N_HIGH_MEMORY unconditionally) Although this works, it is a bit subtle. I think that this could be easier to follow: First, we should only set N_HIGH_MEMORY in case we have CONFIG_HIGHMEM. And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL, without further checking whether we have CONFIG_HIGHMEM or not. Link: http://lkml.kernel.org/r/20180828210158.4617-1-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michael Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/swapfile.c: clear si->swap_map[] in swap_free_cluster()Huang Ying
si->swap_map[] of the swap entries in cluster needs to be cleared during freeing. Previously, this is done in the caller of swap_free_cluster(). This may cause code duplication (one user now, will add more users later) and lock/unlock cluster unnecessarily. In this patch, the clearing code is moved to swap_free_cluster() to avoid the downside. Link: http://lkml.kernel.org/r/20180827075535.17406-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/swapfile.c: call free_swap_slot() in __swap_entry_free()Huang Ying
This is a code cleanup patch without functionality change. Originally, when __swap_entry_free() is called, and its return value is 0, free_swap_slot() will always be called to free the swap entry to the per-CPU pool. So move the call to free_swap_slot() to __swap_entry_free() to simplify the code. Link: http://lkml.kernel.org/r/20180827075535.17406-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/swapfile.c: use __try_to_reclaim_swap() in free_swap_and_cache()Huang Ying
The code path to reclaim the swap entry in free_swap_and_cache() is almost same as that of __try_to_reclaim_swap(). The largest difference is just coding style. So the support to the additional requirement of free_swap_and_cache() is added into __try_to_reclaim_swap(). free_swap_and_cache() is changed to call __try_to_reclaim_swap(), and delete the duplicated code. This will improve code readability and reduce the potential bugs. There are 2 functionality differences between __try_to_reclaim_swap() and swap entry reclaim code of free_swap_and_cache(). - free_swap_and_cache() only reclaims the swap entry if the page is unmapped or swap is getting full. The support has been added into __try_to_reclaim_swap(). - try_to_free_swap() (called by __try_to_reclaim_swap()) checks pm_suspended_storage(), while free_swap_and_cache() not. I think this is OK. Because the page and the swap entry can be reclaimed later eventually. Link: http://lkml.kernel.org/r/20180827075535.17406-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26kmemleak: add module param to print warnings to dmesgVincent Whitchurch
Currently, kmemleak only prints the number of suspected leaks to dmesg but requires the user to read a debugfs file to get the actual stack traces of the objects' allocation points. Add a module option to print the full object information to dmesg too. It can be enabled with kmemleak.verbose=1 on the kernel command line, or "echo 1 > /sys/module/kmemleak/parameters/verbose": This allows easier integration of kmemleak into test systems: We have automated test infrastructure to test our Linux systems. With this option, running our tests with kmemleak is as simple as enabling kmemleak and passing this command line option; the test infrastructure knows how to save kernel logs, which will now include kmemleak reports. Without this option, the test infrastructure needs to be specifically taught to read out the kmemleak debugfs file. Removing this need for special handling makes kmemleak more similar to other kernel debug options (slab debugging, debug objects, etc). Link: http://lkml.kernel.org/r/20180903144046.21023-1-vincent.whitchurch@axis.com Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate ↵Michal Hocko
callbacks" Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks"). MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers"). We now have a full support for per range !blocking behavior so we can drop the stop gap workaround which the per notifier flag was used for. Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry()Michal Hocko
Tetsuo Handa has reported that it is possible to bypass the short sleep for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") and lock up the system if OOM. The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to run even when they have a rescuer available. Those workers might be essential for reclaim to make a forward progress, however. If we are too unlucky all the allocations requests can get stuck waiting for a WQ_MEM_RECLAIM work item and the system is essentially stuck in an OOM condition without much hope to move on. Tetsuo has seen the reclaim stuck on drain_local_pages_wq or xlog_cil_push_work (xfs). There might be others. Since should_reclaim_retry() should be a natural reschedule point, let's do the short sleep for PF_WQ_WORKER threads unconditionally in order to guarantee that other pending work items are started. This will workaround this problem and it is less fragile than hunting down when the sleep is missed. Having a single sleeping point is more robust. [akpm@linux-foundation.org: reflow comment to 80 cols to save a couple of lines] Link: http://lkml.kernel.org/r/20180827135101.15700-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: don't miss the last page because of round-off errorRoman Gushchin
I've noticed, that dying memory cgroups are often pinned in memory by a single pagecache page. Even under moderate memory pressure they sometimes stayed in such state for a long time. That looked strange. My investigation showed that the problem is caused by applying the LRU pressure balancing math: scan = div64_u64(scan * fraction[lru], denominator), where denominator = fraction[anon] + fraction[file] + 1. Because fraction[lru] is always less than denominator, if the initial scan size is 1, the result is always 0. This means the last page is not scanned and has no chances to be reclaimed. Fix this by rounding up the result of the division. In practice this change significantly improves the speed of dying cgroups reclaim. [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments] Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: drain memcg stocks on css offliningRoman Gushchin
Memcg charge is batched using per-cpu stocks, so an offline memcg can be pinned by a cached charge up to a moment, when a process belonging to some other cgroup will charge some memory on the same cpu. In other words, cached charges can prevent a memory cgroup from being reclaimed for some time, without any clear need. Let's optimize it by explicit draining of all stocks on css offlining. As draining is performed asynchronously, and is skipped if any parallel draining is happening, it's cheap. Link: http://lkml.kernel.org/r/20180827162621.30187-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26slub: extend slub debug to handle multiple slabsAaron Tomlin
Extend the slub_debug syntax to "slub_debug=<flags>[,<slub>]*", where <slub> may contain an asterisk at the end. For example, the following would poison all kmalloc slabs: slub_debug=P,kmalloc* and the following would apply the default flags to all kmalloc and all block IO slabs: slub_debug=,bio*,kmalloc* Please note that a similar patch was posted by Iliyan Malchev some time ago but was never merged: https://marc.info/?l=linux-mm&m=131283905330474&w=2 Link: http://lkml.kernel.org/r/20180928111139.27962-1-atomlin@redhat.com Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Iliyan Malchev <malchev@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: don't warn about large allocations for slabDmitry Vyukov
Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE, instead it falls back to kmalloc_large(). For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls kmalloc_slab() for all allocations relying on NULL return value for over-sized allocations. This inconsistency leads to unwanted warnings from kmalloc_slab() for over-sized allocations for slab. Returning NULL for failed allocations is the expected behavior. Make slub and slab code consistent by checking size > KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab(). While we are here also fix the check in kmalloc_slab(). We should check against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda worked because for slab the constants are the same, and slub always checks the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will happen. For example, in case of a newly introduced bug in slub code. Also move the check in kmalloc_slab() from function entry to the size > 192 case. This partially compensates for the additional check in slab code and makes slub code a bit faster (at least theoretically). Also drop __GFP_NOWARN in the warning check. This warning means a bug in slab code itself, user-passed flags have nothing to do with it. Nothing of this affects slob. Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/slub.c: switch to bitmap_zalloc()Andy Shevchenko
Switch to bitmap_zalloc() to show clearly what we are allocating. Besides that it returns pointer of bitmap type instead of opaque void *. Link: http://lkml.kernel.org/r/20180830104301.61649-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26Merge tag 'driver-core-4.20-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is a small number of driver core patches for 4.20-rc1. Not much happened here this merge window, only a very tiny number of patches that do: - add BUS_ATTR_WO() for use by drivers - component error path fixes - kernfs range check fix - other tiny error path fixes and const changes All of these have been in linux-next with no reported issues for a while" * tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: devres: provide devm_kstrdup_const() mm: move is_kernel_rodata() to asm-generic/sections.h devres: constify p in devm_kfree() driver core: add BUS_ATTR_WO() macro kernfs: Fix range checks in kernfs_get_target_path component: fix loop condition to call unbind() if bind() fails drivers/base/devtmpfs.c: don't pretend path is const in delete_path kernfs: update comment about kernfs_path() return value
2018-10-24mm/page_poison: expose page_poisoning_enabled to kernel modulesWei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if page poisoning is in use. This patch exposes the page_poisoning_enabled function to kernel modules. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michael S. Tsirkin <mst@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-10-24Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-10-24iov_iter: Separate type from direction and use accessor functionsDavid Howells
In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24iov_iter: Use accessor functionDavid Howells
Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-23Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...
2018-10-23Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and misc x86 updates from Ingo Molnar: "Lots of changes in this cycle - in part because locking/core attracted a number of related x86 low level work which was easier to handle in a single tree: - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E. McKenney, Andrea Parri) - lockdep scalability improvements and micro-optimizations (Waiman Long) - rwsem improvements (Waiman Long) - spinlock micro-optimization (Matthew Wilcox) - qspinlocks: Provide a liveness guarantee (more fairness) on x86. (Peter Zijlstra) - Add support for relative references in jump tables on arm64, x86 and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens) - Be a lot less permissive on weird (kernel address) uaccess faults on x86: BUG() when uaccess helpers fault on kernel addresses (Jann Horn) - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav Amit) - ... and a handful of other smaller changes as well" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) locking/lockdep: Make global debug_locks* variables read-mostly locking/lockdep: Fix debug_locks off performance problem locking/pvqspinlock: Extend node size when pvqspinlock is configured locking/qspinlock_stat: Count instances of nested lock slowpaths locking/qspinlock, x86: Provide liveness guarantee x86/asm: 'Simplify' GEN_*_RMWcc() macros locking/qspinlock: Rework some comments locking/qspinlock: Re-order code locking/lockdep: Remove duplicated 'lock_class_ops' percpu array x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y futex: Replace spin_is_locked() with lockdep locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs x86/extable: Macrofy inline assembly code to work around GCC inlining bugs x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs x86/refcount: Work around GCC inlining bug x86/objtool: Use asm macros to work around GCC inlining bugs ...
2018-10-22Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the main pull request for block changes for 4.20. This contains: - Series enabling runtime PM for blk-mq (Bart). - Two pull requests from Christoph for NVMe, with items such as; - Better AEN tracking - Multipath improvements - RDMA fixes - Rework of FC for target removal - Fixes for issues identified by static checkers - Fabric cleanups, as prep for TCP transport - Various cleanups and bug fixes - Block merging cleanups (Christoph) - Conversion of drivers to generic DMA mapping API (Christoph) - Series fixing ref count issues with blkcg (Dennis) - Series improving BFQ heuristics (Paolo, et al) - Series improving heuristics for the Kyber IO scheduler (Omar) - Removal of dangerous bio_rewind_iter() API (Ming) - Apply single queue IPI redirection logic to blk-mq (Ming) - Set of fixes and improvements for bcache (Coly et al) - Series closing a hotplug race with sysfs group attributes (Hannes) - Set of patches for lightnvm: - pblk trace support (Hans) - SPDX license header update (Javier) - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0 specs behind a common core interface. (Javier, Matias) - Enable pblk to use a common interface to retrieve chunk metadata (Matias) - Bug fixes (Various) - Set of fixes and updates to the blk IO latency target (Josef) - blk-mq queue number updates fixes (Jianchao) - Convert a bunch of drivers from the old legacy IO interface to blk-mq. This will conclude with the removal of the legacy IO interface itself in 4.21, with the rest of the drivers (me, Omar) - Removal of the DAC960 driver. The SCSI tree will introduce two replacement drivers for this (Hannes)" * tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits) block: setup bounce bio_sets properly blkcg: reassociate bios when make_request() is called recursively blkcg: fix edge case for blk_get_rl() under memory pressure nvme-fabrics: move controller options matching to fabrics nvme-rdma: always have a valid trsvcid mtip32xx: fully switch to the generic DMA API rsxx: switch to the generic DMA API umem: switch to the generic DMA API sx8: switch to the generic DMA API sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code skd: switch to the generic DMA API ubd: remove use of blk_rq_map_sg nvme-pci: remove duplicate check drivers/block: Remove DAC960 driver nvme-pci: fix hot removal during error handling nvmet-fcloop: suppress a compiler warning nvme-core: make implicit seed truncation explicit nvmet-fc: fix kernel-doc headers nvme-fc: rework the request initialization code nvme-fc: introduce struct nvme_fcp_op_w_sgl ...
2018-10-22Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "Apart from some new arm64 features and clean-ups, this also contains the core mmu_gather changes for tracking the levels of the page table being cleared and a minor update to the generic compat_sys_sigaltstack() introducing COMPAT_SIGMINSKSZ. Summary: - Core mmu_gather changes which allow tracking the levels of page-table being cleared together with the arm64 low-level flushing routines - Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to mitigate Spectre-v4 dynamically without trapping to EL3 firmware - Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack - Optimise emulation of MRS instructions to ID_* registers on ARMv8.4 - Support for Common Not Private (CnP) translations allowing threads of the same CPU to share the TLB entries - Accelerated crc32 routines - Move swapper_pg_dir to the rodata section - Trap WFI instruction executed in user space - ARM erratum 1188874 workaround (arch_timer) - Miscellaneous fixes and clean-ups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits) arm64: KVM: Guests can skip __install_bp_hardening_cb()s HYP work arm64: cpufeature: Trap CTR_EL0 access only where it is necessary arm64: cpufeature: Fix handling of CTR_EL0.IDC field arm64: cpufeature: ctr: Fix cpu capability check for late CPUs Documentation/arm64: HugeTLB page implementation arm64: mm: Use __pa_symbol() for set_swapper_pgd() arm64: Add silicon-errata.txt entry for ARM erratum 1188873 Revert "arm64: uaccess: implement unsafe accessors" arm64: mm: Drop the unused cpu parameter MAINTAINERS: fix bad sdei paths arm64: mm: Use #ifdef for the __PAGETABLE_P?D_FOLDED defines arm64: Fix typo in a comment in arch/arm64/mm/kasan_init.c arm64: xen: Use existing helper to check interrupt status arm64: Use daifflag_restore after bp_hardening arm64: daifflags: Use irqflags functions for daifflags arm64: arch_timer: avoid unused function warning arm64: Trap WFI executed in userspace arm64: docs: Document SSBS HWCAP arm64: docs: Fix typos in ELF hwcaps arm64/kprobes: remove an extra semicolon in arch_prepare_kprobe ...
2018-10-21radix tree: Remove multiorder supportMatthew Wilcox
All users have now been converted to the XArray. Removing the support reduces code size and ensures new users will use the XArray instead. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21page cache: Finish XArray conversionMatthew Wilcox
With no more radix tree API users left, we can drop the GFP flags and use xa_init() instead of INIT_RADIX_TREE(). Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Comment fixupsMatthew Wilcox
Remove the last mentions of radix tree from various comments. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21memfd: Convert memfd_tag_pins to XArrayMatthew Wilcox
Switch to a batch-processing model like memfd_wait_for_pins() and use the xa_state previously set up by memfd_wait_for_pins(). Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2018-10-21memfd: Convert memfd_wait_for_pins to XArrayMatthew Wilcox
Simplify the locking by taking the spinlock while we walk the tree on the assumption that many acquires and releases of the lock will be worse than holding the lock while we process an entire batch of pages. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2018-10-21shmem: Convert shmem_partial_swap_usage to XArrayMatthew Wilcox
Simpler code because the xarray takes care of things like the limit and dereferencing the slot. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert shmem_free_swap to XArrayMatthew Wilcox
Since we are conditionally storing NULL in the XArray, we do not need to allocate memory and the GFP flags will be unused. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert shmem_alloc_hugepage to XArrayMatthew Wilcox
xa_find() is a slightly easier API to use than radix_tree_gang_lookup_slot() because it contains its own RCU locking. This commit removes the last user of radix_tree_gang_lookup_slot() so remove the function too. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert shmem_add_to_page_cache to XArrayMatthew Wilcox
We can use xas_find_conflict() instead of radix_tree_gang_lookup_slot() to find any conflicting entry and combine the three paths through this function into one. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert find_swap_entry to XArrayMatthew Wilcox
This is a 1:1 conversion. The major part of this patch is converting the test framework from userspace to kernel space and mirroring the algorithm now used in find_swap_entry(). Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert shmem_confirm_swap to XArrayMatthew Wilcox
xa_load has its own RCU locking, so we can eliminate it here. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21shmem: Convert shmem_radix_tree_replace to XArrayMatthew Wilcox
Rename shmem_radix_tree_replace() to shmem_replace_entry() and convert it to use the XArray API. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21pagevec: Use xa_mark_tMatthew Wilcox
Removes sparse warnings. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert is_page_cache_freeable to XArrayMatthew Wilcox
This is just a variable rename and comment change. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert khugepaged_scan_shmem to XArrayMatthew Wilcox
Slightly shorter and easier to read code. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert collapse_shmem to XArrayMatthew Wilcox
I found another victim of the radix tree being hard to use. Because there was no call to radix_tree_preload(), khugepaged was allocating radix_tree_nodes using GFP_ATOMIC. I also converted a local_irq_save()/restore() pair to disable()/enable(). Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert huge_memory to XArrayMatthew Wilcox
Quite a straightforward conversion. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert page migration to XArrayMatthew Wilcox
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert __do_page_cache_readahead to XArrayMatthew Wilcox
This one is trivial. Signed-off-by: Matthew Wilcox <willy@infradead.org>