summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2016-12-12mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelistMichal Hocko
__GFP_THISNODE is documented to enforce the allocation to be satisified from the requested node with no fallbacks or placement policy enforcements. policy_zonelist seemingly breaks this semantic if the current policy is MPOL_MBIND and instead of taking the node it will fallback to the first node in the mask if the requested one is not in the mask. This is confusing to say the least because it fact we shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs with nodes outside of their mempolicy binding. And secondly policy_zonelist is called only from 3 places: - huge_zonelist - never should do __GFP_THISNODE when going this path - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either - alloc_pages_current - which uses default_policy id __GFP_THISNODE is used So we shouldn't even need to care about this possibility and can drop the confusing code. Let's keep a WARN_ON_ONCE in place to catch potential users and fix them up properly (aka use a different allocation function which ignores mempolicy). [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm, thp: avoid unlikely branches for split_huge_pmdDavid Rientjes
While doing MADV_DONTNEED on a large area of thp memory, I noticed we encountered many unlikely() branches in profiles for each backing hugepage. This is because zap_pmd_range() would call split_huge_pmd(), which rechecked the conditions that were already validated, but as part of an unlikely() branch. Avoid the unlikely() branch when in a context where pmd is known to be good for __split_huge_pmd() directly. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/vmalloc.c: simplify /proc/vmallocinfo implementationzijun_hu
Many seq_file helpers exist for simplifying implementation of virtual files especially, for /proc nodes. however, the helpers for iteration over list_head are available but aren't adopted to implement /proc/vmallocinfo currently. Simplify /proc/vmallocinfo implementation by using existing seq_file helpers. Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm: make unreserve highatomic functions reliableMinchan Kim
Currently, unreserve_highatomic_pageblock bails out if it found highatomic pageblock regardless of really moving free pages from the one so that it could mitigate unreserve logic's goal which saves OOM of a process. This patch makes unreserve functions bail out only if it moves some pages out of !highatomic free list to avoid such false positive. Another potential problem is that by race between page freeing and reserve highatomic function, pages could be in highatomic free list even though the pageblock is !high atomic migratetype. In that case, unreserve_highatomic_pageblock can be void if count of highatomic reserve is less than pageblock_nr_pages. We could solve it simply via draining all of reserved pages before the OOM. It would have a safeguard role to exhuast reserved pages before converging to OOM. Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm: try to exhaust highatomic reserve before the OOMMinchan Kim
I got OOM report from production team with v4.4 kernel. It had enough free memory but failed to allocate GFP_KERNEL order-0 page and finally encountered OOM kill. It occured during QA process which launches several apps, switching and so on. It happned rarely. IOW, In normal situation, it was not a problem but if we are unluck so that several apps uses peak memory at the same time, it can happen. If we manage to pass the phase, the system can go working well. I could reproduce it with my test(memory spike easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: G W OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x63/0x90 dump_header+0x5c/0x1ce oom_kill_process+0x22e/0x400 out_of_memory+0x1ac/0x210 __alloc_pages_nodemask+0x101e/0x1040 handle_mm_fault+0xa0a/0xbf0 __do_page_fault+0x1dd/0x4d0 trace_do_page_fault+0x43/0x130 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned Another example exceeded the limit by the race is in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK) CPU: 0 PID: 476 Comm: in:imklog Tainted: G E 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x63/0x90 warn_alloc_failed+0xdb/0x130 __alloc_pages_nodemask+0x4d6/0xdb0 new_slab+0x339/0x490 ___slab_alloc.constprop.74+0x367/0x480 __slab_alloc.constprop.73+0x20/0x40 __kmalloc+0x1a4/0x1e0 alloc_indirect.isra.14+0x1d/0x50 virtqueue_add_sgs+0x1c4/0x470 __virtblk_add_req+0xae/0x1f0 virtio_queue_rq+0x12d/0x290 __blk_mq_run_hw_queue+0x239/0x370 blk_mq_run_hw_queue+0x8f/0xb0 blk_mq_insert_requests+0x18c/0x1a0 blk_mq_flush_plug_list+0x125/0x140 blk_flush_plug_list+0xc7/0x220 blk_finish_plug+0x2c/0x40 __do_page_cache_readahead+0x196/0x230 filemap_fault+0x448/0x4f0 ext4_filemap_fault+0x36/0x50 __do_fault+0x75/0x140 handle_mm_fault+0x84d/0xbe0 __do_page_fault+0x1dd/0x4d0 trace_do_page_fault+0x43/0x130 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Mem-Info: active_anon:363826 inactive_anon:121283 isolated_anon:32 active_file:65 inactive_file:152 isolated_file:0 unevictable:0 dirty:0 writeback:46 unstable:0 slab_reclaimable:2778 slab_unreclaimable:3070 mapped:112 shmem:0 pagetables:1822 bounce:0 free:9469 free_pcp:231 free_cma:0 Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB 2775 total pagecache pages 2536 pages in swap cache Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077 Free swap = 108744kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12648 pages reserved 0 pages cma reserved 0 pages hwpoisoned It's weird to show that zone has enough free memory above min watermark but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic pages. As last resort, try to unreserve highatomic pages again and if it has moved pages to non-highatmoc free list, retry reclaim once more. Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm: prevent double decrease of nr_reserved_highatomicMinchan Kim
There is race between page freeing and unreserved highatomic. CPU 0 CPU 1 free_hot_cold_page mt = get_pfnblock_migratetype set_pcppage_migratetype(page, mt) unreserve_highatomic_pageblock spin_lock_irqsave(&zone->lock) move_freepages_block set_pageblock_migratetype(page) spin_unlock_irqrestore(&zone->lock) free_pcppages_bulk __free_one_page(mt) <- mt is stale By above race, a page on CPU 0 could go non-highorderatomic free list since the pageblock's type is changed. By that, unreserve logic of highorderatomic can decrease reserved count on a same pageblock severak times and then it will make mismatch between nr_reserved_highatomic and the number of reserved pageblock. So, this patch verifies whether the pageblock is highatomic or not and decrease the count only if the pageblock is highatomic. Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm: don't steal highatomic pageblockMinchan Kim
Patch series "use up highorder free pages before OOM", v3. I got OOM report from production team with v4.4 kernel. It had enough free memory but failed to allocate GFP_KERNEL order-0 page and finally encountered OOM kill. It occured during QA process which launches several apps, switching and so on. It happned rarely. IOW, In normal situation, it was not a problem but if we are unluck so that several apps uses peak memory at the same time, it can happen. If we manage to pass the phase, the system can go working well. I could reproduce it with my test(memory spike easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: G W OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x63/0x90 dump_header+0x5c/0x1ce oom_kill_process+0x22e/0x400 out_of_memory+0x1ac/0x210 __alloc_pages_nodemask+0x101e/0x1040 handle_mm_fault+0xa0a/0xbf0 __do_page_fault+0x1dd/0x4d0 trace_do_page_fault+0x43/0x130 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned Another example exceeded the limit by the race is in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK) CPU: 0 PID: 476 Comm: in:imklog Tainted: G E 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x63/0x90 warn_alloc_failed+0xdb/0x130 __alloc_pages_nodemask+0x4d6/0xdb0 new_slab+0x339/0x490 ___slab_alloc.constprop.74+0x367/0x480 __slab_alloc.constprop.73+0x20/0x40 __kmalloc+0x1a4/0x1e0 alloc_indirect.isra.14+0x1d/0x50 virtqueue_add_sgs+0x1c4/0x470 __virtblk_add_req+0xae/0x1f0 virtio_queue_rq+0x12d/0x290 __blk_mq_run_hw_queue+0x239/0x370 blk_mq_run_hw_queue+0x8f/0xb0 blk_mq_insert_requests+0x18c/0x1a0 blk_mq_flush_plug_list+0x125/0x140 blk_flush_plug_list+0xc7/0x220 blk_finish_plug+0x2c/0x40 __do_page_cache_readahead+0x196/0x230 filemap_fault+0x448/0x4f0 ext4_filemap_fault+0x36/0x50 __do_fault+0x75/0x140 handle_mm_fault+0x84d/0xbe0 __do_page_fault+0x1dd/0x4d0 trace_do_page_fault+0x43/0x130 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Mem-Info: active_anon:363826 inactive_anon:121283 isolated_anon:32 active_file:65 inactive_file:152 isolated_file:0 unevictable:0 dirty:0 writeback:46 unstable:0 slab_reclaimable:2778 slab_unreclaimable:3070 mapped:112 shmem:0 pagetables:1822 bounce:0 free:9469 free_pcp:231 free_cma:0 Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB 2775 total pagecache pages 2536 pages in swap cache Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077 Free swap = 108744kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12648 pages reserved 0 pages cma reserved 0 pages hwpoisoned During the investigation, I found some problems with highatomic so this patch aims to solve the problems and the final goal is to unreserve every highatomic free pages before the OOM kill. This patch (of 4): In page freeing path, migratetype is racy so that a highorderatomic page could free into non-highorderatomic free list. If that page is allocated, VM can change the pageblock from higorderatomic to something. In that case, highatomic pageblock accounting is broken so it doesn't work(e.g., VM cannot reserve highorderatomic pageblocks any more although it doesn't reach 1% limit). So, this patch prohibits the changing from highatomic to other type. It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback array so stealing will only happen due to unexpected races which is really rare. Also, such prohibiting keeps highatomic pageblock more longer so it would be better for highorderatomic page allocation. Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12kmemleak: fix reference to DocumentationAndreas Platschek
Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst, this fixes the reference to the new location. Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.at Signed-off-by: Andreas Platschek <andreas.platschek@opentech.at> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/hugetlb.c: use huge_pte_lock instead of opencoding the lockAneesh Kumar K.V
No functional change by this patch. Link: http://lkml.kernel.org/r/20161018090234.22574-1-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/hugetlb.c: use the right pte val for compare in hugetlb_cowAneesh Kumar K.V
We cannot use the pte value used in set_pte_at for pte_same comparison, because archs like ppc64, filter/add new pte flag in set_pte_at. Instead fetch the pte value inside hugetlb_cow. We are comparing pte value to make sure the pte didn't change since we dropped the page table lock. hugetlb_cow get called with page table lock held, and we can take a copy of the pte value before we drop the page table lock. With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no previous mapping (huge_pte_none entries), by forcing a cow in the fault path. This avoid take an addition fault to covert a read-only mapping to read/write. Here we were comparing a recently instantiated pte (via set_pte_at) to the pte values from linux page table. As explained above on ppc64 such pte_same check returned wrong result, resulting in us taking an additional fault on ppc64. Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit") Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reported-by: Jan Stancek <jstancek@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Scott Wood <scottwood@freescale.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/gup.c: make unnecessarily global vma_permits_fault() staticTobias Klauser
Make vma_permits_fault() static as it is only used in mm/gup.c This fixes a sparse warning. Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.ch Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/vmscan.c: set correct defer count for shrinkerShaohua Li
Our system uses significantly more slab memory with memcg enabled with the latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6 kernel, 6G memory is used. The shrinker has problem. Let's see we have two memcg for one shrinker. In do_shrink_slab: 1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024, then no memory is freed. nr_deferred = 700 2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10 or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10. The deferred share of cg1 is lost in this case. kswapd will free no memory even run above steps again and again. The fix makes sure one memcg's deferred share isn't lost. Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/mprotect.c: don't touch single threaded PTEs which are on the right nodeAndi Kleen
We had some problems with pages getting unmapped in single threaded affinitized processes. It was tracked down to NUMA scanning. In this case it doesn't make any sense to unmap pages if the process is single threaded and the page is already on the node the process is running on. Add a check for this case into the numa protection code, and skip unmapping if true. In theory the process could be migrated later, but we will eventually rescan and unmap and migrate then. In theory this could be made more fancy: remembering this state per process or even whole mm. However that would need extra tracking and be more complicated, and the simple check seems to work fine so far. [ak@linux.intel.com: v3: Minor updates from Mel. Change code layout] Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm, slab: maintain total slab count instead of active countDavid Rientjes
Rather than tracking the number of active slabs for each node, track the total number of slabs. This is a minor improvement that avoids active slab tracking when a slab goes from free to partial or partial to free. For slab debugging, this also removes an explicit free count since it can easily be inferred by the difference in number of total objects and number of active objects. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm, slab: faster active and free statsGreg Thelen
Reading /proc/slabinfo or monitoring slabtop(1) can become very expensive if there are many slab caches and if there are very lengthy per-node partial and/or free lists. Commit 07a63c41fa1f ("mm/slab: improve performance of gathering slabinfo stats") addressed the per-node full lists which showed a significant improvement when no objects were freed. This patch has the same motivation and optimizes the remainder of the usecases where there are very lengthy partial and free lists. This patch maintains per-node active_slabs (full and partial) and free_slabs rather than iterating the lists at runtime when reading /proc/slabinfo. When allocating 100GB of slab from a test cache where every slab page is on the partial list, reading /proc/slabinfo (includes all other slab caches on the system) takes ~247ms on average with 48 samples. As a result of this patch, the same read takes ~0.856ms on average. [rientjes@google.com: changelog] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.com Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm/slab_common.c: check kmem_create_cache flags are commonThomas Garnier
Verify that kmem_create_cache flags are not allocator specific. It is done before removing flags that are not available with the current configuration. The current kmem_cache_create removes incorrect flags but do not validate the callers are using them right. This change will ensure that callers are not trying to create caches with flags that won't be used because allocator specific. Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com Signed-off-by: Thomas Garnier <thgarnie@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12slub: avoid false-postive warningArnd Bergmann
The slub allocator gives us some incorrect warnings when CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro prevents it from seeing that the return code matches what it was before: mm/slub.c: In function `kmem_cache_free_bulk': mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized] mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized] mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized] mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized] I have not been able to come up with a perfect way for dealing with this, the three options I see are: - add a bogus initialization, which would increase the runtime overhead - replace unlikely() with unlikely_notrace() - remove the unlikely() annotation completely I checked the object code for a typical x86 configuration and the last two cases produce the same result, so I went for the last one, which is the simplest. Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12slub: move synchronize_sched out of slab_mutex on shrinkVladimir Davydov
synchronize_sched() is a heavy operation and calling it per each cache owned by a memory cgroup being destroyed may take quite some time. What is worse, it's currently called under the slab_mutex, stalling all works doing cache creation/destruction. Actually, there isn't much point in calling synchronize_sched() for each cache - it's enough to call it just once - after setting cpu_partial for all caches and before shrinking them. This way, we can also move it out of the slab_mutex, which we have to hold for iterating over the slab cache list. Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991 Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reported-by: Doug Smythies <dsmythies@telus.net> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm: memcontrol: use special workqueue for creating per-memcg cachesVladimir Davydov
Creating a lot of cgroups at the same time might stall all worker threads with kmem cache creation works, because kmem cache creation is done with the slab_mutex held. The problem was amplified by commits 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which increased the maximal time the slab_mutex can be held. To prevent that from happening, let's use a special ordered single threaded workqueue for kmem cache creation. This shouldn't introduce any functional changes regarding how kmem caches are created, as the work function holds the global slab_mutex during its whole runtime anyway, making it impossible to run more than one work at a time. By using a single threaded workqueue, we just avoid creating a thread per each work. Ordering is required to avoid a situation when a cgroup's work is put off indefinitely because there are other cgroups to serve, in other words to guarantee fairness. Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981 Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanza Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reported-by: Doug Smythies <dsmythies@telus.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "The main changes in this development cycle were: - a large number of call stack dumping/printing improvements: higher robustness, better cross-context dumping, improved output, etc. (Josh Poimboeuf) - vDSO getcpu() performance improvement for future Intel CPUs with the RDPID instruction (Andy Lutomirski) - add two new Intel AVX512 features and the CPUID support infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela, He Chen) - more copy-user unification (Borislav Petkov) - entry code assembly macro simplifications (Alexander Kuleshov) - vDSO C/R support improvements (Dmitry Safonov) - misc fixes and cleanups (Borislav Petkov, Paul Bolle)" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) scripts/decode_stacktrace.sh: Fix address line detection on x86 x86/boot/64: Use defines for page size x86/dumpstack: Make stack name tags more comprehensible selftests/x86: Add test_vdso to test getcpu() x86/vdso: Use RDPID in preference to LSL when available x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl() x86/cpufeatures: Enable new AVX512 cpu features x86/cpuid: Provide get_scattered_cpuid_leaf() x86/cpuid: Cleanup cpuid_regs definitions x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants x86/unwind: Ensure stack grows down x86/vdso: Set vDSO pointer only after success x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE x86/unwind: Detect bad stack return address x86/dumpstack: Warn on stack recursion x86/unwind: Warn on bad frame pointer x86/decoder: Use stderr if insn sanity test fails x86/decoder: Use stdout if insn decoder test is successful mm/page_alloc: Remove kernel address exposure in free_reserved_area() x86/dumpstack: Remove raw stack dump ...
2016-12-12Merge branches 'pm-sleep' and 'powercap'Rafael J. Wysocki
* pm-sleep: PM / sleep: Print active wakeup sources when blocking on wakeup_count reads x86/suspend: fix false positive KASAN warning on suspend/resume PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag PM / sleep: System sleep state selection interface rework PM / hibernate: Verify the consistent of e820 memory map by md5 digest * powercap: powercap / RAPL: Add Knights Mill CPUID powercap/intel_rapl: fix and tidy up error handling powercap/intel_rapl: Track active CPUs internally powercap/intel_rapl: Cleanup duplicated init code powercap/intel rapl: Convert to hotplug state machine powercap/intel_rapl: Propagate error code when registration fails powercap/intel_rapl: Add missing domain data update on hotplug
2016-12-12Merge branch 'mm-pat-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull mm/PAT cleanup from Ingo Molnar: "A single cleanup for a generic interface that was originally introduced for PAT" * 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pat, mm: Make track_pfn_insert() return void
2016-12-06shmem: fix shm fallocate() list corruptionLinus Torvalds
The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not want to race with generating new pages by faulting them in. However, the wait-queue used to delay the page faulting has a serious problem: the wait queue head (in shmem_fallocate()) is allocated on the stack, and the code expects that "wake_up_all()" will make sure that all the queue entries are gone before the stack frame is de-allocated. And that is not at all necessarily the case. Yes, a normal wake-up sequence will remove the wait-queue entry that caused the wakeup (see "autoremove_wake_function()"), but the key wording there is "that caused the wakeup". When there are multiple possible wakeup sources, the wait queue entry may well stay around. And _particularly_ in a page fault path, we may be faulting in new pages from user space while we also have other things going on, and there may well be other pending wakeups. So despite the "wake_up_all()", it's not at all guaranteed that all list entries are removed from the wait queue head on the stack. Fix this by introducing a new wakeup function that removes the list entry unconditionally, even if the target process had already woken up for other reasons. Use that "synchronous" function to set up the waiters in shmem_fault(). This problem has never been seen in the wild afaik, but Dave Jones has reported it on and off while running trinity. We thought we fixed the stack corruption with the blk-mq rq_list locking fix (commit 7fe311302f7d: "blk-mq: update hardware and software queues for sleeping alloc"), but it turns out there was _another_ stack corruptor hiding in the trinity runs. Vegard Nossum (also running trinity) was able to trigger this one fairly consistently, and made us look once again at the shmem code due to the faults often being in that area. Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>. Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-06x86/suspend: fix false positive KASAN warning on suspend/resumeJosh Poimboeuf
Resuming from a suspend operation is showing a KASAN false positive warning: BUG: KASAN: stack-out-of-bounds in unwind_get_return_address+0x11d/0x130 at addr ffff8803867d7878 Read of size 8 by task pm-suspend/7774 page:ffffea000e19f5c0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x2ffff0000000000() page dumped because: kasan: bad access detected CPU: 0 PID: 7774 Comm: pm-suspend Tainted: G B 4.9.0-rc7+ #8 Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F5 03/07/2016 Call Trace: dump_stack+0x63/0x82 kasan_report_error+0x4b4/0x4e0 ? acpi_hw_read_port+0xd0/0x1ea ? kfree_const+0x22/0x30 ? acpi_hw_validate_io_request+0x1a6/0x1a6 __asan_report_load8_noabort+0x61/0x70 ? unwind_get_return_address+0x11d/0x130 unwind_get_return_address+0x11d/0x130 ? unwind_next_frame+0x97/0xf0 __save_stack_trace+0x92/0x100 save_stack_trace+0x1b/0x20 save_stack+0x46/0xd0 ? save_stack_trace+0x1b/0x20 ? save_stack+0x46/0xd0 ? kasan_kmalloc+0xad/0xe0 ? kasan_slab_alloc+0x12/0x20 ? acpi_hw_read+0x2b6/0x3aa ? acpi_hw_validate_register+0x20b/0x20b ? acpi_hw_write_port+0x72/0xc7 ? acpi_hw_write+0x11f/0x15f ? acpi_hw_read_multiple+0x19f/0x19f ? memcpy+0x45/0x50 ? acpi_hw_write_port+0x72/0xc7 ? acpi_hw_write+0x11f/0x15f ? acpi_hw_read_multiple+0x19f/0x19f ? kasan_unpoison_shadow+0x36/0x50 kasan_kmalloc+0xad/0xe0 kasan_slab_alloc+0x12/0x20 kmem_cache_alloc_trace+0xbc/0x1e0 ? acpi_get_sleep_type_data+0x9a/0x578 acpi_get_sleep_type_data+0x9a/0x578 acpi_hw_legacy_wake_prep+0x88/0x22c ? acpi_hw_legacy_sleep+0x3c7/0x3c7 ? acpi_write_bit_register+0x28d/0x2d3 ? acpi_read_bit_register+0x19b/0x19b acpi_hw_sleep_dispatch+0xb5/0xba acpi_leave_sleep_state_prep+0x17/0x19 acpi_suspend_enter+0x154/0x1e0 ? trace_suspend_resume+0xe8/0xe8 suspend_devices_and_enter+0xb09/0xdb0 ? printk+0xa8/0xd8 ? arch_suspend_enable_irqs+0x20/0x20 ? try_to_freeze_tasks+0x295/0x600 pm_suspend+0x6c9/0x780 ? finish_wait+0x1f0/0x1f0 ? suspend_devices_and_enter+0xdb0/0xdb0 state_store+0xa2/0x120 ? kobj_attr_show+0x60/0x60 kobj_attr_store+0x36/0x70 sysfs_kf_write+0x131/0x200 kernfs_fop_write+0x295/0x3f0 __vfs_write+0xef/0x760 ? handle_mm_fault+0x1346/0x35e0 ? do_iter_readv_writev+0x660/0x660 ? __pmd_alloc+0x310/0x310 ? do_lock_file_wait+0x1e0/0x1e0 ? apparmor_file_permission+0x18/0x20 ? security_file_permission+0x73/0x1c0 ? rw_verify_area+0xbd/0x2b0 vfs_write+0x149/0x4a0 SyS_write+0xd9/0x1c0 ? SyS_read+0x1c0/0x1c0 entry_SYSCALL_64_fastpath+0x1e/0xad Memory state around the buggy address: ffff8803867d7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8803867d7780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff8803867d7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f4 ^ ffff8803867d7880: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 ffff8803867d7900: 00 00 00 f1 f1 f1 f1 04 f4 f4 f4 f3 f3 f3 f3 00 KASAN instrumentation poisons the stack when entering a function and unpoisons it when exiting the function. However, in the suspend path, some functions never return, so their stack never gets unpoisoned, resulting in stale KASAN shadow data which can cause later false positive warnings like the one above. Reported-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-02mm, vmscan: add cond_resched() into shrink_node_memcg()Michal Hocko
Boris Zhmurov has reported RCU stalls during the kswapd reclaim: INFO: rcu_sched detected stalls on CPUs/tasks: 23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23 (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115) Task dump for CPU 23: kswapd1 R running task 0 148 2 0x00000008 Call Trace: shrink_node+0xd2/0x2f0 kswapd+0x2cb/0x6a0 mem_cgroup_shrink_node+0x160/0x160 kthread+0xbd/0xe0 __switch_to+0x1fa/0x5c0 ret_from_fork+0x1f/0x40 kthread_create_on_node+0x180/0x180 a closer code inspection has shown that we might indeed miss all the scheduling points in the reclaim path if no pages can be isolated from the LRU list. This is a pathological case but other reports from Donald Buczek have shown that we might indeed hit such a path: clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193 kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1 [...] kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1 this is minute long snapshot which didn't take a single page from the LRU. It is not entirely clear why only 1303 pages have been scanned during that time (maybe there was a heavy IRQ activity interfering). In any case it looks like we can really hit long periods without scheduling on non preemptive kernels so an explicit cond_resched() in shrink_node_memcg which is independent on the reclaim operation is due. Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Boris Zhmurov <bb@kernelpanic.ru> Tested-by: Boris Zhmurov <bb@kernelpanic.ru> Reported-by: Donald Buczek <buczek@molgen.mpg.de> Reported-by: "Christopher S. Aker" <caker@theshore.net> Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-02mm: workingset: fix NULL ptr in count_shadow_nodesMichal Hocko
Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware") has made the workingset shadow nodes shrinker memcg aware. The implementation is not correct though because memcg_kmem_enabled() might become true while we are doing a global reclaim when the sc->memcg might be NULL which is exactly what Marek has seen: BUG: unable to handle kernel NULL pointer dereference at 0000000000000400 IP: [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40 PGD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 60 Comm: kswapd0 Tainted: G O 4.8.10-12.pvops.qubes.x86_64 #1 task: ffff880011863b00 task.stack: ffff880011868000 RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40 RSP: e02b:ffff88001186bc70 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002 RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000 R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6 R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660 Call Trace: count_shadow_nodes+0x9a/0xa0 shrink_slab.part.42+0x119/0x3e0 shrink_node+0x22c/0x320 kswapd+0x32c/0x700 kthread+0xd8/0xf0 ret_from_fork+0x1f/0x40 Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 <4a> 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1 RIP mem_cgroup_node_nr_lru_pages+0x20/0x40 RSP <ffff88001186bc70> CR2: 0000000000000400 ---[ end trace 100494b9edbdfc4d ]--- This patch fixes the issue by checking sc->memcg rather than memcg_kmem_enabled() which is sufficient because shrink_slab makes sure that only memcg aware shrinkers will get non-NULL memcgs and only if memcg_kmem_enabled is true. Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware") Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Marek Marczykowski-Górecki <marmarek@mimuw.edu.pl> Tested-by: Marek Marczykowski-Górecki <marmarek@mimuw.edu.pl> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-02mm/compaction: Convert to hotplug state machineAnna-Maria Gleixner
Install the callbacks via the state machine. Should the hotplug init fail then no threads are spawned. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20161126231350.10321-15-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02mm/zswap: Convert pool to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Multi state is used to address the per-pool notifier. Uppon adding of the intance the callback is invoked for all online CPUs so the manual init can go. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-mm@kvack.org Cc: Seth Jennings <sjenning@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161126231350.10321-13-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02mm/zswap: Convert dst-mem to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-mm@kvack.org Cc: Seth Jennings <sjenning@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161126231350.10321-12-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02mm/zsmalloc: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-mm@kvack.org Cc: Minchan Kim <minchan@kernel.org> Cc: rt@linutronix.de Cc: Nitin Gupta <ngupta@vflare.org> Link: http://lkml.kernel.org/r/20161126231350.10321-11-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02mm/vmstat: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine, but do not invoke them as we can initialize the node state without calling the callbacks on all online CPUs. start_shepherd_timer() is now called outside the get_online_cpus() block which is safe as it only operates on cpu possible mask. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20161129145221.ffc3kg3hd7lxiwj6@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02mm/vmstat: Avoid on each online CPU loopsSebastian Andrzej Siewior
Both iterations over online cpus can be replaced by the proper node specific functions. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20161129145113.fn3lw5aazjjvdrr3@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()Sebastian Andrzej Siewior
Both functions are called with protection against cpu hotplug already so *_online_cpus() could be dropped. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20161126231350.10321-8-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-30mm: fix false-positive WARN_ON() in truncate/invalidate for hugetlbKirill A. Shutemov
Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or PUD_SIZE), not in PAGE_SIZE as other types of pages. This means we cannot user page_to_pgoff() to check whether we've got the right page for the radix-tree index. Let's introduce page_to_index() which would return radix-tree index for given page. We will be able to get rid of this once hugetlb will be switched to multi-order entries. Fixes: fc127da085c2 ("truncate: handle file thp") Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Doug Nelson <doug.nelson@intel.com> Tested-by: Doug Nelson <doug.nelson@intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30kasan: support use-after-scope detectionDmitry Vyukov
Gcc revision 241896 implements use-after-scope detection. Will be available in gcc 7. Support it in KASAN. Gcc emits 2 new callbacks to poison/unpoison large stack objects when they go in/out of scope. Implement the callbacks and add a test. [dvyukov@google.com: v3] Link: http://lkml.kernel.org/r/1479998292-144502-1-git-send-email-dvyukov@google.com Link: http://lkml.kernel.org/r/1479226045-145148-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30kasan: update kasan_global for gcc 7Dmitry Vyukov
kasan_global struct is part of compiler/runtime ABI. gcc revision 241983 has added a new field to kasan_global struct. Update kernel definition of kasan_global struct to include the new field. Without this patch KASAN is broken with gcc 7. Link: http://lkml.kernel.org/r/1479219743-28682-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30thp: fix corner case of munlock() of PTE-mapped THPsKirill A. Shutemov
The following program triggers BUG() in munlock_vma_pages_range(): // autogenerated by syzkaller (http://github.com/google/syzkaller) #include <sys/mman.h> int main() { mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0); mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul); return 0; } The test-case constructs the situation when munlock_vma_pages_range() finds PTE-mapped THP-head in the middle of page table and, by mistake, skips HPAGE_PMD_NR pages after that. As result, on the next iteration it hits the middle of PMD-mapped THP and gets upset seeing mlocked tail page. The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked during munlock_vma_page(). It would guarantee that the page is PMD-mapped as we never mlock PTE-mapeed THPs. Fixes: e90309c9f772 ("thp: allow mlocked THP again") Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30mm, thp: propagation of conditional compilation in khugepaged.cJérémy Lefaure
Commit b46e756f5e47 ("thp: extract khugepaged from mm/huge_memory.c") moved code from huge_memory.c to khugepaged.c. Some of this code should be compiled only when CONFIG_SYSFS is enabled but the condition around this code was not moved into khugepaged.c. The result is a compilation error when CONFIG_SYSFS is disabled: mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store' mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show' This commit adds the #ifdef CONFIG_SYSFS around the code related to sysfs. Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-29mremap: move_ptes: check pte dirty after its removalAaron Lu
Linus found there still is a race in mremap after commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning"). As described by Linus: "the issue is that another thread might make the pte be dirty (in the hardware walker, so no locking of ours will make any difference) *after* we checked whether it was dirty, but *before* we removed it from the page tables" Fix it by moving the check after we removed it from the page table. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-17mremap: fix race between mremap() and page cleanningAaron Lu
Prior to 3.15, there was a race between zap_pte_range() and page_mkclean() where writes to a page could be lost. Dave Hansen discovered by inspection that there is a similar race between move_ptes() and page_mkclean(). We've been able to reproduce the issue by enlarging the race window with a msleep(), but have not been able to hit it without modifying the code. So, we think it's a real issue, but is difficult or impossible to hit in practice. The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And this patch is to fix the race between page_mkclean() and mremap(). Here is one possible way to hit the race: suppose a process mmapped a file with READ | WRITE and SHARED, it has two threads and they are bound to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread 1 did a write to addr X so that CPU1 now has a writable TLB for addr X on it. Thread 2 starts mremaping from addr X to Y while thread 1 cleaned the page and then did another write to the old addr X again. The 2nd write from thread 1 could succeed but the value will get lost. thread 1 thread 2 (bound to CPU1) (bound to CPU2) 1: write 1 to addr X to get a writeable TLB on this CPU 2: mremap starts 3: move_ptes emptied PTE for addr X and setup new PTE for addr Y and then dropped PTL for X and Y 4: page laundering for N by doing fadvise FADV_DONTNEED. When done, pageframe N is deemed clean. 5: *write 2 to addr X 6: tlb flush for addr X 7: munmap (Y, pagesize) to make the page unmapped 8: fadvise with FADV_DONTNEED again to kick the page off the pagecache 9: pread the page from file to verify the value. If 1 is there, it means we have lost the written 2. *the write may or may not cause segmentation fault, it depends on if the TLB is still on the CPU. Please note that this is only one specific way of how the race could occur, it didn't mean that the race could only occur in exact the above config, e.g. more than 2 threads could be involved and fadvise() could be done in another thread, etc. For anonymous pages, they could race between mremap() and page reclaim: THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge PMD gets unmapped/splitted/pagedout before the flush tlb happened for the old huge PMD in move_page_tables() and we could still write data to it. The normal anonymous page has similar situation. To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and if any, did the flush before dropping the PTL. If we did the flush for every move_ptes()/move_huge_pmd() call then we do not need to do the flush in move_pages_tables() for the whole range. But if we didn't, we still need to do the whole range flush. Alternatively, we can track which part of the range is flushed in move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole range in move_page_tables(). But that would require multiple tlb flushes for the different sub-ranges and should be less efficient than the single whole range flush. KBuild test on my Sandybridge desktop doesn't show any noticeable change. v4.9-rc4: real 5m14.048s user 32m19.800s sys 4m50.320s With this commit: real 5m13.888s user 32m19.330s sys 4m51.200s Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-17Merge branch 'x86/cpufeature' into x86/asm, to pick up dependencyIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-14Merge 4.9-rc5 into char-misc-nextGreg Kroah-Hartman
We want those fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-13Merge branch 'dax-4.10-iomap-pmd' into originTheodore Ts'o
2016-11-11mm: kmemleak: scan .data.ro_after_initJakub Kicinski
Limit the number of kmemleak false positives by including .data.ro_after_init in memory scanning. To achieve this we need to add symbols for start and end of the section to the linker scripts. The problem was been uncovered by commit 56989f6d8568 ("genetlink: mark families as __ro_after_init"). Link: http://lkml.kernel.org/r/1478274173-15218-1-git-send-email-jakub.kicinski@netronome.com Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLABGreg Thelen
While testing OBJFREELIST_SLAB integration with pagealloc, we found a bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB & CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed for loading drivers or creating new caches will fail. The original kmem_cache is created early making OFF_SLAB not possible. When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc is enabled it will try to enable it first under certain conditions. Given kmem_cache(sys) reuses the original flag, you can have both flags at the same time resulting in allocation failures and odd behaviors. This fix discards allocator specific flags from memcg before calling create_cache. The bug exists since 4.6-rc1 and affects testing debug pagealloc configurations. Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB") Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Thomas Garnier <thgarnie@google.com> Tested-by: Thomas Garnier <thgarnie@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11mm/filemap: don't allow partially uptodate page for pipesEryu Guan
Starting from 4.9-rc1 kernel, I started noticing some test failures of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when testing on sub-page block size filesystems (tested both XFS and ext4), these syscalls start to return EIO in the tests. e.g. sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1 sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1 sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1 sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1 This is because that in sub-page block size cases, we don't need the whole page to be uptodate, only the part we care about is uptodate is OK (if fs has ->is_partially_uptodate defined). But page_cache_pipe_buf_confirm() doesn't have the ability to check the partially-uptodate case, it needs the whole page to be uptodate. So it returns EIO in this case. This is a regression introduced by commit 82c156f85384 ("switch generic_file_splice_read() to use of ->read_iter()"). Prior to the change, generic_file_splice_read() doesn't allow partially-uptodate page either, so it worked fine. Fix it by skipping the partially-uptodate check if we're working on a pipe in do_generic_file_read(), so we read the whole page from disk as long as the page is not uptodate. I think the other way to fix it is to add the ability to check & allow partially-uptodate page to page_cache_pipe_buf_confirm(), but that is much harder to do and seems gain little. Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.com Signed-off-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11mm/hugetlb: fix huge page reservation leak in private mapping error pathsMike Kravetz
Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly allocated huge page. If a reservation was associated with the huge page, alloc_huge_page() consumed the reservation while allocating. When the newly allocated page is freed in free_huge_page(), it will increment the global reservation count. However, the reservation entry in the reserve map will remain. This is not an issue for shared mappings as the entry in the reserve map indicates a reservation exists. But, an entry in a private mapping reserve map indicates the reservation was consumed and no longer exists. This results in an inconsistency between the reserve map and the global reservation count. This 'leaks' a reserved huge page. Create a new routine restore_reserve_on_error() to restore the reserve entry in these specific error paths. This routine makes use of a new function vma_add_reservation() which will add a reserve entry for a specific address/page. In general, these error paths were rarely (if ever) taken on most architectures. However, powerpc contained arch specific code that that resulted in an extra fault and execution of these error paths on all private mappings. Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings) Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A . Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11mm: hwpoison: fix thp split handling in memory_failure()Naoya Horiguchi
When memory_failure() runs on a thp tail page after pmd is split, we trigger the following VM_BUG_ON_PAGE(): page:ffffd7cd819b0040 count:0 mapcount:0 mapping: (null) index:0x1 flags: 0x1fffc000400000(hwpoison) page dumped because: VM_BUG_ON_PAGE(!page_count(p)) ------------[ cut here ]------------ kernel BUG at /src/linux-dev/mm/memory-failure.c:1132! memory_failure() passed refcount and page lock from tail page to head page, which is not needed because we can pass any subpage to split_huge_page(). Fixes: 61f5d698cc97 ("mm: re-enable THP") Link: http://lkml.kernel.org/r/1477961577-7183-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11swapfile: fix memory corruption via malformed swapfileJann Horn
When root activates a swap partition whose header has the wrong endianness, nr_badpages elements of badpages are swabbed before nr_badpages has been checked, leading to a buffer overrun of up to 8GB. This normally is not a security issue because it can only be exploited by root (more specifically, a process with CAP_SYS_ADMIN or the ability to modify a swap file/partition), and such a process can already e.g. modify swapped-out memory of any other userspace process on the system. Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11mm/cma.c: check the max limit for cma allocationShiraz Hashim
CMA allocation request size is represented by size_t that gets truncated when same is passed as int to bitmap_find_next_zero_area_off. We observe that during fuzz testing when cma allocation request is too high, bitmap_find_next_zero_area_off still returns success due to the truncation. This leads to kernel crash, as subsequent code assumes that requested memory is available. Fail cma allocation in case the request breaches the corresponding cma region size. Link: http://lkml.kernel.org/r/1478189211-3467-1-git-send-email-shashim@codeaurora.org Signed-off-by: Shiraz Hashim <shashim@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>