summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-07-10mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit()Michal Hocko
Alice has reported the following UBSAN splat: UBSAN: Undefined behaviour in mm/memcontrol.c:661:17 signed integer overflow: -2147483644 - 2147483525 cannot be represented in type 'long int' CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4 Hardware name: XXXXXX, BIOS YYYYYY Call Trace: dump_stack+0x59/0x87 ubsan_epilogue+0xe/0x40 handle_overflow+0xbb/0xf0 __ubsan_handle_sub_overflow+0x12/0x20 memcg_check_events.isra.36+0x223/0x360 mem_cgroup_commit_charge+0x55/0x140 wp_page_copy+0x34e/0xb80 do_wp_page+0x1e6/0x1300 handle_mm_fault+0x88b/0x1990 __do_page_fault+0x2de/0x8a0 do_page_fault+0x1a/0x20 error_code+0x67/0x6c The reason is that we subtract two signed types. Let's fix this by truly mimicing time_after and cast the result of the subtraction. Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Alice Ferrazzi <alicef@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, hugetlb: schedule when potentially allocating many hugepagesDavid Rientjes
A few hugetlb allocators loop while calling the page allocator and can potentially prevent rescheduling if the page allocator slowpath is not utilized. Conditionally schedule when large numbers of hugepages can be allocated. Anshuman: "Fixes a task which was getting hung while writing like 10000 hugepages (16MB on POWER8) into /proc/sys/vm/nr_hugepages." Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: unify new_node_page and alloc_migrate_targetMichal Hocko
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") has duplicated a large part of alloc_migrate_target with some hotplug specific special casing. To be more precise it tried to enfore the allocation from a different node than the original page. As a result the two function diverged in their shared logic, e.g. the hugetlb allocation strategy. Let's unify the two and express different NUMA requirements by the given nodemask. new_node_page will simply exclude the node it doesn't care about and alloc_migrate_target will use all the available nodes. alloc_migrate_target will then learn to migrate hugetlb pages more sanely and use preallocated pool when possible. Please note that alloc_migrate_target used to call alloc_page resp. alloc_pages_current so the memory policy of the current context which is quite strange when we consider that it is used in the context of alloc_contig_range which just tries to migrate pages which stand in the way. Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10hugetlb, memory_hotplug: prefer to use reserved pages for migrationMichal Hocko
new_node_page will try to use the origin's next NUMA node as the migration destination for hugetlb pages. If such a node doesn't have any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to allocate a surplus page instead. This is quite subotpimal for any configuration when hugetlb pages are no distributed to all NUMA nodes evenly. Say we have a hotplugable node 4 and spare hugetlb pages are node 0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0 Now we consume the whole pool on node 4 and try to offline this node. All the allocated pages should be moved to node0 which has enough preallocated pages to hold them. With the current implementation offlining very likely fails because hugetlb allocations during runtime are much less reliable. Fix this by reusing the nodemask which excludes migration source and try to find a first node which has a page in the preallocated pool first and fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is consumed. [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub] Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, memory_hotplug: simplify empty node mask handling in new_node_pageMichal Hocko
new_node_page tries to allocate the target page on a different NUMA node than the source page. This makes sense in most cases during the hotplug because we are likely to offline the whole numa node. But there are cases where there are no other nodes to fallback (e.g. when offlining parts of the only existing node) and we have to fallback to allocating from the source node. The current code does that but it can be simplified by checking the nmask and updating it before we even try to allocate rather than special casing it. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, memory_hotplug: support movable_node for hotpluggable nodesMichal Hocko
movable_node kernel parameter allows making hotpluggable NUMA nodes to put all the hotplugable memory into movable zone which allows more or less reliable memory hotremove. At least this is the case for the NUMA nodes present during the boot (see find_zone_movable_pfns_for_nodes). This is not the case for the memory hotplug, though. echo online > /sys/devices/system/memory/memoryXYZ/state will default to a kernel zone (usually ZONE_NORMAL) unless the particular memblock is already in the movable zone range which is not the case normally when onlining the memory from the udev rule context for a freshly hotadded NUMA node. The only option currently is to have a special udev rule to echo online_movable to all memblocks belonging to such a node which is rather clumsy. Not to mention this is inconsistent as well because what ended up in the movable zone during the boot will end up in a kernel zone after hotremove & hotadd without special care. It would be nice to reuse memblock_is_hotpluggable but the runtime hotplug doesn't have that information available because the boot and hotplug paths are not shared and it would be really non trivial to make them use the same code path because the runtime hotplug doesn't play with the memblock allocator at all. Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if movable_node is enabled and the range doesn't overlap with the existing normal zone. This should provide a reasonable default onlining strategy. Strictly speaking the semantic is not identical with the boot time initialization because find_zone_movable_pfns_for_nodes covers only the hotplugable range as described by the BIOS/FW. From my experience this is usually a full node though (except for Node0 which is special and never goes away completely). If this turns out to be a problem in the real life we can tweak the code to store hotplug flag into memblocks but let's keep this simple now. Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10zram: use __sysfs_match_string() helperAndy Shevchenko
Use __sysfs_match_string() helper instead of open coded variant. Link: http://lkml.kernel.org/r/20170609120835.22156-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/migrate.c: stabilise page count when migrating transparent hugepagesWill Deacon
When migrating a transparent hugepage, migrate_misplaced_transhuge_page guards itself against a concurrent fastgup of the page by checking that the page count is equal to 2 before and after installing the new pmd. If the page count changes, then the pmd is reverted back to the original entry, however there is a small window where the new (possibly writable) pmd is installed and the underlying page could be written by userspace. Restoring the old pmd could therefore result in loss of data. This patch fixes the problem by freezing the page count whilst updating the page tables, which protects against a concurrent fastgup without the need to restore the old pmd in the failure case (since the page count can no longer change under our feet). Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10include/linux/page_ref.h: ensure page_ref_unfreeze is ordered against prior ↵Will Deacon
accesses page_ref_freeze and page_ref_unfreeze are designed to be used as a pair, wrapping a critical section where struct pages can be modified without having to worry about consistency for a concurrent fast-GUP. Whilst page_ref_freeze has full barrier semantics due to its use of atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which doesn't provide any barrier semantics and allows the operation to be reordered with respect to page modifications in the critical section. This patch ensures that page_ref_unfreeze is ordered after any critical section updates, by invoking smp_mb() prior to the atomic_set. Link: http://lkml.kernel.org/r/1497349722-6731-3-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: always enable thp for dax mappingsDan Williams
The madvise policy for transparent huge pages is meant to avoid unwanted allocations of transparent huge pages. It allows a policy of disabling the extra memory pressure and effort to arrange for a huge page when it is not needed. DAX by definition never incurs this overhead since it is statically allocated. The policy choice makes even less sense for device-dax which tries to guarantee a given tlb-fault size. Specifically, the following setting: echo never > /sys/kernel/mm/transparent_hugepage/enabled ...violates that guarantee and silently disables all device-dax instances with a 2M or 1G alignment. So, let's avoid that non-obvious side effect by force enabling thp for dax mappings in all cases. It is worth noting that the reason this uses vma_is_dax(), and the resulting header include changes, is that previous attempts to add a VM_DAX flag were NAKd. Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: improve readability of transparent_hugepage_enabled()Dan Williams
Turn the macro into a static inline and rewrite the condition checks for better readability in preparation for adding another condition. [ross.zwisler@linux.intel.com: fix logic to make conversion equivalent] [akpm@linux-foundation.org: resolve vs mm-make-pr_set_thp_disable-immediately-active.patch] [akpm@linux-foundation.org: include coredump.h for MMF_DISABLE_THP] Link: http://lkml.kernel.org/r/149739530612.20686.14760671150202647861.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10oom, trace: remove ENUM evaluation of COMPACTION_FEEDBACKSteven Rostedt (VMware)
After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have been evaluated: # cat /sys/kernel/debug/tracing/enum_map (which will soon be renamed to eval_map) And it showed some interesting results: [..] ZONE_MOVABLE 3 (oom) ZONE_NORMAL 2 (oom) ZONE_DMA32 1 (oom) ZONE_DMA 0 (oom) 3 3 (oom) 2 2 (oom) 1 1 (oom) COMPACT_PRIO_ASYNC 2 (oom) COMPACT_PRIO_SYNC_LIGHT 1 (oom) COMPACT_PRIO_SYNC_FULL 0 (oom) [..] ZONE_DMA 0 (vmscan) 3 3 (vmscan) 2 2 (vmscan) 1 1 (vmscan) COMPACT_PRIO_ASYNC 2 (vmscan) [..] ZONE_DMA 0 (kmem) 3 3 (kmem) 2 2 (kmem) 1 1 (kmem) COMPACT_PRIO_ASYNC 2 (kmem) [..] ZONE_DMA 0 (compaction) 3 3 (compaction) 2 2 (compaction) 1 1 (compaction) COMPACT_PRIO_ASYNC 2 (compaction) [..] The name within the parenthesis are the trace systems that the enum/eval maps are associated with. When there's a number evaluated to another number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define and not an enum. As #defines get converted normally, they are not needed to be evaluated. Each of the above trace systems with the number to number evaluation included the file include/trace/events/mmflags.h which has: /* High-level compaction status feedback */ #define COMPACTION_FAILED 1 #define COMPACTION_WITHDRAWN 2 #define COMPACTION_PROGRESS 3 [..] #define COMPACTION_FEEDBACK \ EM(COMPACTION_FAILED, "failed") \ EM(COMPACTION_WITHDRAWN, "withdrawn") \ EMe(COMPACTION_PROGRESS, "progress") Which is still needed for the __print_symbolic() usage in the trace_event. But it is not needed to be evaluated. Removing the evaluation part removes the unnecessary evaluations of numbers to numbers. Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/hugetlb.c: warn the user when issues arise on boot due to hugepagesLiam R. Howlett
When the user specifies too many hugepages or an invalid default_hugepagesz the communication to the user is implicit in the allocation message. This patch adds a warning when the desired page count is not allocated and prints an error when the default_hugepagesz is invalid on boot. During boot hugepages will allocate until there is a fraction of the hugepage size left. That is, we allocate until either the request is satisfied or memory for the pages is exhausted. When memory for the pages is exhausted, it will most likely lead to the system failing with the OOM manager not finding enough (or anything) to kill (unless you're using really big hugepages in the order of 100s of MB or in the GBs). The user will most likely see the OOM messages much later in the boot sequence than the implicitly stated message. Worse yet, you may even get an OOM for each processor which causes many pages of OOMs on modern systems. Although these messages will be printed earlier than the OOM messages, at least giving the user errors and warnings will highlight the configuration as an issue. I'm trying to point the user in the right direction by providing a more robust statement of what is failing. During the sysctl or echo command, the user can check the results much easier than if the system hangs during boot and the scenario of having nothing to OOM for kernel memory is highly unlikely. Mike said: "Before sending out this patch, I asked Liam off list why he was doing it. Was it something he just thought would be useful? Or, was there some type of user situation/need. He said that he had been called in to assist on several occasions when a system OOMed during boot. In almost all of these situations, the user had grossly misconfigured huge pages. DB users want to pre-allocate just the right amount of huge pages, but sometimes they can be really off. In such situations, the huge page init code just allocates as many huge pages as it can and reports the number allocated. There is no indication that it quit allocating because it ran out of memory. Of course, a user could compare the number in the message to what they requested on the command line to determine if they got all the huge pages they requested. The thought was that it would be useful to at least flag this situation. That way, the user might be able to better relate the huge page allocation failure to the OOM. I'm not sure if the e-mail discussion made it obvious that this is something he has seen on several occasions. I see Michal's point that this will only flag the situation where someone configures huge pages very badly. And, a more extensive look at the situation of misconfiguring huge pages might be in order. But, this has happened on several occasions which led to the creation of this patch" [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration] Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/cma.c: warn if the CMA area could not be activatedAnshuman Khandual
While activating a CMA area we check to make sure that all the PFNs in the range are inside the same zone. This is a requirement for alloc_contig_range() to work. Any CMA area failing the check is disabled for good. This happens silently right now making all future cma_alloc() allocations failure inevitable. Here we add an error message stating that the CMA area could not be activated which makes it easier to explain any future cma_alloc() failures on it. While in there, change the bail out goto label from 'err' to 'not_in_zone' which makes more sense. Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10vmalloc: show lazy-purged vma info in vmallocinfoYisheng Xie
When ioremap a 67112960 bytes vm_area with the vmallocinfo: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap we get the result: 0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER': if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); So it makes idiot like me a litte puzzled why this was a jump the vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving 0xed000000-0xf1000000 as a big hole. This patch is to show all of vm_area, including vmas which are freeing but still in the vmap_area_list, to make it more clear about why we will get 0xf1000000-0xf5001000 in the above case. And we will get a vmallocinfo like: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap [..] 0xece7c000-0xece7e000 8192 unpurged vm_area 0xece7e000-0xece83000 20480 vm_map_ram 0xf0099000-0xf00aa000 69632 vm_map_ram after this patch. Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memcontrol: exclude @root from checks in mem_cgroup_lowSean Christopherson
Make @root exclusive in mem_cgroup_low; it is never considered low when looked at directly and is not checked when traversing the tree. In effect, @root is handled identically to how root_mem_cgroup was previously handled by mem_cgroup_low. If @root is not excluded from the checks, a cgroup underneath @root will never be considered low during targeted reclaim of @root, e.g. due to memory.current > memory.high, unless @root is misconfigured to have memory.low > memory.high. Excluding @root enables using memory.low to prioritize memory usage between cgroups within a subtree of the hierarchy that is limited by memory.high or memory.max, e.g. when ROOT owns @root's controls but delegates the @root directory to a USER so that USER can create and administer children of @root. For example, given cgroup A with children B and C: A / \ B C and 1. A/memory.current > A/memory.high 2. A/B/memory.current < A/B/memory.low 3. A/C/memory.current >= A/C/memory.low As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we should reclaim from 'C' until 'A' is no longer high or until we can no longer reclaim from 'C'. If 'A', i.e. @root, isn't excluded by mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low and we will reclaim indiscriminately from both 'B' and 'C'. Here is the test I used to confirm the bug and the patch. 20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test #!/bin/bash x62mb=$((62<<20)) x66mb=$((66<<20)) x94mb=$((94<<20)) x98mb=$((98<<20)) setup() { set -e if [[ -n $DEBUG ]]; then set -x fi trap teardown EXIT HUP INT TERM if [[ ! -e /mnt/1gb.swap ]]; then sudo fallocate -l 1G /mnt/1gb.swap > /dev/null sudo mkswap /mnt/1gb.swap > /dev/null fi if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then sudo swapon /mnt/1gb.swap fi if [[ ! -e /cgroup/cgroup.controllers ]]; then sudo mount -t cgroup2 none /cgroup fi grep -q memory /cgroup/cgroup.controllers sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control" sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control" sudo sh -c "echo '96m' > /cgroup/A/memory.high" mkdir /cgroup/A/0 mkdir /cgroup/A/1 echo 64m > /cgroup/A/0/memory.low } teardown() { set +e trap - EXIT HUP INT TERM if [[ -z $1 ]]; then printf "\n" printf "%0.s*" {1..35} printf "\nFAILED!\n\n" tail /cgroup/A/**/memory.current printf "%0.s*" {1..35} printf "\n\n" fi ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill % sleep 2 if [[ -e /cgroup/A/0 ]]; then rmdir /cgroup/A/0 fi if [[ -e /cgroup/A/1 ]]; then rmdir /cgroup/A/1 fi if [[ -e /cgroup/A ]]; then sudo rmdir /cgroup/A fi } stress_test() { sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/cgroup.procs" sleep 1 # A/0 should be consuming more memory than A/1 [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]] # A/0 should be consuming ~64mb [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]] # A should cumulatively be consuming ~96mb [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]] # Stop the stressors ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill % } teardown 1 setup for ((i=1;i<=$1;i++)); do printf "ITERATION $i of $1 - stress_test 0 1" stress_test 0 1 printf "\x1b[2K\r" printf "ITERATION $i of $1 - stress_test 1 0" stress_test 1 0 printf "\x1b[2K\r" printf "ITERATION $i of $1 - PASSED\n" done teardown 1 echo PASSED! 20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10 Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: make PR_SET_THP_DISABLE immediately activeMichal Hocko
PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any existing mapping because it only updated mm->def_flags which is a template for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE flag set. This can be quite surprising for all those applications which do not do prctl(); fork() & exec() and want to control their own THP behavior. Another usecase when the immediate semantic of the prctl might be useful is a combination of pre- and post-copy migration of containers with CRIU. In this case CRIU populates a part of a memory region with data that was saved during the pre-copy stage. Afterwards, the region is registered with userfaultfd and CRIU expects to get page faults for the parts of the region that were not yet populated. However, khugepaged collapses the pages and the expected page faults do not occur. In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a temporary mechanism for enabling/disabling THP process wide. Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is tested when decision whether to use huge pages is taken either during page fault of at the time of THP collapse. It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously. Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE") Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, vmpressure: pass-through notification supportDavid Rientjes
By default, vmpressure events are not pass-through, i.e. they propagate up through the memcg hierarchy until an event notifier is found for any threshold level. This presents a difficulty when a thread waiting on a read(2) for a vmpressure event cannot distinguish between local memory pressure and memory pressure in a descendant memcg, especially when that thread may not control the memcg hierarchy. Consider a user-controlled child memcg with a smaller limit than a top-level memcg controlled by the "Activity Manager" specified in Documentation/cgroup-v1/memory.txt. It may register for memory pressure notification for descendant memcgs to make a policy decision: oom kill a low priority job, increase the limit, decrease other limits, etc. If it registers for memory pressure notification on the top-level memcg, it currently cannot distinguish between memory pressure in its own memcg or a descendant memcg, which is user-controlled. Conversely, if a user registers for memory pressure notification on their own descendant memcg, the Activity Manager does not receive any pressure notification for that child memcg hierarchy. Vmpressure events are not received for ancestor memcgs if the memcg experiencing pressure have notifiers registered, perhaps outside the knowledge of the thread waiting on read(2) at the top level. Both of these are consequences of vmpressure notification not being pass-through. This implements a pass-through behavior for vmpressure events. When writing to control.event_control, vmpressure event handlers may optionally specify a mode. There are two new modes: - "hierarchy": always propagate memory pressure events up the hierarchy regardless if descendant memcgs have their own notifiers registered, and - "local": only receive notifications when the memcg for which the event is registered experiences memory pressure. Of course, processes may register for one notification of "low,local", for example, and another for "low". If no mode is specified, the current behavior is maintained for backwards compatibility. See the change to Documentation/cgroup-v1/memory.txt for full specification. [dan.carpenter@oracle.com: free the same pointer we allocated] Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Anton Vorontsov <anton@enomsg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: introduce idenfity_page_stateNaoya Horiguchi
Factoring duplicate code into a function. Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: delete dequeue_hwpoisoned_huge_page()Naoya Horiguchi
dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it. Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: dissolve in-use hugepage in unrecoverable memory errorNaoya Horiguchi
Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to keep the error hugepage away from the system, which is OK but not good enough because the hugepage still has a refcount and unpoison doesn't work on the error hugepage (PageHWPoison flags are cleared but pages are still leaked.) And there's "wasting health subpages" issue too. This patch reworks on me_huge_page() to solve these issues. For hugetlb file, recently we have truncating code so let's use it in hugetlbfs specific ->error_remove_page(). For anonymous hugepage, it's helpful to dissolve the error page after freeing it into free hugepage list. Migration entry and PageHWPoison in the head page prevent the access to it. TODO: dissolve_free_huge_page() can fail but we don't considered it yet. It's not critical (and at least no worse that now) because in such case the error hugepage just stays in free hugepage list without being dissolved. By virtue of PageHWPoison in head page, it's never allocated to processes. [akpm@linux-foundation.org: fix unused var warnings] Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: introduce memory_failure_hugetlb()Naoya Horiguchi
memory_failure() is a big function and hard to maintain. Handling hugetlb- and non-hugetlb- case in a single function is not good, so this patch separates PageHuge() branch into a new function, which saves many PageHuge() check. Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: soft-offline: dissolve free hugepage if soft-offlinedNaoya Horiguchi
Now we have code to rescue most of healthy pages from a hwpoisoned hugepage. So let's apply it to soft_offline_free_page too. Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: soft-offline: dissolve source hugepage after successful migrationAnshuman Khandual
Currently hugepage migrated by soft-offline (i.e. due to correctable memory errors) is contained as a hugepage, which means many non-error pages in it are unreusable, i.e. wasted. This patch solves this issue by dissolving source hugepages into buddy. As done in previous patch, PageHWPoison is set only on a head page of the error hugepage. Then in dissoliving we move the PageHWPoison flag to the raw error page so that all healthy subpages return back to buddy. [arnd@arndb.de: fix warnings: replace some macros with inline functions] Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: change PageHWPoison behavior on hugetlb pagesNaoya Horiguchi
We'd like to narrow down the error region in memory error on hugetlb pages. However, currently we set PageHWPoison flags on all subpages in the error hugepage and add # of subpages to num_hwpoison_pages, which doesn't fit our purpose. So this patch changes the behavior and we only set PageHWPoison on the head page then increase num_hwpoison_pages only by 1. This is a preparation for narrow-down part which comes in later patches. Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache()Naoya Horiguchi
We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page now, but it's not enough because later code doesn't handle hugetlb properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the end of this function fires for hugetlb, which makes no sense. So we should return immediately for hugetlb pages. Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: prevent reuse of hwpoisoned free hugepagesNaoya Horiguchi
Patch series "mm: hwpoison: fixlet for hugetlb migration". This patchset updates the hwpoison/hugetlb code to address 2 reported issues. One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First half was already fixed in mainline, and another half about hugetlb cases are solved in this series. Another issue is "narrow-down error affected region into a single 4kB page instead of a whole hugetlb page" issue, which was tried by Anshuman (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com) and I updated it to apply it more widely. This patch (of 9): We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages as we did before. So current dequeue_huge_page_node() doesn't work as intended because it still uses is_migrate_isolate_page() for this check. This patch fixes it with PageHWPoison flag. Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10fs/buffer.c: make bh_lru_install() more efficientEric Biggers
To install a buffer_head into the cpu's LRU queue, bh_lru_install() would construct a new copy of the queue and then memcpy it over the real queue. But it's easily possible to do the update in-place, which is faster and simpler. Some work can also be skipped if the buffer_head was already in the queue. As a microbenchmark I timed how long it takes to run sb_getblk() 10,000,000 times alternating between BH_LRU_SIZE + 1 blocks. Effectively, this benchmarks looking up buffer_heads that are in the page cache but not in the LRU: Before this patch: 1.758s After this patch: 1.653s This patch also removes about 350 bytes of compiled code (on x86_64), partly due to removal of the memcpy() which was being inlined+unrolled. Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/zsmalloc.c: fix -Wunneeded-internal-declaration warningNick Desaulniers
is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is only compiled in as a runtime check when CONFIG_DEBUG_VM is set, otherwise is checked at compile time and not actually compiled in. Fixes the following warning, found with Clang: mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration] static int is_first_page(struct page *page) ^ Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.com Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereferenceGustavo A. R. Silva
The NULL check at line 1226: if (!pgdat), implies that pointer pgdat might be NULL. rollback_node_hotadd() dereferences this pointer. Add NULL check to avoid a potential NULL pointer dereference. Addresses-Coverity-ID: 1369133 Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, vmscan: avoid thrashing anon lru when free + file is lowDavid Rientjes
The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do not swap anon pages just because free+file is low'") reintroduces is to prefer swapping anonymous memory rather than trashing the file lru. If the anonymous inactive lru for the set of eligible zones is considered low, however, or the length of the list for the given reclaim priority does not allow for effective anonymous-only reclaiming, then avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the small list and leave unreclaimed memory on the file lrus. If the inactive list is insufficient, fallback to balanced reclaim so the file lru doesn't remain untouched. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTEYevgen Pronenko
The preferred strategy to define debugfs attributes is to use the DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe(). Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.com Signed-off-by: Yevgen Pronenko <y.pronenko@gmail.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, page_alloc: fallback to smallest page when not stealing whole pageblockVlastimil Babka
Since commit 3bc48f96cf11 ("mm, page_alloc: split smallest stolen page in fallback") we pick the smallest (but sufficient) page of all that have been stolen from a pageblock of different migratetype. However, there are cases when we decide not to steal the whole pageblock. Practically in the current implementation it means that we are trying to fallback for a MIGRATE_MOVABLE allocation of order X, go through the freelists from MAX_ORDER-1 down to X, and find free page of order Y. If Y is less than pageblock_order / 2, we decide not to steal all pages from the pageblock. When Y > X, it means we are potentially splitting a larger page than we need, as there might be other pages of order Z, where X <= Z < Y. Since Y is already too small to steal whole pageblock, picking smallest available Z will result in the same decision and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE pageblock. This patch therefore changes the fallback algorithm so that in the situation described above, we switch the fallback search strategy to go from order X upwards to find the smallest suitable fallback. In theory there shouldn't be a downside of this change wrt fragmentation. This has been tested with mmtests' stress-highalloc performing GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint statistics: 4.12.0-rc2 4.12.0-rc2 1-kernel4 2-kernel4 Page alloc extfrag event 25640976 69680977 Extfrag fragmenting 25621086 69661364 Extfrag fragmenting for unmovable 74409 73204 Extfrag fragmenting unmovable placed with movable 69003 67684 Extfrag fragmenting unmovable placed with reclaim. 5406 5520 Extfrag fragmenting for reclaimable 6398 8467 Extfrag fragmenting reclaimable placed with movable 869 884 Extfrag fragmenting reclaimable placed with unmov. 5529 7583 Extfrag fragmenting for movable 25540279 69579693 Since we force movable allocations to steal the smallest available page (which we then practially always split), we steal less per fallback, so the number of fallbacks increases and steals potentially happen from different pageblocks. This is however not an issue for movable pages that can be compacted. Importantly, the "unmovable placed with movable" statistics is lower, which is the result of less fragmentation in the unmovable pageblocks. The effect on reclaimable allocation is a bit unclear. Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10swap: add block io poll in swapin pathShaohua Li
For fast flash disk, async IO could introduce overhead because of context switch. block-mq now supports IO poll, which improves performance and latency a lot. swapin is a good place to use this technique, because the task is waiting for the swapin page to continue execution. In my virtual machine, directly read 4k data from a NVMe with iopoll is about 60% better than that without poll. With iopoll support in swapin patch, my microbenchmark (a task does random memory write) is about 10%~25% faster. CPU utilization increases a lot though, 2x and even 3x CPU utilization. This will depend on disk speed. While iopoll in swapin isn't intended for all usage cases, it's a win for latency sensistive workloads with high speed swap disk. block layer has knob to control poll in runtime. If poll isn't enabled in block layer, there should be no noticeable change in swapin. I got a chance to run the same test in a NVMe with DRAM as the media. In simple fio IO test, blkpoll boosts 50% performance in single thread test and ~20% in 8 threads test. So this is the base line. In above swap test, blkpoll boosts ~27% performance in single thread test. blkpoll uses 2x CPU time though. If we enable hybid polling, the performance gain has very slight drop but CPU time is only 50% worse than that without blkpoll. Also we can adjust parameter of hybid poll, with it, the CPU time penality is reduced further. In 8 threads test, blkpoll doesn't help though. The performance is similar to that without blkpoll, but cpu utilization is similar too. There is lock contention in swap path. The cpu time spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse than that without it. The swapin readahead might read several pages in in the same time and form a big IO request. Since the IO will take longer time, it doesn't make sense to do poll, so the patch only does iopoll for single page swapin. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c Signed-off-by: Shaohua Li <shli@fb.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10Merge tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmiLinus Torvalds
Pull IPMI updates from Corey Minyard: "Some small fixes for IPMI, and one medium sized changed. The medium sized change is adding a platform device for IPMI entries in the DMI table. Otherwise there is no auto loading for IPMI devices if they are only in the DMI table" * tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi: ipmi:ssif: Add missing unlock in error branch char: ipmi: constify bmc_dev_attr_group and bmc_device_type ipmi:ssif: Check dev before setting drvdata ipmi: Convert DMI handling over to a platform device ipmi: Create a platform device for a DMI-specified IPMI interface ipmi: use rcu lock around call to intf->handlers->sender() ipmi:ssif: Use i2c_adapter_id instead of adapter->nr ipmi: Use the proper default value for register size in ACPI ipmi_ssif: remove redundant null check on array client->adapter->name ipmi/watchdog: fix watchdog timeout set on reboot ipmi_ssif: unlock on allocation failure
2017-07-10Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull XFS updates from Darrick Wong: "Here are some changes for you for 4.13. For the most part it's fixes for bugs and deadlock problems, and preparation for online fsck in some future merge window. - Avoid quotacheck deadlocks - Fix transaction overflows when bunmapping fragmented files - Refactor directory readahead - Allow admin to configure if ASSERT is fatal - Improve transaction usage detail logging during overflows - Minor cleanups - Don't leak log items when the log shuts down - Remove double-underscore typedefs - Various preparation for online scrubbing - Introduce new error injection configuration sysfs knobs - Refactor dq_get_next to use extent map directly - Fix problems with iterating the page cache for unwritten data - Implement SEEK_{HOLE,DATA} via iomap - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA - Don't use MAXPATHLEN to check on-disk symlink target lengths" * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits) xfs: don't crash on unexpected holes in dir/attr btrees xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN xfs: fix contiguous dquot chunk iteration livelock xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA vfs: Add iomap_seek_hole and iomap_seek_data helpers vfs: Add page_cache_seek_hole_data helper xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent xfs: Check for m_errortag initialization in xfs_errortag_test xfs: grab dquots without taking the ilock xfs: fix semicolon.cocci warnings xfs: Don't clear SGID when inheriting ACLs xfs: free cowblocks and retry on buffered write ENOSPC xfs: replace log_badcrc_factor knob with error injection tag xfs: convert drop_writes to use the errortag mechanism xfs: remove unneeded parameter from XFS_TEST_ERROR xfs: expose errortag knobs via sysfs xfs: make errortag a per-mountpoint structure xfs: free uncommitted transactions during log recovery xfs: don't allow bmap on rt files ...
2017-07-10Merge branch 'nowait-aio-btrfs-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "This fixes a user-visible bug introduced by the nowait-aio patches merged in this cycle" * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: nowait aio: Correct assignment of pos
2017-07-10Merge branch 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull copy*_iter fix from Al Viro. [ Al used entirely the wrong return value. Oopsie. ] * 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix brown paperbag bug in inlined copy_..._iter()
2017-07-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: - open/close tracking improvements from Dmitry Torokhov - battery support improvements in Wacom driver from Jason Gerecke - Win8 support fixes from Benjamin Tissories and Hans de Geode - misc fixes to Intel-ISH driver from Arnd Bergmann - support for quite a few new devices and small assorted fixes here and there * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (35 commits) HID: intel-ish-hid: Enable Gemini Lake ish driver HID: intel-ish-hid: Enable Cannon Lake ish driver HID: wacom: fix mistake in printk HID: multitouch: optimize the sticky fingers timer HID: multitouch: fix rare Win 8 cases when the touch up event gets missing HID: multitouch: use BIT macro HID: Add driver for Retrode2 joypad adapter HID: multitouch: Add support for Google Rose Touchpad HID: multitouch: Support PTP Stick and Touchpad device HID: core: don't use negative operands when shift HID: apple: Use country code to detect ISO keyboards HID: remove no longer used hid->open field greybus: hid: remove custom locking from gb_hid_open/close HID: usbhid: remove custom locking from usbhid_open/close HID: i2c-hid: remove custom locking from i2c_hid_open/close HID: serialize hid_hw_open and hid_hw_close HID: usbhid: do not rely on hid->open when deciding to do IO HID: hiddev: use hid_hw_power instead of usbhid_get/put_power HID: hiddev: use hid_hw_open/close instead of usbhid_open/close HID: asus: Add support for Zen AiO MD-5110 keyboard ...
2017-07-10btrfs: nowait aio: Correct assignment of posGoldwyn Rodrigues
Assigning pos for usage early messes up in append mode, where the pos is re-assigned in generic_write_checks(). Assign pos later to get the correct position to write from iocb->ki_pos. Since check_can_nocow also uses the value of pos, we shift generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT are present in generic_write_checks(), so checking for IOCB_NOWAIT is enough. Also, put locking sequence in the fast path. This fixes a user visible bug, as reported: "apparently breaks several shell related features on my system. In zsh history stopped working, because no new entries are added anymore. I fist noticed the issue when I tried to build mplayer. It uses a shell script to generate a help_mp.h file: [...] Here is a simple testcase: % echo "foo" >> test % echo "foo" >> test % cat test foo % " Fixes: edf064e7c6fe ("btrfs: nowait aio support") CC: Jens Axboe <axboe@kernel.dk> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Link: https://lkml.kernel.org/r/20170704042306.GA274@x4 Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-10fix brown paperbag bug in inlined copy_..._iter()Al Viro
"copied nothing" == "return 0", not "return full size". Fixes: aa28de275a24 "iov_iter/hardening: move object size checks to inlined part" Spotted-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-10Merge branches 'for-4.13/multitouch', 'for-4.13/retrode', ↵Jiri Kosina
'for-4.13/transport-open-close-consolidation', 'for-4.13/upstream' and 'for-4.13/wacom' into for-linus
2017-07-10Merge branches 'for-4.13/ish' and 'for-4.13/ite' into for-linusJiri Kosina
Conflicts: drivers/hid/hid-core.c
2017-07-10Merge branches 'for-4.13/apple' and 'for-4.13/asus' into for-linusJiri Kosina
Conflicts: drivers/hid/hid-core.c
2017-07-09Merge tag 'drm-for-v4.13' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm updates from Dave Airlie: "This is the main pull request for the drm, I think I've got one later driver pull for mediatek SoC driver, I'm undecided on if it needs to go to you yet. Otherwise summary below: Core drm: - Atomic add driver private objects - Deprecate preclose hook in modern drivers - MST bandwidth tracking - Use kvmalloc in more places - Add mode_valid hook for crtc/encoder/bridge - Reduce sync_file construction time - Documentation updates - New DRM synchronisation object support New drivers: - pl111 - pl111 CLCD display controller Panel: - Innolux P079ZCA panel driver - Add NL12880B20-05, NL192108AC18-02D, P320HVN03 panels - panel-samsung-s6e3ha2: Add s6e3hf2 panel support i915: - SKL+ watermark fixes - G4x/G33 reset improvements - DP AUX backlight improvements - Buffer based GuC/host communication - New getparam for (sub)slice infomation - Cannonlake and Coffeelake initial patches - Execbuf optimisations radeon/amdgpu: - Lots of Vega10 bug fixes - Preliminary raven support - KIQ support for compute rings - MEC queue management rework - DCE6 Audio support - SR-IOV improvements - Better radeon/amdgpu selection support nouveau: - HDMI stereoscopic support - Display code rework for >= GM20x GPUs msm: - GEM rework for fine-grained locking - Per-process pagetable work - HDMI fixes for Snapdragon 820. vc4: - Remove 256MB CMA limit from vc4 - Add out-fence support - Add support for cygnus - Get/set tiling ioctls support - Add T-format tiling support for scanout zte: - add VGA support. etnaviv: - Thermal throttle support for newer GPUs - Restore userspace buffer cache performance - dma-buf sync fix stm: - add stm32f429 display support exynos: - Rework vblank handling - Fixup sw-trigger code sun4i: - V3s display engine support - HDMI support for older SoCs - Preliminary work on dual-pipeline SoCs. rcar-du: - VSP work imx-drm: - Remove counter load enable from PRE - Double read/write reduction flag support tegra: - Documentation for the host1x and drm driver. - Lots of staging ioctl fixes due to grate project work. omapdrm: - dma-buf fence support - TILER rotation fixes" * tag 'drm-for-v4.13' of git://people.freedesktop.org/~airlied/linux: (1270 commits) drm: Remove unused drm_file parameter to drm_syncobj_replace_fence() drm/amd/powerplay: fix bug fail to remove sysfs when rmmod amdgpu. amdgpu: Set cik/si_support to 1 by default if radeon isn't built drm/amdgpu/gfx9: fix driver reload with KIQ drm/amdgpu/gfx8: fix driver reload with KIQ drm/amdgpu: Don't call amd_powerplay_destroy() if we don't have powerplay drm/ttm: Fix use-after-free in ttm_bo_clean_mm drm/amd/amdgpu: move get memory type function from early init to sw init drm/amdgpu/cgs: always set reference clock in mode_info drm/amdgpu: fix vblank_time when displays are off drm/amd/powerplay: power value format change for Vega10 drm/amdgpu/gfx9: support the amdgpu.disable_cu option drm/amd/powerplay: change PPSMC_MSG_GetCurrPkgPwr for Vega10 drm/amdgpu: Make amdgpu_cs_parser_init static (v2) drm/amdgpu/cs: fix a typo in a comment drm/amdgpu: Fix the exported always on CU bitmap drm/amdgpu/gfx9: gfx_v9_0_enable_gfx_static_mg_power_gating() can be static drm/amdgpu/psp: upper_32_bits/lower_32_bits for address setup drm/amd/powerplay/cz: print message if smc message fails drm/amdgpu: fix typo in amdgpu_debugfs_test_ib_init ...
2017-07-09afs: Add metadata xattrsDavid Howells
Add xattrs to allow the user to get/set metadata in lieu of having pioctl() available. The following xattrs are now available: - "afs.cell" The name of the cell in which the vnode's volume resides. - "afs.fid" The volume ID, vnode ID and vnode uniquifier of the file as three hex numbers separated by colons. - "afs.volume" The name of the volume in which the vnode resides. For example: # getfattr -d -m ".*" /mnt/scratch getfattr: Removing leading '/' from absolute path names # file: mnt/scratch afs.cell="mycell.myorg.org" afs.fid="10000b:1:1" afs.volume="scratch" Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09afs: Ignore AFS_ACE_READ and AFS_ACE_WRITE for directoriesMarc Dionne
The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not be used to make access decisions for the directory itself. They are meant to control access for the objects contained in that directory. Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set. This would cause an incorrect access denied error for a directory with AFS_ACE_LOOKUP but not AFS_ACE_READ. The AFS_ACE_WRITE bit does not allow operations that modify the directory. For a directory with AFS_ACE_WRITE but neither AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying operations that would ultimately be denied by the server. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09mqueue: fix a use-after-free in sys_mq_notify()Cong Wang
The retry logic for netlink_attachskb() inside sys_mq_notify() is nasty and vulnerable: 1) The sock refcnt is already released when retry is needed 2) The fd is controllable by user-space because we already release the file refcnt so we when retry but the fd has been just closed by user-space during this small window, we end up calling netlink_detachskb() on the error path which releases the sock again, later when the user-space closes this socket a use-after-free could be triggered. Setting 'sock' to NULL here should be sufficient to fix it. Reported-by: GeneBlue <geneblue.mail@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "The x86 updates contain: - A fix for a longstanding PAT bug, where PAT was reported on CPUs that do not support it, which leads to wrong caching attributes and missing MTRR updates - Prevent overwriting of the e820 firmware table, which causes kexec kernels to lose the fake mptable which is stored there. - Cleanup of the UV/BAU code, removing unused code and making local functions static" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table x86/boot/e820: Rename the e820_table_firmware to e820_table_kexec x86/boot/e820: Avoid overwriting e820_table_firmware x86/mm/pat: Don't report PAT on CPUs that don't support it x86/platform/uv/BAU: Minor cleanup, make some local functions static
2017-07-09Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers fixlet from Thomas Gleixner: "Add Frederic Weisbecker as NOHZ/dyntick maintainer" [ And an unmentioned and unrelated typo fix in the same commit? Hmm.. ] * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Add Frederic Weisbecker as nohz/dyntics maintainer