summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2017-07-10mm/hugetlb.c: replace memfmt with string_get_sizeMatthew Wilcox
The hugetlb code has its own function to report human-readable sizes. Convert it to use the shared string_get_size() function. This will lead to a minor difference in user visible output (MiB/GiB instead of MB/GB), but some would argue that's desirable anyway. Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit()Michal Hocko
Alice has reported the following UBSAN splat: UBSAN: Undefined behaviour in mm/memcontrol.c:661:17 signed integer overflow: -2147483644 - 2147483525 cannot be represented in type 'long int' CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4 Hardware name: XXXXXX, BIOS YYYYYY Call Trace: dump_stack+0x59/0x87 ubsan_epilogue+0xe/0x40 handle_overflow+0xbb/0xf0 __ubsan_handle_sub_overflow+0x12/0x20 memcg_check_events.isra.36+0x223/0x360 mem_cgroup_commit_charge+0x55/0x140 wp_page_copy+0x34e/0xb80 do_wp_page+0x1e6/0x1300 handle_mm_fault+0x88b/0x1990 __do_page_fault+0x2de/0x8a0 do_page_fault+0x1a/0x20 error_code+0x67/0x6c The reason is that we subtract two signed types. Let's fix this by truly mimicing time_after and cast the result of the subtraction. Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Alice Ferrazzi <alicef@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, hugetlb: schedule when potentially allocating many hugepagesDavid Rientjes
A few hugetlb allocators loop while calling the page allocator and can potentially prevent rescheduling if the page allocator slowpath is not utilized. Conditionally schedule when large numbers of hugepages can be allocated. Anshuman: "Fixes a task which was getting hung while writing like 10000 hugepages (16MB on POWER8) into /proc/sys/vm/nr_hugepages." Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: unify new_node_page and alloc_migrate_targetMichal Hocko
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") has duplicated a large part of alloc_migrate_target with some hotplug specific special casing. To be more precise it tried to enfore the allocation from a different node than the original page. As a result the two function diverged in their shared logic, e.g. the hugetlb allocation strategy. Let's unify the two and express different NUMA requirements by the given nodemask. new_node_page will simply exclude the node it doesn't care about and alloc_migrate_target will use all the available nodes. alloc_migrate_target will then learn to migrate hugetlb pages more sanely and use preallocated pool when possible. Please note that alloc_migrate_target used to call alloc_page resp. alloc_pages_current so the memory policy of the current context which is quite strange when we consider that it is used in the context of alloc_contig_range which just tries to migrate pages which stand in the way. Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10hugetlb, memory_hotplug: prefer to use reserved pages for migrationMichal Hocko
new_node_page will try to use the origin's next NUMA node as the migration destination for hugetlb pages. If such a node doesn't have any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to allocate a surplus page instead. This is quite subotpimal for any configuration when hugetlb pages are no distributed to all NUMA nodes evenly. Say we have a hotplugable node 4 and spare hugetlb pages are node 0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0 Now we consume the whole pool on node 4 and try to offline this node. All the allocated pages should be moved to node0 which has enough preallocated pages to hold them. With the current implementation offlining very likely fails because hugetlb allocations during runtime are much less reliable. Fix this by reusing the nodemask which excludes migration source and try to find a first node which has a page in the preallocated pool first and fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is consumed. [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub] Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, memory_hotplug: simplify empty node mask handling in new_node_pageMichal Hocko
new_node_page tries to allocate the target page on a different NUMA node than the source page. This makes sense in most cases during the hotplug because we are likely to offline the whole numa node. But there are cases where there are no other nodes to fallback (e.g. when offlining parts of the only existing node) and we have to fallback to allocating from the source node. The current code does that but it can be simplified by checking the nmask and updating it before we even try to allocate rather than special casing it. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, memory_hotplug: support movable_node for hotpluggable nodesMichal Hocko
movable_node kernel parameter allows making hotpluggable NUMA nodes to put all the hotplugable memory into movable zone which allows more or less reliable memory hotremove. At least this is the case for the NUMA nodes present during the boot (see find_zone_movable_pfns_for_nodes). This is not the case for the memory hotplug, though. echo online > /sys/devices/system/memory/memoryXYZ/state will default to a kernel zone (usually ZONE_NORMAL) unless the particular memblock is already in the movable zone range which is not the case normally when onlining the memory from the udev rule context for a freshly hotadded NUMA node. The only option currently is to have a special udev rule to echo online_movable to all memblocks belonging to such a node which is rather clumsy. Not to mention this is inconsistent as well because what ended up in the movable zone during the boot will end up in a kernel zone after hotremove & hotadd without special care. It would be nice to reuse memblock_is_hotpluggable but the runtime hotplug doesn't have that information available because the boot and hotplug paths are not shared and it would be really non trivial to make them use the same code path because the runtime hotplug doesn't play with the memblock allocator at all. Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if movable_node is enabled and the range doesn't overlap with the existing normal zone. This should provide a reasonable default onlining strategy. Strictly speaking the semantic is not identical with the boot time initialization because find_zone_movable_pfns_for_nodes covers only the hotplugable range as described by the BIOS/FW. From my experience this is usually a full node though (except for Node0 which is special and never goes away completely). If this turns out to be a problem in the real life we can tweak the code to store hotplug flag into memblocks but let's keep this simple now. Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/migrate.c: stabilise page count when migrating transparent hugepagesWill Deacon
When migrating a transparent hugepage, migrate_misplaced_transhuge_page guards itself against a concurrent fastgup of the page by checking that the page count is equal to 2 before and after installing the new pmd. If the page count changes, then the pmd is reverted back to the original entry, however there is a small window where the new (possibly writable) pmd is installed and the underlying page could be written by userspace. Restoring the old pmd could therefore result in loss of data. This patch fixes the problem by freezing the page count whilst updating the page tables, which protects against a concurrent fastgup without the need to restore the old pmd in the failure case (since the page count can no longer change under our feet). Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/hugetlb.c: warn the user when issues arise on boot due to hugepagesLiam R. Howlett
When the user specifies too many hugepages or an invalid default_hugepagesz the communication to the user is implicit in the allocation message. This patch adds a warning when the desired page count is not allocated and prints an error when the default_hugepagesz is invalid on boot. During boot hugepages will allocate until there is a fraction of the hugepage size left. That is, we allocate until either the request is satisfied or memory for the pages is exhausted. When memory for the pages is exhausted, it will most likely lead to the system failing with the OOM manager not finding enough (or anything) to kill (unless you're using really big hugepages in the order of 100s of MB or in the GBs). The user will most likely see the OOM messages much later in the boot sequence than the implicitly stated message. Worse yet, you may even get an OOM for each processor which causes many pages of OOMs on modern systems. Although these messages will be printed earlier than the OOM messages, at least giving the user errors and warnings will highlight the configuration as an issue. I'm trying to point the user in the right direction by providing a more robust statement of what is failing. During the sysctl or echo command, the user can check the results much easier than if the system hangs during boot and the scenario of having nothing to OOM for kernel memory is highly unlikely. Mike said: "Before sending out this patch, I asked Liam off list why he was doing it. Was it something he just thought would be useful? Or, was there some type of user situation/need. He said that he had been called in to assist on several occasions when a system OOMed during boot. In almost all of these situations, the user had grossly misconfigured huge pages. DB users want to pre-allocate just the right amount of huge pages, but sometimes they can be really off. In such situations, the huge page init code just allocates as many huge pages as it can and reports the number allocated. There is no indication that it quit allocating because it ran out of memory. Of course, a user could compare the number in the message to what they requested on the command line to determine if they got all the huge pages they requested. The thought was that it would be useful to at least flag this situation. That way, the user might be able to better relate the huge page allocation failure to the OOM. I'm not sure if the e-mail discussion made it obvious that this is something he has seen on several occasions. I see Michal's point that this will only flag the situation where someone configures huge pages very badly. And, a more extensive look at the situation of misconfiguring huge pages might be in order. But, this has happened on several occasions which led to the creation of this patch" [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration] Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/cma.c: warn if the CMA area could not be activatedAnshuman Khandual
While activating a CMA area we check to make sure that all the PFNs in the range are inside the same zone. This is a requirement for alloc_contig_range() to work. Any CMA area failing the check is disabled for good. This happens silently right now making all future cma_alloc() allocations failure inevitable. Here we add an error message stating that the CMA area could not be activated which makes it easier to explain any future cma_alloc() failures on it. While in there, change the bail out goto label from 'err' to 'not_in_zone' which makes more sense. Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10vmalloc: show lazy-purged vma info in vmallocinfoYisheng Xie
When ioremap a 67112960 bytes vm_area with the vmallocinfo: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap we get the result: 0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER': if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); So it makes idiot like me a litte puzzled why this was a jump the vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving 0xed000000-0xf1000000 as a big hole. This patch is to show all of vm_area, including vmas which are freeing but still in the vmap_area_list, to make it more clear about why we will get 0xf1000000-0xf5001000 in the above case. And we will get a vmallocinfo like: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap [..] 0xece7c000-0xece7e000 8192 unpurged vm_area 0xece7e000-0xece83000 20480 vm_map_ram 0xf0099000-0xf00aa000 69632 vm_map_ram after this patch. Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memcontrol: exclude @root from checks in mem_cgroup_lowSean Christopherson
Make @root exclusive in mem_cgroup_low; it is never considered low when looked at directly and is not checked when traversing the tree. In effect, @root is handled identically to how root_mem_cgroup was previously handled by mem_cgroup_low. If @root is not excluded from the checks, a cgroup underneath @root will never be considered low during targeted reclaim of @root, e.g. due to memory.current > memory.high, unless @root is misconfigured to have memory.low > memory.high. Excluding @root enables using memory.low to prioritize memory usage between cgroups within a subtree of the hierarchy that is limited by memory.high or memory.max, e.g. when ROOT owns @root's controls but delegates the @root directory to a USER so that USER can create and administer children of @root. For example, given cgroup A with children B and C: A / \ B C and 1. A/memory.current > A/memory.high 2. A/B/memory.current < A/B/memory.low 3. A/C/memory.current >= A/C/memory.low As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we should reclaim from 'C' until 'A' is no longer high or until we can no longer reclaim from 'C'. If 'A', i.e. @root, isn't excluded by mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low and we will reclaim indiscriminately from both 'B' and 'C'. Here is the test I used to confirm the bug and the patch. 20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test #!/bin/bash x62mb=$((62<<20)) x66mb=$((66<<20)) x94mb=$((94<<20)) x98mb=$((98<<20)) setup() { set -e if [[ -n $DEBUG ]]; then set -x fi trap teardown EXIT HUP INT TERM if [[ ! -e /mnt/1gb.swap ]]; then sudo fallocate -l 1G /mnt/1gb.swap > /dev/null sudo mkswap /mnt/1gb.swap > /dev/null fi if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then sudo swapon /mnt/1gb.swap fi if [[ ! -e /cgroup/cgroup.controllers ]]; then sudo mount -t cgroup2 none /cgroup fi grep -q memory /cgroup/cgroup.controllers sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control" sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control" sudo sh -c "echo '96m' > /cgroup/A/memory.high" mkdir /cgroup/A/0 mkdir /cgroup/A/1 echo 64m > /cgroup/A/0/memory.low } teardown() { set +e trap - EXIT HUP INT TERM if [[ -z $1 ]]; then printf "\n" printf "%0.s*" {1..35} printf "\nFAILED!\n\n" tail /cgroup/A/**/memory.current printf "%0.s*" {1..35} printf "\n\n" fi ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill % sleep 2 if [[ -e /cgroup/A/0 ]]; then rmdir /cgroup/A/0 fi if [[ -e /cgroup/A/1 ]]; then rmdir /cgroup/A/1 fi if [[ -e /cgroup/A ]]; then sudo rmdir /cgroup/A fi } stress_test() { sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/cgroup.procs" sleep 1 # A/0 should be consuming more memory than A/1 [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]] # A/0 should be consuming ~64mb [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]] # A should cumulatively be consuming ~96mb [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]] # Stop the stressors ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill % } teardown 1 setup for ((i=1;i<=$1;i++)); do printf "ITERATION $i of $1 - stress_test 0 1" stress_test 0 1 printf "\x1b[2K\r" printf "ITERATION $i of $1 - stress_test 1 0" stress_test 1 0 printf "\x1b[2K\r" printf "ITERATION $i of $1 - PASSED\n" done teardown 1 echo PASSED! 20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10 Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: make PR_SET_THP_DISABLE immediately activeMichal Hocko
PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any existing mapping because it only updated mm->def_flags which is a template for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE flag set. This can be quite surprising for all those applications which do not do prctl(); fork() & exec() and want to control their own THP behavior. Another usecase when the immediate semantic of the prctl might be useful is a combination of pre- and post-copy migration of containers with CRIU. In this case CRIU populates a part of a memory region with data that was saved during the pre-copy stage. Afterwards, the region is registered with userfaultfd and CRIU expects to get page faults for the parts of the region that were not yet populated. However, khugepaged collapses the pages and the expected page faults do not occur. In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a temporary mechanism for enabling/disabling THP process wide. Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is tested when decision whether to use huge pages is taken either during page fault of at the time of THP collapse. It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously. Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE") Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, vmpressure: pass-through notification supportDavid Rientjes
By default, vmpressure events are not pass-through, i.e. they propagate up through the memcg hierarchy until an event notifier is found for any threshold level. This presents a difficulty when a thread waiting on a read(2) for a vmpressure event cannot distinguish between local memory pressure and memory pressure in a descendant memcg, especially when that thread may not control the memcg hierarchy. Consider a user-controlled child memcg with a smaller limit than a top-level memcg controlled by the "Activity Manager" specified in Documentation/cgroup-v1/memory.txt. It may register for memory pressure notification for descendant memcgs to make a policy decision: oom kill a low priority job, increase the limit, decrease other limits, etc. If it registers for memory pressure notification on the top-level memcg, it currently cannot distinguish between memory pressure in its own memcg or a descendant memcg, which is user-controlled. Conversely, if a user registers for memory pressure notification on their own descendant memcg, the Activity Manager does not receive any pressure notification for that child memcg hierarchy. Vmpressure events are not received for ancestor memcgs if the memcg experiencing pressure have notifiers registered, perhaps outside the knowledge of the thread waiting on read(2) at the top level. Both of these are consequences of vmpressure notification not being pass-through. This implements a pass-through behavior for vmpressure events. When writing to control.event_control, vmpressure event handlers may optionally specify a mode. There are two new modes: - "hierarchy": always propagate memory pressure events up the hierarchy regardless if descendant memcgs have their own notifiers registered, and - "local": only receive notifications when the memcg for which the event is registered experiences memory pressure. Of course, processes may register for one notification of "low,local", for example, and another for "low". If no mode is specified, the current behavior is maintained for backwards compatibility. See the change to Documentation/cgroup-v1/memory.txt for full specification. [dan.carpenter@oracle.com: free the same pointer we allocated] Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Anton Vorontsov <anton@enomsg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: introduce idenfity_page_stateNaoya Horiguchi
Factoring duplicate code into a function. Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: delete dequeue_hwpoisoned_huge_page()Naoya Horiguchi
dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it. Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: dissolve in-use hugepage in unrecoverable memory errorNaoya Horiguchi
Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to keep the error hugepage away from the system, which is OK but not good enough because the hugepage still has a refcount and unpoison doesn't work on the error hugepage (PageHWPoison flags are cleared but pages are still leaked.) And there's "wasting health subpages" issue too. This patch reworks on me_huge_page() to solve these issues. For hugetlb file, recently we have truncating code so let's use it in hugetlbfs specific ->error_remove_page(). For anonymous hugepage, it's helpful to dissolve the error page after freeing it into free hugepage list. Migration entry and PageHWPoison in the head page prevent the access to it. TODO: dissolve_free_huge_page() can fail but we don't considered it yet. It's not critical (and at least no worse that now) because in such case the error hugepage just stays in free hugepage list without being dissolved. By virtue of PageHWPoison in head page, it's never allocated to processes. [akpm@linux-foundation.org: fix unused var warnings] Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: introduce memory_failure_hugetlb()Naoya Horiguchi
memory_failure() is a big function and hard to maintain. Handling hugetlb- and non-hugetlb- case in a single function is not good, so this patch separates PageHuge() branch into a new function, which saves many PageHuge() check. Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: soft-offline: dissolve free hugepage if soft-offlinedNaoya Horiguchi
Now we have code to rescue most of healthy pages from a hwpoisoned hugepage. So let's apply it to soft_offline_free_page too. Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: soft-offline: dissolve source hugepage after successful migrationAnshuman Khandual
Currently hugepage migrated by soft-offline (i.e. due to correctable memory errors) is contained as a hugepage, which means many non-error pages in it are unreusable, i.e. wasted. This patch solves this issue by dissolving source hugepages into buddy. As done in previous patch, PageHWPoison is set only on a head page of the error hugepage. Then in dissoliving we move the PageHWPoison flag to the raw error page so that all healthy subpages return back to buddy. [arnd@arndb.de: fix warnings: replace some macros with inline functions] Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hwpoison: change PageHWPoison behavior on hugetlb pagesNaoya Horiguchi
We'd like to narrow down the error region in memory error on hugetlb pages. However, currently we set PageHWPoison flags on all subpages in the error hugepage and add # of subpages to num_hwpoison_pages, which doesn't fit our purpose. So this patch changes the behavior and we only set PageHWPoison on the head page then increase num_hwpoison_pages only by 1. This is a preparation for narrow-down part which comes in later patches. Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache()Naoya Horiguchi
We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page now, but it's not enough because later code doesn't handle hugetlb properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the end of this function fires for hugetlb, which makes no sense. So we should return immediately for hugetlb pages. Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: hugetlb: prevent reuse of hwpoisoned free hugepagesNaoya Horiguchi
Patch series "mm: hwpoison: fixlet for hugetlb migration". This patchset updates the hwpoison/hugetlb code to address 2 reported issues. One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First half was already fixed in mainline, and another half about hugetlb cases are solved in this series. Another issue is "narrow-down error affected region into a single 4kB page instead of a whole hugetlb page" issue, which was tried by Anshuman (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com) and I updated it to apply it more widely. This patch (of 9): We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages as we did before. So current dequeue_huge_page_node() doesn't work as intended because it still uses is_migrate_isolate_page() for this check. This patch fixes it with PageHWPoison flag. Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/zsmalloc.c: fix -Wunneeded-internal-declaration warningNick Desaulniers
is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is only compiled in as a runtime check when CONFIG_DEBUG_VM is set, otherwise is checked at compile time and not actually compiled in. Fixes the following warning, found with Clang: mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration] static int is_first_page(struct page *page) ^ Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.com Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereferenceGustavo A. R. Silva
The NULL check at line 1226: if (!pgdat), implies that pointer pgdat might be NULL. rollback_node_hotadd() dereferences this pointer. Add NULL check to avoid a potential NULL pointer dereference. Addresses-Coverity-ID: 1369133 Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, vmscan: avoid thrashing anon lru when free + file is lowDavid Rientjes
The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do not swap anon pages just because free+file is low'") reintroduces is to prefer swapping anonymous memory rather than trashing the file lru. If the anonymous inactive lru for the set of eligible zones is considered low, however, or the length of the list for the given reclaim priority does not allow for effective anonymous-only reclaiming, then avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the small list and leave unreclaimed memory on the file lrus. If the inactive list is insufficient, fallback to balanced reclaim so the file lru doesn't remain untouched. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTEYevgen Pronenko
The preferred strategy to define debugfs attributes is to use the DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe(). Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.com Signed-off-by: Yevgen Pronenko <y.pronenko@gmail.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, page_alloc: fallback to smallest page when not stealing whole pageblockVlastimil Babka
Since commit 3bc48f96cf11 ("mm, page_alloc: split smallest stolen page in fallback") we pick the smallest (but sufficient) page of all that have been stolen from a pageblock of different migratetype. However, there are cases when we decide not to steal the whole pageblock. Practically in the current implementation it means that we are trying to fallback for a MIGRATE_MOVABLE allocation of order X, go through the freelists from MAX_ORDER-1 down to X, and find free page of order Y. If Y is less than pageblock_order / 2, we decide not to steal all pages from the pageblock. When Y > X, it means we are potentially splitting a larger page than we need, as there might be other pages of order Z, where X <= Z < Y. Since Y is already too small to steal whole pageblock, picking smallest available Z will result in the same decision and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE pageblock. This patch therefore changes the fallback algorithm so that in the situation described above, we switch the fallback search strategy to go from order X upwards to find the smallest suitable fallback. In theory there shouldn't be a downside of this change wrt fragmentation. This has been tested with mmtests' stress-highalloc performing GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint statistics: 4.12.0-rc2 4.12.0-rc2 1-kernel4 2-kernel4 Page alloc extfrag event 25640976 69680977 Extfrag fragmenting 25621086 69661364 Extfrag fragmenting for unmovable 74409 73204 Extfrag fragmenting unmovable placed with movable 69003 67684 Extfrag fragmenting unmovable placed with reclaim. 5406 5520 Extfrag fragmenting for reclaimable 6398 8467 Extfrag fragmenting reclaimable placed with movable 869 884 Extfrag fragmenting reclaimable placed with unmov. 5529 7583 Extfrag fragmenting for movable 25540279 69579693 Since we force movable allocations to steal the smallest available page (which we then practially always split), we steal less per fallback, so the number of fallbacks increases and steals potentially happen from different pageblocks. This is however not an issue for movable pages that can be compacted. Importantly, the "unmovable placed with movable" statistics is lower, which is the result of less fragmentation in the unmovable pageblocks. The effect on reclaimable allocation is a bit unclear. Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10swap: add block io poll in swapin pathShaohua Li
For fast flash disk, async IO could introduce overhead because of context switch. block-mq now supports IO poll, which improves performance and latency a lot. swapin is a good place to use this technique, because the task is waiting for the swapin page to continue execution. In my virtual machine, directly read 4k data from a NVMe with iopoll is about 60% better than that without poll. With iopoll support in swapin patch, my microbenchmark (a task does random memory write) is about 10%~25% faster. CPU utilization increases a lot though, 2x and even 3x CPU utilization. This will depend on disk speed. While iopoll in swapin isn't intended for all usage cases, it's a win for latency sensistive workloads with high speed swap disk. block layer has knob to control poll in runtime. If poll isn't enabled in block layer, there should be no noticeable change in swapin. I got a chance to run the same test in a NVMe with DRAM as the media. In simple fio IO test, blkpoll boosts 50% performance in single thread test and ~20% in 8 threads test. So this is the base line. In above swap test, blkpoll boosts ~27% performance in single thread test. blkpoll uses 2x CPU time though. If we enable hybid polling, the performance gain has very slight drop but CPU time is only 50% worse than that without blkpoll. Also we can adjust parameter of hybid poll, with it, the CPU time penality is reduced further. In 8 threads test, blkpoll doesn't help though. The performance is similar to that without blkpoll, but cpu utilization is similar too. There is lock contention in swap path. The cpu time spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse than that without it. The swapin readahead might read several pages in in the same time and form a big IO request. Since the IO will take longer time, it doesn't make sense to do poll, so the patch only does iopoll for single page swapin. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c Signed-off-by: Shaohua Li <shli@fb.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-08Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM updates from Russell King: - add support for ftrace-with-registers, which is needed for kgraft and other ftrace tools - support for mremap() for the sigpage/vDSO so that checkpoint/restore can work - add timestamps to each line of the register dump output - remove the unused KTHREAD_SIZE from nommu - align the ARM bitops APIs with the generic API (using unsigned long pointers rather than void pointers) - make the configuration of userspace Thumb support an expert option so that we can default it on, and avoid some hard to debug userspace crashes * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 8684/1: NOMMU: Remove unused KTHREAD_SIZE definition ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO ARM: 8679/1: bitops: Align prototypes to generic API ARM: 8678/1: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS ARM: make configuration of userspace Thumb support an expert option ARM: 8673/1: Fix __show_regs output timestamps
2017-07-07Merge tag 'for-linus-v4.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07Merge tag 'for-linus-v4.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling fixes from Jeff Layton: "The main rationale for all of these changes is to tighten up writeback error reporting to userland. There are many ways now that writeback errors can be lost, such that fsync/fdatasync/msync return 0 when writeback actually failed. This pile contains a small set of cleanups and writeback error handling fixes that I was able to break off from the main pile (#2). Two of the patches in this pile are trivial. The exceptions are the patch to fix up error handling in write_one_page, and the patch to make JFS pay attention to write_one_page errors" * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove call_fsync helper function mm: clean up error handling in write_one_page JFS: do not ignore return code from write_one_page() mm: drop "wait" parameter from write_one_page()
2017-07-07Merge tag 'powerpc-4.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for STRICT_KERNEL_RWX on 64-bit server CPUs. - Platform support for FSP2 (476fpe) board - Enable ZONE_DEVICE on 64-bit server CPUs. - Generic & powerpc spin loop primitives to optimise busy waiting - Convert VDSO update function to use new update_vsyscall() interface - Optimisations to hypercall/syscall/context-switch paths - Improvements to the CPU idle code on Power8 and Power9. As well as many other fixes and improvements. Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung Bauermann, Yang Li" * tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits) powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix powerpc/mm/hash: Implement mark_rodata_ro() for hash powerpc/vmlinux.lds: Align __init_begin to 16M powerpc/lib/code-patching: Use alternate map for patch_instruction() powerpc/xmon: Add patch_instruction() support for xmon powerpc/kprobes/optprobes: Use patch_instruction() powerpc/kprobes: Move kprobes over to patch_instruction() powerpc/mm/radix: Fix execute permissions for interrupt_vectors powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp() powerpc/64s: Blacklist rtas entry/exit from kprobes powerpc/64s: Blacklist functions invoked on a trap powerpc/64s: Un-blacklist system_call() from kprobes powerpc/64s: Move system_call() symbol to just after setting MSR_EE powerpc/64s: Blacklist system_call() and system_call_common() from kprobes powerpc/64s: Convert .L__replay_interrupt_return to a local label powerpc64/elfv1: Only dereference function descriptor for non-text symbols cxl: Export library to support IBM XSL powerpc/dts: Use #include "..." to include local DT powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8 ...
2017-07-06mm, memory_hotplug: move movable_node to the hotplug properMichal Hocko
movable_node_is_enabled is defined in memblock proper while it is initialized from the memory hotplug proper. This is quite messy and it makes a dependency between the two so move movable_node along with the helper functions to memory_hotplug. To make it more entertaining the kernel parameter is ignored unless CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node information for each memblock otherwise. So let's warn when the option is disabled. Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm, memory_hotplug: drop CONFIG_MOVABLE_NODEMichal Hocko
Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a good explanation on why it is actually useful. It makes a lot of sense to make movable node semantic opt in but we already have that because the feature has to be explicitly enabled on the kernel command line. A config option on top only makes the configuration space larger without a good reason. It also adds an additional ifdefery that pollutes the code. Just drop the config option and make it de-facto always enabled. This shouldn't introduce any change to the semantic. Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm, memory_hotplug: drop artificial restriction on online/offlineMichal Hocko
Patch series "remove CONFIG_MOVABLE_NODE". I am continuing to clean up the memory hotplug code and CONFIG_MOVABLE_NODE seems dubious at best. The following two patches simply removes the flag and make it de-facto always enabled. The current semantic of the config option is twofold 1) it automatically binds hotplugable nodes to have memory in zone_movable by default when movable_node is enabled 2) forbids memory hotplug to online all the memory as movable when !CONFIG_MOVABLE_NODE. The later restriction is quite dubious because there is no clear cut of how much normal memory do we need for a reasonable system operation. A single memory block which is sufficient to allow further movable onlines is far from sufficient (e.g a node with >2GB and memblocks 128MB will fill up this zone with struct pages leaving nothing for other allocations). Removing the config option will not only reduce the configuration space it also removes quite some code. The semantic of the movable_node command line parameter is preserved. The first patch removes the restriction mentioned above and the second one simply removes all the CONFIG_MOVABLE_NODE related stuff. The last patch moves movable_node flag handling to memory_hotplug proper where it belongs. [1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org This patch (of 3): Commit 74d42d8fe146 ("memory_hotplug: ensure every online node has NORMAL memory") has introduced a restriction that every numa node has to have at least some memory in !movable zones before a first movable memory can be onlined if !CONFIG_MOVABLE_NODE. Likewise can_offline_normal checks the amount of normal memory in !movable zones and it disallows to offline memory if there is no normal memory left with a justification that "memory-management acts bad when we have nodes which is online but don't have any normal memory". While it is true that not having _any_ memory for kernel allocations on a NUMA node is far from great and such a node would be quite subotimal because all kernel allocations will have to fallback to another NUMA node but there is no reason to disallow such a configuration in principle. Besides that there is not really a big difference to have one memblock for ZONE_NORMAL available or none. With 128MB size memblocks the system might trash on the kernel allocations requests anyway. It is really hard to draw a line on how much normal memory is really sufficient so we have to rely on administrator to configure system sanely therefore drop the artificial restriction and remove can_offline_normal and can_online_high_movable altogether. Link: http://lkml.kernel.org/r/20170529114141.536-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: memcontrol: account slab stats per lruvecJohannes Weiner
Josef's redesign of the balancing between slab caches and the page cache requires slab cache statistics at the lruvec level. Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: memcontrol: per-lruvec stats infrastructureJohannes Weiner
lruvecs are at the intersection of the NUMA node and memcg, which is the scope for most paging activity. Introduce a convenient accounting infrastructure that maintains statistics per node, per memcg, and the lruvec itself. Then convert over accounting sites for statistics that are already tracked in both nodes and memcgs and can be easily switched. [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code] Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org [hannes@cmpxchg.org: don't track uncharged pages at all Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org [hannes@cmpxchg.org: add missing free_percpu()] Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org [linux@roeck-us.net: hexagon: fix build error caused by include file order] Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: memcontrol: use generic mod_memcg_page_state for kmem pagesJohannes Weiner
The kmem-specific functions do the same thing. Switch and drop. Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: memcontrol: use the node-native slab memory countersJohannes Weiner
Now that the slab counters are moved from the zone to the node level we can drop the private memcg node stats and use the official ones. Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: vmstat: move slab statistics from zone to node countersJohannes Weiner
Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [hannes@cmpxchg.org: fix oops] Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/zswap.c: delete an error message for a failed memory allocation in ↵Markus Elfring
zswap_dstmem_prepare() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Link: http://lkml.kernel.org/r/bae25b04-2ce2-7137-a71c-50d7b4f06431@users.sourceforge.net Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/zswap.c: improve a size determination in zswap_frontswap_init()Markus Elfring
Replace the specification of a data structure by a pointer dereference as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Link: http://lkml.kernel.org/r/19f9da22-092b-f867-bdf6-f4dbad7ccf1f@users.sourceforge.net Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/zswap.c: delete an error message for a failed memory allocation in ↵Markus Elfring
zswap_pool_create() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Link: http://lkml.kernel.org/r/2345aabc-ae98-1d31-afba-40a02c5baf3d@users.sourceforge.net Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/swapfile.c: sort swap entries before freeHuang Ying
To reduce the lock contention of swap_info_struct->lock when freeing swap entry. The freed swap entries will be collected in a per-CPU buffer firstly, and be really freed later in batch. During the batch freeing, if the consecutive swap entries in the per-CPU buffer belongs to same swap device, the swap_info_struct->lock needs to be acquired/released only once, so that the lock contention could be reduced greatly. But if there are multiple swap devices, it is possible that the lock may be unnecessarily released/acquired because the swap entries belong to the same swap device are non-consecutive in the per-CPU buffer. To solve the issue, the per-CPU buffer is sorted according to the swap device before freeing the swap entries. With the patch, the memory (some swapped out) free time reduced 11.6% (from 2.65s to 2.35s) in the vm-scalability swap-w-rand test case with 16 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test swapping, the test case creates 16 processes, which allocate and write to the anonymous pages until the RAM and part of the swap device is used up, finally the memory (some swapped out) is freed before exit. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/20170525005916.25249-1-ying.huang@intel.com Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Tim Chen <tim.c.chen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/oom_kill: count global and memory cgroup oom killsKonstantin Khlebnikov
Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob "memory.events" (in memory.oom_control for v1 cgroup). Also describe difference between "oom" and "oom_kill" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: per-cgroup memory reclaim statsRoman Gushchin
Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objectsCatalin Marinas
Kmemleak requires that vmalloc'ed objects have a minimum reference count of 2: one in the corresponding vm_struct object and the other owned by the vmalloc() caller. There are cases, however, where the original vmalloc() returned pointer is lost and, instead, a pointer to vm_struct is stored (see free_thread_stack()). Kmemleak currently reports such objects as leaks. This patch adds support for treating any surplus references to an object as additional references to a specified object. It introduces the kmemleak_vmalloc() API function which takes a vm_struct pointer and sets its surplus reference passing to the actual vmalloc() returned pointer. The __vmalloc_node_range() calling site has been modified accordingly. Link: http://lkml.kernel.org/r/1495726937-23557-4-git-send-email-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: kmemleak: factor object reference updating out of scan_block()Catalin Marinas
scan_block() updates the number of references (pointers) to objects, adding them to the gray_list when object->min_count is reached. The patch factors out this functionality into a separate update_refs() function. Link: http://lkml.kernel.org/r/1495726937-23557-3-git-send-email-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: kmemleak: slightly reduce the size of some structures on 64-bit ↵Catalin Marinas
architectures Change the kmemleak_object.flags type to unsigned int and moves the early_log.min_count (int) near early_log.op_type (int) to slightly reduce the size of these structures on 64-bit architectures. Link: http://lkml.kernel.org/r/1495726937-23557-2-git-send-email-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>