summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-11-06printk: prevent userland from spoofing kernel messagesMathias Krause
The following statement of ABI/testing/dev-kmsg is not quite right: It is not possible to inject messages from userspace with the facility number LOG_KERN (0), to make sure that the origin of the messages can always be reliably determined. Userland actually can inject messages with a facility of 0 by abusing the fact that the facility is stored in a u8 data type. By using a facility which is a multiple of 256 the assignment of msg->facility in log_store() implicitly truncates it to 0, i.e. LOG_KERN, allowing users of /dev/kmsg to spoof kernel messages as shown below: The following call... # printf '<%d>Kernel panic - not syncing: beer empty\n' 0 >/dev/kmsg ...leads to the following log entry (dmesg -x | tail -n 1): user :emerg : [ 66.137758] Kernel panic - not syncing: beer empty However, this call... # printf '<%d>Kernel panic - not syncing: beer empty\n' 0x800 >/dev/kmsg ...leads to the slightly different log entry (note the kernel facility): kern :emerg : [ 74.177343] Kernel panic - not syncing: beer empty Fix that by limiting the user provided facility to 8 bit right from the beginning and catch the truncation early. Fixes: 7ff9554bb578 ("printk: convert byte-buffer to variable-length...") Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.cz> Cc: Alex Elder <elder@linaro.org> Cc: Joe Perches <joe@perches.com> Cc: Kay Sievers <kay@vrfy.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06lib/vsprintf.c: update documentationRasmus Villemoes
%n is no longer just ignored; it results in early return from vsnprintf. Also add a request to add test cases for future %p extensions. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Martin Kletzander <mkletzan@redhat.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06selftests: run lib/test_printf moduleKees Cook
This runs the lib/test_printf module to make sure printf is operating sanely. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06test_printf: test printf family at runtimeRasmus Villemoes
This adds a simple module for testing the kernel's printf facilities. Previously, some %p extensions have caused a wrong return value in case the entire output didn't fit and/or been unusable in kasprintf(). This should help catch such issues. Also, it should help ensure that changes to the formatting algorithms don't break anything. I'm not sure if we have a struct dentry or struct file lying around at boot time or if we can fake one, but most %p extensions should be testable, as should the ordinary number and string formatting. The nature of vararg functions means we can't use a more conventional table-driven approach. For now, this is mostly a skeleton; contributions are very welcome. Some tests are/will be slightly annoying to write, since the expected output depends on stuff like CONFIG_*, sizeof(long), runtime values etc. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Martin Kletzander <mkletzan@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06lib/vsprintf.c: remove SPECIAL handling in pointer()Rasmus Villemoes
As a quick git grep -E '%[ +0#-]*#[ +0#-]*(\*|[0-9]+)?(\.(\*|[0-9]+)?)?p' shows, nobody uses the # flag with %p. Should one try to do so, one will be met with warning: `#' flag used with `%p' gnu_printf format [-Wformat] (POSIX and C99 both say "... For other conversion specifiers, the behavior is undefined.". Obviously, the kernel can choose to define the behaviour however it wants, but as long as gcc issues that warning, users are unlikely to show up.) Since default_width is effectively always 2*sizeof(void*), we can simplify the prologue of pointer() and save a few instructions. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Martin Kletzander <mkletzan@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06lib/vsprintf.c: also improve sanity check in bstr_printf()Rasmus Villemoes
Quoting from 2aa2f9e21e4e ("lib/vsprintf.c: improve sanity check in vsnprintf()"): On 64 bit, size may very well be huge even if bit 31 happens to be 0. Somehow it doesn't feel right that one can pass a 5 GiB buffer but not a 3 GiB one. So cap at INT_MAX as was probably the intention all along. This is also the made-up value passed by sprintf and vsprintf. I should have seen this copy-pasted instance back then, but let's just do it now. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Martin Kletzander <mkletzan@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06lib/vsprintf.c: handle invalid format specifiers more robustlyRasmus Villemoes
If we meet any invalid or unsupported format specifier, 'handling' it by just printing it as a literal string is not safe: Presumably the format string and the arguments passed gcc's type checking, but that means something like sprintf(buf, "%n %pd", &intvar, dentry) would end up interpreting &intvar as a struct dentry*. When the offending specifier was %n it used to be at the end of the format string, but we can't rely on that always being the case. Also, gcc doesn't complain about some more or less exotic qualifiers (or 'length modifiers' in posix-speak) such as 'j' or 'q', but being unrecognized by the kernel's printf implementation, they'd be interpreted as unknown specifiers, and the rest of arguments would be interpreted wrongly. So let's complain about anything we don't understand, not just %n, and stop pretending that we'd be able to make sense of the rest of the format/arguments. If the offending specifier is in a printk() call we unfortunately only get a "BUG: recent printk recursion!", but at least direct users of the sprintf family will be caught. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Martin Kletzander <mkletzan@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06printk: synchronize %p formatting documentationMartin Kletzander
Move all pointer-formatting documentation to one place in the code and one place in the documentation instead of keeping it in three places with different level of completeness. Documentation/printk-formats.txt has detailed information about each modifier, docstring above pointer() has short descriptions of them (as that is the function dealing with %p) and docstring above vsprintf() is removed as redundant. Both docstrings in the code that were modified are updated with a reminder of updating the documentation upon any further change. [akpm@linux-foundation.org: fix comment] Signed-off-by: Martin Kletzander <mkletzan@redhat.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06lib/dynamic_debug.c: use kstrdup_constRasmus Villemoes
Using kstrdup_const, thus reusing .rodata when possible, saves around 2 kB of runtime memory on my laptop/.config combination. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06fs/jffs2/wbuf.c: remove stray semicolonAndrew Morton
Reported-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06include/linux/compiler-gcc.h: improve __visible documentationAndrew Morton
Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06proc: actually make proc_fd_permission() thread-friendlyOleg Nesterov
The commit 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly") fixed the access to /proc/self/fd from sub-threads, but introduced another problem: a sub-thread can't access /proc/<tid>/fd/ or /proc/thread-self/fd if generic_permission() fails. Change proc_fd_permission() to check same_thread_group(pid_task(), current). Fixes: 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly") Reported-by: "Jin, Yihua" <yihua.jin@intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06fs/proc/array.c: set overflow flag in case of errorAndy Shevchenko
For now in task_name() we ignore the return code of string_escape_str() call. This is not good if buffer suddenly becomes not big enough. Do the proper error handling there. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: use 'unsigned int' for compound_dtor/compound_order on 64BITKirill A. Shutemov
On 64 bit system we have enough space in struct page to encode compound_dtor and compound_order with unsigned int. On x86-64 it leads to slightly smaller code size due usesage of plain MOV instead of MOVZX (zero-extended move) or similar effect. allyesconfig: text data bss dec hex filename 159520446 48146736 72196096 279863278 10ae5fee vmlinux.pre 159520382 48146736 72196096 279863214 10ae5fae vmlinux.post On other architectures without native support of 16-bit data types the Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: use 'unsigned int' for page orderKirill A. Shutemov
Let's try to be consistent about data type of page order. [sfr@canb.auug.org.au: fix build (type of pageblock_order)] [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: make compound_head() robustKirill A. Shutemov
Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: pack compound_dtor and compound_order into one word in struct pageKirill A. Shutemov
The patch halves space occupied by compound_dtor and compound_order in struct page. For compound_order, it's trivial long -> short conversion. For get_compound_page_dtor(), we now use hardcoded table for destructor lookup and store its index in the struct page instead of direct pointer to destructor. It shouldn't be a big trouble to maintain the table: we have only two destructor and NULL currently. This patch free up one word in tail pages for reuse. This is preparation for the next patch. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zsmalloc: use page->private instead of page->first_pageKirill A. Shutemov
We are going to rework how compound_head() work. It will not use page->first_page as we have it now. The only other user of page->first_page beyond compound pages is zsmalloc. Let's use page->private instead of page->first_page here. It occupies the same storage space. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06slab, slub: use page->rcu_head instead of page->lru plus castKirill A. Shutemov
We have properly typed page->rcu_head, no need to cast page->lru. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: drop page->slab_pageKirill A. Shutemov
Since 8456a648cf44 ("slab: use struct page for slab management") nobody uses slab_page field in struct page. Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zsmalloc: reduce size_class memory usageSergey Senozhatsky
Each `struct size_class' contains `struct zs_size_stat': an array of NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned long)' per-class. The patch removes unneeded `struct zs_size_stat' members by redefining NR_ZS_STAT_TYPE (max stat idx in array). Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants, GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at the moment. ./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39) function old new delta fix_fullness_group 97 94 -3 insert_zspage 100 86 -14 remove_zspage 141 119 -22 To summarize: a) each class now uses less memory b) we avoid a number of dec/inc stats (a minor optimization, but still). The gain will increase once we introduce additional stats. A simple IO test. iozone -t 4 -R -r 32K -s 60M -I +Z patched base " Initial write " 4145599.06 4127509.75 " Rewrite " 4146225.94 4223618.50 " Read " 17157606.00 17211329.50 " Re-read " 17380428.00 17267650.50 " Reverse Read " 16742768.00 16162732.75 " Stride read " 16586245.75 16073934.25 " Random read " 16349587.50 15799401.75 " Mixed workload " 10344230.62 9775551.50 " Random write " 4277700.62 4260019.69 " Pwrite " 4302049.12 4313703.88 " Pread " 6164463.16 6126536.72 " Fwrite " 7131195.00 6952586.00 " Fread " 12682602.25 12619207.50 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/zsmalloc.c: remove useless line in obj_free()Hui Zhu
Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zsmalloc: don't test shrinker_enabled in zs_shrinker_count()Sergey Senozhatsky
We don't let user to disable shrinker in zsmalloc (once it's been enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(), at the moment at least. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zsmalloc: use preempt.h for in_interrupt()Sergey Senozhatsky
A cosmetic change. Commit c60369f01125 ("staging: zsmalloc: prevent mappping in interrupt context") added in_interrupt() check to zs_map_object() and 'hardirq.h' include; but in_interrupt() macro is defined in 'preempt.h' not in 'hardirq.h', so include it instead. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zsmalloc: fix obj_to_head use page_private(page) as value but not pointerHui Zhu
In obj_malloc(): if (!class->huge) /* record handle in the header of allocated chunk */ link->handle = handle; else /* record handle in first_page->private */ set_page_private(first_page, handle); In the hugepage we save handle to private directly. But in obj_to_head(): if (class->huge) { VM_BUG_ON(!is_first_page(page)); return *(unsigned long *)page_private(page); } else return *(unsigned long *)obj; It is used as a pointer. The reason why there is no problem until now is huge-class page is born with ZS_FULL so it can't be migrated. However, we need this patch for future work: "VM-aware zsmalloced page migration" to reduce external fragmentation. Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zsmalloc: add comments for ->inuse to zspageHui Zhu
[akpm@linux-foundation.org: fix grammar] Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: zsmalloc: constify struct zs_pool nameSergey SENOZHATSKY
Constify `struct zs_pool' ->name. [akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zpool: remove redundant zpool->type string, const-ify zpool_get_typeDan Streetman
Make the return type of zpool_get_type const; the string belongs to the zpool driver and should not be modified. Remove the redundant type field in the struct zpool; it is private to zpool.c and isn't needed since ->driver->type can be used directly. Add comments indicating strings must be null-terminated. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zswap: use charp for zswap param stringsDan Streetman
Instead of using a fixed-length string for the zswap params, use charp. This simplifies the code and uses less memory, as most zswap param strings will be less than the current maximum length. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06module: export param_free_charp()Dan Streetman
Change the param_free_charp() function from static to exported. It is used by zswap in the next patch ("zswap: use charp for zswap param strings"). Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/zswap.c: remove unneeded initialization to NULL in zswap_entry_find_get()Alexey Klimov
On the next line entry variable will be re-initialized so no need to init it with NULL. Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zram: make is_partial_io/valid_io_request/page_zero_filled return booleanGeliang Tang
Make is_partial_io()/valid_io_request()/page_zero_filled() return boolean, since each function only uses either one or zero as its return value. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zram: keep the exact overcommited value in mem_used_maxSergey SENOZHATSKY
`mem_used_max' is designed to store the max amount of memory zram consumed to store the data. However, it does not represent the actual 'overcommited' (max) value. The existing code goes to -ENOMEM overcommited case before it updates `->stats.max_used_pages', which hides the reason we went to -ENOMEM in the first place -- we actually used more memory than `->limit_pages': alloced_pages = zs_get_total_pages(meta->mem_pool); if (zram->limit_pages && alloced_pages > zram->limit_pages) { zs_free(meta->mem_pool, handle); ret = -ENOMEM; goto out; } update_used_max(zram, alloced_pages); Which is misleading. User will see -ENOMEM, check `->limit_pages', check `->stats.max_used_pages', which will keep the value BEFORE zram passed `->limit_pages', and see: `->stats.max_used_pages' < `->limit_pages' Move update_used_max() before we do `->limit_pages' check, so that user will see: `->stats.max_used_pages' > `->limit_pages' should the overcommit and -ENOMEM happen. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06zram: introduce comp algorithm fallback functionalityLuis Henriques
When the user supplies an unsupported compression algorithm, keep the previously selected one (knowingly supported) or the default one (if the compression algorithm hasn't been changed yet). Note that previously this operation (i.e. setting an invalid algorithm) would result in no algorithm being selected, which means that this represents a small change in the default behaviour. Minchan said: For initializing zram, we need to set up 3 optional parameters in advance. 1. the number of compression streams 2. memory limitation 3. compression algorithm Although user pass completely wrong value to set up for 1 and 2 parameters, it's okay because they have default value so zram will be initialized with the default value (of course, when user passes a wrong value via *echo*, sysfs returns -EINVAL so the user can notice it). But 3 is not consistent with other optional parameters. IOW, if the user passes a wrong value to set up 3 parameter, zram's initialization would fail unlike other optional parameters. So this patch makes them consistent. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/memcontrol.c: uninline mem_cgroup_usageAndrew Morton
gcc version 5.2.1 20151010 (Debian 5.2.1-22) $ size mm/memcontrol.o mm/memcontrol.o.before text data bss dec hex filename 35535 7908 64 43507 a9f3 mm/memcontrol.o 35762 7908 64 43734 aad6 mm/memcontrol.o.before Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writebackJan Kara
sync_file_range(2) is documented to issue writeback only for pages that are not currently being written. After all the system call has been created for userspace to be able to issue background writeout and so waiting for in-flight IO is undesirable there. However commit ee53a891f474 ("mm: do_sync_mapping_range integrity fix") switched do_sync_mapping_range() and thus sync_file_range() to issue writeback in WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code relying on WB_SYNC_ALL semantics. These days do_sync_mapping_range() went away and we can switch sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should help PostgreSQL avoid large latency spikes when flushing data in the background. Andres measured a 20% increase in transactions per second on an SSD disk. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Andres Freund <andres@anarazel.de> Tested-By: Andres Freund <andres@anarazel.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06thp: remove unused vma parameter from khugepaged_alloc_pageAaron Tomlin
The "vma" parameter to khugepaged_alloc_page() is unused. It has to remain unused or the drop read lock 'map_sem' optimisation introduce by commit 8b1645685acf ("mm, THP: don't hold mmap_sem in khugepaged when allocating THP") wouldn't be safe. So let's remove it. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, fs: introduce mapping_gfp_constraint()Michal Hocko
There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06include/linux/mmzone.h: reflow commentAndrew Morton
Someone has an 86 column display. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: page_alloc: hide some GFP internals and document the bits and flag ↵Mel Gorman
combinations Andrew stated the following We have quite a history of remote parts of the kernel using weird/wrong/inexplicable combinations of __GFP_ flags. I tend to think that this is because we didn't adequately explain the interface. And I don't think that gfp.h really improved much in this area as a result of this patchset. Could you go through it some time and decide if we've adequately documented all this stuff? This patches first moves some GFP flag combinations that are part of the MM internals to mm/internal.h. The rest of the patch documents the __GFP_FOO bits under various headings and then documents the flag combinations. It will not help callers that are brain damaged but the clarity might motivate some fixes and avoid future mistakes. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: only enforce watermarks for order-0 allocationsMel Gorman
The primary purpose of watermarks is to ensure that reclaim can always make forward progress in PF_MEMALLOC context (kswapd and direct reclaim). These assume that order-0 allocations are all that is necessary for forward progress. High-order watermarks serve a different purpose. Kswapd had no high-order awareness before they were introduced (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was particularly important when there were high-order atomic requests. The watermarks both gave kswapd awareness and made a reserve for those atomic requests. There are two important side-effects of this. The most important is that a non-atomic high-order request can fail even though free pages are available and the order-0 watermarks are ok. The second is that high-order watermark checks are expensive as the free list counts up to the requested order must be examined. With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to have high-order watermarks. Kswapd and compaction still need high-order awareness which is handled by checking that at least one suitable high-order page is free. With the patch applied, there was little difference in the allocation failure rates as the atomic reserves are small relative to the number of allocation attempts. The expected impact is that there will never be an allocation failure report that shows suitable pages on the free lists. The one potential side-effect of this is that in a vanilla kernel, the watermark checks may have kept a free page for an atomic allocation. Now, we are 100% relying on the HighAtomic reserves and an early allocation to have allocated them. If the first high-order atomic allocation is after the system is already heavily fragmented then it'll fail. [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: reserve pageblocks for high-order atomic allocations on demandMel Gorman
High-order watermark checking exists for two reasons -- kswapd high-order awareness and protection for high-order atomic requests. Historically the kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic allocations on demand and avoids using those blocks for order-0 allocations. This is more flexible and reliable than MIGRATE_RESERVE was. A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation request steals a pageblock but limits the total number to 1% of the zone. Callers that speculatively abuse atomic allocations for long-lived high-order allocations to access the reserve will quickly fail. Note that SLUB is currently not such an abuser as it reclaims at least once. It is possible that the pageblock stolen has few suitable high-order pages and will need to steal again in the near future but there would need to be strong justification to search all pageblocks for an ideal candidate. The pageblocks are unreserved if an allocation fails after a direct reclaim attempt. The watermark checks account for the reserved pageblocks when the allocation request is not a high-order atomic allocation. The reserved pageblocks can not be used for order-0 allocations. This may allow temporary wastage until a failed reclaim reassigns the pageblock. This is deliberate as the intent of the reservation is to satisfy a limited number of atomic high-order short-lived requests if the system requires them. The stutter benchmark was used to evaluate this but while it was running there was a systemtap script that randomly allocated between 1 high-order page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This is much larger than the potential reserve and it does not attempt to be realistic. It is intended to stress random high-order allocations from an unknown source, show that there is a reduction in failures without introducing an anomaly where atomic allocations are more reliable than regular allocations. The amount of memory reserved varied throughout the workload as reserves were created and reclaimed under memory pressure. The allocation failures once the workload warmed up were as follows; 4.2-rc5-vanilla 70% 4.2-rc5-atomic-reserve 56% The failure rate was also measured while building multiple kernels. The failure rate was 14% but is 6% with this patch applied. Overall, this is a small reduction but the reserves are small relative to the number of allocation requests. In early versions of the patch, the failure rate reduced by a much larger amount but that required much larger reserves and perversely made atomic allocations seem more reliable than regular allocations. [yalin.wang2010@gmail.com: fix redundant check and a memory leak] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: yalin wang <yalin.wang2010@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: remove MIGRATE_RESERVEMel Gorman
MIGRATE_RESERVE preserves an old property of the buddy allocator that existed prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to remain contiguous until the only alternative was to fail the allocation. At the time it was discovered that high-order atomic allocations relied on this property so MIGRATE_RESERVE was introduced. A later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE and supporting code so it'll be easier to review. Note that this patch in isolation may look like a false regression if someone was bisecting high-order atomic allocation failures. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: delete the zonelist_cacheMel Gorman
The zonelist cache (zlc) was introduced to skip over zones that were recently known to be full. This avoided expensive operations such as the cpuset checks, watermark calculations and zone_reclaim. The situation today is different and the complexity of zlc is harder to justify. 1) The cpuset checks are no-ops unless a cpuset is active and in general are a lot cheaper. 2) zone_reclaim is now disabled by default and I suspect that was a large source of the cost that zlc wanted to avoid. When it is enabled, it's known to be a major source of stalling when nodes fill up and it's unwise to hit every other user with the overhead. 3) Watermark checks are expensive to calculate for high-order allocation requests. Later patches in this series will reduce the cost of the watermark checking. 4) The most important issue is that in the current implementation it is possible for a failed THP allocation to mark a zone full for order-0 allocations and cause a fallback to remote nodes. The last issue could be addressed with additional complexity but as the benefit of zlc is questionable, it is better to remove it. If stalls due to zone_reclaim are ever reported then an alternative would be to introduce deferring logic based on a timeout inside zone_reclaim itself and leave the page allocator fast paths alone. The impact on page-allocator microbenchmarks is negligible as they don't hit the paths where the zlc comes into play. Most page-reclaim related workloads showed no noticeable difference as a result of the removal. The impact was noticeable in a workload called "stutter". One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. In an ideal world the latency application would not notice the mmap latency. On a 2-node machine the results of this patch are stutter 4.3.0-rc1 4.3.0-rc1 baseline nozlc-v4 Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%) 1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%) 2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%) 3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%) Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%) Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%) Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%) Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%) Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%) Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%) Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%) Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%) Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%) Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%) Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%) Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%) Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%) Note that the maximum stall latency went from 24 seconds to 12 which is still bad but an improvement. The milage varies considerably 2-node machine on an earlier test went from 494 seconds to 47 seconds and a 4-node machine that tested an earlier version of this patch went from a worst case stall time of 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable as it is hammering the system and the milage will vary between machines. There is a secondary impact with potentially more direct reclaim because zones are now being considered instead of being skipped by zlc. In this particular test run it did not occur so will not be described. However, in at least one test the following was observed 1. Direct reclaim rates were higher. This was likely due to direct reclaim being entered instead of the zlc disabling a zone and busy looping. Busy looping may have the effect of allowing kswapd to make more progress and in some cases may be better overall. If this is found then the correct action is to put direct reclaimers to sleep on a waitqueue and allow kswapd make forward progress. Busy looping on the zlc is even worse than when the allocator used to blindly call congestion_wait(). 2. There was higher swap activity as direct reclaim was active. 3. Direct reclaim efficiency was lower. This is related to 1 as more scanning activity also encountered more pages that could not be immediately reclaimed In that case, the direct page scan and reclaim rates are noticeable but it is not considered a problem for a few reasons 1. The test is primarily concerned with latency. The mmap attempts are also faulted which means there are THP allocation requests. The ZLC could cause zones to be disabled causing the process to busy loop instead of reclaiming. This looks like elevated direct reclaim activity but it's the correct action to take based on what processes requested. 2. The test hammers reclaim and compaction heavily. The number of successful THP faults is highly variable but affects the reclaim stats. It's not a realistic or reasonable measure of page reclaim activity. 3. No other page-reclaim intensive workload that was tested showed a problem. 4. If a workload is identified that benefitted from the busy looping then it should be fixed by having direct reclaimers sleep on a wait queue until woken by kswapd instead of busy looping. We had this class of problem before when congestion_waits() with a fixed timeout was a brain damaged decision but happened to benefit some workloads. If a workload is identified that relied on the zlc to busy loop then it should be fixed correctly and have a direct reclaimer sleep on a waitqueue until woken by kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIMMel Gorman
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: page_alloc: remove GFP_IOFSMel Gorman
GFP_IOFS was intended to be shorthand for clearing two flags, not a set of allocation flags. There is only one user of this flag combination now and there appears to be no reason why Lustre had to be protected from reclaim stalls. As none of the sites appear to be atomic, this patch simply deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or GFP_NOIO as appropriate. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: use masks and shifts when converting GFP flags to migrate typesMel Gorman
This patch redefines which GFP bits are used for specifying mobility and the order of the migrate types. Once redefined it's possible to convert GFP flags to a migrate type with a simple mask and shift. The only downside is that readers of OOM kill messages and allocation failures may have been used to the existing values but scripts/gfp-translate will help. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabledMel Gorman
There is a seqcounter that protects against spurious allocation failures when a task is changing the allowed nodes in a cpuset. There is no need to check the seqcounter until a cpuset exists. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: remove unnecessary recalculations for dirty zone balancingMel Gorman
File-backed pages that will be immediately written are balanced between zones. This heuristic tries to avoid having a single zone filled with recently dirtied pages but the checks are unnecessarily expensive. Move consider_zone_balanced into the alloc_context instead of checking bitmaps multiple times. The patch also gives the parameter a more meaningful name. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>