summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-05-23ipc, shm: make shmem attach/detach wait for mmap_sem killableMichal Hocko
shmat and shmdt rely on mmap_sem for write. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm, fork: make dup_mmap wait for mmap_sem for write killableMichal Hocko
dup_mmap needs to lock current's mm mmap_sem for write. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm, proc: make clear_refs killableMichal Hocko
CLEAR_REFS_MM_HIWATER_RSS and CLEAR_REFS_SOFT_DIRTY are relying on mmap_sem for write. If the waiting task gets killed by the oom killer and it would operate on the current's mm it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. This will also expedite the return to the userspace and do_exit even if the mm is remote. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Petr Cermak <petrcermak@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_brk killableMichal Hocko
Now that all the callers handle vm_brk failure we can change it wait for mmap_sem killable to help oom_reaper to not get blocked just because vm_brk gets blocked behind mmap_sem readers. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm, elf: handle vm_brk errorMichal Hocko
load_elf_library doesn't handle vm_brk failure although nothing really indicates it cannot do that because the function is allowed to fail due to vm_mmap failures already. This might be not a problem now but later patch will make vm_brk killable (resp. mmap_sem for write waiting will become killable) and so the failure will be more probable. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm, aout: handle vm_brk failuresMichal Hocko
vm_brk is allowed to fail but load_aout_binary simply ignores the error and happily continues. I haven't noticed any problem from that in real life but later patches will make the failure more likely because vm_brk will become killable (resp. mmap_sem for write waiting will become killable) so we should be more careful now. The error handling should be quite straightforward because there are calls to vm_mmap which check the error properly already. The only notable exception is set_brk which is called after beyond_if label. But nothing indicates that we cannot move it above set_binfmt as the two do not depend on each other and fail before we do set_binfmt and alter reference counting. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_munmap killableMichal Hocko
Almost all current users of vm_munmap are ignoring the return value and so they do not handle potential error. This means that some VMAs might stay behind. This patch doesn't try to solve those potential problems. Quite contrary it adds a new failure mode by using down_write_killable in vm_munmap. This should be safer than other failure modes, though, because the process is guaranteed to die as soon as it leaves the kernel and exit_mmap will clean the whole address space. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_mmap killableMichal Hocko
All the callers of vm_mmap seem to check for the failure already and bail out in one way or another on the error which means that we can change it to use killable version of vm_mmap_pgoff and return -EINTR if the current task gets killed while waiting for mmap_sem. This also means that vm_mmap_pgoff can be killable by default and drop the additional parameter. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Please note that load_elf_binary is ignoring vm_mmap error for current->personality & MMAP_PAGE_ZERO case but that shouldn't be a problem because the address is not used anywhere and we never return to the userspace if we got killed. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make mmap_sem for write waits killable for mm syscallsMichal Hocko
This is a follow up work for oom_reaper [1]. As the async OOM killing depends on oom_sem for read we would really appreciate if a holder for write didn't stood in the way. This patchset is changing many of down_write calls to be killable to help those cases when the writer is blocked and waiting for readers to release the lock and so help __oom_reap_task to process the oom victim. Most of the patches are really trivial because the lock is help from a shallow syscall paths where we can return EINTR trivially and allow the current task to die (note that EINTR will never get to the userspace as the task has fatal signal pending). Others seem to be easy as well as the callers are already handling fatal errors and bail and return to userspace which should be sufficient to handle the failure gracefully. I am not familiar with all those code paths so a deeper review is really appreciated. As this work is touching more areas which are not directly connected I have tried to keep the CC list as small as possible and people who I believed would be familiar are CCed only to the specific patches (all should have received the cover though). This patchset is based on linux-next and it depends on down_write_killable for rw_semaphores which got merged into tip locking/rwsem branch and it is merged into this next tree. I guess it would be easiest to route these patches via mmotm because of the dependency on the tip tree but if respective maintainers prefer other way I have no objections. I haven't covered all the mmap_write(mm->mmap_sem) instances here $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l 98 $ git grep "down_write(.*\<mmap_sem\>)" | wc -l 62 I have tried to cover those which should be relatively easy to review in this series because this alone should be a nice improvement. Other places can be changed on top. [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org This patch (of 18): This is the first step in making mmap_sem write waiters killable. It focuses on the trivial ones which are taking the lock early after entering the syscall and they are not changing state before. Therefore it is very easy to change them to use down_write_killable and immediately return with -EINTR. This will allow the waiter to pass away without blocking the mmap_sem which might be required to make a forward progress. E.g. the oom reaper will need the lock for reading to dismantle the OOM victim address space. The only tricky function in this patch is vm_mmap_pgoff which has many call sites via vm_mmap. To reduce the risk keep vm_mmap with the original non-killable semantic for now. vm_munmap callers do not bother checking the return value so open code it into the munmap syscall path for now for simplicity. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23MAINTAINERS: add co-maintainer for scripts/gdbKieran Bingham
Add myself as a co-maintainer for scripts/gdb supporting Jan Kizka Link: http://lkml.kernel.org/r/fb5d34ce563f33d2f324f26f592b24ded30032ee.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran@bingham.xyz> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: decode bytestream on dmesg for Python3Kieran Bingham
The recent fixes to lx-dmesg, now allow the command to print successfully on Python3, however the python interpreter wraps the bytes for each line with a b'<text>' marker. To remove this, we need to decode the line, where .decode() will default to 'UTF-8' Link: http://lkml.kernel.org/r/d67ccf93f2479c94cb3399262b9b796e0dbefcf2.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran@bingham.xyz> Acked-by: Dom Cote <buzdelabuz2@gmail.com> Tested-by: Dom Cote <buzdelabuz2@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: fix issue with dmesg.py and python 3.XDom Cote
When built against Python 3, GDB differs in the return type for its read_memory function, causing the lx-dmesg command to fail. Now that we have an improved read_16() we can use the new read_memoryview() abstraction to make lx-dmesg return valid data on both current Python APIs Tested with python 3.4 and 2.7 Tested with gdb 7.7 Link: http://lkml.kernel.org/r/28477b727ff7fe3101fd4e426060e8a68317a639.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Dom Cote <buzdelabuz2+git@gmail.com> [kieran@bingham.xyz: Adjusted commit log to better reflect code changes] Tested-by: Kieran Bingham <kieran@bingham.xyz> (Py2.7,Py3.4,GDB10) Signed-off-by: Kieran Bingham <kieran@bingham.xyz> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: improve types abstraction for gdb python scriptsDom Cote
Change the read_u16 function so it accepts both 'str' and 'byte' as type for the arguments. When calling read_memory() from gdb API, depending on if it was built with 2.7 or 3.X, the format used to return the data will differ ( 'str' for 2.7, and 'byte' for 3.X ). Add a function read_memoryview() to be able to get a 'memoryview' object back from read_memory() both with python 2.7 and 3.X . Tested with python 3.4 and 2.7 Tested with gdb 7.7 Link: http://lkml.kernel.org/r/73621f564503137a002a639d174e4fb35f73f462.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Dom Cote <buzdelabuz2+git@gmail.com> Tested-by: Kieran Bingham <kieran@bingham.xyz> (Py2.7,Py3.4,GDB10) Signed-off-by: Kieran Bingham <kieran@bingham.xyz> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: add lx_thread_info_by_pid helperKieran Bingham
The tasks module already provides helpers to find the task struct by pid, and the thread_info by task struct; however this is cumbersome to utilise on the gdb commandline. Wrap these two functionalities together in an extra single helper to allow exploring the thread info, from a PID value Link: http://lkml.kernel.org/r/dadc5667f053ec811eb3e3033d99d937fedbc93b.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: add documentation example for radix treeKieran Bingham
Provide a worked example for utilising the lx_radix_tree_lookup function Link: http://lkml.kernel.org/r/e786008ac5aec4b84198812805b326d718bdeb4b.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: add a Radix Tree ParserKieran Bingham
Linux makes use of the Radix Tree data structure to store pointers indexed by integer values. This structure is utilised across many structures in the kernel including the IRQ descriptor tables, and several filesystems. This module provides a method to lookup values from a structure given its head node. Usage: The function lx_radix_tree_lookup, must be given a symbol of type struct radix_tree_root, and an index into that tree. The object returned is a generic integer value, and must be cast correctly to the type based on the storage in the data structure. For example, to print the irq descriptor in the sparse irq_desc_tree at index 18, try the following: (gdb) print (struct irq_desc)$lx_radix_tree_lookup(irq_desc_tree, 18) Link: http://lkml.kernel.org/r/d2028c55e50cf95a9b7f8ca0d11885174b0cc709.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: cast CPU numbers to integerJan Kiszka
We won't see more than 2 billion CPUs any time soon, and having cpu_list return long makes the output of lx-cpus a bit ugly. Link: http://lkml.kernel.org/r/dcb45c3b0a59e0fd321fa56ff7aa398458c689b3.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: add cpu iteratorsKieran Bingham
The linux kernel provides macro's for iterating against values from the cpu_list masks. By providing some commonly used masks, we can mirror the kernels helper macros with easy to use generators. Link: http://lkml.kernel.org/r/d045c6599771ada1999d49612ee30fd2f9acf17f.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: add mount point list commandKieran Bingham
lx-mounts will identify current mount points based on the 'init_task' namespace by default, as we do not yet have a kernel thread list implementation to select the current running thread. Optionally, a user can specify a PID to list from that process' namespace Link: http://lkml.kernel.org/r/e614c7bc32d2350b4ff1627ec761a7148e65bfe6.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: add io resource readersKieran Bingham
Provide iomem_resource and ioports_resource printers and command hooks It can be quite interesting to halt the kernel as it's booting and check to see this list as it is being populated. It should be useful in the event that a kernel is not booting, you can identify what memory resources have been registered Link: http://lkml.kernel.org/r/f0a6b9fa9c92af4d7ed2e7343ccc84150e9c6fc5.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: provide a dentry_name VFS path helperKieran Bingham
Walk the VFS entries, pre-pending the iname strings to generate a full VFS path name from a dentry. Link: http://lkml.kernel.org/r/4328fdb2d15ba7f1b21ad21c2eecc38d9cfc4d13.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: support !CONFIG_MODULES gracefullyKieran Bingham
If CONFIG_MODULES is not enabled, lx-lsmod tries to find a non-existent symbol and generates an unfriendly traceback: (gdb) lx-lsmod Address Module Size Used by Traceback (most recent call last): File "scripts/gdb/linux/modules.py", line 75, in invoke for module in module_list(): File "scripts/gdb/linux/modules.py", line 24, in module_list module_ptr_type = module_type.get_type().pointer() File "scripts/gdb/linux/utils.py", line 28, in get_type self._type = gdb.lookup_type(self._name) gdb.error: No struct type named module. Error occurred in Python command: No struct type named module. Catch the error and return an empty module_list() for a clean command output as follows: (gdb) lx-lsmod Address Module Size Used by (gdb) Link: http://lkml.kernel.org/r/94d533819437408b85ae5864f939dd7ca6fbfcd6.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: provide exception catching parserKieran Bingham
If we attempt to read a value that is not available to GDB, an exception is raised. Most of the time, this is a good thing; however on occasion we will want to be able to determine if a symbol is available. By catching the exception to simply return None, we can determine if we tried to read an invalid value, without the exception taking our execution context away from us Link: http://lkml.kernel.org/r/c72b25c06fc66e1d68371154097e2cbb112555d8.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: convert modules usage to lists functionsKieran Bingham
Simplify the module list functions with the new list_for_each_entry abstractions Link: http://lkml.kernel.org/r/ad0101c9391088608166fcec26af179868973d86.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: provide kernel list item generatorsKieran Bingham
Facilitate linked-list items by providing a generator to return the dereferenced, and type-cast objects from a kernel linked list Link: http://lkml.kernel.org/r/2b0998564e6e5abe53585d466f87e491331fd2a4.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: provide linux constantsKieran Bingham
Some macro's and defines are needed when parsing memory, and without compiling the kernel as -g3 they are not available in the debug-symbols. We use the pre-processor here to extract constants to a dedicated module for the linux debugger extensions Top level Kbuild is used to call in and generate the constants file, while maintaining dependencies on autogenerated files in include/generated Link: http://lkml.kernel.org/r/bc3df9c25f57ea72177c066a51a446fc19e2c27f.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran.bingham@linaro.org> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23scripts/gdb: Adjust module reference counter reported by lx-lsmodJan Kiszka
This takes the MODULE_REF_BASE into account. Link: http://lkml.kernel.org/r/d926d2d54caa034adb964b52215090cbdb875249.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23arch/defconfig: remove CONFIG_RESOURCE_COUNTERSKonstantin Khlebnikov
This option was replaced by PAGE_COUNTER which is selected by MEMCG. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23drivers/memstick/core/mspro_block: use kmemdupMuhammad Falak R Wani
Use kmemdup when some other buffer is immediately copied into allocated region. It replaces call to allocation followed by memcpy, by a single call to kmemdup. [akpm@linux-foundation.org: remove unneeded cast to void*] Link: http://lkml.kernel.org/r/1463665743-16269-1-git-send-email-falakreyaz@gmail.com Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23rtsx_usb_ms: use schedule_timeout_idle() in polling loopOleksandr Natalenko
First version of this patch has already been posted to LKML by Ben Hutchings ~6 months ago, but no further action were performed. Ben's original message: : rtsx_usb_ms creates a task that mostly sleeps, but tasks in : uninterruptible sleep still contribute to the load average (for : bug-compatibility with Unix). A load average of ~1 on a system that : should be idle is somewhat alarming. : : Change the sleep to be interruptible, but still ignore signals. References: https://bugs.debian.org/765717 Link: http://lkml.kernel.org/r/b49f95ae83057efa5d96f532803cba47@natalenko.name Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Lee Jones <lee.jones@linaro.org> Cc: Wolfram Sang <wsa@the-dreams.de> Cc: Roger Tseng <rogerable@realtek.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kdump: fix gdb macros work work with newer and 64-bit kernelsCorey Minyard
Lots of little changes needed to be made to clean these up, remove the four byte pointer assumption and traverse the pid queue properly. Also consolidate the traceback code into a single function instead of having three copies of it. Link: http://lkml.kernel.org/r/1462926655-9390-1-git-send-email-minyard@acm.org Signed-off-by: Corey Minyard <cminyard@mvista.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Haren Myneni <hbabu@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23s390/kexec: consolidate crash_map/unmap_reserved_pages() and ↵Xunlei Pang
arch_kexec_protect(unprotect)_crashkres() Commit 3f625002581b ("kexec: introduce a protection mechanism for the crashkernel reserved memory") is a similar mechanism for protecting the crash kernel reserved memory to previous crash_map/unmap_reserved_pages() implementation, the new one is more generic in name and cleaner in code (besides, some arch may not be allowed to unmap the pgtable). Therefore, this patch consolidates them, and uses the new arch_kexec_protect(unprotect)_crashkres() to replace former crash_map/unmap_reserved_pages() which by now has been only used by S390. The consolidation work needs the crash memory to be mapped initially, this is done in machine_kdump_pm_init() which is after reserve_crashkernel(). Once kdump kernel is loaded, the new arch_kexec_protect_crashkres() implemented for S390 will actually unmap the pgtable like before. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Minfei Huang <mhuang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kexec: do a cleanup for function kexec_loadMinfei Huang
There are a lof of work to be done in function kexec_load, not only for allocating structs and loading initram, but also for some misc. To make it more clear, wrap a new function do_kexec_load which is used to allocate structs and load initram. And the pre-work will be done in kexec_load. Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Xunlei Pang <xlpang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kexec: make a pair of map/unmap reserved pages in error pathMinfei Huang
For some arch, kexec shall map the reserved pages, then use them, when we try to start the kdump service. kexec may return directly, without unmaping the reserved pages, if it fails during starting service. To fix it, we make a pair of map/unmap reserved pages both in generic path and error path. This patch only affects s390. Other architecturess don't implement the interface of crash_unmap_reserved_pages and crash_map_reserved_pages. It isn't a urgent patch. Kernel can work well without any risk, although the reserved pages are not unmapped before returning in error path. Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Xunlei Pang <xlpang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kexec: provide arch_kexec_protect(unprotect)_crashkres()Xunlei Pang
Implement the protection method for the crash kernel memory reservation for the 64-bit x86 kdump. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Young <dyoung@redhat.com> Cc: Minfei Huang <mhuang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kexec: introduce a protection mechanism for the crashkernel reserved memoryXunlei Pang
For the cases that some kernel (module) path stamps the crash reserved memory(already mapped by the kernel) where has been loaded the second kernel data, the kdump kernel will probably fail to boot when panic happens (or even not happens) leaving the culprit at large, this is unacceptable. The patch introduces a mechanism for detecting such cases: 1) After each crash kexec loading, it simply marks the reserved memory regions readonly since we no longer access it after that. When someone stamps the region, the first kernel will panic and trigger the kdump. The weak arch_kexec_protect_crashkres() is introduced to do the actual protection. 2) To allow multiple loading, once 1) was done we also need to remark the reserved memory to readwrite each time a system call related to kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced to do the actual protection. The architecture can make its specific implementation by overriding arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres(). Signed-off-by: Xunlei Pang <xlpang@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Young <dyoung@redhat.com> Cc: Minfei Huang <mhuang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23exec: remove the no longer needed remove_arg_zero()->free_arg_page()Oleg Nesterov
remove_arg_zero() does free_arg_page() for no reason. This was needed before and only if CONFIG_MMU=y: see commit 4fc75ff4816c ("exec: fix remove_arg_zero"), install_arg_page() was called for every page != NULL in bprm->page[] array. Today install_arg_page() has already gone and free_arg_page() is nop after another commit b6a2fea39318 ("mm: variable length argument support"). CONFIG_MMU=n does free_arg_pages() in free_bprm() and thus it doesn't need remove_arg_zero()->free_arg_page() too; apart from get_arg_page() it never checks if the page in bprm->page[] was allocated or not, so the "extra" non-freed page is fine. OTOH, this free_arg_page() can add the minor pessimization, the caller is going to do copy_strings_kernel() right after remove_arg_zero() which will likely need to re-allocate the same page again. And as Hujunjie pointed out, the "offset == PAGE_SIZE" check is wrong because we are going to increment bprm->p once again before return, so CONFIG_MMU=n "leaks" the page anyway if '0' is the final byte in this page. NOTE: remove_arg_zero() assumes that argv[0] is null-terminated but this is not necessarily true. copy_strings() does "len = strnlen_user(...)", then copy_from_user(len) but another thread or debuger can overwrite the trailing '0' in between. Afaics nothing really bad can happen because we must always have the null-terminated bprm->filename copied by the 1st copy_strings_kernel(), but perhaps we should change this code to check "bprm->p < bprm->exec" anyway, and/or change copy_strings() to ensure that the last byte in string is always zero. Link: http://lkml.kernel.org/r/20160517155335.GA31435@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported by: hujunjie <jj.net@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kernek/fork.c: allocate idle task for a CPU always on its local nodeAndi Kleen
Linux preallocates the task structs of the idle tasks for all possible CPUs. This currently means they all end up on node 0. This also implies that the cache line of MWAIT, which is around the flags field in the task struct, are all located in node 0. We see a noticeable performance improvement on Knights Landing CPUs when the cache lines used for MWAIT are located in the local nodes of the CPUs using them. I would expect this to give a (likely slight) improvement on other systems too. The patch implements placing the idle task in the node of its CPUs, by passing the right target node to copy_process() [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1] Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23signal: move the "sig < SIGRTMIN" check into siginmask(sig)Oleg Nesterov
All the users of siginmask() must ensure that sig < SIGRTMIN. sig_fatal() doesn't and this is wrong: UBSAN: Undefined behaviour in kernel/signal.c:911:6 shift exponent 32 is too large for 32-bit type 'long unsigned int' the patch doesn't add the neccesary check to sig_fatal(), it moves the check into siginmask() and updates other callers. Link: http://lkml.kernel.org/r/20160517195052.GA15187@redhat.com Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23kernel/signal.c: convert printk(KERN_<LEVEL> ...) to pr_<level>(...)Wang Xiaoqiang
Use pr_<level> instead of printk(KERN_<LEVEL> ). Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23signal: make oom_flags a boolTetsuo Handa
Currently the size of "struct signal_struct"->oom_flags member is sizeof(unsigned) bytes, but only one flag OOM_FLAG_ORIGIN which is updated by current thread is defined. We can convert OOM_FLAG_ORIGIN into a bool, and reuse the saved bytes for updating from the OOM killer and/or the OOM reaper thread. By the way, do we care about a race window between run_store() and swapoff() because it would be theoretically possible that two threads sharing the "struct signal_struct" concurrently call respective functions? If we care, we can make oom_flags an atomic_t. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALLOleg Nesterov
I see no reason why waitid() can't support other linux-specific flags allowed in sys_wait4(). In particular this change can help if we reconsider the previous change ("wait/ptrace: assume __WALL if the child is traced") which adds the "automagical" __WALL for debugger. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23wait/ptrace: assume __WALL if the child is tracedOleg Nesterov
The following program (simplified version of generated by syzkaller) #include <pthread.h> #include <unistd.h> #include <sys/ptrace.h> #include <stdio.h> #include <signal.h> void *thread_func(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return 0; } int main(void) { pthread_t thread; if (fork()) return 0; while (getppid() != 1) ; pthread_create(&thread, NULL, thread_func, NULL); pthread_join(thread, NULL); return 0; } creates an unreapable zombie if /sbin/init doesn't use __WALL. This is not a kernel bug, at least in a sense that everything works as expected: debugger should reap a traced sub-thread before it can reap the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads. Unfortunately, it seems that /sbin/init in most (all?) distributions doesn't use it and we have to change the kernel to avoid the problem. Note also that most init's use sys_waitid() which doesn't allow __WALL, so the necessary user-space fix is not that trivial. This patch just adds the "ptrace" check into eligible_child(). To some degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is mostly ignored when the tracee reports to debugger. Or WSTOPPED, the tracer doesn't need to set this flag to wait for the stopped tracee. This obviously means the user-visible change: __WCLONE and __WALL no longer have any meaning for debugger. And I can only hope that this won't break something, but at least strace/gdb won't suffer. We could make a more conservative change. Say, we can take __WCLONE into account, or !thread_group_leader(). But it would be nice to not complicate these historical/confusing checks. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: <syzkaller@googlegroups.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: fix block commentsRyusuke Konishi
This fixes block comments with proper formatting to eliminate the following checkpatch.pl warnings: "WARNING: Block comments use * on subsequent lines" "WARNING: Block comments use a trailing */ on a separate line" Link: http://lkml.kernel.org/r/1462886671-3521-8-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: remove loops of single statement macrosRyusuke Konishi
This fixes checkpatch.pl warning "WARNING: Single statement macros should not use a do {} while (0) loop". Link: http://lkml.kernel.org/r/1462886671-3521-7-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: remove unnecessary else after return or breakRyusuke Konishi
This fixes the checkpatch.pl warning that suggests else is not generally useful after a break or return. Link: http://lkml.kernel.org/r/1462886671-3521-6-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: avoid bare use of 'unsigned'Ryusuke Konishi
This fixes checkpatch.pl warning "WARNING: Prefer 'unsigned int' to bare use of 'unsigned'". Link: http://lkml.kernel.org/r/1462886671-3521-5-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: fix code indent coding style issueRyusuke Konishi
This fixes checkpatch.pl warning "WARNING: suspect code indent for conditional statements". Link: http://lkml.kernel.org/r/1462886671-3521-4-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: remove space before semicolonRyusuke Konishi
This fixes the checkpatch.pl warning "WARNING: space prohibited before semicolon" at nilfs_store_magic_and_option(). Link: http://lkml.kernel.org/r/1462886671-3521-3-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23nilfs2: do not emit extra newline on nilfs_warning() and nilfs_error()Ryusuke Konishi
This updates call sites of nilfs_warning() and nilfs_error() so that they don't add a duplicate newline. These output functions are already designed to add a trailing newline to the message. Link: http://lkml.kernel.org/r/1462886671-3521-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>