summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-12-14radix tree test suite: use rcu_barrierMatthew Wilcox
Calling rcu_barrier() allows all of the rcu-freed memory to be actually returned to the pool, and allows nr_allocated to return to 0. As well as allowing diffs between runs to be more useful, it also lets us pinpoint leaks more effectively. Link: http://lkml.kernel.org/r/1480369871-5271-44-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix tree test suite: benchmark for iteratorKonstantin Khlebnikov
This adds simple benchmark for iterator similar to one I've used for commit 78c1d78488a3 ("radix-tree: introduce bit-optimized iterator") Building with make BENCHMARK=1 set radix tree order to 6, this allows to get performance comparable to in kernel performance. Link: http://lkml.kernel.org/r/1480369871-5271-43-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix tree test suite: iteration test misuses RCUMatthew Wilcox
Each thread needs to register itself with RCU, otherwise the reading thread's read lock has no effect and the freeing thread will free the memory in the tree without waiting for the read lock to be dropped. Link: http://lkml.kernel.org/r/1480369871-5271-42-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix tree test suite: make runs more reproducibleMatthew Wilcox
Instead of reseeding the random number generator every time around the loop in big_gang_check(), seed it at the beginning of execution. Use rand_r() and an independent base seed for each thread in iteration_test() so they don't stomp all over each others state. Since this particular test depends on the kernel scheduler, the iteration test can't be reproduced based purely on the random seed, but at least it won't pollute the other tests. Print the seed, and allow the seed to be specified so that a run which hits a problem can be reproduced. Link: http://lkml.kernel.org/r/1480369871-5271-41-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix tree test suite: free preallocated nodesMatthew Wilcox
It can be a source of mild concern when the test suite shows that we're leaking nodes. While poring over the source code looking for leaks can lead to some fascinating bugs being discovered, sometimes the leak is simply that these nodes were preallocated and are sitting on the per-CPU list. Free them by calling the CPU dead callback. Link: http://lkml.kernel.org/r/1480369871-5271-40-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix tree test suite: track preempt_countMatthew Wilcox
Rather than simply NOP out preempt_enable() and preempt_disable(), keep track of preempt_count and display it regularly in case either the test suite or the code under test is forgetting to balance the enables & disables. Only found a test-case that was forgetting to re-enable preemption, but it's a possibility worth checking. Link: http://lkml.kernel.org/r/1480369871-5271-39-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix tree test suite: allow GFP_ATOMIC allocations to failMatthew Wilcox
In order to test the preload code, it is necessary to fail GFP_ATOMIC allocations, which requires defining GFP_KERNEL and GFP_ATOMIC properly. Remove the obsolete __GFP_WAIT and copy the definitions of the __GFP flags which are used from the kernel include files. We also need the real definition of gfpflags_allow_blocking() to persuade the radix tree to actually use its preallocated nodes. Link: http://lkml.kernel.org/r/1480369871-5271-38-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14tools: add WARN_ON_ONCEMatthew Wilcox
Patch series "Radix tree patches for 4.10", v3. Mostly these are improvements; the only bug fixes in here relate to multiorder entries (which are unused in the 4.9 tree). This patch (of 32): The radix tree uses its own buggy WARN_ON_ONCE. Replace it with the definition from asm-generic/bug.h Link: http://lkml.kernel.org/r/1480369871-5271-37-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14dax: clear dirty entry tags on cache flushJan Kara
Currently we never clear dirty tags in DAX mappings and thus address ranges to flush accumulate. Now that we have locking of radix tree entries, we have all the locking necessary to reliably clear the radix tree dirty tag when flushing caches for corresponding address range. Similarly to page_mkclean() we also have to write-protect pages to get a page fault when the page is next written to so that we can mark the entry dirty again. Link: http://lkml.kernel.org/r/1479460644-25076-21-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14dax: protect PTE modification on WP fault by radix tree entry lockJan Kara
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite() has released corresponding radix tree entry lock. When we want to writeprotect PTE on cache flush, we need PTE modification to happen under radix tree entry lock to ensure consistent updates of PTE and radix tree (standard faults use page lock to ensure this consistency). So move update of PTE bit into dax_pfn_mkwrite(). Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14dax: make cache flushing protected by entry lockJan Kara
Currently, flushing of caches for DAX mappings was ignoring entry lock. So far this was ok (modulo a bug that a difference in entry lock could cause cache flushing to be mistakenly skipped) but in the following patches we will write-protect PTEs on cache flushing and clear dirty tags. For that we will need more exclusion. So do cache flushing under an entry lock. This allows us to remove one lock-unlock pair of mapping->tree_lock as a bonus. Link: http://lkml.kernel.org/r/1479460644-25076-19-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: export follow_pte()Jan Kara
DAX will need to implement its own version of page_check_address(). To avoid duplicating page table walking code, export follow_pte() which does what we need. Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: change return values of finish_mkwrite_fault()Jan Kara
Currently finish_mkwrite_fault() returns 0 when PTE got changed before we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying the PTE. This is somewhat confusing since 0 generally means success, it is also inconsistent with finish_fault() which returns 0 on success. Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE when PTE changed. Practically, there should be no behavioral difference since we bail out from the fault the same way regardless whether we return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that VM_FAULT_WRITE has no effect for shared mappings since the only two places that check it - KSM and GUP - care about private mappings only. Generally the meaning of VM_FAULT_WRITE for shared mappings is not well defined and we should probably clean that up. Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: provide helper for finishing mkwrite faultsJan Kara
Provide a helper function for finishing write faults due to PTE being read-only. The helper will be used by DAX to avoid the need of complicating generic MM code with DAX locking specifics. Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: move part of wp_page_reuse() into the single call siteJan Kara
wp_page_reuse() handles write shared faults which is needed only in wp_page_shared(). Move the handling only into that location to make wp_page_reuse() simpler and avoid a strange situation when we sometimes pass in locked page, sometimes unlocked etc. Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use vmf->page during WP faultsJan Kara
So far we set vmf->page during WP faults only when we needed to pass it to the ->page_mkwrite handler. Set it in all the cases now and use that instead of passing page pointer explicitly around. Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: pass vm_fault structure into do_page_mkwrite()Jan Kara
We will need more information in the ->page_mkwrite() helper for DAX to be able to fully finish faults there. Pass vm_fault structure to do_page_mkwrite() and use it there so that information propagates properly from upper layers. Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: factor out common parts of write fault handlingJan Kara
Currently we duplicate handling of shared write faults in wp_page_reuse() and do_shared_fault(). Factor them out into a common function. Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: move handling of COW faults into DAX codeJan Kara
Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: factor out functionality to finish page faultsJan Kara
Introduce finish_fault() as a helper function for finishing page faults. It is rather thin wrapper around alloc_set_pte() but since we'd want to call this from DAX code or filesystems, it is still useful to avoid some boilerplate code. Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: allow full handling of COW faults in ->fault handlersJan Kara
Patch series "dax: Clear dirty bits after flushing caches", v5. Patchset to clear dirty bits from radix tree of DAX inodes when caches for corresponding pfns have been flushed. In principle, these patches enable handlers to easily update PTEs and do other work necessary to finish the fault without duplicating the functionality present in the generic code. I'd like to thank Kirill and Ross for reviews of the series! This patch (of 20): To allow full handling of COW faults add memcg field to struct vm_fault and a return value of ->fault() handler meaning that COW fault is fully handled and memcg charge must not be canceled. This will allow us to remove knowledge about special DAX locking from the generic fault code. Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: add orig_pte field into vm_faultJan Kara
Add orig_pte field to vm_fault structure to allow ->page_mkwrite handlers to fully handle the fault. This also allows us to save some passing of extra arguments around. Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use passed vm_fault structure for in wp_pfn_shared()Jan Kara
Instead of creating another vm_fault structure, use the one passed to wp_pfn_shared() for passing arguments into pfn_mkwrite handler. Link: http://lkml.kernel.org/r/1479460644-25076-7-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: trim __do_fault() argumentsJan Kara
Use vm_fault structure to pass cow_page, page, and entry in and out of the function. That reduces number of __do_fault() arguments from 4 to 1. Link: http://lkml.kernel.org/r/1479460644-25076-6-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use passed vm_fault structure in __do_fault()Jan Kara
Instead of creating another vm_fault structure, use the one passed to __do_fault() for passing arguments into fault handler. Link: http://lkml.kernel.org/r/1479460644-25076-5-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use pgoff in struct vm_fault instead of passing it separatelyJan Kara
struct vm_fault has already pgoff entry. Use it instead of passing pgoff as a separate argument and then assigning it later. Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use vmf->address instead of of vmf->virtual_addressJan Kara
Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: join struct fault_env and vm_faultJan Kara
Currently we have two different structures for passing fault information around - struct vm_fault and struct fault_env. DAX will need more information in struct vm_fault to handle its faults so the content of that structure would become event closer to fault_env. Furthermore it would need to generate struct fault_env to be able to call some of the generic functions. So at this point I don't think there's much use in keeping these two structures separate. Just embed into struct vm_fault all that is needed to use it for both purposes. Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: unexport __get_user_pages_unlocked()Lorenzo Stoakes
Unexport the low-level __get_user_pages_unlocked() function and replaces invocations with calls to more appropriate higher-level functions. In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked() with get_user_pages_unlocked() since we can now pass gup_flags. In async_pf_execute() and process_vm_rw_single_vec() we need to pass different tsk, mm arguments so get_user_pages_remote() is the sane replacement in these cases (having added manual acquisition and release of mmap_sem.) Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH flag. However, this flag was originally silently dropped by commit 1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this appears to have been unintentional and reintroducing it is therefore not an issue. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: add locked parameter to get_user_pages_remote()Lorenzo Stoakes
Patch series "mm: unexport __get_user_pages_unlocked()". This patch series continues the cleanup of get_user_pages*() functions taking advantage of the fact we can now pass gup_flags as we please. It firstly adds an additional 'locked' parameter to get_user_pages_remote() to allow for its callers to utilise VM_FAULT_RETRY functionality. This is necessary as the invocation of __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of this and no other existing higher level function would allow it to do so. Secondly existing callers of __get_user_pages_unlocked() are replaced with the appropriate higher-level replacement - get_user_pages_unlocked() if the current task and memory descriptor are referenced, or get_user_pages_remote() if other task/memory descriptors are referenced (having acquiring mmap_sem.) This patch (of 2): Add a int *locked parameter to get_user_pages_remote() to allow VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked(). Taking into account the previous adjustments to get_user_pages*() functions allowing for the passing of gup_flags, we are now in a position where __get_user_pages_unlocked() need only be exported for his ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to subsequently unexport __get_user_pages_unlocked() as well as allowing for future flexibility in the use of get_user_pages_remote(). [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change] Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: avoid idr tree lookup for interrupted semopDavidlohr Bueso
We can avoid the idr tree lookup (albeit possibly avoiding idr_find_fast()) when being awoken in EINTR, as the semid will not change in this context while blocked. Use the sma pointer directly and take the sem_lock, then re-check for RMID races. We continue to re-check the queue.status with the lock held such that we can detect situations where we where are dealing with a spurious wakeup but another task that holds the sem_lock updated the queue.status while we were spinning for it. Once we take the lock it obviously won't change again. Being the only caller, get rid of sem_obtain_lock() altogether. Link: http://lkml.kernel.org/r/1478708774-28826-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: simplify wait-wake loopDavidlohr Bueso
Instead of using the reverse goto, we can simplify the flow and make it more language natural by just doing do-while instead. One would hope this is the standard way (or obviously just with a while bucle) that we do wait/wakeup handling in the kernel. The exact same logic is kept, just more indented. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1478708774-28826-2-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: use proper list api for pending_list wakeupsDavidlohr Bueso
... saves some LoC and looks cleaner than re-implementing the calls. Link: http://lkml.kernel.org/r/1474225896-10066-6-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: explicitly inline check_restartDavidlohr Bueso
The compiler already does this, but make it explicit. This helper is really small and also used in update_queue's main loop, which is O(N^2) scanning. Inline and avoid the function overhead. Link: http://lkml.kernel.org/r/1474225896-10066-5-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: optimize perform_atomic_semop()Davidlohr Bueso
This is the main workhorse that deals with semop user calls such that the waitforzero or semval update operations, on the set, can complete on not as the sma currently stands. Currently, the set is iterated twice (setting semval, then backwards for the sempid value). Slowpaths, and particularly SEM_UNDO calls, must undo any altered sem when it is detected that the caller must block or has errored-out. With larger sets, there can occur situations where this involves a lot of cycles and can obviously be a suboptimal use of cached resources in shared memory. Ie, discarding CPU caches that are also calling semop and have the sembuf cached (and can complete), while the current lock holder doing the semop will block, error, or does a waitforzero operation. This patch proposes still iterating the set twice, but the first scan is read-only, and we perform the actual updates afterward, once we know that the call will succeed. In order to not suffer from the overhead of dealing with sops that act on the same sem_num, such (rare) cases use perform_atomic_semop_slow(), which is exactly what we have now. Duplicates are detected before grabbing sem_lock, and uses simple a 32/64-bit hash array variable to based on the sem_num we are working on. In addition add some comments to when we expect to the caller to block. [akpm@linux-foundation.org: coding-style fixes] [colin.king@canonical.com: ensure we left shift a ULL rather than a 32 bit integer] Link: http://lkml.kernel.org/r/20161028181129.7311-1-colin.king@canonical.com Link: http://lkml.kernel.org/r/20160921194603.GB21438@linux-80c1.suse Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: rework task wakeupsDavidlohr Bueso
Our sysv sems have been using the notion of lockless wakeups for a while, ever since commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the spinlock section"), in order to reduce the sem_lock hold times. This in-house pending queue can be replaced by wake_q (just like all the rest of ipc now), in that it provides the following advantages: o Simplifies and gets rid of unnecessary code. o We get rid of the IN_WAKEUP complexities. Given that wake_q_add() grabs reference to the task, if awoken due to an unrelated event, between the wake_q_add() and wake_up_q() window, we cannot race with sys_exit and the imminent call to wake_up_process(). o By not spinning IN_WAKEUP, we no longer need to disable preemption. In consequence, the wakeup paths (after schedule(), that is) must acknowledge an external signal/event, as well spurious wakeup occurring during the pending wakeup window. Obviously no changes in semantics that could be visible to the user. The fastpath is _only_ for when we know for sure that we were awoken due to a the waker's successful semop call (queue.status is not -EINTR). On a 48-core Haswell, running the ipcscale 'waitforzero' test, the following is seen with increasing thread counts: v4.8-rc5 v4.8-rc5 semopv2 Hmean sembench-sem-2 574733.00 ( 0.00%) 578322.00 ( 0.62%) Hmean sembench-sem-8 811708.00 ( 0.00%) 824689.00 ( 1.59%) Hmean sembench-sem-12 842448.00 ( 0.00%) 845409.00 ( 0.35%) Hmean sembench-sem-21 933003.00 ( 0.00%) 977748.00 ( 4.80%) Hmean sembench-sem-48 935910.00 ( 0.00%) 1004759.00 ( 7.36%) Hmean sembench-sem-79 937186.00 ( 0.00%) 983976.00 ( 4.99%) Hmean sembench-sem-234 974256.00 ( 0.00%) 1060294.00 ( 8.83%) Hmean sembench-sem-265 975468.00 ( 0.00%) 1016243.00 ( 4.18%) Hmean sembench-sem-296 991280.00 ( 0.00%) 1042659.00 ( 5.18%) Hmean sembench-sem-327 975415.00 ( 0.00%) 1029977.00 ( 5.59%) Hmean sembench-sem-358 1014286.00 ( 0.00%) 1049624.00 ( 3.48%) Hmean sembench-sem-389 972939.00 ( 0.00%) 1043127.00 ( 7.21%) Hmean sembench-sem-420 981909.00 ( 0.00%) 1056747.00 ( 7.62%) Hmean sembench-sem-451 990139.00 ( 0.00%) 1051609.00 ( 6.21%) Hmean sembench-sem-482 965735.00 ( 0.00%) 1040313.00 ( 7.72%) [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: merge fix for WAKE_Q to DEFINE_WAKE_Q rename] Link: http://lkml.kernel.org/r/20161122210410.5eca9fc2@canb.auug.org.au Link: http://lkml.kernel.org/r/1474225896-10066-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: do not call wake_sem_queue_do() prematurely ... as this call should ↵Davidlohr Bueso
obviously be paired with its _prepare() counterpart. At least whenever possible, as there is no harm in calling it bogusly as we do now in a few places. Immediate error semop(2) paths that are far from ever having the task block can be simplified and avoid a few unnecessary loads on their way out of the call as it is not deeply nested. Link: http://lkml.kernel.org/r/1474225896-10066-2-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14sparc: implement watchdog_nmi_enable and watchdog_nmi_disableBabu Moger
Implement functions watchdog_nmi_enable and watchdog_nmi_disable to enable/disable nmi watchdog. Sparc uses arch specific nmi watchdog handler. Currently, we do not have a way to enable/disable nmi watchdog dynamically. With these patches we can enable or disable arch specific nmi watchdogs using proc or sysctl interface. Example commands. To enable: echo 1 > /proc/sys/kernel/nmi_watchdog To disable: echo 0 > /proc/sys/kernel/nmi_watchdog It can also achieved using the sysctl parameter kernel.nmi_watchdog Link: http://lkml.kernel.org/r/1478034826-43888-4-git-send-email-babu.moger@oracle.com Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Josh Hunt <johunt@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14kernel/watchdog.c: move hardlockup detector to separate fileBabu Moger
Separate hardlockup code from watchdog.c and move it to watchdog_hld.c. It is mostly straight forward. Remove everything inside CONFIG_HARDLOCKUP_DETECTORS. This code will go to file watchdog_hld.c. Also update the makefile accordigly. Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Josh Hunt <johunt@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14kernel/watchdog.c: move shared definitions to nmi.hBabu Moger
Patch series "Clean up watchdog handlers", v2. This is an attempt to cleanup watchdog handlers. Right now, kernel/watchdog.c implements both softlockup and hardlockup detectors. Softlockup code is generic. Hardlockup code is arch specific. Some architectures don't use hardlockup detectors. They use their own watchdog detectors. To make both these combination work, we have numerous #ifdefs in kernel/watchdog.c. We are trying here to make these handlers independent of each other. Also provide an interface for architectures to implement their own handlers. watchdog_nmi_enable and watchdog_nmi_disable will be defined as weak such that architectures can override its definitions. Thanks to Don Zickus for his suggestions. Here are our previous discussions http://www.spinics.net/lists/sparclinux/msg16543.html http://www.spinics.net/lists/sparclinux/msg16441.html This patch (of 3): Move shared macros and definitions to nmi.h so that watchdog.c, new file watchdog_hld.c or any other architecture specific handler can use those definitions. Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Josh Hunt <johunt@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ktest.pl: fix englishPavel Machek
Ajdust spelling to more common "mandatory". Variant "mandidory" is certainly wrong. Link: http://lkml.kernel.org/r/20161011073003.GA19476@amd Signed-off-by: Pavel Machek <pavel@ucw.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14drivers/net/wireless/intel/iwlwifi/dvm/calib.c: simplfy min() expressionAndrew Morton
This cast is no longer needed. Cc: Johannes Berg <johannes.berg@intel.com> Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Cc: Intel Linux Wireless <linuxwifi@intel.com> Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14posix-timers: give lazy compilers some help optimizing code awayNicolas Pitre
The OpenRISC compiler (so far) fails to optimize away a large portion of code containing a reference to posix_timer_event in alarmtimer.c when CONFIG_POSIX_TIMERS is unset. Let's give it a direct clue to let the build succeed. This fixes [linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event' reported by kbuild test robot. Signed-off-by: Nicolas Pitre <nico@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/shm.c: coding style fixesShailesh Pandey
This patch fixes below warnings: WARNING: Missing a blank line after declarations WARNING: Block comments use a trailing */ on a separate line ERROR: spaces required around that '=' (ctx:WxV) Above warnings were reported by checkpatch.pl Link: http://lkml.kernel.org/r/1478604980-18062-1-git-send-email-p.shailesh@samsung.com Signed-off-by: Shailesh Pandey <p.shailesh@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc: msg, make msgrcv work with LONG_MINJiri Slaby
When LONG_MIN is passed to msgrcv, one would expect to recieve any message. But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is undefined. In particular, with my gcc -LONG_MIN produces -LONG_MIN again. So handle this case properly by assigning LONG_MAX to *msgtyp if LONG_MIN was specified as msgtyp to msgrcv. This code: long msg[] = { 100, 200 }; int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644); msgsnd(m, &msg, sizeof(msg), 0); msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0); produces currently nothing: msgget(IPC_PRIVATE, IPC_CREAT|0644) = 65538 msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0 msgrcv(65538, ... Except a UBSAN warning: UBSAN: Undefined behaviour in ipc/msg.c:745:13 negation of -9223372036854775808 cannot be represented in type 'long int': With the patch, I see what I expect: msgget(IPC_PRIVATE, IPC_CREAT|0644) = 0 msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0 msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16 Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.cz Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14initramfs: allow again choice of the embedded initram compression algorithmFrancisco Blas Izquierdo Riera (klondike)
Choosing the appropriate compression option when using an embedded initramfs can result in significant size differences in the resulting data. This is caused by avoiding double compression of the initramfs contents. For example on my tests, choosing CONFIG_INITRAMFS_COMPRESSION_NONE when compressing the kernel using XZ) results in up to 500KiB differences (9MiB to 8.5MiB) in the kernel size as the dictionary will not get polluted with uncomprensible data and may reuse kernel data too. Despite embedding an uncompressed initramfs, a user may want to allow for a compressed extra initramfs to be passed using the rd system, for example to boot a recovery system. 9ba4bcb645898d ("initramfs: read CONFIG_RD_ variables for initramfs compression") broke that behavior by making the choice based on CONFIG_RD_* instead of adding CONFIG_INITRAMFS_COMPRESSION_LZ4. Saddly, CONFIG_RD_* is also used to choose the supported RD compression algorithms by the kernel and a user may want to support more than one. This patch also reverts commit 3e4e0f0a875 ("initramfs: remove "compression mode" choice") restoring back the "compression mode" choice and includes the CONFIG_INITRAMFS_COMPRESSION_LZ4 option which was never added. As a result the following options are added or readed affecting the embedded initramfs compression: INITRAMFS_COMPRESSION_NONE Do no compression INITRAMFS_COMPRESSION_GZIP Compress using gzip INITRAMFS_COMPRESSION_BZIP2 Compress using bzip2 INITRAMFS_COMPRESSION_LZMA Compress using lzma INITRAMFS_COMPRESSION_XZ Compress using xz INITRAMFS_COMPRESSION_LZO Compress using lzo INITRAMFS_COMPRESSION_LZ4 Compress using lz4 These depend on the corresponding CONFIG_RD_* option being set (except NONE which has no dependencies). This patch depends on the previous one (the previous version didn't) to simplify the way in which the algorithm is chosen and keep backwards compatibility with the behaviour introduced by 9ba4bcb645898 ("initramfs: read CONFIG_RD_ variables for initramfs compression"). Link: http://lkml.kernel.org/r/57EAD77B.7090607@klondike.es Signed-off-by: Francisco Blas Izquierdo Riera (klondike) <klondike@klondike.es> Cc: P J P <ppandit@redhat.com> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14initramfs: select builtin initram compression algorithm on KConfig instead ↵Francisco Blas Izquierdo Riera (klondike)
of Makefile Move the current builtin initram compression algorithm selection from the Makefile into the INITRAMFS_COMPRESSION variable. This makes deciding algorithm precedence easier and would allow for overrides if new algorithms want to be tested. Link: http://lkml.kernel.org/r/57EAD769.1090401@klondike.es Signed-off-by: Francisco Blas Izquierdo Riera (klondike) <klondike@klondike.es> Cc: P J P <ppandit@redhat.com> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14kdb: call vkdb_printf() from vprintk_default() only when wantedPetr Mladek
kdb_trap_printk allows to pass normal printk() messages to kdb via vkdb_printk(). For example, it is used to get backtrace using the classic show_stack(), see kdb_show_stack(). vkdb_printf() tries to avoid a potential infinite loop by disabling the trap. But this approach is racy, for example: CPU1 CPU2 vkdb_printf() // assume that kdb_trap_printk == 0 saved_trap_printk = kdb_trap_printk; kdb_trap_printk = 0; kdb_show_stack() kdb_trap_printk++; Problem1: Now, a nested printk() on CPU0 calls vkdb_printf() even when it should have been disabled. It will not cause a deadlock but... // using the outdated saved value: 0 kdb_trap_printk = saved_trap_printk; kdb_trap_printk--; Problem2: Now, kdb_trap_printk == -1 and will stay like this. It means that all messages will get passed to kdb from now on. This patch removes the racy saved_trap_printk handling. Instead, the recursion is prevented by a check for the locked CPU. The solution is still kind of racy. A non-related printk(), from another process, might get trapped by vkdb_printf(). And the wanted printk() might not get trapped because kdb_printf_cpu is assigned. But this problem existed even with the original code. A proper solution would be to get_cpu() before setting kdb_trap_printk and trap messages only from this CPU. I am not sure if it is worth the effort, though. In fact, the race is very theoretical. When kdb is running any of the commands that use kdb_trap_printk there is a single active CPU and the other CPUs should be in a holding pen inside kgdb_cpu_enter(). The only time this is violated is when there is a timeout waiting for the other CPUs to report to the holding pen. Finally, note that the situation is a bit schizophrenic. vkdb_printf() explicitly allows recursion but only from KDB code that calls kdb_printf() directly. On the other hand, the generic printk() recursion is not allowed because it might cause an infinite loop. This is why we could not hide the decision inside vkdb_printf() easily. Link: http://lkml.kernel.org/r/1480412276-16690-4-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14kdb: properly synchronize vkdb_printf() calls with other CPUsPetr Mladek
kdb_printf_lock does not prevent other CPUs from entering the critical section because it is ignored when KDB_STATE_PRINTF_LOCK is set. The problematic situation might look like: CPU0 CPU1 vkdb_printf() if (!KDB_STATE(PRINTF_LOCK)) KDB_STATE_SET(PRINTF_LOCK); spin_lock_irqsave(&kdb_printf_lock, flags); vkdb_printf() if (!KDB_STATE(PRINTF_LOCK)) BANG: The PRINTF_LOCK state is set and CPU1 is entering the critical section without spinning on the lock. The problem is that the code tries to implement locking using two state variables that are not handled atomically. Well, we need a custom locking because we want to allow reentering the critical section on the very same CPU. Let's use solution from Petr Zijlstra that was proposed for a similar scenario, see https://lkml.kernel.org/r/20161018171513.734367391@infradead.org This patch uses the same trick with cmpxchg(). The only difference is that we want to handle only recursion from the same context and therefore we disable interrupts. In addition, KDB_STATE_PRINTF_LOCK is removed. In fact, we are not able to set it a non-racy way. Link: http://lkml.kernel.org/r/1480412276-16690-3-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14kdb: remove unused kdb_event handlingPetr Mladek
kdb_event state variable is only set but never checked in the kernel code. http://www.spinics.net/lists/kdb/msg01733.html suggests that this variable affected WARN_CONSOLE_UNLOCKED() in the original implementation. But this check never went upstream. The semantic is unclear and racy. The value is updated after the kdb_printf_lock is acquired and after it is released. It should be symmetric at minimum. The value should be manipulated either inside or outside the locked area. Fortunately, it seems that the original function is gone and we could simply remove the state variable. Link: http://lkml.kernel.org/r/1480412276-16690-2-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Suggested-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>