Age | Commit message (Collapse) | Author |
|
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On 64 bit system we have enough space in struct page to encode
compound_dtor and compound_order with unsigned int.
On x86-64 it leads to slightly smaller code size due usesage of plain
MOV instead of MOVZX (zero-extended move) or similar effect.
allyesconfig:
text data bss dec hex filename
159520446 48146736 72196096 279863278 10ae5fee vmlinux.pre
159520382 48146736 72196096 279863214 10ae5fae vmlinux.post
On other architectures without native support of 16-bit data types the
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Let's try to be consistent about data type of page order.
[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:
CPU0 CPU1
isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
<head == NULL dereferencing>
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.
We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.
The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.
The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.
hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.
The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.
That means page->compound_head shares storage space with:
- page->lru.next;
- page->next;
- page->rcu_head.next;
That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.
page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().
[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The patch halves space occupied by compound_dtor and compound_order in
struct page.
For compound_order, it's trivial long -> short conversion.
For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.
This patch free up one word in tail pages for reuse. This is preparation
for the next patch.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since 8456a648cf44 ("slab: use struct page for slab management") nobody
uses slab_page field in struct page.
Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Constify `struct zs_pool' ->name.
[akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make the return type of zpool_get_type const; the string belongs to the
zpool driver and should not be modified. Remove the redundant type field
in the struct zpool; it is private to zpool.c and isn't needed since
->driver->type can be used directly. Add comments indicating strings must
be null-terminated.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Change the param_free_charp() function from static to exported.
It is used by zswap in the next patch ("zswap: use charp for zswap param
strings").
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Someone has an 86 column display.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
combinations
Andrew stated the following
We have quite a history of remote parts of the kernel using
weird/wrong/inexplicable combinations of __GFP_ flags. I tend
to think that this is because we didn't adequately explain the
interface.
And I don't think that gfp.h really improved much in this area as
a result of this patchset. Could you go through it some time and
decide if we've adequately documented all this stuff?
This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
High-order watermark checking exists for two reasons -- kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
high-order free pages for as long as possible. This patch introduces
MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
allocations on demand and avoids using those blocks for order-0
allocations. This is more flexible and reliable than MIGRATE_RESERVE was.
A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
allocation request steals a pageblock but limits the total number to 1% of
the zone. Callers that speculatively abuse atomic allocations for
long-lived high-order allocations to access the reserve will quickly fail.
Note that SLUB is currently not such an abuser as it reclaims at least
once. It is possible that the pageblock stolen has few suitable
high-order pages and will need to steal again in the near future but there
would need to be strong justification to search all pageblocks for an
ideal candidate.
The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.
The watermark checks account for the reserved pageblocks when the
allocation request is not a high-order atomic allocation.
The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock.
This is deliberate as the intent of the reservation is to satisfy a
limited number of atomic high-order short-lived requests if the system
requires them.
The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic. It is intended to stress random high-order allocations from an
unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations. The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure.
The allocation failures once the workload warmed up were as follows;
4.2-rc5-vanilla 70%
4.2-rc5-atomic-reserve 56%
The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.
Overall, this is a small reduction but the reserves are small relative to
the number of allocation requests. In early versions of the patch, the
failure rate reduced by a much larger amount but that required much larger
reserves and perversely made atomic allocations seem more reliable than
regular allocations.
[yalin.wang2010@gmail.com: fix redundant check and a memory leak]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
MIGRATE_RESERVE preserves an old property of the buddy allocator that
existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
tended to remain contiguous until the only alternative was to fail the
allocation. At the time it was discovered that high-order atomic
allocations relied on this property so MIGRATE_RESERVE was introduced. A
later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
Note that this patch in isolation may look like a false regression if
someone was bisecting high-order atomic allocation failures.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.
1) The cpuset checks are no-ops unless a cpuset is active and in general
are a lot cheaper.
2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.
3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost
of the watermark checking.
4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.
The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it. If stalls due
to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.
The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticeable difference as a result of the removal.
The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency. On a 2-node machine the results of this patch are
stutter
4.3.0-rc1 4.3.0-rc1
baseline nozlc-v4
Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)
Note that the maximum stall latency went from 24 seconds to 12 which is
still bad but an improvement. The milage varies considerably 2-node
machine on an earlier test went from 494 seconds to 47 seconds and a
4-node machine that tested an earlier version of this patch went from a
worst case stall time of 6 seconds to 67ms. The nature of the benchmark
is inherently unpredictable as it is hammering the system and the milage
will vary between machines.
There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed
1. Direct reclaim rates were higher. This was likely due to direct reclaim
being entered instead of the zlc disabling a zone and busy looping.
Busy looping may have the effect of allowing kswapd to make more
progress and in some cases may be better overall. If this is found then
the correct action is to put direct reclaimers to sleep on a waitqueue
and allow kswapd make forward progress. Busy looping on the zlc is even
worse than when the allocator used to blindly call congestion_wait().
2. There was higher swap activity as direct reclaim was active.
3. Direct reclaim efficiency was lower. This is related to 1 as more
scanning activity also encountered more pages that could not be
immediately reclaimed
In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons
1. The test is primarily concerned with latency. The mmap attempts are also
faulted which means there are THP allocation requests. The ZLC could
cause zones to be disabled causing the process to busy loop instead
of reclaiming. This looks like elevated direct reclaim activity but
it's the correct action to take based on what processes requested.
2. The test hammers reclaim and compaction heavily. The number of successful
THP faults is highly variable but affects the reclaim stats. It's not a
realistic or reasonable measure of page reclaim activity.
3. No other page-reclaim intensive workload that was tested showed a problem.
4. If a workload is identified that benefitted from the busy looping then it
should be fixed by having direct reclaimers sleep on a wait queue until
woken by kswapd instead of busy looping. We had this class of problem before
when congestion_waits() with a fixed timeout was a brain damaged decision
but happened to benefit some workloads.
If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
GFP_IOFS was intended to be shorthand for clearing two flags, not a set of
allocation flags. There is only one user of this flag combination now and
there appears to be no reason why Lustre had to be protected from reclaim
stalls. As none of the sites appear to be atomic, this patch simply
deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or
GFP_NOIO as appropriate.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch redefines which GFP bits are used for specifying mobility and
the order of the migrate types. Once redefined it's possible to convert
GFP flags to a migrate type with a simple mask and shift. The only
downside is that readers of OOM kill messages and allocation failures may
have been used to the existing values but scripts/gfp-translate will help.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a seqcounter that protects against spurious allocation failures
when a task is changing the allowed nodes in a cpuset. There is no need
to check the seqcounter until a cpuset exists.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Overall, the intent of this series is to remove the zonelist cache which
was introduced to avoid high overhead in the page allocator. Once this is
done, it is necessary to reduce the cost of watermark checks.
The series starts with minor micro-optimisations.
Next it notes that GFP flags that affect watermark checks are abused.
__GFP_WAIT historically identified callers that could not sleep and could
access reserves. This was later abused to identify callers that simply
prefer to avoid sleeping and have other options. A patch distinguishes
between atomic callers, high-priority callers and those that simply wish
to avoid sleep.
The zonelist cache has been around for a long time but it is of dubious
merit with a lot of complexity and some issues that are explained. The
most important issue is that a failed THP allocation can cause a zone to
be treated as "full". This potentially causes unnecessary stalls, reclaim
activity or remote fallbacks. The issues could be fixed but it's not
worth it. The series places a small number of other micro-optimisations
on top before examining GFP flags watermarks.
High-order watermarks enforcement can cause high-order allocations to fail
even though pages are free. The watermark checks both protect high-order
atomic allocations and make kswapd aware of high-order pages but there is
a much better way that can be handled using migrate types. This series
uses page grouping by mobility to reserve pageblocks for high-order
allocations with the size of the reservation depending on demand. kswapd
awareness is maintained by examining the free lists. By patch 12 in this
series, there are no high-order watermark checks while preserving the
properties that motivated the introduction of the watermark checks.
This patch (of 10):
No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
"We have a lot of subvolume quota improvements in here, along with big
piles of cleanups from Dave Sterba and Anand Jain and others.
Josef pitched in a batch of allocator fixes based on production use
here at FB. We found that mount -o ssd_spread greatly improved our
performance on hardware raid5/6, but it exposed some CPU bottlenecks
in the allocator. These patches make a huge difference"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (100 commits)
Btrfs: fix hole punching when using the no-holes feature
Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
btrfs: Fix a data space underflow warning
btrfs: qgroup: Fix a rebase bug which will cause qgroup double free
btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans
btrfs: clear PF_NOFREEZE in cleaner_kthread()
btrfs: qgroup: Don't copy extent buffer to do qgroup rescan
btrfs: add balance filters limits, stripes and usage to supported mask
btrfs: extend balance filter usage to take minimum and maximum
btrfs: add balance filter for stripes
btrfs: extend balance filter limit to take minimum and maximum
btrfs: fix use after free iterating extrefs
btrfs: check unsupported filters in balance arguments
Btrfs: fix regression running delayed references when using qgroups
Btrfs: fix regression when running delayed references
Btrfs: don't do extra bitmap search in one bit case
Btrfs: keep track of largest extent in bitmaps
Btrfs: don't keep trying to build clusters if we are fragmented
Btrfs: cut down on loops through the allocator
Btrfs: don't continue setting up space cache when enospc
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Add support for the CSUM_SEED feature which will allow future
userspace utilities to change the file system's UUID without rewriting
all of the file system metadata.
A number of miscellaneous fixes, the most significant of which are in
the ext4 encryption support. Anyone wishing to use the encryption
feature should backport all of the ext4 crypto patches up to 4.4 to
get fixes to a memory leak and file system corruption bug.
There are also cleanups in ext4's feature test macros and in ext4's
sysfs support code"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
fs/ext4: remove unnecessary new_valid_dev check
ext4: fix abs() usage in ext4_mb_check_group_pa
ext4: do not allow journal_opts for fs w/o journal
ext4: explicit mount options parsing cleanup
ext4, jbd2: ensure entering into panic after recording an error in superblock
[PATCH] fix calculation of meta_bg descriptor backups
ext4: fix potential use after free in __ext4_journal_stop
jbd2: fix checkpoint list cleanup
ext4: fix xfstest generic/269 double revoked buffer bug with bigalloc
ext4: make the bitmap read routines return real error codes
jbd2: clean up feature test macros with predicate functions
ext4: clean up feature test macros with predicate functions
ext4: call out CRC and corruption errors with specific error codes
ext4: store checksum seed in superblock
ext4: reserve code points for the project quota feature
ext4: promote ext4 over ext2 in the default probe order
jbd2: gate checksum calculations on crc driver presence, not sb flags
ext4: use private version of page_zero_new_buffers() for data=journal mode
ext4 crypto: fix bugs in ext4_encrypted_zeroout()
ext4 crypto: replace some BUG_ON()'s with error checks
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanups from Arnd Bergmann:
"The asm-generic changes for 4.4 are mostly a series from Christoph
Hellwig to clean up various abuses of headers in there. The patch to
rename the io-64-nonatomic-*.h headers caused some conflicts with new
users, so I added a workaround that we can remove in the next merge
window.
The only other patch is a warning fix from Marek Vasut"
* tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: temporarily add back asm-generic/io-64-nonatomic*.h
asm-generic: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations
gpio-mxc: stop including <asm-generic/bug>
n_tracesink: stop including <asm-generic/bug>
n_tracerouter: stop including <asm-generic/bug>
mlx5: stop including <asm-generic/kmap_types.h>
hifn_795x: stop including <asm-generic/kmap_types.h>
drbd: stop including <asm-generic/kmap_types.h>
move count_zeroes.h out of asm-generic
move io-64-nonatomic*.h out of asm-generic
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracking updates from Steven Rostedt:
"Most of the changes are clean ups and small fixes. Some of them have
stable tags to them. I searched through my INBOX just as the merge
window opened and found lots of patches to pull. I ran them through
all my tests and they were in linux-next for a few days.
Features added this release:
----------------------------
- Module globbing. You can now filter function tracing to several
modules. # echo '*:mod:*snd*' > set_ftrace_filter (Dmitry Safonov)
- Tracer specific options are now visible even when the tracer is not
active. It was rather annoying that you can only see and modify
tracer options after enabling the tracer. Now they are in the
options/ directory even when the tracer is not active. Although
they are still only visible when the tracer is active in the
trace_options file.
- Trace options are now per instance (although some of the tracer
specific options are global)
- New tracefs file: set_event_pid. If any pid is added to this file,
then all events in the instance will filter out events that are not
part of this pid. sched_switch and sched_wakeup events handle next
and the wakee pids"
* tag 'trace-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
tracefs: Fix refcount imbalance in start_creating()
tracing: Put back comma for empty fields in boot string parsing
tracing: Apply tracer specific options from kernel command line.
tracing: Add some documentation about set_event_pid
ring_buffer: Remove unneeded smp_wmb() before wakeup of reader benchmark
tracing: Allow dumping traces without tracking trace started cpus
ring_buffer: Fix more races when terminating the producer in the benchmark
ring_buffer: Do no not complete benchmark reader too early
tracing: Remove redundant TP_ARGS redefining
tracing: Rename max_stack_lock to stack_trace_max_lock
tracing: Allow arch-specific stack tracer
recordmcount: arm64: Replace the ignored mcount call into nop
recordmcount: Fix endianness handling bug for nop_mcount
tracepoints: Fix documentation of RCU lockdep checks
tracing: ftrace_event_is_function() can return boolean
tracing: is_legal_op() can return boolean
ring-buffer: rb_event_is_commit() can return boolean
ring-buffer: rb_per_cpu_empty() can return boolean
ring_buffer: ring_buffer_empty{cpu}() can return boolean
ring-buffer: rb_is_reader_page() can return boolean
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree updates from Rob Herring:
"A fairly large (by DT standards) pull request this time with the
majority being some overdue moving DT binding docs around to
consolidate similar bindings.
- DT binding doc consolidation moving similar bindings to common
locations. The majority of these are display related which were
scattered in video/, fb/, drm/, gpu/, and panel/ directories.
- Add new config option, CONFIG_OF_ALL_DTBS, to enable building all
dtbs in the tree for most arches with dts files (except powerpc for
now).
- OF_IRQ=n fixes for user enabled CONFIG_OF.
- of_node_put ref counting fixes from Julia Lawall.
- Common DT binding for wakeup-source and deprecation of all similar
bindings.
- DT binding for PXA LCD controller.
- Allow ignoring failed PCI resource translations in order to ignore
64-bit addresses on non-LPAE 32-bit kernels.
- Support setting the NUMA node from DT instead of only from parent
device.
- Couple of earlycon DT parsing fixes for address and options"
* tag 'devicetree-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (45 commits)
MAINTAINERS: update DT binding doc locations
devicetree: add Sigma Designs vendor prefix
of: simplify arch_find_n_match_cpu_physical_id() function
Documentation: arm: Fixed typo in socfpga fpga mgr example
Documentation: devicetree: fix reference to legacy wakeup properties
Documentation: devicetree: standardize/consolidate on "wakeup-source" property
drivers: of: removing assignment of 0 to static variable
xtensa: enable building of all dtbs
mips: enable building of all dtbs
metag: enable building of all dtbs
metag: use common make variables for dtb builds
h8300: enable building of all dtbs
arm64: enable building of all dtbs
arm: enable building of all dtbs
arc: enable building of all dtbs
arc: use common make variables for dtb builds
of: add config option to enable building of all dtbs
of/fdt: fix error checking for earlycon address
of/overlay: add missing of_node_put
of/platform: add missing of_node_put
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"Items of note:
- evdev users can now limit or mask the kind of events they will
receive. This will allow applications such as power manager or
network manager to only be woken when user presses special keys
such as KEY_POWER or KEY_WIFI and not be bothered with ordinary
key presses coming from keyboard
- support for FocalTech FT6236 touchscreen controller
- support for ROHM BU21023/24 touchscreen controller
- edt-ft5x06 touchscreen driver got a face lift and can now be used
with FT5506
- support for Google Fiber TV Box remote controls
- improvements in xpad driver (with more to come)
- several parport-based drivers have been switched to the new device
model
- other miscellaneous driver improvements"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (70 commits)
HID: hid-gfrm: avoid warning for input_configured API change
HID: hid-input: allow input_configured callback return errors
Input: evdev - fix bug in checking duplicate clock change request
Input: add userio module
Input: evdev - add event-mask API
Input: snvs_pwrkey - remove duplicated semicolon
HID: hid-gfrm: Google Fiber TV Box remote controls
Input: e3x0-button - update Kconfig description
Input: tegra-kbc - drop use of IRQF_NO_SUSPEND flag
Input: tegra-kbc - enable support for the standard "wakeup-source" property
Input: xen - check return value of xenbus_printf
Input: hp_sdc_rtc - fix y2038 problem in proc_show
Input: nomadik-ske-keypad - fix a trivial typo
Input: xpad - fix clash of presence handling with LED setting
Input: edt-ft5x06 - work around FT5506 firmware bug
Input: edt-ft5x06 - add support for FT5506
Input: edt-ft5x06 - add support for different max support points
Input: edt-ft5x06 - use max support points to determine how much to read
Input: rotary-encoder - add support for quarter-period mode
Input: rotary-encoder - use of_property_read_bool
...
|
|
Pull MTD updates from Brian Norris:
"Core:
- WARN (in some cases) when a struct mtd_info is registered multiple
times; in the past this was "supported", but it's still error prone
for future development. There's only one ugly case of this left in
the tree (that we're aware of) and the owners are aware of the
problems there.
- fix potential deadlock in the blkdev removal path NOTE: the
(potential) deadlock was introduced in a for-stable patch. This
one is also marked for -stable.
- ioctl(BLKPG) compat_ioctl support; resolves issues with 32-bit user
space vs 64-bit kernel space
- Set MTD parent device correctly throughout the tree, so the tree
structure appears correctly in sysfs; many drivers were missing
this (soft) requirement
- Move device tree partitions (ofpart) into a dedicated 'partitions'
subnode; this helps to disambiguate whether a node is a partition
or some other auxiliary data
- Improve error handling for partitioning failures
NAND:
- General: Increase timeout period, for corner-case systems with
less-than-accurate jiffies
- Fix OF-based autoloading of several NAND drivers when built as
modules
- pxa3xx_nand:
- Rework timing configuration to be more dynamic
- Refactor PM support
- brcmnand: prepare for NorthStar 2 support (ARM64, 16-bit NAND
chips)
- sunxi_nand: refactoring and a few bug fixes
- vf610: new NAND driver
- FSMC: add SW BCH support; support common NAND DT bindings
- lpc32xx_slc: refactor and improve timing calculations logic
- denali: support for rev 5.1
SPI NOR:
- Layering improvements
- Added Winbond lock/unlock support
- Added mtd_is_locked() (i.e., ioctl(MEMISLOCKED)) support
- Increase full-chip-erase timeout linearly with flash size
- fsl-quadspi: fix compile for non-ARM architectures
- New flash support"
* tag 'for-linus-20151106' of git://git.infradead.org/linux-mtd: (169 commits)
mtd: don't WARN about overloaded users of mtd->reboot_notifier.notifier_call
mtd: nand: sunxi: avoid retrieving data before ECC pass
mtd: nand: sunxi: fix sunxi_nfc_hw_ecc_read/write_chunk()
mtd: blkdevs: fix potential deadlock + lockdep warnings
mtd: ofpart: move ofpart partitions to a dedicated dt node
doc: dt: mtd: support partitions in a special 'partitions' subnode
mtd: brcmnand: Force 8bit mode before doing nand_scan_ident()
mtd: brcmnand: factor out CFG and CFG_EXT bitfields
mtd: mtdpart: Do not fail mtd probe when parsing partitions fails
mtd: fsl-quadspi: fix macro collision problems with READ/WRITE
mtd: warn when registering the same master many times
mtd: fixup corner case error handling in mtd_device_parse_register()
mtd: tests: Replace timeval with ktime_t
mtd: fsmc_nand: Add BCH4 SW ECC support for SPEAr600
mtd: nand: vf610_nfc: use nand_check_erased_ecc_chunk() helper
mtd: nand: increase ready wait timeout and report timeouts
mtd: docg3: off by one in doc_register_sysfs()
mtd: pxa3xx_nand: clean up the pxa3xx timings
mtd: pxa3xx_nand: rework flash detection and timing setup
mtd: pxa3xx_nand: add helpers to setup the timings
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Resource management:
- Add support for Enhanced Allocation devices (Sean O. Stalley)
- Add Enhanced Allocation register entries (Sean O. Stalley)
- Handle IORESOURCE_PCI_FIXED when sizing resources (David Daney)
- Handle IORESOURCE_PCI_FIXED when assigning resources (David Daney)
- Handle Enhanced Allocation capability for SR-IOV devices (David Daney)
- Clear IORESOURCE_UNSET when reverting to firmware-assigned address (Bjorn Helgaas)
- Make Enhanced Allocation bitmasks more obvious (Bjorn Helgaas)
- Expand Enhanced Allocation BAR output (Bjorn Helgaas)
- Add of_pci_check_probe_only to parse "linux,pci-probe-only" (Marc Zyngier)
- Fix lookup of linux,pci-probe-only property (Marc Zyngier)
- Add sparc mem64 resource parsing for root bus (Yinghai Lu)
PCI device hotplug:
- pciehp: Queue power work requests in dedicated function (Guenter Roeck)
Driver binding:
- Add builtin_pci_driver() to avoid registration boilerplate (Paul Gortmaker)
Virtualization:
- Set SR-IOV NumVFs to zero after enumeration (Alexander Duyck)
- Remove redundant validation of SR-IOV offset/stride registers (Alexander Duyck)
- Remove VFs in reverse order if virtfn_add() fails (Alexander Duyck)
- Reorder pcibios_sriov_disable() (Alexander Duyck)
- Wait 1 second between disabling VFs and clearing NumVFs (Alexander Duyck)
- Fix sriov_enable() error path for pcibios_enable_sriov() failures (Alexander Duyck)
- Enable SR-IOV ARI Capable Hierarchy before reading TotalVFs (Ben Shelton)
- Don't try to restore VF BARs (Wei Yang)
MSI:
- Don't alloc pcibios-irq when MSI is enabled (Joerg Roedel)
- Add msi_controller setup_irqs() method for special multivector setup (Lucas Stach)
- Export all remapped MSIs to sysfs attributes (Romain Bezut)
- Disable MSI on SiS 761 (Ondrej Zary)
AER:
- Clear error status registers during enumeration and restore (Taku Izumi)
Generic host bridge driver:
- Fix lookup of linux,pci-probe-only property (Marc Zyngier)
- Allow multiple hosts with different map_bus() methods (David Daney)
- Pass starting bus number to pci_scan_root_bus() (David Daney)
- Fix address window calculation for non-zero starting bus (David Daney)
Altera host bridge driver:
- Add msi.h to ARM Kbuild (Ley Foon Tan)
- Add Altera PCIe host controller driver (Ley Foon Tan)
- Add Altera PCIe MSI driver (Ley Foon Tan)
APM X-Gene host bridge driver:
- Remove msi_controller assignment (Duc Dang)
Broadcom iProc host bridge driver:
- Fix header comment "Corporation" misspelling (Florian Fainelli)
- Fix code comment to match code (Ray Jui)
- Remove unused struct iproc_pcie.irqs[] (Ray Jui)
- Call pci_fixup_irqs() for ARM64 as well as ARM (Ray Jui)
- Fix PCIe reset logic (Ray Jui)
- Improve link detection logic (Ray Jui)
- Update PCIe device tree bindings (Ray Jui)
- Add outbound mapping support (Ray Jui)
Freescale i.MX6 host bridge driver:
- Return real error code from imx6_add_pcie_port() (Fabio Estevam)
- Add PCIE_PHY_RX_ASIC_OUT_VALID definition (Fabio Estevam)
Freescale Layerscape host bridge driver:
- Remove ls_pcie_establish_link() (Minghuan Lian)
- Ignore PCIe controllers in Endpoint mode (Minghuan Lian)
- Factor out SCFG related function (Minghuan Lian)
- Update ls_add_pcie_port() (Minghuan Lian)
- Remove unused fields from struct ls_pcie (Minghuan Lian)
- Add support for LS1043a and LS2080a (Minghuan Lian)
- Add ls_pcie_msi_host_init() (Minghuan Lian)
HiSilicon host bridge driver:
- Add HiSilicon SoC Hip05 PCIe driver (Zhou Wang)
Marvell MVEBU host bridge driver:
- Return zero for reserved or unimplemented config space (Russell King)
- Use exact config access size; don't read/modify/write (Russell King)
- Use of_get_available_child_count() (Russell King)
- Use for_each_available_child_of_node() to walk child nodes (Russell King)
- Report full node name when reporting a DT error (Russell King)
- Use port->name rather than "PCIe%d.%d" (Russell King)
- Move port parsing and resource claiming to separate function (Russell King)
- Fix memory leaks and refcount leaks (Russell King)
- Split port parsing and resource claiming from port setup (Russell King)
- Use gpio_set_value_cansleep() (Russell King)
- Use devm_kcalloc() to allocate an array (Russell King)
- Use gpio_desc to carry around gpio (Russell King)
- Improve clock/reset handling (Russell King)
- Add PCI Express root complex capability block (Russell King)
- Remove code restricting accesses to slot 0 (Russell King)
NVIDIA Tegra host bridge driver:
- Wrap static pgprot_t initializer with __pgprot() (Ard Biesheuvel)
Renesas R-Car host bridge driver:
- Build pci-rcar-gen2.c only on ARM (Geert Uytterhoeven)
- Build pcie-rcar.c only on ARM (Geert Uytterhoeven)
- Make PCI aware of the I/O resources (Phil Edworthy)
- Remove dependency on ARM-specific struct hw_pci (Phil Edworthy)
- Set root bus nr to that provided in DT (Phil Edworthy)
- Fix I/O offset for multiple host bridges (Phil Edworthy)
ST Microelectronics SPEAr13xx host bridge driver:
- Fix dw_pcie_cfg_read/write() usage (Gabriele Paoloni)
Synopsys DesignWare host bridge driver:
- Make "clocks" and "clock-names" optional DT properties (Bhupesh Sharma)
- Use exact access size in dw_pcie_cfg_read() (Gabriele Paoloni)
- Simplify dw_pcie_cfg_read/write() interfaces (Gabriele Paoloni)
- Require config accesses to be naturally aligned (Gabriele Paoloni)
- Make "num-lanes" an optional DT property (Gabriele Paoloni)
- Move calculation of bus addresses to DRA7xx (Gabriele Paoloni)
- Replace ARM pci_sys_data->align_resource with global function pointer (Gabriele Paoloni)
- Factor out MSI msg setup (Lucas Stach)
- Implement multivector MSI IRQ setup (Lucas Stach)
- Make get_msi_addr() return phys_addr_t, not u32 (Lucas Stach)
- Set up high part of MSI target address (Lucas Stach)
- Fix PORT_LOGIC_LINK_WIDTH_MASK (Zhou Wang)
- Revert "PCI: designware: Program ATU with untranslated address" (Zhou Wang)
- Use of_pci_get_host_bridge_resources() to parse DT (Zhou Wang)
- Make driver arch-agnostic (Zhou Wang)
Miscellaneous:
- Make x86 pci_subsys_init() static (Alexander Kuleshov)
- Turn off Request Attributes to avoid Chelsio T5 Completion erratum (Hariprasad Shenai)"
* tag 'pci-v4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (94 commits)
PCI: altera: Add Altera PCIe MSI driver
PCI: hisi: Add HiSilicon SoC Hip05 PCIe driver
PCI: layerscape: Add ls_pcie_msi_host_init()
PCI: layerscape: Add support for LS1043a and LS2080a
PCI: layerscape: Remove unused fields from struct ls_pcie
PCI: layerscape: Update ls_add_pcie_port()
PCI: layerscape: Factor out SCFG related function
PCI: layerscape: Ignore PCIe controllers in Endpoint mode
PCI: layerscape: Remove ls_pcie_establish_link()
PCI: designware: Make "clocks" and "clock-names" optional DT properties
PCI: designware: Make driver arch-agnostic
ARM/PCI: Replace pci_sys_data->align_resource with global function pointer
PCI: designware: Use of_pci_get_host_bridge_resources() to parse DT
Revert "PCI: designware: Program ATU with untranslated address"
PCI: designware: Move calculation of bus addresses to DRA7xx
PCI: designware: Make "num-lanes" an optional DT property
PCI: designware: Require config accesses to be naturally aligned
PCI: designware: Simplify dw_pcie_cfg_read/write() interfaces
PCI: designware: Use exact access size in dw_pcie_cfg_read()
PCI: spear: Fix dw_pcie_cfg_read/write() usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"Here is the first batch of updates for sound system on 4.4-rc1.
Again at this time, the update looks fairly calm; no big changes in
either ALSA core or ASoC infrastructures, rather all small cleanups,
in addition to the new stuff as usual.
The biggest changes are about Firewire sound devices. It gained lots
of new device support, and MIDI functionality. Also there are updates
for a few still working-in-progress stuff (topology API and ASoC
skylake), too. But overall, this update should give no big surprise.
Some highlights are below:
Core:
- A few more Kconfig items for tinification; it's marked as EXPERT,
so normal user should't be bothered :)
- Refactoring with a new PCM hw_constraint helper
- Removal of unused transfer_ack_{begin,end} PCM callbacks
Firewire:
- Restructuring of code subtree, lots of refactoring
- Support AMDTP variants
- New driver for Digidesign 002/003 family
- Adds support for TASCAM FireOne to ALSA OXFW driver
- Add MIDI support to TASCAM and Digi00x devices
HD-Audio:
- Automated modalias generation for codec drivers, finally
- Improvement on heuristics for setting mixer name
- A few fixes for longstanding bugs on Creative CA0132 cards
- Addition of audio rate callback with i915 communication
- Fix suspend issue on recent Dell XPS
- Intel Lewisburg controller support
ASoC:
- Updates to the topology userspace interface
- Big updates to the Renesas support (rcar)
- More updates for supporting Intel Sky Lake systems
- New drivers for Asahi Kasei Microdevices AK4613, Allwinnner A10,
Cirrus Logic WM8998, Dialog DA7219, Nuvoton NAU8825, Rockchip
S/PDIF, and Atmel class D amplifier
USB-Audio:
- A fix for newer Roland MIDI devices
- Quirks and workarounds for Zoom R16/24 device
Misc:
- A few fixes for some old Cirrus CS46xx PCI sound boards
- Yet another fixes for some old ESS Maestro3 PCI sound boards"
* tag 'sound-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (330 commits)
ALSA: hda - Add Intel Lewisburg device IDs Audio
ALSA: hda - Apply pin fixup for HP ProBook 6550b
ALSA: hda - Fix lost 4k BDL boundary workaround
ALSA: maestro3: Fix Allegro mute until master volume/mute is touched
ALSA: maestro3: Enable docking support for Dell Latitude C810
ALSA: firewire-digi00x: add another rawmidi character device for MIDI control ports
ALSA: firewire-digi00x: add MIDI operations for MIDI control port
ALSA: firewire-digi00x: rename identifiers of MIDI operation for physical ports
ALSA: cs46xx: Fix suspend for all channels
ALSA: cs46xx: Fix Duplicate front for CS4294 and CS4298 codecs
ALSA: DocBook: Add soc-ops.c and soc-compress.c
ALSA: hda - Add / fix kernel doc comments
ALSA: Constify ratden/ratnum constraints
ALSA: hda - Disable 64bit address for Creative HDA controllers
ALSA: hda/realtek - Dell XPS one ALC3260 speaker no sound after resume back
ALSA: hda/ca0132 - Convert leftover pr_info() and pr_err()
ASoC: fsl: Use #ifdef instead of #if for CONFIG_PM_SLEEP
ASoC: rt5645: Sort the order for register bit defines
ASoC: dwc: add check for master/slave format
ASoC: rt5645: Add the HWEQ for the speaker output
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"New Device Support:
- Add support for 88pm860; 88pm80x
- Add support for 24c08 EEPROM; at24
- Add support for Broxton Whiskey Cove; intel*
- Add support for RTS522A; rts5227
- Add support for I2C devices; intel_quark_i2c_gpio
New Functionality:
- Add microphone support; arizona
- Add general purpose switch support; arizona
- Add fuel-gauge support; da9150-core
- Add shutdown support; sec-core
- Add charger support; tps65217
- Add flexible serial communication unit support; atmel-flexcom
- Add power button support; axp20x
- Add led-flash support; rt5033
Core Frameworks:
- Supply a generic macro for defining Regmap IRQs
- Rework ACPI child device matching
Fix-ups:
- Use Regmap to access registers; tps6105x
- Use DEFINE_RES_IRQ_NAMED() macro; da9150
- Re-arrange device registration order; intel_quark_i2c_gpio
- Allow OF matching; cros_ec_i2c, atmel-hlcdc, hi6421-pmic, max8997, sm501
- Handle deferred probe; twl6040
- Improve accuracy of headphone detect; arizona
- Unnecessary MODULE_ALIAS() removal; bcm590xx, rt5033
- Remove unused code; htc-i2cpld, arizona, pcf50633-irq, sec-core
- Simplify code; kempld, rts5209, da903x, lm3533, da9052, arizona
- Remove #iffery; arizona
- DT binding adaptions; many
Bug Fixes:
- Fix possible NULL pointer dereference; wm831x, tps6105x
- Fix 64bit bug; intel_soc_pmic_bxtwc
- Fix signedness issue; arizona"
* tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
bindings: mfd: s2mps11: Add documentation for s2mps15 PMIC
mfd: sec-core: Remove unused s2mpu02-rtc and s2mpu02-clk children
extcon: arizona: Add extcon specific device tree binding document
MAINTAINERS: Add binding docs for Cirrus Logic/Wolfson Arizona devices
mfd: arizona: Remove bindings covered in new subsystem specific docs
mfd: rt5033: Add RT5033 Flash led sub device
mfd: lpss: Add Intel Broxton PCI IDs
mfd: lpss: Add Broxton ACPI IDs
mfd: arizona: Signedness bug in arizona_runtime_suspend()
mfd: axp20x: Add a cell for the power button part of the, axp288 PMICs
mfd: dt-bindings: Document pulled down WRSTBI pin on S2MPS1X
mfd: sec-core: Disable buck voltage reset on watchdog falling edge
mfd: sec-core: Dump PMIC revision to find out the HW
mfd: arizona: Use correct type ID for device tree config
mfd: arizona: Remove use of codec build config #ifdefs
mfd: arizona: Simplify adding subdevices
mfd: arizona: Downgrade type mismatch messages to dev_warn
mfd: arizona: Factor out checking of jack detection state
mfd: arizona: Factor out DCVDD isolation control
mfd: Make TPS6105X select REGMAP_I2C
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Kconfig: remove BE-only platforms from LE kernel build from Boqun
Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael
Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan
Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel
Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
Colin Ian King
- Disable hugepd for 64K page size, from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh
Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael
Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe
Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis
Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages, from Christophe
Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e
kexec/kdump support, a rework of the qoriq clock driver, device tree
changes including qoriq fman nodes, support for a new 85xx board, and
some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for
MPC512x LocalPlus Bus FIFO with its device tree binding
documentation, mpc512x device tree updates and some minor fixes.
* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
powerpc/pseries: Correct string length in pseries_of_derive_parent()
powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
powerpc: handle error case in cpm_muram_alloc()
powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
powerpc/book3e-64: Enable kexec
powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
powerpc/book3e-64/kexec: Enable SMP release
powerpc/book3e-64/kexec: create an identity TLB mapping
powerpc/book3e-64: Don't limit paca to 256 MiB
powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
powerpc/book3e: support CONFIG_RELOCATABLE
powerpc/booke64: Fix args to copy_and_flush
powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
powerpc/e6500: kexec: Handle hardware threads
...
|
|
Merge patch-bomb from Andrew Morton:
- inotify tweaks
- some ocfs2 updates (many more are awaiting review)
- various misc bits
- kernel/watchdog.c updates
- Some of mm. I have a huge number of MM patches this time and quite a
lot of it is quite difficult and much will be held over to next time.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
selftests: vm: add tests for lock on fault
mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
mm: introduce VM_LOCKONFAULT
mm: mlock: add new mlock system call
mm: mlock: refactor mlock, munlock, and munlockall code
kasan: always taint kernel on report
mm, slub, kasan: enable user tracking by default with KASAN=y
kasan: use IS_ALIGNED in memory_is_poisoned_8()
kasan: Fix a type conversion error
lib: test_kasan: add some testcases
kasan: update reference to kasan prototype repo
kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
kasan: various fixes in documentation
kasan: update log messages
kasan: accurately determine the type of the bad access
kasan: update reported bug types for kernel memory accesses
kasan: update reported bug types for not user nor kernel memory accesses
mm/kasan: prevent deadlock in kasan reporting
mm/kasan: don't use kasan shadow pointer in generic functions
mm/kasan: MODULE_VADDR is not available on all archs
...
|
|
The previous patch introduced a flag that specified pages in a VMA should
be placed on the unevictable LRU, but they should not be made present when
the area is created. This patch adds the ability to set this state via
the new mlock system calls.
We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
When used with MCL_CURRENT, all current mappings will be marked with
VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
marked with VM_LOCKED | VM_LOCKONFAULT.
Prior to this patch, mlockall() will unconditionally clear the
mm->def_flags any time it is called without MCL_FUTURE. This behavior is
maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
in either mlockall() invocation.
munlock() will unconditionally clear both vma flags. munlockall()
unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
field.
Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).
This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.
Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.
Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With the refactored mlock code, introduce a new system call for mlock.
The new call will allow the user to specify what lock states are being
added. mlock2 is trivial at the moment, but a follow on patch will add a
new mlock state making it useful.
Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
refresh_cpu_vm_stats(int cpu) is no longer referenced by !SMP kernel
since Linux 3.12.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
page_counter_try_charge() currently returns 0 on success and -ENOMEM on
failure, which is surprising behavior given the function name.
Make it follow the expected pattern of try_stuff() functions that return a
boolean true to indicate success, or false for failure.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
mem_cgroup_migrate() doesn't have much to offer in page migration: convert
migrate_misplaced_transhuge_page() to set_page_memcg() instead.
Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
remaining callers are replace_page_cache_page() and shmem_replace_page():
both of whom passed lrucare true, so just eliminate that argument.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
shouldn't increment NR_PAGETABLE.
Since small allocations, such as ptlock, normally do not fail (currently
they can fail if kmemcg is used though), this patch does not really fix
anything and should be considered as a code cleanup.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Before the previous patch ("memcg: unify slab and other kmem pages
charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
pages and pages allocated with alloc_kmem_pages - memcg in the page
struct. Now we can unify it. Since after it, this function becomes tiny
we can fold it into mem_cgroup_from_kmem.
[hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e. they are only used for charging pages allocated
with alloc_kmem_pages). The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.
To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context. Next, it makes the slab
subsystem use this function to charge slab pages.
Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.
Note, this patch switches slab to charge-after-alloc design. Since this
design is already used for all other memcg charges, it should not make any
difference.
[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Charging kmem pages proceeds in two steps. First, we try to charge the
allocation size to the memcg the current task belongs to, then we allocate
a page and "commit" the charge storing the pointer to the memcg in the
page struct.
Such a design looks overcomplicated, because there is not much sense in
trying charging the allocation before actually allocating a page: we won't
be able to consume much memory over the limit even if we charge after
doing the actual allocation, besides we already charge user pages post
factum, so being pedantic with kmem pages just looks pointless.
So this patch simplifies the design by merging the "charge" and the
"commit" steps into the same function, which takes the allocated page.
Also, rename the charge and uncharge methods to memcg_kmem_charge and
memcg_kmem_uncharge and make the charge method return error code instead
of bool to conform to mem_cgroup_try_charge.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Change HIGHMEM_ZONE to be the same as the DMA_ZONE macro.
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
(4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
stack. Uninlining node_page_state() reduces this to 440 bytes.
The stack consumption issue is fixed by newer gcc (4.8.4) however with
that compiler this patch reduces the node.o text size from 7314 bytes to
4578.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
kmap_atomic() generated code seemed needlessly bloated due to the way
PageHighMem() macro is implemented. It derives the exact zone for page
and then does pointer subtraction with first zone to infer the zone_type.
The pointer arithmatic in turn generates the code bloat.
PageHighMem(page)
is_highmem(page_zone(page))
zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones
Instead use is_highmem_idx() to work on zone_type available in page flags
----- Before -----
80756348: mov_s r13,r0
8075634a: ld_s r2,[r13,0]
8075634c: lsr_s r2,r2,30
8075634e: mpy r2,r2,0x2a4
80756352: add_s r2,r2,0x80aef880
80756358: ld_s r3,[r2,28]
8075635a: sub_s r2,r2,r3
8075635c: breq r2,0x2a4,80756378 <kmap+0x48>
80756364: breq r2,0x548,80756378 <kmap+0x48>
----- After -----
80756330: mov_s r13,r0
80756332: ld_s r2,[r13,0]
80756334: lsr_s r2,r2,30
80756336: sub_s r2,r2,1
80756338: brlo r2,2,80756348 <kmap+0x30>
For x86 defconfig build (32 bit only) it saves around 900 bytes.
For ARC defconfig with HIGHMEM, it saved around 2K bytes.
---->8-------
./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
function old new delta
saveable_page 162 154 -8
saveable_highmem_page 154 146 -8
skb_gro_reset_offset 147 131 -16
...
...
__change_page_attr_set_clr 1715 1678 -37
setup_data_read 434 394 -40
mon_bin_event 1967 1927 -40
swsusp_save 1148 1105 -43
_set_pages_array 549 493 -56
---->8-------
e.g. For ARC kmap()
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jennifer Herbert <jennifer.herbert@citrix.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.
A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.
The comm will always be NULL-terminated, so the worst race scenario would
only be during update. We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.
Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Compaction returns prematurely with COMPACT_PARTIAL when contended or has
fatal signal pending. This is ok for the callers, but might be misleading
in the traces, as the usual reason to return COMPACT_PARTIAL is that we
think the allocation should succeed. After this patch we distinguish the
premature ending condition in the mm_compaction_finished and
mm_compaction_end tracepoints.
The contended status covers the following reasons:
- lock contention or need_resched() detected in async compaction
- fatal signal pending
- too many pages isolated in the zone (only for async compaction)
Further distinguishing the exact reason seems unnecessary for now.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some compaction tracepoints use zone->name to print which zone is being
compacted. This works for in-kernel printing, but not userspace trace
printing of raw captured trace such as via trace-cmd report.
This patch uses zone_idx() instead of zone->name as the raw value, and
when printing, converts the zone_type to string using the appropriate EM()
macros and some ugly tricks to overcome the problem that half the values
depend on CONFIG_ options and one does not simply use #ifdef inside of
#define.
trace-cmd output before:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=ffffffff81815d7a order=9 ret=partial
after:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=Normal order=9 ret=partial
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some compaction tracepoints convert the integer return values to strings
using the compaction_status_string array. This works for in-kernel
printing, but not userspace trace printing of raw captured trace such as
via trace-cmd report.
This patch converts the private array to appropriate tracepoint macros
that result in proper userspace support.
trace-cmd output before:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=ffffffff81815d7a order=9 ret=
after:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=ffffffff81815d7a order=9 ret=partial
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|