Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Here are the PCI changes intended for v3.19. I don't think there's
anything very exciting here, but there was a lot of MSI-related stuff
coming via Thomas.
Details:
NUMA
- Allow numa_node override via sysfs (Prarit Bhargava)
Resource management
- Restore detection of read-only BARs (Myron Stowe)
- Shrink decoding-disabled window while sizing BARs (Myron Stowe)
- Add informational printk for invalid BARs (Myron Stowe)
- Remove fixed parameter in pci_iov_resource_bar() (Myron Stowe)
MSI
- Add pci_msi_ignore_mask to prevent writes to MSI/MSI-X Mask Bits (Yijing Wang)
- Revert "PCI: Add x86_msi.msi_mask_irq() and msix_mask_irq()" (Yijing Wang)
- s390/MSI: Use __msi_mask_irq() instead of default_msi_mask_irq() (Yijing Wang)
Virtualization
- xen: Process failure for pcifront_(re)scan_root() (Chen Gang)
- Make FLR and AF FLR reset warning messages different (Gavin Shan)
Generic host bridge driver
- Allocate config space windows after limiting bus number range (Lorenzo Pieralisi)
- Convert to DT resource parsing API (Lorenzo Pieralisi)
Freescale Layerscape
- Add Freescale Layerscape PCIe driver (Minghuan Lian)
NVIDIA Tegra
- Do not build on 64-bit ARM (Thierry Reding)
- Add Kconfig help text (Thierry Reding)
Renesas R-Car
- Make rcar_pci static (Jingoo Han)
Samsung Exynos
- Add exynos prefix to add_pcie_port(), pcie_init() (Jingoo Han)
ST Microelectronics SPEAr13xx
- Add spear prefix to add_pcie_port(), pcie_init() (Jingoo Han)
- Make spear13xx_add_pcie_port() __init (Jingoo Han)
- Remove unnecessary OOM message (Jingoo Han)
TI DRA7xx
- Add dra7xx prefix to add_pcie_port() (Jingoo Han)
- Make dra7xx_add_pcie_port() __init (Jingoo Han)
TI Keystone
- Make ks_dw_pcie_msi_domain_ops static (Jingoo Han)
- Remove unnecessary OOM message (Jingoo Han)
Miscellaneous
- Delete unnecessary NULL pointer checks (Markus Elfring)
- Remove unused to_hotplug_slot() (Gavin Shan)
- Whitespace cleanup (Jingoo Han)
- Simplify if-return sequences (Quentin Lambert)"
* tag 'pci-v3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (28 commits)
PCI: Remove fixed parameter in pci_iov_resource_bar()
PCI: Add informational printk for invalid BARs
PCI: tegra: Add Kconfig help text
PCI: tegra: Do not build on 64-bit ARM
PCI: spear: Remove unnecessary OOM message
PCI: mvebu: Add a blank line after declarations
PCI: designware: Add a blank line after declarations
PCI: exynos: Remove unnecessary return statement
PCI: imx6: Use tabs for indentation
PCI: keystone: Remove unnecessary OOM message
PCI: Remove unused and broken to_hotplug_slot()
PCI: Make FLR and AF FLR reset warning messages different
PCI: dra7xx: Add __init annotation to dra7xx_add_pcie_port()
PCI: spear: Add __init annotation to spear13xx_add_pcie_port()
PCI: spear: Rename add_pcie_port(), pcie_init() to spear13xx_add_pcie_port(), etc.
PCI: dra7xx: Rename add_pcie_port() to dra7xx_add_pcie_port()
PCI: layerscape: Add Freescale Layerscape PCIe driver
PCI: Simplify if-return sequences
PCI: Delete unnecessary NULL pointer checks
PCI: Shrink decoding-disabled window while sizing BARs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull nmi-safe seq_buf printk update from Steven Rostedt:
"This code is a fork from the trace-3.19 pull as it needed the
trace_seq clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in
practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
- Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
- The seq_buf code was change to model the seq_file code. I have a
patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was
done to make sure that seq_buf.c is compatible with seq_file.c. I
may try to get that patch in for 3.20.
- The seq_buf.c file was moved to lib/ to remove it from being
dependent on CONFIG_TRACING.
- The printk() was updated to allow for a per_cpu "override" of the
internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow
the NMI to change what printk() does in order to call dump_stack()
without needing to update that code as well.
- Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would
wait till all the NMIs finished, and then it would print the
seq_buf data to the console safely from a non NMI context
One added bonus is that this code also makes the NMI dump stack work
on PREEMPT_RT kernels. As printk() includes sleeping locks on
PREEMPT_RT, printk() only writes to console if the console does not
use any rt_mutex converted spin locks. Which a lot do"
* tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
x86/nmi: Fix use of unallocated cpumask_var_t
printk/percpu: Define printk_func when printk is not defined
x86/nmi: Perform a safe NMI stack trace on all CPUs
printk: Add per_cpu printk func to allow printk to be diverted
seq_buf: Move the seq_buf code to lib/
seq-buf: Make seq_buf_bprintf() conditional on CONFIG_BINARY_PRINTF
tracing: Add seq_buf_get_buf() and seq_buf_commit() helper functions
tracing: Have seq_buf use full buffer
seq_buf: Add seq_buf_can_fit() helper function
tracing: Add paranoid size check in trace_printk_seq()
tracing: Use trace_seq_used() and seq_buf_used() instead of len
tracing: Clean up tracing_fill_pipe_page()
seq_buf: Create seq_buf_used() to find out how much was written
tracing: Add a seq_buf_clear() helper and clear len and readpos in init
tracing: Convert seq_buf fields to be like seq_file fields
tracing: Convert seq_buf_path() to be like seq_path()
tracing: Create seq_buf layer in trace_seq
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
|
|
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
cmsghdr from msghdr, just cleanup.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Merge first patchbomb from Andrew Morton:
- a few minor cifs fixes
- dma-debug upadtes
- ocfs2
- slab
- about half of MM
- procfs
- kernel/exit.c
- panic.c tweaks
- printk upates
- lib/ updates
- checkpatch updates
- fs/binfmt updates
- the drivers/rtc tree
- nilfs
- kmod fixes
- more kernel/exit.c
- various other misc tweaks and fixes
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
exit: pidns: fix/update the comments in zap_pid_ns_processes()
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
exit: exit_notify: re-use "dead" list to autoreap current
exit: reparent: call forget_original_parent() under tasklist_lock
exit: reparent: avoid find_new_reaper() if no children
exit: reparent: introduce find_alive_thread()
exit: reparent: introduce find_child_reaper()
exit: reparent: document the ->has_child_subreaper checks
exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
exit: proc: don't try to flush /proc/tgid/task/tgid
exit: release_task: fix the comment about group leader accounting
exit: wait: drop tasklist_lock before psig->c* accounting
exit: wait: don't use zombie->real_parent
exit: wait: cleanup the ptrace_reparented() checks
usermodehelper: kill the kmod_thread_locker logic
usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
...
|
|
Use #defines instead of magic values.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Eliminate the unlikely possibility of message interleaving for
early_printk/early_vprintk use.
early_vprintk can be done via the %pV extension so remove this
unnecessary function and change early_printk to have the equivalent
vprintk code.
All uses of early_printk already end with a newline so also remove the
unnecessary newline from the early_printk function.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system. Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to
the user.
A much easier method would be a switch to change the WARN() over to a
panic. This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.
This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the
warn_slowpath_common() path. The function will still print out the
location of the warning.
An example of the panic_on_warn output:
The first line below is from the WARN_ON() to output the WARN_ON()'s
location. After that the panic() output is displayed.
WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...
CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
Call Trace:
[<ffffffff81665190>] dump_stack+0x46/0x58
[<ffffffff8165e2ec>] panic+0xd0/0x204
[<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
[<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
[<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
[<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
[<ffffffff81002144>] do_one_initcall+0xd4/0x210
[<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
[<ffffffff810f8889>] load_module+0x16a9/0x1b30
[<ffffffff810f3d30>] ? store_uevent+0x70/0x70
[<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
[<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
[<ffffffff8166cf29>] system_call_fastpath+0x12/0x17
Successfully tested by me.
hpa said: There is another very valid use for this: many operators would
rather a machine shuts down than being potentially compromised either
functionally or security-wise.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Macro get_unused_fd() is used to allocate a file descriptor with default
flags. Those default flags (0) don't enable close-on-exec.
This can be seen as an unsafe default: in most case close-on-exec should
be enabled to not leak file descriptor across exec().
It would be better to have a "safer" default set of flags, eg. O_CLOEXEC
must be used to enable close-on-exec.
Instead this patch removes get_unused_fd() so that out of tree modules
won't be affect by a runtime behavor change which might introduce other
kind of bugs: it's better to catch the change at build time, making it
easier to fix.
Removing the macro will also promote use of get_unused_fd_flags() (or
anon_inode_getfd()) with flags provided by userspace. Or, if flags cannot
be given by userspace, with flags set to O_CLOEXEC by default.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
forget_original_parent()
Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.
Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.
Rename it and move the conditional compilation into mm/Makefile.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.
There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.
text data bss dec hex filename
8828345 1725264 983040 11536649 b00909 vmlinux.old
8827425 1725264 966656 11519345 afc571 vmlinux.new
[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback
page accounting") mem_cgroup_end_page_stat consumes locked and flags
variables directly rather than via pointers which might trigger C
undefined behavior as those variables are initialized only in the slow
path of mem_cgroup_begin_page_stat.
Although mem_cgroup_end_page_stat handles parameters correctly and
touches them only when they hold a sensible value it is caller which
loads a potentially uninitialized value which then might allow compiler
to do crazy things.
I haven't seen any warning from gcc and it seems that the current
version (4.9) doesn't exploit this type undefined behavior but Sasha has
reported the following:
UBSan: Undefined behaviour in mm/rmap.c:1084:2
load of value 255 is not a valid value for type '_Bool'
CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
Call Trace:
dump_stack (lib/dump_stack.c:52)
ubsan_epilogue (lib/ubsan.c:159)
__ubsan_handle_load_invalid_value (lib/ubsan.c:482)
page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
unmap_single_vma (mm/memory.c:1348)
unmap_vmas (mm/memory.c:1377 (discriminator 3))
exit_mmap (mm/mmap.c:2837)
mmput (kernel/fork.c:659)
do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
SyS_exit_group (kernel/exit.c:901)
tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
Fix this by using pointer parameters for both locked and flags and be
more robust for future compiler changes even though the current code is
implemented correctly.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
None of the mem_cgroup_same_or_subtree() callers actually require it to
take the RCU lock, either because they hold it themselves or they have css
references. Remove it.
To make the API change clear, rename the leftover helper to
mem_cgroup_is_descendant() to match cgroup_is_descendant().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It
makes a lot more sense to check where it's looked up, rather than check
for it in __mem_cgroup_same_or_subtree() where it's unexpected.
No other callsite passes NULL to __mem_cgroup_same_or_subtree().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Let's use generic slab_start/next/stop for showing memcg caches info. In
contrast to the current implementation, this will work even if all memcg
caches' info doesn't fit into a seq buffer (a page), plus it simply looks
neater.
Actually, the main reason I do this isn't mere cleanup. I'm going to zap
the memcg_slab_caches list, because I find it useless provided we have the
slab_caches list, and this patch is a step in this direction.
It should be noted that before this patch an attempt to read
memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
in -EIO, while after this patch it will silently show nothing except the
header, but I don't think it will frustrate anyone.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
hstate_sizelog() would shift left an int rather than long, triggering
undefined behaviour and passing an incorrect value when the requested
page size was more than 4GB, thus breaking >4GB pages.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
pc->mem_cgroup had to be left intact after uncharge for the final LRU
removal, and !PCG_USED indicated whether the page was uncharged. But
since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages
are uncharged after the final LRU removal. Uncharge can simply clear
the pointer and the PCG_USED/PageCgroupUsed sites can test that instead.
Because this is the last page_cgroup flag, this patch reduces the memcg
per-page overhead to a single pointer.
[akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
PCG_MEM is a remnant from an earlier version of 0a31bc97c80c ("mm:
memcontrol: rewrite uncharge API"), used to tell whether migration cleared
a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the
final version, mem_cgroup_migrate() directly uncharges the source page,
rendering this distinction unnecessary. Remove it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that mem_cgroup_swapout() fully uncharges the page, every page that is
still in use when reaching mem_cgroup_uncharge() is known to carry both
the memory and the memory+swap charge. Simplify the uncharge path and
remove the PCG_MEMSW page flag accordingly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since commit 53853e2d2bfb ("mm, compaction: defer each zone individually
instead of preferred zone"), compaction is deferred for each zone where
sync direct compaction fails, and reset where it succeeds. However, it
was observed that for DMA zone compaction often appeared to succeed
while subsequent allocation attempt would not, due to different outcome
of watermark check.
In order to properly defer compaction in this zone, the candidate zone
has to be passed back to __alloc_pages_direct_compact() and compaction
deferred in the zone after the allocation attempt fails.
The large source of mismatch between watermark check in compaction and
allocation was the lack of alloc_flags and classzone_idx values in
compaction, which has been fixed in the previous patch. So with this
problem fixed, we can simplify the code by removing the candidate_zone
parameter and deferring in __alloc_pages_direct_compact().
After this patch, the compaction activity during stress-highalloc
benchmark is still somewhat increased, but it's negligible compared to the
increase that occurred without the better watermark checking. This
suggests that it is still possible to apparently succeed in compaction but
fail to allocate, possibly due to parallel allocation activity.
[akpm@linux-foundation.org: fix build]
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Compaction relies on zone watermark checks for decisions such as if it's
worth to start compacting in compaction_suitable() or whether compaction
should stop in compact_finished(). The watermark checks take
classzone_idx and alloc_flags parameters, which are related to the memory
allocation request. But from the context of compaction they are currently
passed as 0, including the direct compaction which is invoked to satisfy
the allocation request, and could therefore know the proper values.
The lack of proper values can lead to mismatch between decisions taken
during compaction and decisions related to the allocation request. Lack
of proper classzone_idx value means that lowmem_reserve is not taken into
account. This has manifested (during recent changes to deferred
compaction) when DMA zone was used as fallback for preferred Normal zone.
compaction_suitable() without proper classzone_idx would think that the
watermarks are already satisfied, but watermark check in
get_page_from_freelist() would fail. Because of this problem, deferring
compaction has extra complexity that can be removed in the following
patch.
The issue (not confirmed in practice) with missing alloc_flags is opposite
in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
CMA-enabled systems) the watermark checking in compaction with 0 passed
will be stricter than in get_page_from_freelist(). In these cases
compaction might be running for a longer time than is really needed.
Another issue compaction_suitable() is that the check for "does the zone
need compaction at all?" comes only after the check "does the zone have
enough free free pages to succeed compaction". The latter considers extra
pages for migration and can therefore in some situations fail and return
COMPACT_SKIPPED, although the high-order allocation would succeed and we
should return COMPACT_PARTIAL.
This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction
and watermark checking. Where possible, all other callers of
compaction_suitable() pass proper values where those are known. This is
currently limited to classzone_idx, which is sometimes known in kswapd
context. However, the direct reclaim callers should_continue_reclaim()
and compaction_ready() do not currently know the proper values, so the
coordination between reclaim and compaction may still not be as accurate
as it could. This can be fixed later, if it's shown to be an issue.
Additionaly the checks in compact_suitable() are reordered to address the
second issue described above.
The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA. It allows simplifying deferred
compaction code in a followup patch.
When testing with stress-highalloc, there was some slight improvement
(which might be just due to variance) in success rates of non-THP-like
allocations.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.
This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.
All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As charges now pin the css explicitely, there is no more need for kmemcg
to acquire a proxy reference for outstanding pages during offlining, or
maintain state to identify such "dead" groups.
This was the last user of the uncharge functions' return values, so remove
them as well.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Charges currently pin the css indirectly by playing tricks during
css_offline(): user pages stall the offlining process until all of them
have been reparented, whereas kmemcg acquires a keep-alive reference if
outstanding kernel pages are detected at that point.
In preparation for removing all this complexity, make the pinning explicit
and acquire a css references for every charged page.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
[akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
[mhocko@suse.cz: ditch the last remainings of res_counter]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Abandon the spinlock-protected byte counters in favor of the unlocked
page counters in the hugetlb controller as well.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:
- unification of d_splice_alias()/d_materialize_unique()
- iov_iter rewrite
- killing a bunch of ->f_path.dentry users (and f_dentry macro).
Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.
Still not complete, but much closer now.
- crapectomy in lustre (dead code removal, mostly)
- "let's make seq_printf return nothing" preparations
- assorted cleanups and fixes
There _definitely_ will be more piles"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm update from David Teigland:
"This set includes one feature, which allows locks that have been
orphaned to be reacquired"
* tag 'dlm-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: adopt orphan locks
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota updates from Jan Kara:
"Quota improvements and some minor cleanups.
The main portion in the pull request are changes which move i_dquot
array from struct inode into fs-private part of an inode which saves
memory for filesystems which don't use VFS quotas"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: One function call less in udf_fill_super() after error detection
udf: Deletion of unnecessary checks before the function call "iput"
jbd: Deletion of an unnecessary check before the function call "iput"
vfs: Remove i_dquot field from inode
jfs: Convert to private i_dquot field
reiserfs: Convert to private i_dquot field
ocfs2: Convert to private i_dquot field
ext4: Convert to private i_dquot field
ext3: Convert to private i_dquot field
ext2: Convert to private i_dquot field
quota: Use function to provide i_dquot pointers
xfs: Set allowed quota types
gfs2: Set allowed quota types
quota: Allow each filesystem to specify which quota types it supports
quota: Remove const from function declarations
quota: Add log level to printk
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes lots of bug fixes based on clean-ups and
refactored codes. And inline_dir was introduced and two minor mount
options were added. Details from signed tag:
This series includes the following enhancement with refactored flows.
- fix inmemory page operations
- fix wrong inline_data & inline_dir logics
- enhance memory and IO control under memory pressure
- consider preemption on radix_tree operation
- fix memory leaks and deadlocks
But also, there are a couple of new features:
- support inline_dir to store dentries inside inode page
- add -o fastboot to reduce booting time
- implement -o dirsync
And a lot of clean-ups and minor bug fixes as well"
* tag 'for-f2fs-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (88 commits)
f2fs: avoid to ra unneeded blocks in recover flow
f2fs: introduce is_valid_blkaddr to cleanup codes in ra_meta_pages
f2fs: fix to enable readahead for SSA/CP blocks
f2fs: use atomic for counting inode with inline_{dir,inode} flag
f2fs: cleanup path to need cp at fsync
f2fs: check if inode state is dirty at fsync
f2fs: count the number of inmemory pages
f2fs: release inmemory pages when the file was closed
f2fs: set page private for inmemory pages for truncation
f2fs: count inline_xx in do_read_inode
f2fs: do retry operations with cond_resched
f2fs: call radix_tree_preload before radix_tree_insert
f2fs: use rw_semaphore for nat entry lock
f2fs: fix missing kmem_cache_free
f2fs: more fast lookup for gc_inode list
f2fs: cleanup redundant macro
f2fs: fix to return correct error number in f2fs_write_begin
f2fs: cleanup if-statement of phase in gc_data_segment
f2fs: fix to recover converted inline_data
f2fs: make clean the page before writing
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 update from Steven Whitehouse:
"In contrast to recent merge windows, there are a number of interesting
features this time:
There is a set of patches to improve performance in relation to block
reservations. Some correctness fixes for fallocate, and an update to
the freeze/thaw code which greatly simplyfies this code path. In
addition there is a set of clean ups from Al Viro too"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: gfs2_atomic_open(): simplify the use of finish_no_open()
GFS2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here...
GFS2: use kvfree() instead of open-coding it
GFS2: gfs2_create_inode(): don't bother with d_splice_alias()
GFS2: bugger off early if O_CREAT open finds a directory
GFS2: Deletion of unnecessary checks before two function calls
GFS2: update freeze code to use freeze/thaw_super on all nodes
fs: add freeze_super/thaw_super fs hooks
GFS2: Update timestamps on fallocate
GFS2: Update i_size properly on fallocate
GFS2: Use inode_newsize_ok and get_write_access in fallocate
GFS2: If we use up our block reservation, request more next time
GFS2: Only increase rs_sizehint
GFS2: Set of distributed preferences for rgrps
GFS2: directly return gfs2_dir_check()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"On a system that restricts access to dmesg, don't let people side-step
that by reading copies that pstore saved"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
syslog: Provide stub check_syslog_permissions
pstore: Honor dmesg_restrict sysctl on dmesg dumps
pstore/ram: Strip ramoops header for correct decompression
|
|
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.2 client support for hole punching and preallocation.
- Further RPC/RDMA client improvements.
- Add more RPC transport debugging tracepoints.
- Add RPC debugging tools in debugfs.
Bugfixes:
- Stable fix for layoutget error handling
- Fix a change in COMMIT behaviour resulting from the recent io code
updates"
* tag 'nfs-for-3.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
sunrpc: add a debugfs rpc_xprt directory with an info file in it
sunrpc: add debugfs file for displaying client rpc_task queue
nfs: Add DEALLOCATE support
nfs: Add ALLOCATE support
NFS: Clean up nfs4_init_callback()
NFS: SETCLIENTID XDR buffer sizes are incorrect
SUNRPC: serialize iostats updates
xprtrdma: Display async errors
xprtrdma: Enable pad optimization
xprtrdma: Re-write rpcrdma_flush_cqs()
xprtrdma: Refactor tasklet scheduling
xprtrdma: unmap all FMRs during transport disconnect
xprtrdma: Cap req_cqinit
xprtrdma: Return an errno from rpcrdma_register_external()
nfs: define nfs_inc_fscache_stats and using it as possible
nfs: replace nfs_add_stats with nfs_inc_stats when add one
NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
sunrpc: eliminate RPC_TRACEPOINTS
sunrpc: eliminate RPC_DEBUG
lockd: eliminate LOCKD_DEBUG
...
|
|
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Making things const is a good thing.
(x86-64 defconfig with all irda)
$ size net/irda/built-in.o*
text data bss dec hex filename
109276 1868 244 111388 1b31c net/irda/built-in.o.new
108828 2316 244 111388 1b31c net/irda/built-in.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It's better when function pointer arrays aren't modifiable.
Net change:
$ size net/llc/built-in.o.*
text data bss dec hex filename
61193 12758 1344 75295 1261f net/llc/built-in.o.new
47113 27030 1344 75487 126df net/llc/built-in.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It's better when function pointer arrays aren't modifiable.
Net change from original:
$ size net/llc/built-in.o.*
text data bss dec hex filename
61065 12886 1344 75295 1261f net/llc/built-in.o.new
47113 27030 1344 75487 126df net/llc/built-in.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It's better when function pointer arrays aren't modifiable.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As there are now no remaining users of arch_fast_hash(), lets kill
it entirely.
This basically reverts commit 71ae8aac3e19 ("lib: introduce arch
optimized hash library") and follow-up work, that is f.e., commit
237217546d44 ("lib: hash: follow-up fixups for arch hash"),
commit e3fec2f74f7f ("lib: Add missing arch generic-y entries for
asm-generic/hash.h") and last but not least commit 6a02652df511
("perf tools: Fix include for non x86 architectures").
Cc: Francesco Fusco <fusco@ntop.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This change pulls the core functionality out of __netdev_alloc_skb and
places them in a new function named __alloc_rx_skb. The reason for doing
this is to make these bits accessible to a new function __napi_alloc_skb.
In addition __alloc_rx_skb now has a new flags value that is used to
determine which page frag pool to allocate from. If the SKB_ALLOC_NAPI
flag is set then the NAPI pool is used. The advantage of this is that we
do not have to use local_irq_save/restore when accessing the NAPI pool from
NAPI context.
In my test setup I saw at least 11ns of savings using the napi_alloc_skb
function versus the netdev_alloc_skb function, most of this being due to
the fact that we didn't have to call local_irq_save/restore.
The main use case for napi_alloc_skb would be for things such as copybreak
or page fragment based receive paths where an skb is allocated after the
data has been received instead of before.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch splits the netdev_alloc_frag function up so that it can be used
on one of two page frag pools instead of being fixed on the
netdev_alloc_cache. By doing this we can add a NAPI specific function
__napi_alloc_frag that accesses a pool that is only used from softirq
context. The advantage to this is that we do not need to call
local_irq_save/restore which can be a significant savings.
I also took the opportunity to refactor the core bits that were placed in
__alloc_page_frag. First I updated the allocation to do either a 32K
allocation or an order 0 page. This is based on the changes in commmit
d9b2938aa where it was found that latencies could be reduced in case of
failures. Then I also rewrote the logic to work from the end of the page to
the start. By doing this the size value doesn't have to be used unless we
have run out of space for page fragments. Finally I cleaned up the atomic
bits so that we just do an atomic_sub_and_test and if that returns true then
we set the page->_count via an atomic_set. This way we can remove the extra
conditional for the atomic_read since it would have led to an atomic_inc in
the case of success anyway.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
More iov_iter work for the networking from Al Viro.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more 2038 timer work from Thomas Gleixner:
"Two more patches for the ongoing 2038 work:
- New accessors to clock MONOTONIC and REALTIME seconds
This is a seperate branch as Arnd has follow up work depending on
this"
* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME
timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MPX support from Thomas Gleixner:
"This enables support for x86 MPX.
MPX is a new debug feature for bound checking in user space. It
requires kernel support to handle the bound tables and decode the
bound violating instruction in the trap handler"
* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
asm-generic: Remove asm-generic arch_bprm_mm_init()
mm: Make arch_unmap()/bprm_mm_init() available to all architectures
x86: Cleanly separate use of asm-generic/mm_hooks.h
x86 mpx: Change return type of get_reg_offset()
fs: Do not include mpx.h in exec.c
x86, mpx: Add documentation on Intel MPX
x86, mpx: Cleanup unused bound tables
x86, mpx: On-demand kernel allocation of bounds tables
x86, mpx: Decode MPX instruction to get bound violation information
x86, mpx: Add MPX-specific mmap interface
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: Add MPX to disabled features
ia64: Sync struct siginfo with general version
mips: Sync struct siginfo with general version
mpx: Extend siginfo structure to include bound violation information
x86, mpx: Rename cfg_reg_u and status_reg
x86: mpx: Give bndX registers actual names
x86: Remove arbitrary instruction size limit in instruction decoder
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq domain updates from Thomas Gleixner:
"The real interesting irq updates:
- Support for hierarchical irq domains:
For complex interrupt routing scenarios where more than one
interrupt related chip is involved we had no proper representation
in the generic interrupt infrastructure so far. That made people
implement rather ugly constructs in their nested irq chip
implementations. The main offenders are x86 and arm/gic.
To distangle that mess we have now hierarchical irqdomains which
seperate the various interrupt chips and connect them via the
hierarchical domains. That keeps the domain specific details
internal to the particular hierarchy level and removes the
criss/cross referencing of chip internals. The resulting hierarchy
for a complex x86 system will look like this:
vector mapped: 74
msi-0 mapped: 2
dmar-ir-1 mapped: 69
ioapic-1 mapped: 4
ioapic-0 mapped: 20
pci-msi-2 mapped: 45
dmar-ir-0 mapped: 3
ioapic-2 mapped: 1
pci-msi-1 mapped: 2
htirq mapped: 0
Neither ioapic nor pci-msi know about the dmar interrupt remapping
between themself and the vector domain. If interrupt remapping is
disabled ioapic and pci-msi become direct childs of the vector
domain.
In hindsight we should have done that years ago, but in hindsight
we always know better :)
- Support for generic MSI interrupt domain handling
We have more and more non PCI related MSI interrupts, so providing
a generic infrastructure for this is better than having all
affected architectures implementing their own private hacks.
- Support for PCI-MSI interrupt domain handling, based on the generic
MSI support.
This part carries the pci/msi branch from Bjorn Helgaas pci tree to
avoid a massive conflict. The PCI/MSI parts are acked by Bjorn.
I have two more branches on top of this. The full conversion of x86
to hierarchical domains and a partial conversion of arm/gic"
* 'irq-irqdomain-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
genirq: Move irq_chip_write_msi_msg() helper to core
PCI/MSI: Allow an msi_controller to be associated to an irq domain
PCI/MSI: Provide mechanism to alloc/free MSI/MSIX interrupt from irqdomain
PCI/MSI: Enhance core to support hierarchy irqdomain
PCI/MSI: Move cached entry functions to irq core
genirq: Provide default callbacks for msi_domain_ops
genirq: Introduce msi_domain_alloc/free_irqs()
asm-generic: Add msi.h
genirq: Add generic msi irq domain support
genirq: Introduce callback irq_chip.irq_write_msi_msg
genirq: Work around __irq_set_handler vs stacked domains ordering issues
irqdomain: Introduce helper function irq_domain_add_hierarchy()
irqdomain: Implement a method to automatically call parent domains alloc/free
genirq: Introduce helper irq_domain_set_info() to reduce duplicated code
genirq: Split out flow handler typedefs into seperate header file
genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchip
genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchip
genirq: Add more helper functions to support stacked irq_chip
genirq: Introduce helper functions to support stacked irq_chip
irqdomain: Do irq_find_mapping and set_type for hierarchy irqdomain in case OF
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
"This is the first (boring) part of irq updates:
- support for big endian I/O accessors in the generic irq chip
- cleanup of brcmstb/bcm7120 drivers so they can be reused for non
ARM SoCs
- the usual pile of fixes and updates for the various ARM irq chips"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
irqchip: dw-apb-ictl: Add PM support
irqchip: dw-apb-ictl: Enable IRQ_GC_MASK_CACHE_PER_TYPE
irqchip: dw-apb-ictl: Always use use {readl|writel}_relaxed
ARM: orion: convert the irq_reg_{readl,writel} calls to the new API
irqchip: atmel-aic: Add missing entry for rm9200 irq fixups
irqchip: atmel-aic: Rename at91sam9_aic_irq_fixup for naming consistency
irqchip: atmel-aic: Add specific irq fixup function for sam9g45 and sam9rl
irqchip: atmel-aic: Add irq fixups for at91sam926x SoCs
irqchip: atmel-aic: Add irq fixup for RTT block
irqchip: brcmstb-l2: Convert driver to use irq_reg_{readl,writel}
irqchip: bcm7120-l2: Convert driver to use irq_reg_{readl,writel}
irqchip: bcm7120-l2: Decouple driver from brcmstb-l2
irqchip: bcm7120-l2: Extend driver to support 64+ bit controllers
irqchip: bcm7120-l2: Use gc->mask_cache to simplify suspend/resume functions
irqchip: bcm7120-l2: Fix missing nibble in gc->unused mask
irqchip: bcm7120-l2: Make sure all register accesses use base+offset
irqchip: bcm7120-l2, brcmstb-l2: Remove ARM Kconfig dependency
irqchip: bcm7120-l2: Eliminate bad IRQ check
irqchip: brcmstb-l2: Eliminate dependency on ARM code
genirq: Generic chip: Add big endian I/O accessors
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
"The time(r) departement provides:
- more infrastructure work on the year 2038 issue
- a few fixes in the Armada SoC timers
- the usual pile of fixlets and improvements"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: armada-370-xp: Use the reference clock on A375 SoC
watchdog: orion: Use the reference clock on Armada 375 SoC
clocksource: armada-370-xp: Add missing clock enable
time: Fix sign bug in NTP mult overflow warning
time: Remove timekeeping_inject_sleeptime()
rtc: Update suspend/resume timing to use 64bit time
rtc/lib: Provide y2038 safe rtc_tm_to_time()/rtc_time_to_tm() replacement
time: Fixup comments to reflect usage of timespec64
time: Expose get_monotonic_coarse64() for in-kernel uses
time: Expose getrawmonotonic64 for in-kernel uses
time: Provide y2038 safe mktime() replacement
time: Provide y2038 safe timekeeping_inject_sleeptime() replacement
time: Provide y2038 safe do_settimeofday() replacement
time: Complete NTP adjustment threshold judging conditions
time: Avoid possible NTP adjustment mult overflow.
time: Rename udelay_test.c to test_udelay.c
clocksource: sirf: Remove hard-coded clock rate
|