mm/memcg: automatically penalize tasks with high swap use

Add a memory.swap.high knob, which can be used to protect the system from SWAP exhaustion. The mechanism used for penalizing is similar to memory.high penalty (sleep on return to user space). That is not to say that the knob itself is equivalent to memory.high. The objective is more to protect the system from potentially buggy tasks consuming a lot of swap and impacting other tasks, or even bringing the whole system to stand still with complete SWAP exhaustion. Hopefully without the need to find per-task hard limits. Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react. oomd and earlyoom already do killing based on swap exhaustion, and memory.swap.high protection will help implement such userspace oom policies more reliably. We can use one counter for number of pages allocated under pressure to save struct task space and avoid two separate hierarchy walks on the hot path. The exact overage is calculated on return to user space, anyway. Take the new high limit into account when determining if swap is "full". Borrowing the explanation from Johannes: The idea behind "swap full" is that as long as the workload has plenty of swap space available and it's not changing its memory contents, it makes sense to generously hold on to copies of data in the swap device, even after the swapin. A later reclaim cycle can drop the page without any IO. Trading disk space for IO. But the only two ways to reclaim a swap slot is when they're faulted in and the references go away, or by scanning the virtual address space like swapoff does - which is very expensive (one could argue it's too expensive even for swapoff, it's often more practical to just reboot). So at some point in the fill level, we have to start freeing up swap slots on fault/swapin. Otherwise we could eventually run out of swap slots while they're filled with copies of data that is also in RAM. We don't want to OOM a workload because its available swap space is filled with redundant cache. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Chris Down <chris@chrisdown.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Jakub Kicinski <kuba@kernel.org> 2020-06-01 21:49:52 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2020-06-02 10:59:09 -0700
commit: 4b82ab4f28836646eca12cb37f408568d3cdc5c3 (patch)
tree: bf2cda64795d153f2512914de9f4f7d79b05aed2 /include/linux
parent: d1663a907bd348f912b7f7088e83ca1b6fd3309f (diff)
1 files changed, 1 insertions, 0 deletions
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95a09a7ec412..bfe9533bb67e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -45,6 +45,7 @@ enum memcg_memory_event {
 	MEMCG_MAX,
 	MEMCG_OOM,
 	MEMCG_OOM_KILL,
+	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
 	MEMCG_NR_MEMORY_EVENTS,
author	Jakub Kicinski <kuba@kernel.org>	2020-06-01 21:49:52 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2020-06-02 10:59:09 -0700
commit	4b82ab4f28836646eca12cb37f408568d3cdc5c3 (patch)
tree	bf2cda64795d153f2512914de9f4f7d79b05aed2 /include/linux
parent	d1663a907bd348f912b7f7088e83ca1b6fd3309f (diff)