summaryrefslogtreecommitdiff
path: root/kernel/time
AgeCommit message (Collapse)Author
2020-09-24timers: Mask invalid flags in do_init_timer()Qianli Zhao
do_init_timer() accepts any combination of timer flags handed in by the caller without a sanity check, but only TIMER_DEFFERABLE, TIMER_PINNED and TIMER_IRQSAFE are valid. If the supplied flags have other bits set, this could result in malfunction. If bits are set in TIMER_CPUMASK the first timer usage could deference a cpu base which is outside the range of possible CPUs. If TIMER_MIGRATION is set, then the switch_timer_base() will live lock. Prevent that with a sanity check which warns when invalid flags are supplied and masks them out. [ tglx: Made it WARN_ON_ONCE() and added context to the changelog ] Signed-off-by: Qianli Zhao <zhaoqianli@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/9d79a8aa4eb56713af7379f99f062dedabcde140.1597326756.git.zhaoqianli@xiaomi.com
2020-09-24treewide: Make all debug_obj_descriptors constStephen Boyd
This should make it harder for the kernel to corrupt the debug object descriptor, used to call functions to fixup state and track debug objects, by moving the structure to read-only memory. Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200815004027.2046113-3-swboyd@chromium.org
2020-09-10timekeeping: Use seqcount_latch_tAhmed S. Darwish
Latch sequence counters are a multiversion concurrency control mechanism where the seqcount_t counter even/odd value is used to switch between two data storage copies. This allows the seqcount_t read path to safely interrupt its write side critical section (e.g. from NMIs). Initially, latch sequence counters were implemented as a single write function, raw_write_seqcount_latch(), above plain seqcount_t. The read path was expected to use plain seqcount_t raw_read_seqcount(). A specialized read function was later added, raw_read_seqcount_latch(), and became the standardized way for latch read paths. Having unique read and write APIs meant that latch sequence counters are basically a data type of their own -- just inappropriately overloading plain seqcount_t. The seqcount_latch_t data type was thus introduced at seqlock.h. Use that new data type instead of seqcount_raw_spinlock_t. This ensures that only latch-safe APIs are to be used with the sequence counter. Note that the use of seqcount_raw_spinlock_t was not very useful in the first place. Only the "raw_" subset of seqcount_t APIs were used at timekeeping.c. This subset was created for contexts where lockdep cannot be used. seqcount_LOCKTYPE_t's raison d'être -- verifying that the seqcount_t writer serialization lock is held -- cannot thus be done. References: 0c3351d451ae ("seqlock: Use raw_ prefix instead of _no_lockdep") References: 55f3560df975 ("seqlock: Extend seqcount API with associated locks") Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200827114044.11173-6-a.darwish@linutronix.de
2020-09-10time/sched_clock: Use seqcount_latch_tAhmed S. Darwish
Latch sequence counters have unique read and write APIs, and thus seqcount_latch_t was recently introduced at seqlock.h. Use that new data type instead of plain seqcount_t. This adds the necessary type-safety and ensures only latching-safe seqcount APIs are to be used. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200827114044.11173-5-a.darwish@linutronix.de
2020-09-10time/sched_clock: Use raw_read_seqcount_latch() during suspendAhmed S. Darwish
sched_clock uses seqcount_t latching to switch between two storage places protected by the sequence counter. This allows it to have interruptible, NMI-safe, seqcount_t write side critical sections. Since 7fc26327b756 ("seqlock: Introduce raw_read_seqcount_latch()"), raw_read_seqcount_latch() became the standardized way for seqcount_t latch read paths. Due to the dependent load, it has one read memory barrier less than the currently used raw_read_seqcount() API. Use raw_read_seqcount_latch() for the suspend path. Commit aadd6e5caaac ("time/sched_clock: Use raw_read_seqcount_latch()") missed changing that instance of raw_read_seqcount(). References: 1809bfa44e10 ("timers, sched/clock: Avoid deadlock during read from NMI") Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200715092345.GA231464@debian-buster-darwi.lab.linutronix.de
2020-08-25alarmtimer: Convert comma to semicolonXu Wang
Replace a comma between expression statements by a semicolon. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Stephen Boyd <sboyd@kernel.org> Link: https://lore.kernel.org/r/20200818062651.21680-1-vulab@iscas.ac.cn
2020-08-24tick-sched: Clarify "NOHZ: local_softirq_pending" warningPaul E. McKenney
Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH" (where "HH" is the hexadecimal softirq vector number) when one or more non-RCU softirq handlers are still enabled when checking to stop the scheduler-tick interrupt. This message is not as enlightening as one might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #HH". Reported-by: Andy Lutomirski <luto@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23timekeeping: Provide multi-timestamp accessor to NMI safe timekeeperThomas Gleixner
printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to make correlation of dmesg from several systems easier. Provide an interface to retrieve all three timestamps in one go. There are some caveats: 1) Boot time and late sleep time injection Boot time is a racy access on 32bit systems if the sleep time injection happens late during resume and not in timekeeping_resume(). That could be avoided by expanding struct tk_read_base with boot offset for 32bit and adding more overhead to the update. As this is a hard to observe once per resume event which can be filtered with reasonable effort using the accurate mono/real timestamps, it's probably not worth the trouble. Aside of that it might be possible on 32 and 64 bit to observe the following when the sleep time injection happens late: CPU 0 CPU 1 timekeeping_resume() ktime_get_fast_timestamps() mono, real = __ktime_get_real_fast() inject_sleep_time() update boot offset boot = mono + bootoffset; That means that boot time already has the sleep time adjustment, but real time does not. On the next readout both are in sync again. Preventing this for 64bit is not really feasible without destroying the careful cache layout of the timekeeper because the sequence count and struct tk_read_base would then need two cache lines instead of one. 2) Suspend/resume timestamps Access to the time keeper clock source is disabled accross the innermost steps of suspend/resume. The accessors still work, but the timestamps are frozen until time keeping is resumed which happens very early. For regular suspend/resume there is no observable difference vs. sched clock, but it might affect some of the nasty low level debug printks. OTOH, access to sched clock is not guaranteed accross suspend/resume on all systems either so it depends on the hardware in use. If that turns out to be a real problem then this could be mitigated by using sched clock in a similar way as during early boot. But it's not as trivial as on early boot because it needs some careful protection against the clock monotonic timestamp jumping backwards on resume. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de
2020-08-23timekeeping: Utilize local_clock() for NMI safe timekeeper during early bootThomas Gleixner
During early boot the NMI safe timekeeper returns 0 until the first clocksource becomes available. This prevents it from being used for printk or other facilities which today use sched clock. sched clock can be available way before timekeeping is initialized. The obvious workaround for this is to utilize the early sched clock in the default dummy clock read function until a clocksource becomes available. After switching to the clocksource clock MONOTONIC and BOOTTIME will not jump because the timekeeping_init() bases clock MONOTONIC on sched clock and the offset between clock MONOTONIC and BOOTTIME is zero during boot. Clock REALTIME cannot provide useful timestamps during early boot up to the point where a persistent clock becomes available, which is either in timekeeping_init() or later when the RTC driver which might depend on I2C or other subsystems is initialized. There is a minor difference to sched_clock() vs. suspend/resume. As the timekeeper clock source might not be accessible during suspend, after timekeeping_suspend() timestamps freeze up to the point where timekeeping_resume() is invoked. OTOH this is true for some sched clock implementations as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200814115512.041422402@linutronix.de
2020-08-19time: Use generic ns_common::countKirill Tkhai
Switch over time namespaces to use the newly introduced common lifetime counter. Currently every namespace type has its own lifetime counter which is stored in the specific namespace struct. The lifetime counters are used identically for all namespaces types. Namespaces may of course have additional unrelated counters and these are not altered. This introduces a common lifetime counter into struct ns_common. The ns_common struct encompasses information that all namespaces share. That should include the lifetime counter since its common for all of them. It also allows us to unify the type of the counters across all namespaces. Most of them use refcount_t but one uses atomic_t and at least one uses kref. Especially the last one doesn't make much sense since it's just a wrapper around refcount_t since 2016 and actually complicates cleanup operations by having to use container_of() to cast the correct namespace struct out of struct ns_common. Having the lifetime counter for the namespaces in one place reduces maintenance cost. Not just because after switching all namespaces over we will have removed more code than we added but also because the logic is more easily understandable and we indicate to the user that the basic lifetime requirements for all namespaces are currently identical. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/159644982033.604812.9406853013011123238.stgit@localhost.localdomain Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-08-14Merge tag 'timers-urgent-2020-08-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "A set of timekeeping/VDSO updates: - Preparatory work to allow S390 to switch over to the generic VDSO implementation. S390 requires that the VDSO data pointer is handed in to the counter read function when time namespace support is enabled. Adding the pointer is a NOOP for all other architectures because the compiler is supposed to optimize that out when it is unused in the architecture specific inline. The change also solved a similar problem for MIPS which fortunately has time namespaces not yet enabled. S390 needs to update clock related VDSO data independent of the timekeeping updates. This was solved so far with yet another sequence counter in the S390 implementation. A better solution is to utilize the already existing VDSO sequence count for this. The core code now exposes helper functions which allow to serialize against the timekeeper code and against concurrent readers. S390 needs extra data for their clock readout function. The initial common VDSO data structure did not provide a way to add that. It now has an embedded architecture specific struct embedded which defaults to an empty struct. Doing this now avoids tree dependencies and conflicts post rc1 and allows all other architectures which work on generic VDSO support to work from a common upstream base. - A trivial comment fix" * tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Delete repeated words in comments lib/vdso: Allow to add architecture-specific vdso data timekeeping/vsyscall: Provide vdso_update_begin/end() vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
2020-08-14Merge tag 'timers-core-2020-08-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more timer updates from Thomas Gleixner: "A set of posix CPU timer changes which allows to defer the heavy work of posix CPU timers into task work context. The tick interrupt is reduced to a quick check which queues the work which is doing the heavy lifting before returning to user space or going back to guest mode. Moving this out is deferring the signal delivery slightly but posix CPU timers are inaccurate by nature as they depend on the tick so there is no real damage. The relevant test cases all passed. This lifts the last offender for RT out of the hard interrupt context tick handler, but it also has the general benefit that the actual heavy work is accounted to the task/process and not to the tick interrupt itself. Further optimizations are possible to break long sighand lock hold and interrupt disabled (on !RT kernels) times when a massive amount of posix CPU timers (which are unpriviledged) is armed for a task/process. This is currently only enabled for x86 because the architecture has to ensure that task work is handled in KVM before entering a guest, which was just established for x86 with the new common entry/exit code which got merged post 5.8 and is not the case for other KVM architectures" * tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Select POSIX_CPU_TIMERS_TASK_WORK posix-cpu-timers: Provide mechanisms to defer timer handling to task_work posix-cpu-timers: Split run_posix_cpu_timers()
2020-08-10Merge tag 'locking-urgent-2020-08-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Thomas Gleixner: "A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generally useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers" * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) locking/seqlock, headers: Untangle the spaghetti monster locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header x86/headers: Remove APIC headers from <asm/smp.h> seqcount: More consistent seqprop names seqcount: Compress SEQCNT_LOCKNAME_ZERO() seqlock: Fold seqcount_LOCKNAME_init() definition seqlock: Fold seqcount_LOCKNAME_t definition seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g hrtimer: Use sequence counter with associated raw spinlock kvm/eventfd: Use sequence counter with associated spinlock userfaultfd: Use sequence counter with associated spinlock NFSv4: Use sequence counter with associated spinlock iocost: Use sequence counter with associated spinlock raid5: Use sequence counter with associated spinlock vfs: Use sequence counter with associated spinlock timekeeping: Use sequence counter with associated raw spinlock xfrm: policy: Use sequence counters with associated lock netfilter: nft_set_rbtree: Use sequence counter with associated rwlock netfilter: conntrack: Use sequence counter with associated spinlock sched: tasks: Use sequence counter with associated spinlock ...
2020-08-10time: Delete repeated words in commentsRandy Dunlap
Drop repeated words in kernel/time/. {when, one, into} Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <john.stultz@linaro.org> Link: https://lore.kernel.org/r/20200807033248.8452-1-rdunlap@infradead.org
2020-08-06posix-cpu-timers: Provide mechanisms to defer timer handling to task_workThomas Gleixner
Running posix CPU timers in hard interrupt context has a few downsides: - For PREEMPT_RT it cannot work as the expiry code needs to take sighand lock, which is a 'sleeping spinlock' in RT. The original RT approach of offloading the posix CPU timer handling into a high priority thread was clumsy and provided no real benefit in general. - For fine grained accounting it's just wrong to run this in context of the timer interrupt because that way a process specific CPU time is accounted to the timer interrupt. - Long running timer interrupts caused by a large amount of expiring timers which can be created and armed by unpriviledged user space. There is no hard requirement to expire them in interrupt context. If the signal is targeted at the task itself then it won't be delivered before the task returns to user space anyway. If the signal is targeted at a supervisor process then it might be slightly delayed, but posix CPU timers are inaccurate anyway due to the fact that they are tied to the tick. Provide infrastructure to schedule task work which allows splitting the posix CPU timer code into a quick check in interrupt context and a thread context expiry and signal delivery function. This has to be enabled by architectures as it requires that the architecture specific KVM implementation handles pending task work before exiting to guest mode. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
2020-08-06posix-cpu-timers: Split run_posix_cpu_timers()Thomas Gleixner
Split it up as a preparatory step to move the heavy lifting out of interrupt context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200730102337.677439437@linutronix.de
2020-08-06timekeeping/vsyscall: Provide vdso_update_begin/end()Thomas Gleixner
Architectures can have the requirement to add additional architecture specific data to the VDSO data page which needs to be updated independent of the timekeeper updates. To protect these updates vs. concurrent readers and a conflicting update through timekeeping, provide helper functions to make such updates safe. vdso_update_begin() takes the timekeeper_lock to protect against a potential update from timekeeper code and increments the VDSO sequence count to signal data inconsistency to concurrent readers. vdso_update_end() makes the sequence count even again to signal data consistency and drops the timekeeper lock. [ Sven: Add interrupt disable handling to the functions ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200804150124.41692-3-svens@linux.ibm.com
2020-08-04Merge tag 'timers-core-2020-08-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Time, timers and related driver updates: - Prevent unnecessary timer softirq invocations by extending the tracking of the next expiring timer in the timer wheel beyond the existing NOHZ functionality. The tracking overhead at enqueue time is within the noise, but on sensitive workloads the avoidance of the soft interrupt invocation is a measurable improvement. - The obligatory new clocksource driver for Ingenic X100 OST - The usual fixes, improvements, cleanups and extensions for newer chip variants all over the driver space" * tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) timers: Recalculate next timer interrupt only when necessary clocksource/drivers/ingenic: Add support for the Ingenic X1000 OST. dt-bindings: timer: Add Ingenic X1000 OST bindings. clocksource/drivers: Replace HTTP links with HTTPS ones clocksource/drivers/nomadik-mtu: Handle 32kHz clock clocksource/drivers/sh_cmt: Use "kHz" for kilohertz clocksource/drivers/imx: Add support for i.MX TPM driver with ARM64 clocksource/drivers/ingenic: Add high resolution timer support for SMP/SMT. timers: Lower base clock forwarding threshold timers: Remove must_forward_clk timers: Spare timer softirq until next expiry timers: Expand clk forward logic beyond nohz timers: Reuse next expiry cache after nohz exit timers: Always keep track of next expiry timers: Optimize _next_timer_interrupt() level iteration timers: Add comments about calc_index() ceiling work timers: Move trigger_dyntick_cpu() to enqueue_timer() timers: Use only bucket expiry for base->next_expiry value timers: Preserve higher bits of expiration on index calculation clocksource/drivers/timer-atmel-tcb: Add sama5d2 support ...
2020-08-04Merge tag 'threads-v5.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread updates from Christian Brauner: "This contains the changes to add the missing support for attaching to time namespaces via pidfds. Last cycle setns() was changed to support attaching to multiple namespaces atomically. This requires all namespaces to have a point of no return where they can't fail anymore. Specifically, <namespace-type>_install() is allowed to perform permission checks and install the namespace into the new struct nsset that it has been given but it is not allowed to make visible changes to the affected task. Once <namespace-type>_install() returns, anything that the given namespace type additionally requires to be setup needs to ideally be done in a function that can't fail or if it fails the failure must be non-fatal. For time namespaces the relevant functions that fell into this category were timens_set_vvar_page() and vdso_join_timens(). The latter could still fail although it didn't need to. This function is only implemented for vdso_join_timens() in current mainline. As discussed on-list (cf. [1]), in order to make setns() support time namespaces when attaching to multiple namespaces at once properly we changed vdso_join_timens() to always succeed. So vdso_join_timens() replaces the mmap_write_lock_killable() with mmap_read_lock(). Please note that arm is about to grow vdso support for time namespaces (possibly this merge window). We've synced on this change and arm64 also uses mmap_read_lock(), i.e. makes vdso_join_timens() a function that can't fail. Once the changes here and the arm64 changes have landed, vdso_join_timens() should be turned into a void function so it's obvious to callers and implementers on other architectures that the expectation is that it can't fail. We didn't do this right away because it would've introduced unnecessary merge conflicts between the two trees for no major gain. As always, tests included" [1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein * tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add CLONE_NEWTIME setns tests nsproxy: support CLONE_NEWTIME with setns() timens: add timens_commit() helper timens: make vdso_join_timens() always succeed
2020-08-03Merge tag 'sched-core-2020-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve uclamp performance by using a static key for the fast path - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for better power efficiency of RT tasks on battery powered devices. (The default is to maximize performance & reduce RT latencies.) - Improve utime and stime tracking accuracy, which had a fixed boundary of error, which created larger and larger relative errors as the values become larger. This is now replaced with more precise arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h. - Improve the deadline scheduler, such as making it capacity aware - Improve frequency-invariant scheduling - Misc cleanups in energy/power aware scheduling - Add sched_update_nr_running tracepoint to track changes to nr_running - Documentation additions and updates - Misc cleanups and smaller fixes * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst sched/doc: Document capacity aware scheduling sched: Document arch_scale_*_capacity() arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE Documentation/sysctl: Document uclamp sysctl knobs sched/uclamp: Add a new sysctl to control RT default boost value sched/uclamp: Fix a deadlock when enabling uclamp static key sched: Remove duplicated tick_nohz_full_enabled() check sched: Fix a typo in a comment sched/uclamp: Remove unnecessary mutex_init() arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry arch_topology, sched/core: Cleanup thermal pressure definition trace/events/sched.h: fix duplicated word linux/sched/mm.h: drop duplicated words in comments smp: Fix a potential usage of stale nr_cpus sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal sched: nohz: stop passing around unused "ticks" parameter. sched: Better document ttwu() sched: Add a tracepoint to track rq->nr_running ...
2020-08-03Merge tag 'core-rcu-2020-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits) torture: Remove obsolete "cd $KVM" torture: Avoid duplicate specification of qemu command torture: Dump ftrace at shutdown only if requested torture: Add kvm-tranform.sh script for qemu-cmd files torture: Add more tracing crib notes to kvm.sh torture: Improve diagnostic for KCSAN-incapable compilers torture: Correctly summarize build-only runs torture: Pass --kmake-arg to all make invocations rcutorture: Check for unwatched readers torture: Abstract out console-log error detection torture: Add a stop-run capability torture: Create qemu-cmd in --buildonly runs rcu/rcutorture: Replace 0 with false torture: Add --allcpus argument to the kvm.sh script torture: Remove whitespace from identify_qemu_vcpus output rcutorture: NULL rcu_torture_current earlier in cleanup code rcutorture: Handle non-statistic bang-string error messages torture: Set configfile variable to current scenario rcutorture: Add races with task-exit processing locktorture: Use true and false to assign to bool variables ...
2020-08-03Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 and cross-arch updates from Catalin Marinas: "Here's a slightly wider-spread set of updates for 5.9. Going outside the usual arch/arm64/ area is the removal of read_barrier_depends() series from Will and the MSI/IOMMU ID translation series from Lorenzo. The notable arm64 updates include ARMv8.4 TLBI range operations and translation level hint, time namespace support, and perf. Summary: - Removal of the tremendously unpopular read_barrier_depends() barrier, which is a NOP on all architectures apart from Alpha, in favour of allowing architectures to override READ_ONCE() and do whatever dance they need to do to ensure address dependencies provide LOAD -> LOAD/STORE ordering. This work also offers a potential solution if compilers are shown to convert LOAD -> LOAD address dependencies into control dependencies (e.g. under LTO), as weakly ordered architectures will effectively be able to upgrade READ_ONCE() to smp_load_acquire(). The latter case is not used yet, but will be discussed further at LPC. - Make the MSI/IOMMU input/output ID translation PCI agnostic, augment the MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID bus-specific parameter and apply the resulting changes to the device ID space provided by the Freescale FSL bus. - arm64 support for TLBI range operations and translation table level hints (part of the ARMv8.4 architecture version). - Time namespace support for arm64. - Export the virtual and physical address sizes in vmcoreinfo for makedumpfile and crash utilities. - CPU feature handling cleanups and checks for programmer errors (overlapping bit-fields). - ACPI updates for arm64: disallow AML accesses to EFI code regions and kernel memory. - perf updates for arm64. - Miscellaneous fixes and cleanups, most notably PLT counting optimisation for module loading, recordmcount fix to ignore relocations other than R_AARCH64_CALL26, CMA areas reserved for gigantic pages on 16K and 64K configurations. - Trivial typos, duplicate words" Link: http://lkml.kernel.org/r/20200710165203.31284-1-will@kernel.org Link: http://lkml.kernel.org/r/20200619082013.13661-1-lorenzo.pieralisi@arm.com * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (82 commits) arm64: use IRQ_STACK_SIZE instead of THREAD_SIZE for irq stack arm64/mm: save memory access in check_and_switch_context() fast switch path arm64: sigcontext.h: delete duplicated word arm64: ptrace.h: delete duplicated word arm64: pgtable-hwdef.h: delete duplicated words bus: fsl-mc: Add ACPI support for fsl-mc bus/fsl-mc: Refactor the MSI domain creation in the DPRC driver of/irq: Make of_msi_map_rid() PCI bus agnostic of/irq: make of_msi_map_get_device_domain() bus agnostic dt-bindings: arm: fsl: Add msi-map device-tree binding for fsl-mc bus of/device: Add input id to of_dma_configure() of/iommu: Make of_map_rid() PCI agnostic ACPI/IORT: Add an input ID to acpi_dma_configure() ACPI/IORT: Remove useless PCI bus walk ACPI/IORT: Make iort_msi_map_rid() PCI agnostic ACPI/IORT: Make iort_get_device_domain IRQ domain agnostic ACPI/IORT: Make iort_match_node_callback walk the ACPI namespace for NC arm64: enable time namespace support arm64/vdso: Restrict splitting VVAR VMA arm64/vdso: Handle faults on timens page ...
2020-07-31Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull the v5.9 RCU bits from Paul E. McKenney: - Documentation updates - Miscellaneous fixes - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-29random32: update the net random state on interrupt and activityWilly Tarreau
This modifies the first 32 bits out of the 128 bits of a random CPU's net_rand_state on interrupt or CPU activity to complicate remote observations that could lead to guessing the network RNG's internal state. Note that depending on some network devices' interrupt rate moderation or binding, this re-seeding might happen on every packet or even almost never. In addition, with NOHZ some CPUs might not even get timer interrupts, leaving their local state rarely updated, while they are running networked processes making use of the random state. For this reason, we also perform this update in update_process_times() in order to at least update the state when there is user or system activity, since it's the only case we care about. Reported-by: Amit Klein <aksecurity@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-29hrtimer: Use sequence counter with associated raw spinlockAhmed S. Darwish
A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_raw_spinlock_t data type, which allows to associate a raw spinlock with the sequence counter. This enables lockdep to verify that the raw spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-25-a.darwish@linutronix.de
2020-07-29timekeeping: Use sequence counter with associated raw spinlockAhmed S. Darwish
A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_raw_spinlock_t data type, which allows to associate a raw spinlock with the sequence counter. This enables lockdep to verify that the raw spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-18-a.darwish@linutronix.de
2020-07-24timers: Recalculate next timer interrupt only when necessaryFrederic Weisbecker
The nohz tick code recalculates the timer wheel's next expiry on each idle loop iteration. On the other hand, the base next expiry is now always cached and updated upon timer enqueue and execution. Only timer dequeue may leave base->next_expiry out of date (but then its stale value won't ever go past the actual next expiry to be recalculated). Since recalculating the next_expiry isn't a free operation, especially when the last wheel level is reached to find out that no timer has been enqueued at all, reuse the next expiry cache when it is known to be reliable, which it is most of the time. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200723151641.12236-1-frederic@kernel.org
2020-07-22sched: nohz: stop passing around unused "ticks" parameter.Paul Gortmaker
The "ticks" parameter was added in commit 0f004f5a696a ("sched: Cure more NO_HZ load average woes") since calc_global_nohz() was called and needed the "ticks" argument. But in commit c308b56b5398 ("sched: Fix nohz load accounting -- again!") it became unused as the function calc_global_nohz() dropped using "ticks". Fixes: c308b56b5398 ("sched: Fix nohz load accounting -- again!") Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1593628458-32290-1-git-send-email-paul.gortmaker@windriver.com
2020-07-20time/sched_clock: Use raw_read_seqcount_latch()Ahmed S. Darwish
sched_clock uses seqcount_t latching to switch between two storage places protected by the sequence counter. This allows it to have interruptible, NMI-safe, seqcount_t write side critical sections. Since 7fc26327b756 ("seqlock: Introduce raw_read_seqcount_latch()"), raw_read_seqcount_latch() became the standardized way for seqcount_t latch read paths. Due to the dependent load, it also has one read memory barrier less than the currently used raw_read_seqcount() API. Use raw_read_seqcount_latch() for the seqcount_t latch read path. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Leo Yan <leo.yan@linaro.org> Link: https://lkml.kernel.org/r/20200625085745.GD117543@hirez.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20200715092345.GA231464@debian-buster-darwi.lab.linutronix.de Link: https://lore.kernel.org/r/20200716051130.4359-3-leo.yan@linaro.org References: 1809bfa44e10 ("timers, sched/clock: Avoid deadlock during read from NMI") Signed-off-by: Will Deacon <will@kernel.org>
2020-07-20sched_clock: Expose struct clock_read_dataPeter Zijlstra
In order to support perf_event_mmap_page::cap_time features, an architecture needs, aside from a userspace readable counter register, to expose the exact clock data so that userspace can convert the counter register into a correct timestamp. Provide struct clock_read_data and two (seqcount) helpers so that architectures (arm64 in specific) can expose the numbers to userspace. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Leo Yan <leo.yan@linaro.org> Link: https://lore.kernel.org/r/20200716051130.4359-2-leo.yan@linaro.org Signed-off-by: Will Deacon <will@kernel.org>
2020-07-17timers: Lower base clock forwarding thresholdFrederic Weisbecker
There is nothing that prevents from forwarding the base clock if it's one jiffy off. The reason for this arbitrary limit of two jiffies is historical and does not longer exist. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-13-frederic@kernel.org
2020-07-17timers: Remove must_forward_clkFrederic Weisbecker
There is no reason to keep this guard around. The code makes sure that base->clk remains sane and won't be forwarded beyond jiffies nor set backward. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-12-frederic@kernel.org
2020-07-17timers: Spare timer softirq until next expiryFrederic Weisbecker
Now that the core timer infrastructure doesn't depend anymore on periodic base->clk increments, even when the CPU is not in NO_HZ mode, timer softirqs can be skipped until there are timers to expire. Some spurious softirqs can still remain since base->next_expiry doesn't keep track of canceled timers but this still reduces the number of softirqs significantly: ~15 times less for HZ=1000 and ~5 times less for HZ=100. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-11-frederic@kernel.org
2020-07-17timers: Expand clk forward logic beyond nohzFrederic Weisbecker
As for next_expiry, the base->clk catch up logic will be expanded beyond NOHZ in order to avoid triggering useless softirqs. If softirqs should only fire to expire pending timers, periodic base->clk increments must be skippable for random amounts of time. Therefore prepare to catch-up with missing updates whenever an up-to-date base clock is needed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-10-frederic@kernel.org
2020-07-17timers: Reuse next expiry cache after nohz exitFrederic Weisbecker
Now that the next expiry it tracked unconditionally when a timer is added, this information can be reused on a tick firing after exiting nohz. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-9-frederic@kernel.org
2020-07-17timers: Always keep track of next expiryFrederic Weisbecker
So far next expiry was only tracked while the CPU was in nohz_idle mode in order to cope with missing ticks that can't increment the base->clk periodically anymore. This logic is going to be expanded beyond nohz in order to spare timer softirqs so do it unconditionally. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-8-frederic@kernel.org
2020-07-17timers: Optimize _next_timer_interrupt() level iterationFrederic Weisbecker
If a level has a timer that expires before reaching the next level, there is no need to iterate further. The next level is reached when the 3 lower bits of the current level are cleared. If the next event happens before/during that, the next levels won't provide an earlier expiration. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-7-frederic@kernel.org
2020-07-17timers: Add comments about calc_index() ceiling workFrederic Weisbecker
calc_index() adds 1 unit of the level granularity to the expiry passed in parameter to ensure that the timer doesn't expire too early. Add a comment to explain that and the resulting layout in the wheel. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-6-frederic@kernel.org
2020-07-17timers: Move trigger_dyntick_cpu() to enqueue_timer()Frederic Weisbecker
Consolidate the code by calling trigger_dyntick_cpu() from enqueue_timer() instead of calling it from all its callers. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200717140551.29076-5-frederic@kernel.org
2020-07-17timers: Use only bucket expiry for base->next_expiry valueAnna-Maria Behnsen
The bucket expiry time is the effective expriy time of timers and is greater than or equal to the requested timer expiry time. This is due to the guarantee that timers never expire early and the reduced expiry granularity in the secondary wheel levels. When a timer is enqueued, trigger_dyntick_cpu() checks whether the timer is the new first timer. This check compares next_expiry with the requested timer expiry value and not with the effective expiry value of the bucket into which the timer was queued. Storing the requested timer expiry value in base->next_expiry can lead to base->clk going backwards if the requested timer expiry value is smaller than base->clk. Commit 30c66fc30ee7 ("timer: Prevent base->clk from moving backward") worked around this by preventing the store when timer->expiry is before base->clk, but did not fix the underlying problem. Use the expiry value of the bucket into which the timer is queued to do the new first timer check. This fixes the base->clk going backward problem. The workaround of commit 30c66fc30ee7 ("timer: Prevent base->clk from moving backward") in trigger_dyntick_cpu() is not longer necessary as the timers bucket expiry is guaranteed to be greater than or equal base->clk. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200717140551.29076-4-frederic@kernel.org
2020-07-17timers: Preserve higher bits of expiration on index calculationFrederic Weisbecker
The higher bits of the timer expiration are cropped while calling calc_index() due to the implicit cast from unsigned long to unsigned int. This loss shouldn't have consequences on the current code since all the computation to calculate the index is done on the lower 32 bits. However to prepare for returning the actual bucket expiration from calc_index() in order to properly fix base->next_expiry updates, the higher bits need to be preserved. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200717140551.29076-3-frederic@kernel.org
2020-07-17timer: Fix wheel index calculation on last levelFrederic Weisbecker
When an expiration delta falls into the last level of the wheel, that delta has be compared against the maximum possible delay and reduced to fit in if necessary. However instead of comparing the delta against the maximum, the code compares the actual expiry against the maximum. Then instead of fixing the delta to fit in, it sets the maximum delta as the expiry value. This can result in various undesired outcomes, the worst possible one being a timer expiring 15 days ahead to fire immediately. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200717140551.29076-2-frederic@kernel.org
2020-07-09timer: Prevent base->clk from moving backwardFrederic Weisbecker
When a timer is enqueued with a negative delta (ie: expiry is below base->clk), it gets added to the wheel as expiring now (base->clk). Yet the value that gets stored in base->next_expiry, while calling trigger_dyntick_cpu(), is the initial timer->expires value. The resulting state becomes: base->next_expiry < base->clk On the next timer enqueue, forward_timer_base() may accidentally rewind base->clk. As a possible outcome, timers may expire way too early, the worst case being that the highest wheel levels get spuriously processed again. To prevent from that, make sure that base->next_expiry doesn't get below base->clk. Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200703010657.2302-1-frederic@kernel.org
2020-07-08nsproxy: support CLONE_NEWTIME with setns()Christian Brauner
So far setns() was missing time namespace support. This was partially due to it simply not being implemented but also because vdso_join_timens() could still fail which made switching to multiple namespaces atomically problematic. This is now fixed so support CLONE_NEWTIME with setns() Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Andrei Vagin <avagin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Dmitry Safonov <dima@arista.com> Link: https://lore.kernel.org/r/20200706154912.3248030-4-christian.brauner@ubuntu.com
2020-07-08timens: add timens_commit() helperChristian Brauner
Wrap the calls to timens_set_vvar_page() and vdso_join_timens() in timens_on_fork() and timens_install() in a new timens_commit() helper. We'll use this helper in a follow-up patch in nsproxy too. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Andrei Vagin <avagin@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Dmitry Safonov <dima@arista.com> Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20200706154912.3248030-3-christian.brauner@ubuntu.com
2020-07-08timens: make vdso_join_timens() always succeedChristian Brauner
As discussed on-list (cf. [1]), in order to make setns() support time namespaces when attaching to multiple namespaces at once properly we need to tweak vdso_join_timens() to always succeed. So switch vdso_join_timens() to using a read lock and replacing mmap_write_lock_killable() to mmap_read_lock() as we discussed. Last cycle setns() was changed to support attaching to multiple namespaces atomically. This requires all namespaces to have a point of no return where they can't fail anymore. Specifically, <namespace-type>_install() is allowed to perform permission checks and install the namespace into the new struct nsset that it has been given but it is not allowed to make visible changes to the affected task. Once <namespace-type>_install() returns anything that the given namespace type requires to be setup in addition needs to ideally be done in a function that can't fail or if it fails the failure is not fatal. For time namespaces the relevant functions that fall into this category are timens_set_vvar_page() and vdso_join_timens(). Currently the latter can fail but doesn't need to. With this we can go on to implement a timens_commit() helper in a follow up patch to be used by setns(). [1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Andrei Vagin <avagin@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Dmitry Safonov <dima@arista.com> Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20200706154912.3248030-2-christian.brauner@ubuntu.com
2020-06-29tick/nohz: Narrow down noise while setting current task's tick dependencyFrederic Weisbecker
Setting a tick dependency on any task, including the case where a task sets that dependency on itself, triggers an IPI to all CPUs. That is of course suboptimal but it had previously not been an issue because it was only used by POSIX CPU timers on nohz_full, which apparently never occurs in latency-sensitive workloads in production. (Or users of such systems are suffering in silence on the one hand or venting their ire on the wrong people on the other.) But RCU now sets a task tick dependency on the current task in order to fix stall issues that can occur during RCU callback processing. Thus, RCU callback processing triggers frequent system-wide IPIs from nohz_full CPUs. This is quite counter-productive, after all, avoiding IPIs is what nohz_full is supposed to be all about. This commit therefore optimizes tasks' self-setting of a task tick dependency by using tick_nohz_full_kick() to avoid the system-wide IPI. Instead, only the execution of the one task is disturbed, which is acceptable given that this disturbance is well down into the noise compared to the degree to which the RCU callback processing itself disturbs execution. Fixes: 6a949b7af82d (rcu: Force on tick when invoking lots of callbacks) Reported-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@kernel.org Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-13Merge tag 'x86-entry-2020-06-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit d5f744f9a2ac ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY*() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid *SAN instrumentation x86/entry: __always_inline arch_atomic_* for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...
2020-06-11Merge tag 'x86-urgent-2020-06-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more x86 updates from Thomas Gleixner: "A set of fixes and updates for x86: - Unbreak paravirt VDSO clocks. While the VDSO code was moved into lib for sharing a subtle check for the validity of paravirt clocks got replaced. While the replacement works perfectly fine for bare metal as the update of the VDSO clock mode is synchronous, it fails for paravirt clocks because the hypervisor can invalidate them asynchronously. Bring it back as an optional function so it does not inflict this on architectures which are free of PV damage. - Fix the jiffies to jiffies64 mapping on 64bit so it does not trigger an ODR violation on newer compilers - Three fixes for the SSBD and *IB* speculation mitigation maze to ensure consistency, not disabling of some *IB* variants wrongly and to prevent a rogue cross process shutdown of SSBD. All marked for stable. - Add yet more CPU models to the splitlock detection capable list !@#%$! - Bring the pr_info() back which tells that TSC deadline timer is enabled. - Reboot quirk for MacBook6,1" * tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Unbreak paravirt VDSO clocks lib/vdso: Provide sanity check for cycles (again) clocksource: Remove obsolete ifdef x86_64: Fix jiffies ODR violation x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches. x86/speculation: Prevent rogue cross-process SSBD shutdown x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS. x86/cpu: Add Sapphire Rapids CPU model number x86/split_lock: Add Icelake microserver and Tigerlake CPU models x86/apic: Make TSC deadline timer detection message visible x86/reboot/quirks: Add MacBook6,1 reboot quirk