summaryrefslogtreecommitdiff
path: root/kernel/time/tick-sched.c
AgeCommit message (Collapse)Author
2019-11-26Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - Dynamic tick (nohz) updates, perhaps most notably changes to force the tick on when needed due to lengthy in-kernel execution on CPUs on which RCU is waiting. - Linux-kernel memory consistency model updates. - Replace rcu_swap_protected() with rcu_prepace_pointer(). - Torture-test updates. - Documentation updates. - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer() net/sched: Replace rcu_swap_protected() with rcu_replace_pointer() net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer() net/core: Replace rcu_swap_protected() with rcu_replace_pointer() bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer() fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer() drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer() drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer() x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer() rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer() rcu: Suppress levelspread uninitialized messages rcu: Fix uninitialized variable in nocb_gp_wait() rcu: Update descriptions for rcu_future_grace_period tracepoint rcu: Update descriptions for rcu_nocb_wake tracepoint rcu: Remove obsolete descriptions for rcu_barrier tracepoint rcu: Ensure that ->rcu_urgent_qs is set before resched IPI workqueue: Convert for_each_wq to use built-in list check rcu: Several rcu_segcblist functions can be static rcu: Remove unused function hlist_bl_del_init_rcu() Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch() ...
2019-10-29sched/vtime: Rename vtime_accounting_cpu_enabled() to ↵Frederic Weisbecker
vtime_accounting_enabled_this_cpu() Standardize the naming on top of the vtime_accounting_enabled_*() base. Also make it clear we are checking the vtime state of the *current* CPU with this function. We'll need to add an API to check that state on remote CPUs as well, so we must disambiguate the naming. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191016025700.31277-9-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-05time: Export tick start/stop functions for rcutorturePaul E. McKenney
It turns out that rcutorture needs to ensure that the scheduling-clock interrupt is enabled in CONFIG_NO_HZ_FULL kernels before starting on CPU-bound in-kernel processing. This commit therefore exports tick_nohz_dep_set_task(), tick_nohz_dep_clear_task(), and tick_nohz_full_setup() to GPL kernel modules. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05nohz: Add TICK_DEP_BIT_RCUFrederic Weisbecker
If a nohz_full CPU is looping in the kernel, the scheduling-clock tick might nevertheless remain disabled. In !PREEMPT kernels, this can prevent RCU's attempts to enlist the aid of that CPU's executions of cond_resched(), which can in turn result in an arbitrarily delayed grace period and thus an OOM. RCU therefore needs a way to enable a holdout nohz_full CPU's scheduler-clock interrupt. This commit therefore provides a new TICK_DEP_BIT_RCU value which RCU can pass to tick_dep_set_cpu() and friends to force on the scheduler-clock interrupt for a specified CPU or task. In some cases, rcutorture needs to turn on the scheduler-clock tick, so this commit also exports the relevant symbols to GPL-licensed modules. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-08-28tick: Mark sched_timer to expire in hard interrupt contextSebastian Andrzej Siewior
sched_timer must be initialized with the _HARD mode suffix to ensure expiry in hard interrupt context on RT. The previous conversion to HARD expiry mode missed on one instance in tick_nohz_switch_to_nohz(). Fix it up. Fixes: 902a9f9c50905 ("tick: Mark tick related hrtimers to expiry in hard interrupt context") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190823113845.12125-3-bigeasy@linutronix.de
2019-08-01tick: Mark tick related hrtimers to expiry in hard interrupt contextSebastian Andrzej Siewior
The tick related hrtimers, which drive the scheduler tick and hrtimer based broadcasting are required to expire in hard interrupt context for obvious reasons. Mark them so PREEMPT_RT kernels wont move them to soft interrupt expiry. Make the horribly formatted RCU_NONIDLE bracket maze readable while at it. No functional change, [ tglx: Split out from larger combo patch. Add changelog ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185753.459144407@linutronix.de
2019-06-03sched/fair: Remove the rq->cpu_load[] update codeDietmar Eggemann
With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx] any more. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-06Merge tag 'pm-5.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These fix the (Intel-specific) Performance and Energy Bias Hint (EPB) handling and expose it to user space via sysfs, fix and clean up several cpufreq drivers, add support for two new chips to the qoriq cpufreq driver, fix, simplify and clean up the cpufreq core and the schedutil governor, add support for "CPU" domains to the generic power domains (genpd) framework and provide low-level PSCI firmware support for that feature, fix the exynos cpuidle driver and fix a couple of issues in the devfreq subsystem and clean it up. Specifics: - Fix the handling of Performance and Energy Bias Hint (EPB) on Intel processors and expose it to user space via sysfs to avoid having to access it through the generic MSR I/F (Rafael Wysocki). - Improve the handling of global turbo changes made by the platform firmware in the intel_pstate driver (Rafael Wysocki). - Convert some slow-path static_cpu_has() callers to boot_cpu_has() in cpufreq (Borislav Petkov). - Fix the frequency calculation loop in the armada-37xx cpufreq driver (Gregory CLEMENT). - Fix possible object reference leaks in multuple cpufreq drivers (Wen Yang). - Fix kerneldoc comment in the centrino cpufreq driver (dongjian). - Clean up the ACPI and maple cpufreq drivers (Viresh Kumar, Mohan Kumar). - Add support for lx2160a and ls1028a to the qoriq cpufreq driver (Vabhav Sharma, Yuantian Tang). - Fix kobject memory leak in the cpufreq core (Viresh Kumar). - Simplify the IOwait boosting in the schedutil cpufreq governor and rework the TSC cpufreq notifier on x86 (Rafael Wysocki). - Clean up the cpufreq core and statistics code (Yue Hu, Kyle Lin). - Improve the cpufreq documentation, add SPDX license tags to some PM documentation files and unify copyright notices in them (Rafael Wysocki). - Add support for "CPU" domains to the generic power domains (genpd) framework and provide low-level PSCI firmware support for that feature (Ulf Hansson). - Rearrange the PSCI firmware support code and add support for SYSTEM_RESET2 to it (Ulf Hansson, Sudeep Holla). - Improve genpd support for devices in multiple power domains (Ulf Hansson). - Unify target residency for the AFTR and coupled AFTR states in the exynos cpuidle driver (Marek Szyprowski). - Introduce new helper routine in the operating performance points (OPP) framework (Andrew-sh.Cheng). - Add support for passing on-die termination (ODT) and auto power down parameters from the kernel to Trusted Firmware-A (TF-A) to the rk3399_dmc devfreq driver (Enric Balletbo i Serra). - Add tracing to devfreq (Lukasz Luba). - Make the exynos-bus devfreq driver suspend all devices on system shutdown (Marek Szyprowski). - Fix a few minor issues in the devfreq subsystem and clean it up somewhat (Enric Balletbo i Serra, MyungJoo Ham, Rob Herring, Saravana Kannan, Yangtao Li). - Improve system wakeup diagnostics (Stephen Boyd). - Rework filesystem sync messages emitted during system suspend and hibernation (Harry Pan)" * tag 'pm-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits) cpufreq: Fix kobject memleak cpufreq: armada-37xx: fix frequency calculation for opp cpufreq: centrino: Fix centrino_setpolicy() kerneldoc comment cpufreq: qoriq: add support for lx2160a x86: tsc: Rework time_cpufreq_notifier() PM / Domains: Allow to attach a CPU via genpd_dev_pm_attach_by_id|name() PM / Domains: Search for the CPU device outside the genpd lock PM / Domains: Drop unused in-parameter to some genpd functions PM / Domains: Use the base device for driver_deferred_probe_check_state() cpufreq: qoriq: Add ls1028a chip support PM / Domains: Enable genpd_dev_pm_attach_by_id|name() for single PM domain PM / Domains: Allow OF lookup for multi PM domain case from ->attach_dev() PM / Domains: Don't kfree() the virtual device in the error path cpufreq: Move ->get callback check outside of __cpufreq_get() PM / Domains: remove unnecessary unlikely() cpufreq: Remove needless bios_limit check in show_bios_limit() drivers/cpufreq/acpi-cpufreq.c: This fixes the following checkpatch warning firmware/psci: add support for SYSTEM_RESET2 PM / devfreq: add tracing for scheduling work trace: events: add devfreq trace event file ...
2019-05-06Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Ingo Molnar: "This cycle had the following changes: - Timer tracing improvements (Anna-Maria Gleixner) - Continued tasklet reduction work: remove the hrtimer_tasklet (Thomas Gleixner) - Fix CPU hotplug remove race in the tick-broadcast mask handling code (Thomas Gleixner) - Force upper bound for setting CLOCK_REALTIME, to fix ABI inconsistencies with handling values that are close to the maximum supported and the vagueness of when uptime related wraparound might occur. Make the consistent maximum the year 2232 across all relevant ABIs and APIs. (Thomas Gleixner) - various cleanups and smaller fixes" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick: Fix typos in comments tick/broadcast: Fix warning about undefined tick_broadcast_oneshot_offline() timekeeping: Force upper bound for setting CLOCK_REALTIME timer/trace: Improve timer tracing timer/trace: Replace deprecated vsprintf pointer extension %pf by %ps timer: Move trace point to get proper index tick/sched: Update tick_sched struct documentation tick: Remove outgoing CPU from broadcast masks timekeeping: Consistently use unsigned int for seqcount snapshot softirq: Remove tasklet_hrtimer xfrm: Replace hrtimer tasklet with softirq hrtimer mac80211_hwsim: Replace hrtimer tasklet with softirq hrtimer
2019-05-03nohz_full: Allow the boot CPU to be nohz_fullNicholas Piggin
Allow the boot CPU/CPU0 to be nohz_full. Have the boot CPU take the do_timer duty during boot until a housekeeping CPU can take over. This is supported when CONFIG_PM_SLEEP_SMP is not configured, or when it is configured and the arch allows suspend on non-zero CPUs. nohz_full has been trialed at a large supercomputer site and found to significantly reduce jitter. In order to deploy it in production, they need CPU0 to be nohz_full because their job control system requires the application CPUs to start from 0, and the housekeeping CPUs are placed higher. An equivalent job scheduling that uses CPU0 for housekeeping could be achieved by modifying their system, but it is preferable if nohz_full can support their environment without modification. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev@lists.ozlabs.org Link: https://lkml.kernel.org/r/20190411033448.20842-6-npiggin@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-10cpuidle: Export the next timer expiration for CPUsUlf Hansson
To be able to predict the sleep duration for a CPU entering idle, it is essential to know the expiration time of the next timer. Both the teo and the menu cpuidle governors already use this information for CPU idle state selection. Moving forward, a similar prediction needs to be made for a group of idle CPUs rather than for a single one and the following changes implement a new genpd governor for that purpose. In order to support that feature, add a new function called tick_nohz_get_next_hrtimer() that will return the next hrtimer expiration time of a given CPU to be invoked after deciding whether or not to stop the scheduler tick on that CPU. Make the cpuidle core call tick_nohz_get_next_hrtimer() right before invoking the ->enter() callback provided by the cpuidle driver for the given state and store its return value in the per-CPU struct cpuidle_device, so as to make it available to code outside of cpuidle. Note that at the point when cpuidle calls tick_nohz_get_next_hrtimer(), the governor's ->select() callback has already returned and indicated whether or not the tick should be stopped, so in fact the value returned by tick_nohz_get_next_hrtimer() always is the next hrtimer expiration time for the given CPU, possibly including the tick (if it hasn't been stopped). Co-developed-by: Lina Iyer <lina.iyer@linaro.org> Co-developed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-03-23timekeeping: Consistently use unsigned int for seqcount snapshotRasmus Villemoes
The timekeeping code uses a random mix of "unsigned long" and "unsigned int" for the seqcount snapshots (ratio 14:12). Since the seqlock.h API is entirely based on unsigned int, use that throughout. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Stephen Boyd <sboyd@kernel.org> Link: https://lkml.kernel.org/r/20190318195557.20773-1-linux@rasmusvillemoes.dk
2018-11-23hrtimers/tick/clockevents: Remove sloppy license referencesThomas Gleixner
"For licencing details see kernel-base/COPYING" and similar license references have no value over the SPDX identifier. Remove them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Corey Minyard <cminyard@mvista.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Richard Cochran <richardcochran@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: David Riley <davidriley@chromium.org> Cc: Colin Cross <ccross@android.com> Cc: Mark Brown <broonie@kernel.org> Link: https://lkml.kernel.org/r/20181031182252.963632760@linutronix.de
2018-11-23time: Add SPDX license identifiersThomas Gleixner
Update the time(r) core files files with the correct SPDX license identifier based on the license text in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This work is based on a script and data from Philippe Ombredanne, Kate Stewart and myself. The data has been created with two independent license scanners and manual inspection. The following files do not contain any direct license information and have been omitted from the big initial SPDX changes: timeconst.bc: The .bc files were not touched time.c, timer.c, timekeeping.c: Licence was deduced from EXPORT_SYMBOL_GPL As those files do not contain direct license references they fall under the project license, i.e. GPL V2 only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Corey Minyard <cminyard@mvista.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: David Riley <davidriley@chromium.org> Cc: Colin Cross <ccross@android.com> Cc: Mark Brown <broonie@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20181031182252.879109557@linutronix.de
2018-11-23time: Remove useless filenames in top level commentsThomas Gleixner
Remove the pointless filenames in the top level comments. They have no value at all and just occupy space. While at it tidy up some of the comments and remove a stale one. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Nicolas Pitre <nico@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Corey Minyard <cminyard@mvista.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Richard Cochran <richardcochran@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Riley <davidriley@chromium.org> Cc: Colin Cross <ccross@android.com> Cc: Mark Brown <broonie@kernel.org> Link: https://lkml.kernel.org/r/20181031182252.794898238@linutronix.de
2018-10-10tick/sched : Remove redundant cpu_online() checkPeng Hao
can_stop_idle_tick() checks cpu_online() twice. The first check leaves the function when the CPU is not online, so the second check it redundant. Remove it. Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: fweisbec@gmail.com Link: https://lkml.kernel.org/r/1539099815-2943-1-git-send-email-penghao122@sina.com.cn
2018-07-31nohz: Fix local_timer_softirq_pending()Anna-Maria Gleixner
local_timer_softirq_pending() checks whether the timer softirq is pending with: local_softirq_pending() & TIMER_SOFTIRQ. This is wrong because TIMER_SOFTIRQ is the softirq number and not a bitmask. So the test checks for the wrong bit. Use BIT(TIMER_SOFTIRQ) instead. Fixes: 5d62c183f9e9 ("nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()") Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Cc: bigeasy@linutronix.de Cc: peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180731161358.29472-1-anna-maria@linutronix.de
2018-04-26Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIMEThomas Gleixner
Revert commits 92af4dcb4e1c ("tracing: Unify the "boot" and "mono" tracing clocks") 127bfa5f4342 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior") 7250a4047aa6 ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior") d6c7270e913d ("timekeeping: Remove boot time specific code") f2d6fdbfd238 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior") d6ed449afdb3 ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock") 72199320d49d ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock") As stated in the pull request for the unification of CLOCK_MONOTONIC and CLOCK_BOOTTIME, it was clear that we might have to revert the change. As reported by several folks systemd and other applications rely on the documented behaviour of CLOCK_MONOTONIC on Linux and break with the above changes. After resume daemons time out and other timeout related issues are observed. Rafael compiled this list: * systemd kills daemons on resume, after >WatchdogSec seconds of suspending (Genki Sky). [Verified that that's because systemd uses CLOCK_MONOTONIC and expects it to not include the suspend time.] * systemd-journald misbehaves after resume: systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal corrupted or uncleanly shut down, renaming and replacing. (Mike Galbraith). * NetworkManager reports "networking disabled" and networking is broken after resume 50% of the time (Pavel). [May be because of systemd.] * MATE desktop dims the display and starts the screensaver right after system resume (Pavel). * Full system hang during resume (me). [May be due to systemd or NM or both.] That happens on debian and open suse systems. It's sad, that these problems were neither catched in -next nor by those folks who expressed interest in this change. Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net> Reported-by: Genki Sky <sky@genki.is>, Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2018-04-26tick/sched: Do not mess with an enqueued hrtimerThomas Gleixner
Kaike reported that in tests rdma hrtimers occasionaly stopped working. He did great debugging, which provided enough context to decode the problem. CPU 3 CPU 2 idle start sched_timer expires = 712171000000 queue->next = sched_timer start rdmavt timer. expires = 712172915662 lock(baseof(CPU3)) tick_nohz_stop_tick() tick = 716767000000 timerqueue_add(tmr) hrtimer_set_expires(sched_timer, tick); sched_timer->expires = 716767000000 <---- FAIL if (tmr->expires < queue->next->expires) hrtimer_start(sched_timer) queue->next = tmr; lock(baseof(CPU3)) unlock(baseof(CPU3)) timerqueue_remove() timerqueue_add() ts->sched_timer is queued and queue->next is pointing to it, but then ts->sched_timer.expires is modified. This not only corrupts the ordering of the timerqueue RB tree, it also makes CPU2 see the new expiry time of timerqueue->next->expires when checking whether timerqueue->next needs to be updated. So CPU2 sees that the rdma timer is earlier than timerqueue->next and sets the rdma timer as new next. Depending on whether it had also seen the new time at RB tree enqueue, it might have queued the rdma timer at the wrong place and then after removing the sched_timer the RB tree is completely hosed. The problem was introduced with a commit which tried to solve inconsistency between the hrtimer in the tick_sched data and the underlying hardware clockevent. It split out hrtimer_set_expires() to store the new tick time in both the NOHZ and the NOHZ + HIGHRES case, but missed the fact that in the NOHZ + HIGHRES case the hrtimer might still be queued. Use hrtimer_start(timer, tick...) for the NOHZ + HIGHRES case which sets timer->expires after canceling the timer and move the hrtimer_set_expires() invocation into the NOHZ only code path which is not affected as it merily uses the hrtimer as next event storage so code pathes can be shared with the NOHZ + HIGHRES case. Fixes: d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync") Reported-by: "Wan Kaike" <kaike.wan@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Cc: "Marciniszyn Mike" <mike.marciniszyn@intel.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: linux-rdma@vger.kernel.org Cc: "Dalessandro Dennis" <dennis.dalessandro@intel.com> Cc: "Fleck John" <john.fleck@intel.com> Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Weiny Ira" <ira.weiny@intel.com> Cc: "linux-rdma@vger.kernel.org" Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804241637390.1679@nanos.tec.linutronix.de Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804242119210.1597@nanos.tec.linutronix.de
2018-04-11Merge branches 'pm-cpuidle' and 'pm-qos'Rafael J. Wysocki
* pm-cpuidle: tick-sched: avoid a maybe-uninitialized warning cpuidle: Add definition of residency to sysfs documentation time: hrtimer: Use timerqueue_iterate_next() to get to the next timer nohz: Avoid duplication of code related to got_idle_tick nohz: Gather tick_sched booleans under a common flag field cpuidle: menu: Avoid selecting shallow states with stopped tick cpuidle: menu: Refine idle state selection for running tick sched: idle: Select idle state before stopping the tick time: hrtimer: Introduce hrtimer_next_event_without() time: tick-sched: Split tick_nohz_stop_sched_tick() cpuidle: Return nohz hint from cpuidle_select() jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC sched: idle: Do not stop the tick before cpuidle_idle_call() sched: idle: Do not stop the tick upfront in the idle loop time: tick-sched: Reorganize idle tick management code * pm-qos: PM / QoS: mark expected switch fall-throughs
2018-04-10tick-sched: avoid a maybe-uninitialized warningArnd Bergmann
The use of bitfields seems to confuse gcc, leading to a false-positive warning in all compiler versions: kernel/time/tick-sched.c: In function 'tick_nohz_idle_exit': kernel/time/tick-sched.c:538:2: error: 'now' may be used uninitialized in this function [-Werror=maybe-uninitialized] This introduces a temporary variable to track the flags so gcc doesn't have to evaluate twice, eliminating the code path that leads to the warning. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85301 Fixes: 1cae544d42d2 ("nohz: Gather tick_sched booleans under a common flag field") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-09nohz: Avoid duplication of code related to got_idle_tickRafael J. Wysocki
Move the code setting ts->got_idle_tick into tick_sched_do_timer() to avoid code duplication. No intentional changes in functionality. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2018-04-09nohz: Gather tick_sched booleans under a common flag fieldFrederic Weisbecker
Optimize the space and leave plenty of room for further flags. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> [ rjw: Do not use __this_cpu_read() to access tick_stopped and add got_idle_tick to avoid overloading inidle ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-09cpuidle: menu: Refine idle state selection for running tickRafael J. Wysocki
If the tick isn't stopped, the target residency of the state selected by the menu governor may be greater than the actual time to the next tick and that means lost energy. To avoid that, make tick_nohz_get_sleep_length() return the current time to the next event (before stopping the tick) in addition to the estimated one via an extra pointer argument and make menu_select() use that value to refine the state selection when necessary. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-09sched: idle: Select idle state before stopping the tickRafael J. Wysocki
In order to address the issue with short idle duration predictions by the idle governor after the scheduler tick has been stopped, reorder the code in cpuidle_idle_call() so that the governor idle state selection runs before tick_nohz_idle_go_idle() and use the "nohz" hint returned by cpuidle_select() to decide whether or not to stop the tick. This isn't straightforward, because menu_select() invokes tick_nohz_get_sleep_length() to get the time to the next timer event and the number returned by the latter comes from __tick_nohz_idle_stop_tick(). Fortunately, however, it is possible to compute that number without actually stopping the tick and with the help of the existing code. Namely, tick_nohz_get_sleep_length() can be made call tick_nohz_next_event(), introduced earlier, to get the time to the next non-highres timer event. If that happens, tick_nohz_next_event() need not be called by __tick_nohz_idle_stop_tick() again. If it turns out that the scheduler tick cannot be stopped going forward or the next timer event is too close for the tick to be stopped, tick_nohz_get_sleep_length() can simply return the time to the next event currently programmed into the corresponding clock event device. In addition to knowing the return value of tick_nohz_next_event(), however, tick_nohz_get_sleep_length() needs to know the time to the next highres timer event, but with the scheduler tick timer excluded, which can be computed with the help of hrtimer_get_next_event(). That minimum of that number and the tick_nohz_next_event() return value is the total time to the next timer event with the assumption that the tick will be stopped. It can be returned to the idle governor which can use it for predicting idle duration (under the assumption that the tick will be stopped) and deciding whether or not it makes sense to stop the tick before putting the CPU into the selected idle state. With the above, the sleep_length field in struct tick_sched is not necessary any more, so drop it. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227 Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2018-04-07time: tick-sched: Split tick_nohz_stop_sched_tick()Rafael J. Wysocki
In order to address the issue with short idle duration predictions by the idle governor after the scheduler tick has been stopped, split tick_nohz_stop_sched_tick() into two separate routines, one computing the time to the next timer event and the other simply stopping the tick when the time to the next timer event is known. Prepare these two routines to be called separately, as one of them will be called by the idle governor in the cpuidle_select() code path after subsequent changes. Update the former callers of tick_nohz_stop_sched_tick() to use the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(), instead of it and move the updates of the sleep_length field in struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't need to be updated anywhere else. There should be no intentional visible changes in functionality resulting from this change. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2018-04-06cpuidle: Return nohz hint from cpuidle_select()Rafael J. Wysocki
Add a new pointer argument to cpuidle_select() and to the ->select cpuidle governor callback to allow a boolean value indicating whether or not the tick should be stopped before entering the selected state to be returned from there. Make the ladder governor ignore that pointer (to preserve its current behavior) and make the menu governor return 'false" through it if: (1) the idle exit latency is constrained at 0, or (2) the selected state is a polling one, or (3) the expected idle period duration is within the tick period range. In addition to that, the correction factor computations in the menu governor need to take the possibility that the tick may not be stopped into account to avoid artificially small correction factor values. To that end, add a mechanism to record tick wakeups, as suggested by Peter Zijlstra, and use it to modify the menu_update() behavior when tick wakeup occurs. Namely, if the CPU is woken up by the tick and the return value of tick_nohz_get_sleep_length() is not within the tick boundary, the predicted idle duration is likely too short, so make menu_update() try to compensate for that by updating the governor statistics as though the CPU was idle for a long time. Since the value returned through the new argument pointer of cpuidle_select() is not used by its caller yet, this change by itself is not expected to alter the functionality of the code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05sched: idle: Do not stop the tick upfront in the idle loopRafael J. Wysocki
Push the decision whether or not to stop the tick somewhat deeper into the idle loop. Stopping the tick upfront leads to unpleasant outcomes in case the idle governor doesn't agree with the nohz code on the duration of the upcoming idle period. Specifically, if the tick has been stopped and the idle governor predicts short idle, the situation is bad regardless of whether or not the prediction is accurate. If it is accurate, the tick has been stopped unnecessarily which means excessive overhead. If it is not accurate, the CPU is likely to spend too much time in the (shallow, because short idle has been predicted) idle state selected by the governor [1]. As the first step towards addressing this problem, change the code to make the tick stopping decision inside of the loop in do_idle(). In particular, do not stop the tick in the cpu_idle_poll() code path. Also don't do that in tick_nohz_irq_exit() which doesn't really have enough information on whether or not to stop the tick. Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1] Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05time: tick-sched: Reorganize idle tick management codeRafael J. Wysocki
Prepare the scheduler tick code for reworking the idle loop to avoid stopping the tick in some cases. The idea is to split the nohz idle entry call to decouple the idle time stats accounting and preparatory work from the actual tick stop code, in order to later be able to delay the tick stop once we reach more power-knowledgeable callers. Move away the tick_nohz_start_idle() invocation from __tick_nohz_idle_enter(), rename the latter to __tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick() as a wrapper around it for calling it from the outside. Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead of calling the entire __tick_nohz_idle_enter(), add another wrapper disabling and enabling interrupts around tick_nohz_idle_stop_tick() and make the current callers of tick_nohz_idle_enter() call it too to retain their current functionality. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-04Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time(r) updates from Thomas Gleixner: "A small set of updates for timers and timekeeping: - The most interesting change is the consolidation of clock MONOTONIC and clock BOOTTIME. Clock MONOTONIC behaves now exactly like clock BOOTTIME and does not longer ignore the time spent in suspend. A new clock MONOTONIC_ACTIVE is provived which behaves like clock MONOTONIC in kernels before this change. This allows applications to programmatically check for the clock MONOTONIC behaviour. As discussed in the review thread, this has the potential of breaking user space and we might have to revert this. Knock on wood that we can avoid that exercise. - Updates to the NTP mechanism to improve accuracy - A new kernel internal data structure to aid the ongoing Y2038 work. - Cleanups and simplifications of the clocksource code. - Make the alarmtimer code play nicely with debugobjects" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Init nanosleep alarm timer on stack y2038: Introduce struct __kernel_old_timeval tracing: Unify the "boot" and "mono" tracing clocks hrtimer: Unify MONOTONIC and BOOTTIME clock behavior posix-timers: Unify MONOTONIC and BOOTTIME clock behavior timekeeping: Remove boot time specific code Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock timekeeping/ntp: Determine the multiplier directly from NTP tick length timekeeping/ntp: Don't align NTP frequency adjustments to ticks clocksource: Use ATTRIBUTE_GROUPS clocksource: Use DEVICE_ATTR_RW/RO/WO to define device attributes clocksource: Don't walk the clocksource list for empty override
2018-04-02Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main scheduler changes in this cycle were: - NUMA balancing improvements (Mel Gorman) - Further load tracking improvements (Patrick Bellasi) - Various NOHZ balancing cleanups and optimizations (Peter Zijlstra) - Improve blocked load handling, in particular we can now reduce and eventually stop periodic load updates on 'very idle' CPUs. (Vincent Guittot) - On isolated CPUs offload the final 1Hz scheduler tick as well, plus related cleanups and reorganization. (Frederic Weisbecker) - Core scheduler code cleanups (Ingo Molnar)" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) sched/core: Update preempt_notifier_key to modern API sched/cpufreq: Rate limits for SCHED_DEADLINE sched/fair: Update util_est only on util_avg updates sched/cpufreq/schedutil: Use util_est for OPP selection sched/fair: Use util_est in LB and WU paths sched/fair: Add util_est on top of PELT sched/core: Remove TASK_ALL sched/completions: Use bool in try_wait_for_completion() sched/fair: Update blocked load when newly idle sched/fair: Move idle_balance() sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks sched/fair: Move rebalance_domains() sched/nohz: Optimize nohz_idle_balance() sched/fair: Reduce the periodic update duration sched/nohz: Stop NOHZ stats when decayed sched/cpufreq: Provide migration hint sched/nohz: Clean up nohz enter/exit sched/fair: Update blocked load from NEWIDLE sched/fair: Add NOHZ stats balancing sched/fair: Restructure nohz_balance_kick() ...
2018-03-13timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clockThomas Gleixner
The MONOTONIC clock is not fast forwarded by the time spent in suspend on resume. This is only done for the BOOTTIME clock. The reason why the MONOTONIC clock is not forwarded is historical: the original Linux implementation was using jiffies as a base for the MONOTONIC clock and jiffies have never been advanced after resume. At some point when timekeeping was unified in the core code, the MONONOTIC clock was advanced after resume which also advanced jiffies causing interesting side effects. As a consequence the the MONOTONIC clock forwarding was disabled again and the BOOTTIME clock was introduced, which allows to read time since boot. Back then it was not possible to completely distangle the MONOTONIC clock and jiffies because there were still interfaces which exposed the MONOTONIC clock behaviour based on the timer wheel and therefore jiffies. As of today none of the MONOTONIC clock facilities depends on jiffies anymore so the forwarding can be done seperately. This is achieved by forwarding the variables which are used for the jiffies update after resume before the tick is restarted, In timekeeping resume, the change is rather simple. Instead of updating the offset between the MONOTONIC clock and the REALTIME/BOOTTIME clocks, advance the time keeper base for the MONOTONIC and the MONOTONIC_RAW clocks by the time spent in suspend. The MONOTONIC clock is now the same as the BOOTTIME clock and the offset between the REALTIME and the MONOTONIC clocks is the same as before suspend. There might be side effects in applications, which rely on the (unfortunately) well documented behaviour of the MONOTONIC clock, but the downsides of the existing behaviour are probably worse. There is one obvious issue. Up to now it was possible to retrieve the time spent in suspend by observing the delta between the MONOTONIC clock and the BOOTTIME clock. This is not longer available, but the previously introduced mechanism to read the active non-suspended monotonic time can mitigate that in a detectable fashion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165150.062975504@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/nohz: Clean up nohz enter/exitPeter Zijlstra
The primary observation is that nohz enter/exit is always from the current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be an atomic. Secondary is that we appear to have 2 nearly identical hooks in the nohz enter code, set_cpu_sd_state_idle() and nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into nohz_balance_{enter,exit}_idle. Removes an atomic op from both enter and exit paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21sched/nohz: Remove the 1 Hz tick codeFrederic Weisbecker
Now that the 1Hz tick is offloaded to workqueues, we can safely remove the residual code that used to handle it locally. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21nohz: Allow to check if remote CPU tick is stoppedFrederic Weisbecker
This check is racy but provides a good heuristic to determine whether a CPU may need a remote tick or not. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-4-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21nohz: Convert tick_nohz_tick_stopped() to boolFrederic Weisbecker
It makes this function more self-explanatory about what it does and how to use it. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-3-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-15sched/isolation: Eliminate NO_HZ_FULL_ALLPaul E. McKenney
Commit 6f1982fedd59 ("sched/isolation: Handle the nohz_full= parameter") broke CONFIG_NO_HZ_FULL_ALL=y kernels. This breakage is due to the code under CONFIG_NO_HZ_FULL_ALL failing to invoke the shiny new housekeeping functions. This means that rcutorture scenario TREE04 now emits RCU CPU stall warnings due to the RCU grace-period kthreads not being awakened at a time of their choosing, or perhaps even not at all: [ 27.731422] rcu_bh kthread starved for 21001 jiffies! g18446744073709551369 c18446744073709551368 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=3 [ 27.731423] rcu_bh I14936 9 2 0x80080000 [ 27.731435] Call Trace: [ 27.731440] __schedule+0x31a/0x6d0 [ 27.731442] schedule+0x31/0x80 [ 27.731446] schedule_timeout+0x15a/0x320 [ 27.731453] ? call_timer_fn+0x130/0x130 [ 27.731457] rcu_gp_kthread+0x66c/0xea0 [ 27.731458] ? rcu_gp_kthread+0x66c/0xea0 Because no one has complained about CONFIG_NO_HZ_FULL_ALL=y being broken, I hypothesize that no one is in fact using it, other than rcutorture. This commit therefore eliminates CONFIG_NO_HZ_FULL_ALL and updates rcutorture's config files to instead use the nohz_full= kernel parameter to put the desired CPUs into nohz_full mode. Fixes: 6f1982fedd59 ("sched/isolation: Handle the nohz_full= parameter") Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net>
2018-01-16hrtimer: Optimize the hrtimer code by using static keys for ↵Thomas Gleixner
migration_enable/nohz_active The hrtimer_cpu_base::migration_enable and ::nohz_active fields were originally introduced to avoid accessing global variables for these decisions. Still that results in a (cache hot) load and conditional branch, which can be avoided by using static keys. Implement it with static keys and optimize for the most critical case of high performance networking which tends to disable the timer migration functionality. No change in functionality. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: keescook@chromium.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1801142327490.2371@nanos Link: https://lkml.kernel.org/r/20171221104205.7269-2-anna-maria@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-29nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()Thomas Gleixner
The conditions in irq_exit() to invoke tick_nohz_irq_exit() which subsequently invokes tick_nohz_stop_sched_tick() are: if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) If need_resched() is not set, but a timer softirq is pending then this is an indication that the softirq code punted and delegated the execution to softirqd. need_resched() is not true because the current interrupted task takes precedence over softirqd. Invoking tick_nohz_irq_exit() in this case can cause an endless loop of timer interrupts because the timer wheel contains an expired timer, but softirqs are not yet executed. So it returns an immediate expiry request, which causes the timer to fire immediately again. Lather, rinse and repeat.... Prevent that by adding a check for a pending timer soft interrupt to the conditions in tick_nohz_stop_sched_tick() which avoid calling get_next_timer_interrupt(). That keeps the tick sched timer on the tick and prevents a repetitive programming of an already expired timer. Reported-by: Sebastian Siewior <bigeasy@linutronix.d> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
2017-12-28cpufreq: schedutil: Use idle_calls counter of the remote CPUJoel Fernandes
Since the recent remote cpufreq callback work, its possible that a cpufreq update is triggered from a remote CPU. For single policies however, the current code uses the local CPU when trying to determine if the remote sg_cpu entered idle or is busy. This is incorrect. To remedy this, compare with the nohz tick idle_calls counter of the remote CPU. Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks) Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes <joelaf@google.com> Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main updates in this cycle were: - Group balancing enhancements and cleanups (Brendan Jackman) - Move CPU isolation related functionality into its separate kernel/sched/isolation.c file, with related 'housekeeping_*()' namespace and nomenclature et al. (Frederic Weisbecker) - Improve the interactive/cpu-intense fairness calculation (Josef Bacik) - Improve the PELT code and related cleanups (Peter Zijlstra) - Improve the logic of pick_next_task_fair() (Uladzislau Rezki) - Improve the RT IPI based balancing logic (Steven Rostedt) - Various micro-optimizations: - better !CONFIG_SCHED_DEBUG optimizations (Patrick Bellasi) - better idle loop (Cheng Jian) - ... plus misc fixes, cleanups and updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds sched/sysctl: Fix attributes of some extern declarations sched/isolation: Document isolcpus= boot parameter flags, mark it deprecated sched/isolation: Add basic isolcpus flags sched/isolation: Move isolcpus= handling to the housekeeping code sched/isolation: Handle the nohz_full= parameter sched/isolation: Introduce housekeeping flags sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu() sched/isolation: Use its own static key sched/isolation: Make the housekeeping cpumask private sched/isolation: Provide a dynamic off-case to housekeeping_any_cpu() sched/isolation, watchdog: Use housekeeping_cpumask() instead of ad-hoc version sched/isolation: Move housekeeping related code to its own file sched/idle: Micro-optimize the idle loop sched/isolcpus: Fix "isolcpus=" boot parameter handling when !CONFIG_CPUMASK_OFFSTACK x86/tsc: Append the 'tsc=' description for the 'tsc=unstable' boot parameter sched/rt: Simplify the IPI based RT balancing logic block/ioprio: Use a helper to check for RT prio sched/rt: Add a helper to test for a RT task ...
2017-11-08timers/nohz: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-5-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27sched/isolation: Handle the nohz_full= parameterFrederic Weisbecker
We want to centralize the isolation management, done by the housekeeping subsystem. Therefore we need to handle the nohz_full= parameter from there. Since nohz_full= so far has involved unbound timers, watchdog, RCU and tilegx NAPI isolation, we keep that default behaviour. nohz_full= will be deprecated in the future. We want to control the isolation features from the isolcpus= parameter. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1509072159-31808-10-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27sched/isolation: Move housekeeping related code to its own fileFrederic Weisbecker
The housekeeping code is currently tied to the NOHZ code. As we are planning to make housekeeping independent from it, start with moving the relevant code to its own file. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/idle: Move quiet_vmstate() into the NOHZ codePeter Zijlstra
quiet_vmstat() is an expensive function that only makes sense when we go into NOHZ. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: aubrey.li@linux.intel.com Cc: cl@linux.com Cc: fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-03Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull nohz updates from Ingo Molnar: "The main changes in this cycle relate to fixing another bad (but sporadic and hard to detect) interaction between the dynticks scheduler tick and hrtimers, plus related improvements to better detection and handling of similar problems - by Frédéric Weisbecker" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: Fix spurious warning when hrtimer and clockevent get out of sync nohz: Fix buggy tick delay on IRQ storms nohz: Reset next_tick cache even when the timer has no regs nohz: Fix collision between tick and other hrtimers, again nohz: Add hrtimer sanity check
2017-06-22nohz: Move idle balancer registration to the idle pathFrederic Weisbecker
The idle load balancing registration path assumes that we only stop the tick when the CPU is idle, ignoring the nohz full case. As a result, a nohz full CPU that is running a task may be chosen to perform idle load balancing. Lets make sure that only CPUs in dynticks idle mode can be picked as idle load balancers. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1497838322-10913-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22sched/loadavg: Generalize "_idle" naming to "_nohz"Frederic Weisbecker
The loadavg naming code still assumes that nohz == idle whereas its code is actually handling well both nohz idle and nohz full. So lets fix the naming according to what the code actually does, to unconfuse the reader. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1497838322-10913-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13nohz: Fix spurious warning when hrtimer and clockevent get out of syncFrederic Weisbecker
The sanity check ensuring that the tick expiry cache (ts->next_tick) is actually in sync with the hardware clock (dev->next_event) makes the wrong assumption that the clock can't be programmed later than the hrtimer deadline. In fact the clock hardware can be programmed later on some conditions such as: * The hrtimer deadline is already in the past. * The hrtimer deadline is earlier than the minimum delay supported by the hardware. Such conditions can be met when we program the tick, for example if the last jiffies update hasn't been seen by the current CPU yet, we may program the hrtimer to a deadline that is earlier than ktime_get() because last_jiffies_update is our timestamp base to compute the next tick. As a result, we can randomly observe such warning: WARNING: CPU: 5 PID: 0 at kernel/time/tick-sched.c:794 tick_nohz_stop_sched_tick kernel/time/tick-sched.c:791 [inline] Call Trace: tick_nohz_irq_exit tick_irq_exit irq_exit exiting_irq smp_call_function_interrupt smp_call_function_single_interrupt call_function_single_interrupt Therefore, let's rather make sure that the tick expiry cache is sync'ed with the tick hrtimer deadline, against which it is not supposed to drift away. The clock hardware instead has its own will and can't be used as a reliable comparison point. Reported-and-tested-by: Sasha Levin <alexander.levin@verizon.com> Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: James Hartsock <hartsjc@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Wright <tim@binbash.co.uk> Link: http://lkml.kernel.org/r/1497326654-14122-1-git-send-email-fweisbec@gmail.com [ Minor readability edit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05nohz: Fix buggy tick delay on IRQ stormsFrederic Weisbecker
When the tick is stopped and we reach the dynticks evaluation code on IRQ exit, we perform a soft tick restart if we observe an expired timer from there. It means we program the nearest possible tick but we stay in dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick again after that expired timer is handled. Now this solution works most of the time but if we suffer an IRQ storm and those interrupts trigger faster than the hardware clockevents min delay, our tick won't fire until that IRQ storm is finished. Here is the problem: on IRQ exit we reprog the timer to at least NOW() + min_clockevents_delay. Another IRQ fires before the tick so we reschedule again to NOW() + min_clockevents_delay, etc... The tick is eternally rescheduled min_clockevents_delay ahead. A solution is to simply remove this soft tick restart. After all the normal dynticks evaluation path can handle 0 delay just fine. And by doing that we benefit from the optimization branch which avoids clock reprogramming if the clockevents deadline hasn't changed since the last reprog. This fixes our issue because we don't do repetitive clock reprog that always add hardware min delay. As a side effect it should even optimize the 0 delay path in general. Reported-and-tested-by: Octavian Purdila <octavian.purdila@nxp.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1496328429-13317-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>