summaryrefslogtreecommitdiff
path: root/drivers/cpuidle/governors/menu.c
AgeCommit message (Collapse)Author
2025-04-09cpuidle: menu: Optimize bucket assignment when next_timer_ns equals KTIME_MAXZhongqiu Han
Directly assign the last bucket value instead of calling which_bucket() when next_timer_ns equals KTIME_MAX, the largest possible value that always falls into the last bucket. This avoids unnecessary calculations and enhances performance. Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Link: https://patch.msgid.link/20250405135308.1854342-1-quic_zhonhan@quicinc.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-25cpuidle: menu: Update documentation after get_typical_interval() changesRafael J. Wysocki
The documentation of the menu cpuidle governor needs to be updated to match the code behavior after some changes made recently. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/4998484.31r3eYUQgx@rjwysocki.net [ rjw: More specific subject, two typos fixed in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-25cpuidle: menu: Avoid discarding useful informationRafael J. Wysocki
When giving up on making a high-confidence prediction, get_typical_interval() always returns UINT_MAX which means that the next idle interval prediction will be based entirely on the time till the next timer. However, the information represented by the most recent intervals may not be completely useless in those cases. Namely, the largest recent idle interval is an upper bound on the recently observed idle duration, so it is reasonable to assume that the next idle duration is unlikely to exceed it. Moreover, this is still true after eliminating the suspected outliers if the sample set still under consideration is at least as large as 50% of the maximum sample set size. Accordingly, make get_typical_interval() return the current maximum recent interval value in that case instead of UINT_MAX. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net
2025-02-25cpuidle: menu: Eliminate outliers on both ends of the sample setRafael J. Wysocki
Currently, get_typical_interval() attempts to eliminate outliers at the high end of the sample set only (probably in order to bias the prediction toward lower values), but this it problematic because if the outliers are present at the low end of the sample set, discarding the highest values will not help to reduce the variance. Since the presence of outliers at the low end of the sample set is generally as likely as their presence at the high end of the sample set, modify get_typical_interval() to treat samples at the largest distances from the average (on both ends of the sample set) as outliers. This should increase the likelihood of making a meaningful prediction in some cases. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/2301940.iZASKD2KPV@rjwysocki.net
2025-02-25cpuidle: menu: Tweak threshold use in get_typical_interval()Rafael J. Wysocki
To prepare get_typical_interval() for subsequent changes, rearrange the use of the data point threshold in it a bit and initialize that threshold to UINT_MAX which is more consistent with its data type. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/8490144.T7Z3S40VBb@rjwysocki.net
2025-02-25cpuidle: menu: Use one loop for average and variance computationsRafael J. Wysocki
Use the observation that one loop is sufficient to compute the average of an array of values and their variance to eliminate one of the loops from get_typical_interval(). While at it, make get_typical_interval() consistently use u64 as the 64-bit unsigned integer data type and rearrange some white space and the declarations of local variables in it (to make them follow the reverse X-mas tree pattern). No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/3339073.aeNJFYEL58@rjwysocki.net
2025-02-25cpuidle: menu: Drop a redundant local variableRafael J. Wysocki
Local variable min in get_typical_interval() is updated, but never accessed later, so drop it. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/13699686.uLZWGnKmhe@rjwysocki.net
2024-09-30cpuidle: menu: Remove iowait influenceChristian Loehle
Remove CPU iowaiters influence on idle state selection. Remove the menu notion of performance multiplier which increased with the number of tasks that went to iowait sleep on this CPU and haven't woken up yet. Relying on iowait for cpuidle is problematic for a few reasons: 1. There is no guarantee that an iowaiting task will wake up on the same CPU. 2. The task being in iowait says nothing about the idle duration, we could be selecting shallower states for a long time. 3. The task being in iowait doesn't always imply a performance hit with increased latency. 4. If there is such a performance hit, the number of iowaiting tasks doesn't directly correlate. 5. The definition of iowait altogether is vague at best, it is sprinkled across kernel code. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com [ rjw: Minor edits in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-07cpuidle: menu: Cleanup after loadavg removalChristian Loehle
The performance impact of loadavg was removed with commit a7fe5190c03f ("cpuidle: menu: Remove get_loadavg() from the performance multiplier") With only iowait remaining the description can be simplified, remove also the no longer needed includes. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-17cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some casesRafael J. Wysocki
Because the cost of calling tick_nohz_get_sleep_length() may increase in the future, reorder the code in menu_select() so it first uses the statistics to determine the expected idle duration. If that value is higher than RESIDENCY_THRESHOLD_NS, tick_nohz_get_sleep_length() will be called to obtain the time till the closest timer and refine the idle duration prediction if necessary. This causes the governor to always take the full overhead of get_typical_interval() with the assumption that the cost will be amortized by skipping the tick_nohz_get_sleep_length() call in the cases when the predicted idle duration is relatively very small. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Doug Smythies <dsmythies@telus.net>
2021-11-24cpuidle: menu: Fix typo in a commentJason Wang
The word `these' in a comment is repeated, so drop one. Signed-off-by: Jason Wang <wangborong@cdjrlc.com> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12sched: Make nr_iowait_cpu() return 32-bit valueAlexey Dobriyan
Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com
2021-04-07cpuidle: menu: Take negative "sleep length" values into accountRafael J. Wysocki
Make the menu governor check the tick_nohz_get_next_hrtimer() return value so as to avoid dealing with negative "sleep length" values and make it use that value directly when the tick is stopped. While at it, rename local variable delta_next in menu_select() to delta_tick which better reflects its purpose. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-11cpuidle: Use nanoseconds as the unit of timeRafael J. Wysocki
Currently, the cpuidle subsystem uses microseconds as the unit of time which (among other things) causes the idle loop to incur some integer division overhead for no clear benefit. In order to allow cpuidle to measure time in nanoseconds, add two new fields, exit_latency_ns and target_residency_ns, to represent the exit latency and target residency of an idle state in nanoseconds, respectively, to struct cpuidle_state and initialize them with the help of the corresponding values in microseconds provided by drivers. Additionally, change cpuidle_governor_latency_req() to return the idle state exit latency constraint in nanoseconds. Also meeasure idle state residency (last_residency_ns in struct cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds and update the cpuidle core and governors accordingly. However, the menu governor still computes typical intervals in microseconds to avoid integer overflows. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net>
2019-11-06cpuidle: Consolidate disabled state checksRafael J. Wysocki
There are two reasons why CPU idle states may be disabled: either because the driver has disabled them or because they have been disabled by user space via sysfs. In the former case, the state's "disabled" flag is set once during the initialization of the driver and it is never cleared later (it is read-only effectively). In the latter case, the "disable" field of the given state's cpuidle_state_usage struct is set and it may be changed via sysfs. Thus checking whether or not an idle state has been disabled involves reading these two flags every time. In order to avoid the additional check of the state's "disabled" flag (which is effectively read-only anyway), use the value of it at the init time to set a (new) flag in the "disable" field of that state's cpuidle_state_usage structure and use the sysfs interface to manipulate another (new) flag in it. This way the state is disabled whenever the "disable" field of its cpuidle_state_usage structure is nonzero, whatever the reason, and it is the only place to look into to check whether or not the state has been disabled. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-08-05cpuidle: menu: Allow tick to be stopped if PM QoS is usedRafael J. Wysocki
After commit 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") the menu governor prevents the scheduler tick from being stopped (unless stopped already) if there is a PM QoS latency constraint for the given CPU and the target residency of the deepest idle state matching that constraint is below the tick boundary. However, that is problematic if CPUs with PM QoS latency constraints are idle for long times, because it effectively causes the tick to run on them all the time which is wasteful. [It is also confusing and questionable if they are full dynticks CPUs.] To address that issue, make the menu governor allow the tick to be stopped only if the idle duration predicted by it is beyond the tick boundary, except when the shallowest idle state is selected upfront and it is not a "polling" one. Fixes: 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") Link: https://lore.kernel.org/lkml/79b247b3-e056-610e-9a07-e685dfdaa6c9@gmail.com/ Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com> Tested-by: Thomas Lindroth <thomas.lindroth@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-07-30governors: unify last_state_idxMarcelo Tosatti
Since this field is shared by all governors, move it to cpuidle device structure. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215Thomas Gleixner
Based on 1 normalized pattern(s): this code is licenced under the gpl version 2 as described in the copying file that acompanies the linux kernel extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171439.466585205@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-07cpuidle: menu: Avoid overflows when computing varianceRafael J. Wysocki
The variance computation in get_typical_interval() may overflow if the square of the value of diff exceeds the maximum for the int64_t data type value which basically is the case when it is of the order of UINT_MAX. However, data points so far in the future don't matter for idle state selection anyway, so change the initial threshold value in get_typical_interval() to INT_MAX which will cause more "outlying" data points to be discarded without affecting the selection result. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-30Merge tag 'pm-4.20-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These remove a questionable heuristic from the menu cpuidle governor, fix a recent build regression in the intel_pstate driver, clean up ARM big-Little support in cpufreq and fix up hung task watchdog's interaction with system-wide power management transitions. Specifics: - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov)" * tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: kernel: hung_task.c: disable on suspend cpufreq: remove unused arm_big_little_dt driver cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64 cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI cpuidle: menu: Remove get_loadavg() from the performance multiplier sched: Factor out nr_iowait and nr_iowait_cpu
2018-10-26sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOADJohannes Weiner
There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-25cpuidle: menu: Remove get_loadavg() from the performance multiplierDaniel Lezcano
The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq *this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit 372ba8cb46b2 (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { struct rq *rq = this_rq(); *nr_waiters = atomic_read(&rq->nr_iowait); *load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-18cpuidle: menu: Avoid computations when result will be discardedRafael J. Wysocki
If the minimum interval taken into account in the average computation loop in get_typical_interval() is less than the expected idle duration determined so far, the resultant average cannot be greater than that value as well and the entire return result of the function is going to be discarded anyway going forward. In that case, it is a waste of time to carry out the remaining computations in get_typical_interval(), so avoid that by returning early if the minimum interval is not below the expected idle duration. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-18cpuidle: menu: Drop redundant comparisonRafael J. Wysocki
Since the correction factor cannot be greater than RESOLUTION * DECAY, the result of the predicted_us computation in menu_select() cannot be greater than data->next_timer_us, so it is not necessary to compare the "typical interval" value coming from get_typical_interval() with data->next_timer_us separately. It is sufficient to copmare predicted_us with the return value of get_typical_interval() directly, so do that and drop the now redundant expected_interval variable. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-12cpuidle: menu: Simplify checks related to the polling stateRafael J. Wysocki
After some recent menu governor changes, the promotion of the "polling" state to a physical one is mostly controlled by the latency limit (resulting from the "interactivity" factor) and not by the time to the closest timer event, so it should be sufficient to check the exit latency of that state for this purpose (of course, its target residency still needs to be within the next timer event range for energy-efficiency). Also, the physical state the "polling" one is promoted to need not be the next one in principle (in case the next state is disabled, for example). For these reasons, simplify the checks made to decide whether or not to promote the "polling" state to a physical one and update the target idle duration when it is promoted in case the residency of the new state turns out to be above the tick boundary (in which case there is no reason to stop the tick). Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-04cpuidle: menu: Move the latency_req == 0 special case checkRafael J. Wysocki
It is better to always update data->bucket before returning from menu_select() to avoid updating the correction factor for a stale bucket, so combine the latency_req == 0 special check with the more general check below. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-10-04cpuidle: menu: Avoid computations for very close timersRafael J. Wysocki
If the next timer event (not including the tick) is closer than the target residency of the second state or the PM QoS latency constraint is below its exit latency, state[0] will be used regardless of any other factors, so skip the computations in menu_select() then and return 0 straight away from it. Still, do that after the bucket has been determined to avoid updating the correction factor for a stale bucket. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-10-04cpuidle: menu: Do not update last_state_idx in menu_select()Rafael J. Wysocki
It is not necessary to update data->last_state_idx in menu_select() as it only is used in menu_update() which only runs when data->needs_update is set and that is set only when updating data->last_state_idx in menu_reflect(). Accordingly, drop the update of data->last_state_idx from menu_select() and get rid of the (now redundant) "out" label from it. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-04cpuidle: menu: Get rid of first_idx from menu_select()Rafael J. Wysocki
Rearrange the code in menu_select() so that the loop over idle states always starts from 0 and get rid of the first_idx variable. While at it, add two empty lines to separate conditional statements from one another. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-04cpuidle: menu: Compute first_idx when latency_req is knownRafael J. Wysocki
Since menu_select() can only set first_idx to 1 if the exit latency of the second state is not greater than the latency limit, it should first determine that limit. Thus first_idx should be computed after the "interactivity" factor has been taken into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewedy-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-04cpuidle: menu: Fix wakeup statistics updates for polling stateRafael J. Wysocki
If the CPU exits the "polling" state due to the time limit in the loop in poll_idle(), this is not a real wakeup and it just means that the "polling" state selection was not adequate. The governor mispredicted short idle duration, but had a more suitable state been selected, the CPU might have spent more time in it. In fact, there is no reason to expect that there would have been a wakeup event earlier than the next timer in that case. Handling such cases as regular wakeups in menu_update() may cause the menu governor to make suboptimal decisions going forward, but ignoring them altogether would not be correct either, because every time menu_select() is invoked, it makes a separate new attempt to predict the idle duration taking distinct time to the closest timer event as input and the outcomes of all those attempts should be recorded. For this reason, make menu_update() always assume that if the "polling" state was exited due to the time limit, the next proper wakeup event for the CPU would be the next timer event (not including the tick). Fixes: a37b969a61c1 "cpuidle: poll_state: Add time limit to poll_idle()" Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-03cpuidle: menu: Replace data->predicted_us with local variableRafael J. Wysocki
The predicted_us field in struct menu_device is only accessed in menu_select(), so replace it with a local variable in that function. With that, stop using expected_interval instead of predicted_us to store the new predicted idle duration value if it is set to the selected state's target residency which is quite confusing. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-09-18cpuidle: Remove unnecessary wrapper cpuidle_get_last_residency()Fieah Lim
cpuidle_get_last_residency() is just a wrapper for retrieving the last_residency member of struct cpuidle_device. It's also weirdly the only wrapper function for accessing cpuidle_* struct member (by my best guess is it could be a leftover from v2.x). Anyhow, since the only two users (the ladder and menu governors) can access dev->last_residency directly, and it's more intuitive to do it that way, let's just get rid of the wrapper. This patch tidies up CPU idle code a bit without functional changes. Signed-off-by: Fieah Lim <kw@fieahl.im> [ rjw: Changelog cleanup ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-25cpuidle: menu: Retain tick when shallow state is selectedRafael J. Wysocki
The case addressed by commit 5ef499cd571c (cpuidle: menu: Handle stopped tick more aggressively) in the stopped tick case is present when the tick has not been stopped yet too. Namely, if only two CPU idle states, shallow state A with target residency significantly below the tick boundary and deep state B with target residency significantly above it, are available and the predicted idle duration is above the tick boundary, but below the target residency of state B, state A will be selected and the CPU may spend indefinite amount of time in it, which is not quite energy-efficient. However, if the tick has not been stopped yet and the governor is about to select a shallow idle state for the CPU even though the idle duration predicted by it is above the tick boundary, it should be fine to wake up the CPU early, so the tick can be retained then and the governor will have a chance to select a deeper state when it runs next time. [Note that when this really happens, it will make the idle duration predictor believe that the CPU might be idle longer than predicted, which will make it more likely to predict longer idle durations going forward, but that will also cause deeper idle states to be selected going forward, on average, which is what's needed here.] Fixes: 87c9fe6ee495 (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+: 5ef499cd571c (cpuidle: menu: Handle ...) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-20cpuidle: menu: Handle stopped tick more aggressivelyRafael J. Wysocki
Commit 87c9fe6ee495 (cpuidle: menu: Avoid selecting shallow states with stopped tick) missed the case when the target residencies of deep idle states of CPUs are above the tick boundary which may cause the CPU to get stuck in a shallow idle state for a long time. Say there are two CPU idle states available: one shallow, with the target residency much below the tick boundary and one deep, with the target residency significantly above the tick boundary. In that case, if the tick has been stopped already and the expected next timer event is relatively far in the future, the governor will assume the idle duration to be equal to TICK_USEC and it will select the idle state for the CPU accordingly. However, that will cause the shallow state to be selected even though it would have been more energy-efficient to select the deep one. To address this issue, modify the governor to always use the time till the closest timer event instead of the predicted idle duration if the latter is less than the tick period length and the tick has been stopped already. Also make it extend the search for a matching idle state if the tick is stopped to avoid settling on a shallow state if deep states with target residencies above the tick period length are available. In addition, make it always indicate that the tick should be stopped if it has been stopped already for consistency. Fixes: 87c9fe6ee495 (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-16cpuidle: menu: Update stale polling override commentRafael J. Wysocki
The comment to explain why the menu governor uses idle state 1 instead of idle state 0 as the first one sometimes is stale (among other things it mentions a user setting not present any more), so update it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-15cpuidle: menu: Fix white spaceRafael J. Wysocki
Fix some damaged white space in menu_select(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-30cpuidle: governors: Consolidate PM QoS handlingRafael J. Wysocki
There is some code duplication related to the PM QoS handling between the existing cpuidle governors, so move that code to a common helper function and call that from the governors. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-30cpuidle: governors: Drop redundant checks related to PM QoSRafael J. Wysocki
PM_QOS_RESUME_LATENCY_NO_CONSTRAINT is defined as the 32-bit integer maximum, so it is not necessary to test the return value of dev_pm_qos_raw_read_value() against it directly in the menu and ladder cpuidle governors. Drop these redundant checks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-09cpuidle: menu: Avoid selecting shallow states with stopped tickRafael J. Wysocki
If the scheduler tick has been stopped already and the governor selects a shallow idle state, the CPU can spend a long time in that state if the selection is based on an inaccurate prediction of idle time. That effect turns out to be relevant, so it needs to be mitigated. To that end, modify the menu governor to discard the result of the idle time prediction if the tick is stopped and the predicted idle time is less than the tick period length, unless the tick timer is going to expire soon. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-09cpuidle: menu: Refine idle state selection for running tickRafael J. Wysocki
If the tick isn't stopped, the target residency of the state selected by the menu governor may be greater than the actual time to the next tick and that means lost energy. To avoid that, make tick_nohz_get_sleep_length() return the current time to the next event (before stopping the tick) in addition to the estimated one via an extra pointer argument and make menu_select() use that value to refine the state selection when necessary. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-06cpuidle: Return nohz hint from cpuidle_select()Rafael J. Wysocki
Add a new pointer argument to cpuidle_select() and to the ->select cpuidle governor callback to allow a boolean value indicating whether or not the tick should be stopped before entering the selected state to be returned from there. Make the ladder governor ignore that pointer (to preserve its current behavior) and make the menu governor return 'false" through it if: (1) the idle exit latency is constrained at 0, or (2) the selected state is a polling one, or (3) the expected idle period duration is within the tick period range. In addition to that, the correction factor computations in the menu governor need to take the possibility that the tick may not be stopped into account to avoid artificially small correction factor values. To that end, add a mechanism to record tick wakeups, as suggested by Peter Zijlstra, and use it to modify the menu_update() behavior when tick wakeup occurs. Namely, if the CPU is woken up by the tick and the return value of tick_nohz_get_sleep_length() is not within the tick boundary, the predicted idle duration is likely too short, so make menu_update() try to compensate for that by updating the governor statistics as though the CPU was idle for a long time. Since the value returned through the new argument pointer of cpuidle_select() is not used by its caller yet, this change by itself is not expected to alter the functionality of the code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2017-11-08PM / QoS: Fix device resume latency frameworkRafael J. Wysocki
The special value of 0 for device resume latency PM QoS means "no restriction", but there are two problems with that. First, device resume latency PM QoS requests with 0 as the value are always put in front of requests with positive values in the priority lists used internally by the PM QoS framework, causing 0 to be chosen as an effective constraint value. However, that 0 is then interpreted as "no restriction" effectively overriding the other requests with specific restrictions which is incorrect. Second, the users of device resume latency PM QoS have no way to specify that *any* resume latency at all should be avoided, which is an artificial limitation in general. To address these issues, modify device resume latency PM QoS to use S32_MAX as the "no constraint" value and 0 as the "no latency at all" one and rework its users (the cpuidle menu governor, the genpd QoS governor and the runtime PM framework) to follow these changes. Also add a special "n/a" value to the corresponding user space I/F to allow user space to indicate that it cannot accept any resume latencies at all for the given device. Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency constraints) Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323 Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Tero Kristo <t-kristo@ti.com> Reviewed-by: Ramesh Thomas <ramesh.thomas@intel.com>
2017-08-30cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbolRafael J. Wysocki
On some architectures the first (index 0) idle state is a polling one and it doesn't really save energy, so there is the CPUIDLE_DRIVER_STATE_START symbol allowing some pieces of cpuidle code to avoid using that state. However, this makes the code rather hard to follow. It is better to explicitly avoid the polling state, so add a new cpuidle state flag CPUIDLE_FLAG_POLLING to mark it and make the relevant code check that flag for the first state instead of using the CPUIDLE_DRIVER_STATE_START symbol. In the ACPI processor driver that cannot always rely on the state flags (like before the states table has been set up) define a new internal symbol ACPI_IDLE_STATE_START equivalent to the CPUIDLE_DRIVER_STATE_START one and drop the latter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2017-06-29cpuidle: menu: allow state 0 to be disabledNicholas Piggin
The menu driver does not allow state0 to be disabled completely. If it is disabled but other enabled states don't meet latency requirements, it is still used. Fix this by starting with the first enabled idle state. Fall back to state 0 if no idle states are enabled (arguably this should be -EINVAL if it is attempted, but this is the minimal fix). Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/stat.h> We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/stat.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/loadavg.h> We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which will have to be picked up from a couple of .c files. Create a trivial placeholder <linux/sched/topology.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-27cpuidle: menu: Avoid taking spinlock for accessing QoS valuesRafael J. Wysocki
After commit 9908859acaa9 (cpuidle/menu: add per CPU PM QoS resume latency consideration) the cpuidle menu governor calls dev_pm_qos_read_value() on CPU devices to read the current resume latency QoS constraint values for them. That function takes a spinlock to prevent the device's power.qos pointer from becoming NULL during the access which is a problem for the RT patchset where spinlocks are converted into mutexes and the idle loop stops working. However, it is not even necessary for the menu governor to take that spinlock, because the power.qos pointer accessed under it cannot be modified during the access anyway. For this reason, introduce a "raw" routine for accessing device QoS resume latency constraints without locking and use it in the menu governor. Fixes: 9908859acaa9 (cpuidle/menu: add per CPU PM QoS resume latency consideration) Acked-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-01-30cpuidle/menu: add per CPU PM QoS resume latency considerationAlex Shi
There may be special requirements on CPU response time, like if a interrupt is pinned to a CPU, that CPU should not go into excessively deep idle states. For this reason, add a mechanism for adding PM QoS resume latency constraints for individual CPUs and modify the menu governor to take them into account. To that end, extend the device PM QoS pm_qos_resume_latency attribute to CPUs, which is possible, because the exit latency for CPUs is effectively equivalent to the resume latency for devices. Signed-off-by: Alex Shi <alex.shi@linaro.org> Acked-by: Rik van Riel <riel@redhat.com> [ rjw : Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>