summaryrefslogtreecommitdiff
path: root/drivers/cpuidle/governors
AgeCommit message (Collapse)Author
2016-03-21cpuidle: menu: Fall back to polling if next timer event is nearRafael J. Wysocki
Commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling) changed the behavior of the fallback state selection part of menu_select() so it looks at interactivity_req instead of data->next_timer_us when it makes its decision. That effectively caused polling to be used more often as fallback idle which led to significant increases of energy consumption in some cases. Commit e132b9b3bc7f (cpuidle: menu: use high confidence factors only when considering polling) changed that logic again to be more predictable, but that didn't help with the increased energy consumption problem. For this reason, go back to making decisions on which state to fall back to based on data->next_timer_us which is the time we know for sure something will happen rather than a prediction (which may be inaccurate and turns out to be so often enough to be problematic). However, take the target residency of the first proper idle state (C1) into account, so that state is not used as the fallback one if its target residency is greater than data->next_timer_us. Fixes: a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
2016-03-17cpuidle: menu: use high confidence factors only when considering pollingRik van Riel
The menu governor uses five different factors to pick the idle state: - the user configured latency_req - the time until the next timer (next_timer_us) - the typical sleep interval, as measured recently - an estimate of sleep time by dividing next_timer_us by an observed factor - a load corrected version of the above, divided again by load Only the first three items are known with enough confidence that we can use them to consider polling, instead of an actual CPU idle state, because the cost of being wrong about polling can be excessive power use. The latter two are used in the menu governor's main selection loop, and can result in choosing a shallower idle state when the system is expected to be busy again soon. This pushes a busy system in the "performance" direction of the performance<>power tradeoff, when choosing between idle states, but stays more strictly on the "power" state when deciding between polling and C1. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-02-17cpuidle: menu: help gcc generate slightly better codeRasmus Villemoes
We know that the avg variable actually ends up holding a 32 bit quantity, since it's an average of such numbers. It is only a u64 because it is temporarily used to hold the sum. Making it an actual u32 allows gcc to generate slightly better code, e.g. when computing the square, it can do a 32x32->64 multiply. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-02-17cpuidle: menu: avoid expensive square root computationRasmus Villemoes
Computing the integer square root is a rather expensive operation, at least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64 bit platforms, doing an extra comparison to a constant (variance <= U64_MAX/36). On 64 bit platforms, this does mean that we add a restriction on the range of the variance where we end up using the estimate (since previously the stddev <= ULONG_MAX was a tautology), but on the other hand, we extend the range quite substantially on 32 bit platforms - in both cases, we now allow standard deviations up to 715 seconds, which is for example guaranteed if all observations are less than 1430 seconds. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-19cpuidle: menu: Avoid pointless checks in menu_select()Rafael J. Wysocki
If menu_select() cannot find a suitable state to return, it will return the state index stored in data->last_state_idx. This means that it is pointless to look at the states whose indices are less than or equal to data->last_state_idx in the main loop, so don't do that. Given that those checks are done on every idle state selection, this change can save quite a bit of completely unnecessary overhead. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sudeep Holla <sudeep.holla@arm.com>
2016-01-15cpuidle: Default to ladder governor on ticking systemsJean Delvare
The menu governor is currently the default on all systems. However the documentation claims that the ladder governor is preferred on ticking systems. So bump the rating of the ladder governor when NO_HZ is disabled, or when booting with nohz=off. This fixes the first half of kernel BZ #65531. Link: https://bugzilla.kernel.org/show_bug.cgi?id=65531 Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-14cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0Rafael J. Wysocki
Commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling) exposed a bug in menu_select() causing it to return -1 on systems with CPUIDLE_DRIVER_STATE_START equal to zero, although it should have returned 0. As a result, idle states are not entered by CPUs on those systems. Namely, on the systems in question data->last_state_idx is initially equal to -1 and the above commit modified the condition that would have caused it to be changed to 0 to be less likely to trigger which exposed the problem. However, setting data->last_state_idx initially to -1 doesn't make sense at all and on the affected systems it should always be set to CPUIDLE_DRIVER_STATE_START (ie. 0) unconditionally, so make that happen. Fixes: a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling) Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-11-17cpuidle,menu: smooth out measured_us calculationRik van Riel
The cpuidle state tables contain the maximum exit latency for each cpuidle state. On x86, that is the exit latency for when the entire package goes into that same idle state. However, a lot of the time we only go into the core idle state, not the package idle state. This means we see a much smaller exit latency. We have no way to detect whether we went into the core or package idle state while idle, and that is ok. However, the current menu_update logic does have the potential to trip up the repeating pattern detection in get_typical_interval. If the system is experiencing an exit latency near the idle state's exit latency, some of the samples will have exit_us subtracted, while others will not. This turns a repeating pattern into mush, potentially breaking get_typical_interval. Furthermore, for smaller sleep intervals, we know the chance that all the cores in the package went to the same idle state are fairly small. Dividing the measured_us by two, instead of subtracting the full exit latency when hitting a small measured_us, will reduce the error. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-11-17cpuidle,menu: use interactivity_req to disable pollingRik van Riel
The menu governor carefully figures out how much time we typically sleep for an estimated sleep interval, or whether there is a repeating pattern going on, and corrects that estimate for the CPU load. Then it proceeds to ignore that information when determining whether or not to consider polling. This is not a big deal on most x86 CPUs, which have very low C1 latencies, and the patch should not have any effect on those CPUs. However, certain CPUs (eg. Atom) have much higher C1 latencies, and it would be good to not waste performance and power on those CPUs if we are expecting a very low wakeup latency. Disable polling based on the estimated interactivity requirement, not on the time to the next timer interrupt. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-11-17cpuidle,x86: increase forced cut-off for polling to 20usRik van Riel
The cpuidle menu governor has a forced cut-off for polling at 5us, in order to deal with firmware that gives the OS bad information on cpuidle states, leading to the system spending way too much time in polling. However, at least one x86 CPU family (Atom) has chips that have a 20us break-even point for C1. Forcing the polling cut-off to less than that wastes performance and power. Increase the polling cut-off to 20us. Systems with a lower C1 latency will be found in the states table by the menu governor, which will pick those states as appropriate. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-05-04cpuidle: Check the sign of index in cpuidle_reflect()Rafael J. Wysocki
Avoid calling the governor's ->reflect method if the state index passed to cpuidle_reflect() is negative. This allows the analogous check to be dropped from menu_reflect(), so do that too, and ensures that arbitrary error codes can be passed to cpuidle_reflect() as the index with no adverse consequences. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2015-04-17cpuidle: menu: use DIV_ROUND_CLOSEST_ULL()Javi Merino
Now that the kernel provides DIV_ROUND_CLOSEST_ULL(), drop the internal implementation and use the kernel one. Signed-off-by: Javi Merino <javi.merino@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17cpuidle: ladder: Better idle duration measurement without using ↵Len Brown
CPUIDLE_FLAG_TIME_INVALID When the ladder governor sees the CPUIDLE_FLAG_TIME_INVALID flag, it unconditionally causes a state promotion by setting last_residency to a number higher than the state's promotion_time: last_residency = last_state->threshold.promotion_time + 1 It does this for fear that cpuidle_get_last_residency() will be in-accurate, because cpuidle_enter_state() invoked a state with CPUIDLE_FLAG_TIME_INVALID. But the only state with CPUIDLE_FLAG_TIME_INVALID is acpi_safe_halt(), which may return well after its actual idle duration because it enables interrupts, so cpuidle_enter_state() also measures interrupt service time. So what? In ladder, a huge invalid last_residency has exactly the same effect as the current code -- it unconditionally causes a state promotion. In the case where the idle residency plus measured interrupt handling time is less than the state's demotion_time -- we should use that timestamp to give ladder a chance to demote, rather than unconditionally promoting. This can be done by simply ignoring the CPUIDLE_FLAG_TIME_INVALID, and using the "invalid" time, as it is either equal to what we are doing today, or better. Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-12-17cpuidle: menu: Better idle duration measurement without using ↵Len Brown
CPUIDLE_FLAG_TIME_INVALID When menu sees CPUIDLE_FLAG_TIME_INVALID, it ignores its timestamps, and assumes that idle lasted as long as the time till next predicted timer expiration. But if an interrupt was seen and serviced before that duration, it would actually be more accurate to use the measured time rather than rounding up to the next predicted timer expiration. And if an interrupt is seen and serviced such that the mesured time exceeds the time till next predicted timer expiration, then truncating to that expiration is the right thing to do -- since we can never stay idle past that timer expiration. So the code can do a better job without checking for CPUIDLE_FLAG_TIME_INVALID. Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-12cpuidle: Invert CPUIDLE_FLAG_TIME_VALID logicDaniel Lezcano
The only place where the time is invalid is when the ACPI_CSTATE_FFH entry method is not set. Otherwise for all the drivers, the time can be correctly measured. Instead of duplicating the CPUIDLE_FLAG_TIME_VALID flag in all the drivers for all the states, just invert the logic by replacing it by the flag CPUIDLE_FLAG_TIME_INVALID, hence we can set this flag only for the acpi idle driver, remove the former flag from all the drivers and invert the logic with this flag in the different governor. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-26drivers/cpuidle: Replace __get_cpu_var uses for address calculationChristoph Lameter
All of these are for address calculation. Replace with this_cpu_ptr(). Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: linux-pm@vger.kernel.org Acked-by: Rafael J. Wysocki <rjw@sisk.pl> [cpufreq changes] Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-14Merge tag 'pm+acpi-3.17-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI and power management updates from Rafael Wysocki: "These are a couple of regression fixes, cpuidle menu governor optimizations, fixes for ACPI proccessor and battery drivers, hibernation fix to avoid problems related to the e820 memory map, fixes for a few cpufreq drivers and a new version of the suspend profiling tool analyze_suspend.py. Specifics: - Fix for an ACPI-based device hotplug regression introduced in 3.14 that causes a kernel panic to trigger when memory hot-remove is attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen - Fix for a cpufreq regression introduced in 3.16 that triggers a "sleeping function called from invalid context" bug in dev_pm_opp_init_cpufreq_table() from Stephen Boyd - ACPI battery driver fix for a warning message added in 3.16 that prints silly stuff sometimes from Mariusz Ceier - Hibernation fix for safer handling of mismatches in the 820 memory map between the configurations during image creation and during the subsequent restore from Chun-Yi Lee - ACPI processor driver fix to handle CPU hotplug notifications correctly during system suspend/resume from Lan Tianyu - Series of four cpuidle menu governor cleanups that also should speed it up a bit from Mel Gorman - Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus Pargmann and Uwe Kleine-König - Version 3.0 of the analyze_suspend.py suspend profiling tool from Todd E Brandt" * tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / battery: Fix warning message in acpi_battery_get_state() PM / tools: analyze_suspend.py: update to v3.0 cpufreq: arm_big_little: fix module license spec cpufreq: speedstep-smi: fix decimal printf specifiers ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove() cpufreq: OPP: Avoid sleeping while atomic cpufreq: cpu0: Do not print error message when deferring cpufreq: integrator: Use set_cpus_allowed_ptr PM / hibernate: avoid unsafe pages in e820 reserved regions ACPI / processor: Make acpi_cpu_soft_notify() process CPU FROZEN events cpuidle: menu: Lookup CPU runqueues less cpuidle: menu: Call nr_iowait_cpu less times cpuidle: menu: Use ktime_to_us instead of reinventing the wheel cpuidle: menu: Use shifts when calculating averages where possible
2014-08-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree changes from Jiri Kosina: "Summer edition of trivial tree updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits) doc: fix two typos in watchdog-api.txt irq-gic: remove file name from heading comment MAINTAINERS: Add miscdevice.h to file list for char/misc drivers. scsi: mvsas: mv_sas.c: Fix for possible null pointer dereference doc: replace "practise" with "practice" in Documentation befs: remove check for CONFIG_BEFS_RW scsi: doc: fix 'SCSI_NCR_SETUP_MASTER_PARITY' drivers/usb/phy/phy.c: remove a leading space mfd: fix comment cpuidle: fix comment doc: hpfall.c: fix missing null-terminate after strncpy call usb: doc: hotplug.txt code typos kbuild: fix comment in Makefile.modinst SH: add proper prompt to SH_MAGIC_PANEL_R2_VERSION ARM: msm: Remove MSM_SCM crypto: Remove MPILIB_EXTRA doc: CN: remove dead link, kerneltrap.org no longer works media: update reference, kerneltrap.org no longer works hexagon: update reference, kerneltrap.org no longer works doc: LSM: update reference, kerneltrap.org no longer works ...
2014-08-06cpuidle: menu: Lookup CPU runqueues lessMel Gorman
The menu governer makes separate lookups of the CPU runqueue to get load and number of IO waiters but it can be done with a single lookup. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-06cpuidle: menu: Call nr_iowait_cpu less timesMel Gorman
menu_select() via inline functions calls nr_iowait_cpu() twice as much as necessary. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-06cpuidle: menu: Use ktime_to_us instead of reinventing the wheelMel Gorman
The ktime_to_us implementation is slightly better than the one implemented in menu.c. Use it Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-06cpuidle: menu: Use shifts when calculating averages where possibleMel Gorman
We use do_div even though the divisor will usually be a power-of-two unless there are unusual outliers. Use shifts where possible Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-29cpuidle: ladder governor - use macro instead of hardcoded valueMohammad Merajul Islam Molla
use CPUIDLE_DRIVER_STATE_START, instead of hardcoded value 0. As, CPUIDLE_DRIVER_STATE_START can be 1/0 based on CONFIG_ARCH_HAS_CPU_RELAX is defined or not. Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-28cpuidle: menu governor - remove unused macro STDDEV_THRESHMohammad Merajul Islam Molla
STDDEV_THRESH was once defined and used in menu governor. But now its no longer used anywhere. So removing the define. Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-06-19cpuidle: fix commentAntonio Ospite
Signed-off-by: Antonio Ospite <ao2@ao2.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "tuukka.tikkanen@linaro.org" <tuukka.tikkanen@linaro.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-05-01cpuidle / menu: move repeated correction factor check to initChander Kashyap
In menu_select function we check for correction factor every time. If it is zero we are initializing to unity. Hence move it to init function and initialise by unity, hence avoid repeated comparisons. Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org> Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-01cpuidle / menu: Return (-1) if there are no suitable statesRafael J. Wysocki
If there is a PM QoS latency limit and all of the sufficiently shallow C-states are disabled, the cpuidle menu governor returns 0 which on some systems is CPUIDLE_DRIVER_STATE_START and shouldn't be returned if that C-state has been disabled. Fix the issue by modifying the menu governor to return (-1) in such situations. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Move perf multiplier calculation out of the selection looptuukka.tikkanen@linaro.org
The menu governor performance multiplier defines a minimum predicted idle duration to latency ratio. Instead of checking this separately in every iteration of the state selection loop, adjust the overall latency restriction for the whole loop if this restriction is tighter than what is set by the QoS subsystem. The original test s->exit_latency * multiplier > data->predicted_us becomes s->exit_latency > data->predicted_us / multiplier by dividing both sides of the comparison by "multiplier". While division is likely to be several times slower than multiplication, the minor performance hit allows making a generic sleep state selection function based on (sleep duration, maximum latency) tuple. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Do not substract exit latency from assumed sleep lengthtuukka.tikkanen@linaro.org
The menu governor statistics update function tries to determine the amount of time between entry to low power state and the occurrence of the wakeup event. However, the time measured by the framework includes exit latency on top of the desired value. This exit latency is substracted from the measured value to obtain the desired value. When measured value is not available, the menu governor assumes the wakeup was caused by the timer and the time is equal to remaining timer length. No exit latency should be substracted from this value. This patch prevents the erroneous substraction and clarifies the associated comment. It also removes one intermediate variable that serves no purpose. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Ensure menu coefficients stay within domaintuukka.tikkanen@linaro.org
The menu governor uses coefficients as one method of actual idle period length estimation. The coefficients are, as detailed below, multipliers giving expected idle period length from time until next timer expiry. The multipliers are supposed to have domain of (0..1]. The coefficients are fractions where only the numerators are stored and denominators are a shared constant RESOLUTION*DECAY. Since the value of the coefficient should always be greater than 0 and less than or equal to 1, the numerator must have a value greater than 0 and less than or equal to RESOLUTION*DECAY. If the coefficients are updated with measured idle durations exceeding timer length, the multiplier may reach values exceeding unity (i.e. the stored numerator exceeds RESOLUTION*DECAY). This patch ensures that the multipliers are updated with durations capped to timer length. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Use actual state latency in menu governortuukka.tikkanen@linaro.org
Currently menu governor records the exit latency of the state it has chosen for the idle period. The stored latency value is then later used to calculate the actual length of the idle period. This value may however be incorrect, as the entered state may not be the one chosen by the governor. The entered state information is available, so we can use that to obtain the real exit latency. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: rename expected_us to next_timer_us in menu governortuukka.tikkanen@linaro.org
The field expected_us is used to store the time remaining until next timer expiry. The name is inaccurate, as we really do not expect all wakeups to be caused by timers. In addition, another field with a very similar name (predicted_us) is used to store the predicted time remaining until any wakeup source being active. This patch renames expected_us to next_timer_us in order to better reflect the contained information. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Acked-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Change struct menu_device field typesTuukka Tikkanen
Field predicted_us value can never exceed expected_us value, but it has a potentially larger type. As there is no need for additional 32 bits of zeroes on 32 bit plaforms, change the type of predicted_us to match the type of expected_us. Field correction_factor is used to store a value that cannot exceed the product of RESOLUTION and DECAY (default 1024*8 = 8192). The constants cannot in practice be incremented to such values, that they'd overflow unsigned int even on 32 bit systems, so the type is changed to avoid unnecessary 64 bit arithmetic on 32 bit systems. One multiplication of (now) 32 bit values needs an added cast to avoid truncation of the result and has been added. In order to avoid another multiplication from 32 bit domain to 64 bit domain, the new correction_factor calculation has been changed from new = old * (DECAY-1) / DECAY to new = old - old / DECAY, which with infinite precision would yeild exactly the same result, but now changes the direction of rounding. The impact is not significant as the maximum accumulated difference cannot exceed the value of DECAY, which is relatively small compared to product of RESOLUTION and DECAY (8 / 8192). Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Add a comment warning about possible overflowTuukka Tikkanen
The menu governor has a number of tunable constants that may be changed in the source. If certain combination of values are chosen, an overflow is possible when the correction_factor is being recalculated. This patch adds a warning regarding this possibility and describes the change needed for fixing the issue. The change should not be permanently enabled, as it will hurt performance when it is not needed. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Fix variable domains in get_typical_interval()Tuukka Tikkanen
The menu governor uses a static function get_typical_interval() to try to detect a repeating pattern of wakeups. The previous interval durations are stored as an array of unsigned ints, but the arithmetic in the function is performed exclusively as 64 bit values, even when the value stored in a variable is known not to exceed unsigned int, which may be smaller and more efficient on some platforms. This patch changes the types of varibles used to store some intermediates, the maximum and and the cutoff threshold to unsigned ints. Average and standard deviation are still treated as 64 bit values, even when the values are known to be within the domain of unsigned int, to avoid casts to ensure correct integer promotion for arithmetic operations. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Fix menu_device->intervals typeTuukka Tikkanen
Struct menu_device member intervals is declared as u32, but the value stored is (unsigned) int. The type is changed to match the value being stored. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: CodingStyle: Break up multiple assignments on single lineTuukka Tikkanen
The function get_typical_interval() initializes a number of variables that are immediately after declarations assigned constant values. In addition, there are multiple assignments on a single line, which is explicitly forbidden by Documentation/CodingStyle. This patch removes redundant initial values for the variables and breaks up the multiple assignment line. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Check called function parameter in get_typical_interval()Tuukka Tikkanen
get_typical_interval() uses int_sqrt() in calculation of standard deviation. The formal parameter of int_sqrt() is unsigned long, which may on some platforms be smaller than the 64 bit unsigned integer used as the actual parameter. The overflow can occur frequently when actual idle period lengths are in hundreds of milliseconds. This patch adds a check for such overflow and rejects the candidate average when an overflow would occur. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Rearrange code and comments in get_typical_interval()Tuukka Tikkanen
This patch rearranges a if-return-elsif-goto-fi-return sequence into if-return-fi-if-return-fi-goto sequence. The functionality remains the same. Also, a lengthy comment that did not describe the functionality in the order it occurs is split into half and top half is moved closer to actual implementation it describes. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Ignore interval prediction result when timer is shorterTuukka Tikkanen
This patch prevents cpuidle menu governor from using repeating interval prediction result if the idle period predicted is longer than the one allowed by shortest running timer. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-14Merge back earlier 'pm-cpuidle' material.Rafael J. Wysocki
2013-07-29Revert "cpuidle: Quickly notice prediction failure for repeat mode"Rafael J. Wysocki
Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for repeat mode), because it has been identified as the source of a significant performance regression in v3.8 and later as explained by Jeremy Eder: We believe we've identified a particular commit to the cpuidle code that seems to be impacting performance of variety of workloads. The simplest way to reproduce is using netperf TCP_RR test, so we're using that, on a pair of Sandy Bridge based servers. We also have data from a large database setup where performance is also measurably/positively impacted, though that test data isn't easily share-able. Included below are test results from 3 test kernels: kernel reverts ----------------------------------------------------------- 1) vanilla upstream (no reverts) 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c 3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 e11538d1f03914eb92af5a1a378375c05ae8520c In summary, netperf TCP_RR numbers improve by approximately 4% after reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never seems to get above 40%. Taking that patch out gets C0 near 100% quite often, and performance increases. The below data are histograms representing the %c0 residency @ 1-second sample rates (using turbostat), while under netperf test. - If you look at the first 4 histograms, you can see %c0 residency almost entirely in the 30,40% bin. - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4, shows %c0 in the 80,90,100% bins. Below each kernel name are netperf TCP_RR trans/s numbers for the particular kernel that can be disclosed publicly, comparing the 3 test kernels. We ran a 4th test with the vanilla kernel where we've also set /dev/cpu_dma_latency=0 to show overall impact boosting single-threaded TCP_RR performance over 11% above baseline. 3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0): TCP_RR trans/s 54323.78 ----------------------------------------------------------- 3.10-rc2 vanilla RX (no reverts) TCP_RR trans/s 48192.47 Receiver %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 59]: *********************************************************** 40.0000 - 50.0000 [ 1]: * 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: Sender %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 11]: *********** 40.0000 - 50.0000 [ 49]: ************************************************* 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: ----------------------------------------------------------- 3.10-rc2 perfteam2 RX (reverts commit e11538d1f03914eb92af5a1a378375c05ae8520c) TCP_RR trans/s 49698.69 Receiver %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 1]: * 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 59]: *********************************************************** 40.0000 - 50.0000 [ 0]: 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: Sender %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 2]: ** 40.0000 - 50.0000 [ 58]: ********************************************************** 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: ----------------------------------------------------------- 3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and e11538d1f03914eb92af5a1a378375c05ae8520c) TCP_RR trans/s 47766.95 Receiver %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 1]: * 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 27]: *************************** 40.0000 - 50.0000 [ 2]: ** 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 2]: ** 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 28]: **************************** Sender: 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 11]: *********** 40.0000 - 50.0000 [ 0]: 50.0000 - 60.0000 [ 1]: * 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 3]: *** 80.0000 - 90.0000 [ 7]: ******* 90.0000 - 100.0000 [ 38]: ************************************** These results demonstrate gaining back the tendency of the CPU to stay in more responsive, performant C-states (and thus yield measurably better performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. Requested-by: Jeremy Eder <jeder@redhat.com> Tested-by: Len Brown <len.brown@intel.com> Cc: 3.8+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-29Revert "cpuidle: Quickly notice prediction failure in general case"Rafael J. Wysocki
Revert commit e11538d1 (cpuidle: Quickly notice prediction failure in general case), since it depends on commit 69a37be (cpuidle: Quickly notice prediction failure for repeat mode) that has been identified as the source of a significant performance regression in v3.8 and later. Requested-by: Jeremy Eder <jeder@redhat.com> Tested-by: Len Brown <len.brown@intel.com> Cc: 3.8+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-15cpuidle: Make it clear that governors cannot be modulesDaniel Lezcano
cpufreq governors are defined as modules in the code, but the Kconfig options do not allow them to be built as modules. This is not really a problem, but the cpuidle init ordering is: the cpuidle init functions (framework and driver) and then the governors. That leads to some weirdness in the cpuidle framework. Namely, cpuidle_register_device() calls cpuidle_enable_device() which fails at the first attempt, because governors have not been registered yet. When a governor is registered, the framework calls cpuidle_enable_device() again which runs __cpuidle_register_device() only then. Of course, for that to work, the cpuidle_enable_device() return value has to be ignored by cpuidle_register_device(). Instead of having this cyclic call graph and relying on a positive side effects of the hackish back and forth cpuidle_enable_device() calls it is better to fix the cpuidle init ordering. To that end, replace the module init code with postcore_initcall() so we have: * cpuidle framework : core_initcall * cpuidle governors : postcore_initcall * cpuidle drivers : device_initcall and remove the corresponding module exit code as it is dead anyway (governors can't be built as modules). [rjw: Changelog] Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-01-15cpuidle: remove the power_specified field in the driverDaniel Lezcano
We realized that the power usage field is never filled and when it is filled for tegra, the power_specified flag is not set causing all of these values to be reset when the driver is initialized with set_power_state(). However, the power_specified flag can be simply removed under the assumption that the states are always backward sorted, which is the case with the current code. This change allows the menu governor select function and the cpuidle_play_dead() to be simplified. Moreover, the set_power_states() function can removed as it does not make sense any more. Drop the power_specified flag from struct cpuidle_driver and make the related changes as described above. As a consequence, this also fixes the bug where on the dynamic C-states system, the power fields are not initialized. [rjw: Changelog] References: https://bugzilla.kernel.org/show_bug.cgi?id=42870 References: https://bugzilla.kernel.org/show_bug.cgi?id=43349 References: https://lkml.org/lkml/2012/10/16/518 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-01-03cpuidle: Fix finding state with min power_usageSivaram Nair
Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead of -1) to init the local copies so that functions that tries to find cpuidle states with minimum power usage works correctly even if they use non-negative values. Signed-off-by: Sivaram Nair <sivaramn@nvidia.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-23cpuidle: fix a suspicious RCU usage in menu governorLi Zhong
I saw this suspicious RCU usage on the next tree of 11/15 [ 67.123404] =============================== [ 67.123413] [ INFO: suspicious RCU usage. ] [ 67.123423] 3.7.0-rc5-next-20121115-dirty #1 Not tainted [ 67.123434] ------------------------------- [ 67.123444] include/trace/events/timer.h:186 suspicious rcu_dereference_check() usage! [ 67.123458] [ 67.123458] other info that might help us debug this: [ 67.123458] [ 67.123474] [ 67.123474] RCU used illegally from idle CPU! [ 67.123474] rcu_scheduler_active = 1, debug_locks = 0 [ 67.123493] RCU used illegally from extended quiescent state! [ 67.123507] 1 lock held by swapper/1/0: [ 67.123516] #0: (&cpu_base->lock){-.-...}, at: [<c0000000000979b0>] .__hrtimer_start_range_ns+0x28c/0x524 [ 67.123555] [ 67.123555] stack backtrace: [ 67.123566] Call Trace: [ 67.123576] [c0000001e2ccb920] [c00000000001275c] .show_stack+0x78/0x184 (unreliable) [ 67.123599] [c0000001e2ccb9d0] [c0000000000c15a0] .lockdep_rcu_suspicious+0x120/0x148 [ 67.123619] [c0000001e2ccba70] [c00000000009601c] .enqueue_hrtimer+0x1c0/0x1c8 [ 67.123639] [c0000001e2ccbb00] [c000000000097aa0] .__hrtimer_start_range_ns+0x37c/0x524 [ 67.123660] [c0000001e2ccbc20] [c0000000005c9698] .menu_select+0x508/0x5bc [ 67.123678] [c0000001e2ccbd20] [c0000000005c740c] .cpuidle_idle_call+0xa8/0x6e4 [ 67.123699] [c0000001e2ccbdd0] [c0000000000459a0] .pSeries_idle+0x10/0x34 [ 67.123717] [c0000001e2ccbe40] [c000000000014dc8] .cpu_idle+0x130/0x280 [ 67.123738] [c0000001e2ccbee0] [c0000000006ffa8c] .start_secondary+0x378/0x384 [ 67.123758] [c0000001e2ccbf90] [c00000000000936c] .start_secondary_prolog+0x10/0x14 hrtimer_start was added in 198fd638 and ae515197. The patch below tries to use RCU_NONIDLE around it to avoid the above report. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15cpuidle: Get typical recent sleep intervalYouquan Song
The function detect_repeating_patterns was not very useful for workloads with alternating long and short pauses, for example virtual machines handling network requests for each other (say a web and database server). Instead, try to find a recent sleep interval that is somewhere between the median and the mode sleep time, by discarding outliers to the up side and recalculating the average and standard deviation until that is no longer required. This should do something sane with a sleep interval series like: 200 180 210 10000 30 1000 170 200 The current code would simply discard such a series, while the new code will guess a typical sleep interval just shy of 200. The original patch come from Rik van Riel <riel@redhat.com>. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15cpuidle: Quickly notice prediction failure in general caseYouquan Song
The prediction for future is difficult and when the cpuidle governor prediction fails and govenor possibly choose the shallower C-state than it should. How to quickly notice and find the failure becomes important for power saving. The patch extends to general case that prediction logic get a small predicted residency, so it choose a shallow C-state though the expected residency is large . Once the prediction will be fail, the CPU will keep staying at shallow C-state for a long time. Acutally, the CPU has change enter into deep C-state. So when the expected residency is long enough but governor choose a shallow C-state, an timer will be added in order to monitor if the prediction failure. When C-state is waken up prior to the adding timer, the timer will be cancelled initiatively. When the timer is triggered and menu governor will quickly notice prediction failure and re-evaluates deeper C-states possibility. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15cpuidle: Quickly notice prediction failure for repeat modeYouquan Song
The prediction for future is difficult and when the cpuidle governor prediction fails and govenor possibly choose the shallower C-state than it should. How to quickly notice and find the failure becomes important for power saving. cpuidle menu governor has a method to predict the repeat pattern if there are 8 C-states residency which are continuous and the same or very close, so it will predict the next C-states residency will keep same residency time. There is a real case that turbostat utility (tools/power/x86/turbostat) at kernel 3.3 or early. turbostat utility will read 10 registers one by one at Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu governor will predict it is repeat mode and there is another IPI wake up idle CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally idle. However, in the turbostat, following 10 registers reading is sleep 5 seconds by default, so the idle CPU will keep at C1 for a long time though it is idle until break event occurs. In a idle Sandybridge system, run "./turbostat -v", we will notice that deep C-state dangles between "70% ~ 99%". After patched the kernel, we will notice deep C-state stays at >99.98%. In the patch, a timer is added when menu governor detects a repeat mode and choose a shallow C-state. The timer is set to a time out value that greater than predicted time, and we conclude repeat mode prediction failure if timer is triggered. When repeat mode happens as expected, the timer is not triggered and CPU waken up from C-states and it will cancel the timer initiatively. When repeat mode does not happen, the timer will be time out and menu governor will quickly notice that the repeat mode prediction fails and then re-evaluates deeper C-states possibility. Below is another case which will clearly show the patch much benefit: #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/time.h> #include <time.h> #include <pthread.h> volatile int * shutdown; volatile long * count; int delay = 20; int loop = 8; void usage(void) { fprintf(stderr, "Usage: idle_predict [options]\n" " --help -h Print this help\n" " --thread -n Thread number\n" " --loop -l Loop times in shallow Cstate\n" " --delay -t Sleep time (uS)in shallow Cstate\n"); } void *simple_loop() { int idle_num = 1; while (!(*shutdown)) { *count = *count + 1; if (idle_num % loop) usleep(delay); else { /* sleep 1 second */ usleep(1000000); idle_num = 0; } idle_num++; } } static void sighand(int sig) { *shutdown = 1; } int main(int argc, char *argv[]) { sigset_t sigset; int signum = SIGALRM; int i, c, er = 0, thread_num = 8; pthread_t pt[1024]; static char optstr[] = "n:l:t:h:"; while ((c = getopt(argc, argv, optstr)) != EOF) switch (c) { case 'n': thread_num = atoi(optarg); break; case 'l': loop = atoi(optarg); break; case 't': delay = atoi(optarg); break; case 'h': default: usage(); exit(1); } printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay); count = malloc(sizeof(long)); shutdown = malloc(sizeof(int)); *count = 0; *shutdown = 0; sigemptyset(&sigset); sigaddset(&sigset, signum); sigprocmask (SIG_BLOCK, &sigset, NULL); signal(SIGINT, sighand); signal(SIGTERM, sighand); for(i = 0; i < thread_num ; i++) pthread_create(&pt[i], NULL, simple_loop, NULL); for (i = 0; i < thread_num; i++) pthread_join(pt[i], NULL); exit(0); } Get powertop V2 from git://github.com/fenrus75/powertop, build powertop. After build the above test application, then run it. Test plaform can be Intel Sandybridge or other recent platforms. #./idle_predict -l 10 & #./powertop We will find that deep C-state will dangle between 40%~100% and much time spent on C1 state. It is because menu governor wrongly predict that repeat mode is kept, so it will choose the C1 shallow C-state even though it has chance to sleep 1 second in deep C-state. While after patched the kernel, we find that deep C-state will keep >99.6%. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>