summaryrefslogtreecommitdiff
path: root/drivers/cpufreq/intel_pstate.c
AgeCommit message (Collapse)Author
2017-04-17cpufreq: schedutil: Use policy-dependent transition delaysRafael J. Wysocki
Make the schedutil governor take the initial (default) value of the rate_limit_us sysfs attribute from the (new) transition_delay_us policy parameter (to be set by the scaling driver). That will allow scaling drivers to make schedutil use smaller default values of rate_limit_us and reduce the default average time interval between consecutive frequency changes. Make intel_pstate set transition_delay_us to 500. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-03-29cpufreq: intel_pstate: Add support for Gemini LakeBox, David E
Use same parameters as INTEL_FAM6_ATOM_GOLDMONT to enable Gemini Lake. Signed-off-by: Box, David E <david.e.box@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Eliminate intel_pstate_get_min_max()Rafael J. Wysocki
Some computations in intel_pstate_get_min_max() are not necessary and one of its two callers doesn't even use the full result. First off, the fixed-point value of cpu->max_perf represents a non-negative number between 0 and 1 inclusive and cpu->min_perf cannot be greater than cpu->max_perf. It is not necessary to check those conditions every time the numbers in question are used. Moreover, since intel_pstate_max_within_limits() only needs the upper boundary, it doesn't make sense to compute the lower one in there and returning min and max from intel_pstate_get_min_max() via pointers doesn't look particularly nice. For the above reasons, drop intel_pstate_get_min_max(), add a helper to get the base P-state for min/max computations and carry out them directly in the previous callers of intel_pstate_get_min_max(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Do not walk policy->cpusRafael J. Wysocki
intel_pstate_hwp_set() is the only function walking policy->cpus in intel_pstate. The rest of the code simply assumes one CPU per policy, including the initialization code. Therefore it doesn't make sense for intel_pstate_hwp_set() to walk policy->cpus as it is guaranteed to have only one bit set for policy->cpu. For this reason, rearrange intel_pstate_hwp_set() to take the CPU number as the argument and drop the loop over policy->cpus from it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Introduce pid_in_use()Rafael J. Wysocki
Add a new function pid_in_use() to return the information on whether or not the PID-based P-state selection algorithm is in use. That allows a couple of complicated conditions in the code to be reduced to simple checks against the new function's return value. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Drop struct cpu_defaultsRafael J. Wysocki
The cpu_defaults structure is redundant, because it only contains one member of type struct pstate_funcs which can be used directly instead of struct cpu_defaults. For this reason, drop struct cpu_defaults, use struct pstate_funcs directly instead of it where applicable and rename all of the variables of that type accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Move cpu_defaults definitionsRafael J. Wysocki
Move the definitions of the cpu_defaults structures after the definitions of utilization update callback routines to avoid extra declarations of the latter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Add update_util callback to pstate_funcsRafael J. Wysocki
Avoid using extra function pointers during P-state selection by dropping the get_target_pstate member from struct pstate_funcs, adding a new update_util callback to it (to be registered with the CPU scheduler as the utilization update callback in the active mode) and reworking the utilization update callback routines to invoke specific P-state selection functions directly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Use different utilization update callbacksRafael J. Wysocki
Notice that some overhead in the utilization update callbacks registered by intel_pstate in the active mode can be avoided if those callbacks are tailored to specific configurations of the driver. For example, the utilization update callback for the HWP enabled case only needs to update the average CPU performance periodically whereas the utilization update callback for the PID-based algorithm does not need to take IO-wait boosting into account and so on. With that in mind, define three utilization update callbacks for three different use cases: HWP enabled, the CPU load "powersave" P-state selection algorithm and the PID-based "powersave" P-state selection algorithm and modify the driver initialization to choose the callback matching its current configuration. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Modify check in intel_pstate_update_status()Rafael J. Wysocki
One of the checks in intel_pstate_update_status() implicitly relies on the information that there are only two struct cpufreq_driver objects available, but it is better to do it directly against the value it really is about (to make the code easier to follow if nothing else). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Drop driver_registered variableRafael J. Wysocki
The driver_registered variable in intel_pstate is used for checking whether or not the driver has been registered, but intel_pstate_driver can be used for that too (with the rule that the driver is not registered as long as it is NULL). That is a bit more straightforward and the code may be simplified a bit this way, so modify the driver accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Skip unnecessary PID resets on initRafael J. Wysocki
PID controller parameters only need to be initialized if the get_target_pstate_use_performance() P-state selection routine is going to be used. It is not necessary to initialize them otherwise, so don't do that. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Set HWP sampling interval onceRafael J. Wysocki
In the HWP enabled case pid_params.sample_rate_ns only needs to be updated once, because it is global, so do that when setting hwp_active instead of doing it during the initialization of every CPU. Moreover, pid_params.sample_rate_ms is never used if HWP is enabled, so do not update it at all then. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Clean up intel_pstate_busy_pid_reset()Rafael J. Wysocki
intel_pstate_busy_pid_reset() is the only caller of pid_reset(), pid_p_gain_set(), pid_i_gain_set(), and pid_d_gain_set(). Moreover, it passes constants as two parameters of pid_reset() and all of the other routines above essentially contain the same code, so fold all of them into the caller and drop unnecessary computations. Introduce percent_fp() for converting integer values in percent to fixed-point fractions and use it in the above code cleanup. Finally, rename intel_pstate_busy_pid_reset() to intel_pstate_pid_reset() as it also is used for the initialization of PID parameters for every CPU and the meaning of the "busy" part of the name is not particularly clear. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Fold intel_pstate_reset_all_pid() into the callerRafael J. Wysocki
There is only one caller of intel_pstate_reset_all_pid(), which is pid_param_set() used in the debugfs interface only, and having that code split does not make it particularly convenient to follow. For this reason, move the body of intel_pstate_reset_all_pid() into its caller and drop that function. Also change the loop from for_each_online_cpu() (which is obviously racy with respect to CPU offline/online) to for_each_possible_cpu(), so that all PID parameters are reset for all CPUs regardless of their online/offline status (to prevent, for example, a previously offline CPU from going online with a stale set of PID parameters). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Initialize pid_params staticallyRafael J. Wysocki
Notice that both the existing struct cpu_defaults instances in which PID parameters are actually initialized use the same values of those parameters, so it is not really necessary to copy them over to pid_params dynamically. Instead, initialize pid_params statically with those values and drop the unused pid_policy member from struct cpu_defaults along with copy_pid_params() used for initializing it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Drop pointless initialization of PID parametersRafael J. Wysocki
The P-state selection algorithm used by intel_pstate for Atom processors is not based on the PID controller and the initialization of PID parametrs for those processors is pointless and confusing, so drop it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-28cpufreq: intel_pstate: Eliminate struct perf_limitsRafael J. Wysocki
After recent changes the purpose of struct perf_limits is not particularly clear any more and the code may be made somewhat easier to follow by eliminating it, so go for that. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-24cpufreq: intel_pstate: Avoid transient updates of cpuinfo.max_freqRafael J. Wysocki
Both intel_pstate_verify_policy() and intel_cpufreq_verify_policy() set policy->cpuinfo.max_freq depending on the turbo status, but the updates made by them are discarded by the core, because the policy object passed to them by the core is temporary and cpuinfo.max_freq from that object is not copied to the final policy object in cpufreq_set_policy(). However, cpufreq_set_policy() passes the temporary policy object to the ->setpolicy callback of the driver, so intel_pstate_set_policy() actually sees the policy->cpuinfo.max_freq value updated by intel_pstate_verify_policy() and not the final one. It also updates policy->max sometimes which basically has no effect after it returns, because the core discards that update. To avoid confusion, eliminate policy->cpuinfo.max_freq updates from intel_pstate_verify_policy() and intel_cpufreq_verify_policy() entirely and check the maximum frequency explicitly in intel_pstate_update_perf_limits() instead of relying on the transiently updated policy->cpuinfo.max_freq value. Moreover, move the max->policy adjustment carried out in intel_pstate_set_policy() to a separate function and call that function from the ->verify driver callbacks to ensure that it will actually be effective. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-24cpufreq: intel_pstate: Active mode P-state limits reworkRafael J. Wysocki
The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-24cpufreq: intel_pstate: Use load-based P-state selection more widelyRafael J. Wysocki
Extend the set of systems for which intel_pstate will use the "powersave" P-state selection algorithm based on CPU load in the active mode by systems with ACPI preferred profile set to "tablet", "appliance PC", "desktop", or "workstation" (ie. everything with a specified preferred profile that is not a "server"). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-24cpufreq: intel_pstate: Support HWP processors in all operation modesRafael J. Wysocki
Currently, some processors supporting HWP are only supported by intel_pstate if HWP is actually going to be used and not supported otherwise which is confusing. Specifically, they are not supported if "intel_pstate=no_hwp" is passed to the kernel in the command line or if the driver is started in the passive mode ("intel_pstate=passive"). There is no real reason for that, because everything about those processor is known anyway and the driver can work with them in all modes, so make that happen, but use the load-based P-state selection algorithm for the active mode "powersave" policy with them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-24Merge back intel_pstate updates for 4.12.Rafael J. Wysocki
2017-03-21cpufreq: intel_pstate: Fix policy data management in passive modeRafael J. Wysocki
The policy->cpuinfo.max_freq and policy->max updates in intel_cpufreq_turbo_update() are excessive as they are done for no good reason and may lead to problems in principle, so they should be dropped. However, after dropping them intel_cpufreq_turbo_update() becomes almost entirely pointless, because the check made by it is made again down the road in intel_pstate_prepare_request(). The only thing in it that still needs to be done is the call to update_turbo_state(), so drop intel_cpufreq_turbo_update() altogether and make its callers invoke update_turbo_state() directly instead of it. In addition to that, fix intel_cpufreq_verify_policy() so that it checks global.no_turbo in addition to global.turbo_disabled when updating policy->cpuinfo.max_freq to make it consistent with intel_pstate_verify_policy(). Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18cpufreq: intel_pstate: One set of global limits in active modeRafael J. Wysocki
In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-15cpufreq: intel_pstate: Avoid percentages in limits-related computationsRafael J. Wysocki
Currently, intel_pstate_update_perf_limits() first converts the policy minimum and maximum limits into percentages of the maximum turbo frequency (rounding up to an integer) and then converts these percentages to fractions (by using fixed-point arithmetic to divide them by 100). That introduces a rounding error unnecessarily, because the fractions can be obtained by carrying out fixed-point divisions directly on the input numbers. Rework the computations in intel_pstate_hwp_set() to use fractions instead of percentages (and drop redundant local variables from there) and modify intel_pstate_update_perf_limits() to compute the fractions directly and percentages out of them. While at it, introduce percent_ext_fp() for converting percentages to fractions (with extended number of fraction bits) and use it in the computations. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-14cpufreq: intel_pstate: Correct frequency setting in the HWP modeSrinivas Pandruvada
In the functions intel_pstate_hwp_set(), min/max range from HWP capability MSR along with max_perf_pct and min_perf_pct, is used to set the HWP request MSR. In some cases this doesn't result in the correct HWP max/min in HWP request. For example: In the following case: HWP capabilities from MSR 0x771 0x70a1220 Here cpufreq min/max frequencies from above MSR dump are 700MHz and 3.2GHz respectively. This will result in hwp_min = 0x07 hwp_max = 0x20 To limit max frequency to 2GHz: perf_limits->max_perf_pct = 63 (2GHz as a percent of 3.2GHz rounded up) With the current calculation: adj_range = max_perf_pct * range / 100; adj_range = 63 * (32 - 7) / 100 adj_range = 15 max = hw_min + adj_range; max = 7 + 15 = 22 This will result in HWP request of 0x160f, which will result in a frequency cap of 2.2GHz not 2GHz. The problem with the above calculation is that hwp_min of 7 is treated as 0% in the range. But max_perf_pct is calculated with respect to minimum as 0 and max as 3.2GHz or hwp_max, so adding hwp_min to it will result in more than the desired. Since the min_perf_pct and max_perf_pct is already a percent of max frequency or hwp_max, this min/max HWP request value can be calculated directly applying these percentage to hwp_max. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-13cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()Rafael J. Wysocki
Fix the debugfs interface for PID tuning to actually update pid_params.sample_rate_ns on PID parameters updates, as changing pid_params.sample_rate_ms via debugfs has no effect now. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-03-12cpufreq: intel_pstate: Drop redundant wrapper functionRafael J. Wysocki
intel_pstate_hwp_set_policy() is a wrapper around intel_pstate_hwp_set(), but the only value it adds is to check hwp_active before calling the latter and one of its two callers has already checked hwp_active before that happens, so in that code path the additional check is redundant and using the wrapper is rather pointless. For this reason, drop intel_pstate_hwp_set_policy() and make its callers invoke intel_pstate_hwp_set() directly (after checking hwp_active). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-03-09Merge branch 'pm-cpufreq'Rafael J. Wysocki
* pm-cpufreq: cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy cpufreq: intel_pstate: Fix intel_pstate_verify_policy() cpufreq: intel_pstate: Fix global settings in active mode cpufreq: Add the "cpufreq.off=1" cmdline option cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy() cpufreq: intel_pstate: Do not use performance_limits in passive mode
2017-03-06cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicyRafael J. Wysocki
If the current P-state selection algorithm is set to "performance" in intel_pstate_set_policy(), the limits may be initialized from scratch, but only if no_turbo is not set and the maximum frequency allowed for the given CPU (i.e. the policy object representing it) is at least equal to the max frequency supported by the CPU. In all of the other cases, the limits will not be updated. For example, the following can happen: # cat intel_pstate/status active # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 100 # cat cpufreq/policy0/scaling_max_freq 3100000 echo 3000000 > cpufreq/policy0/scaling_max_freq # cat intel_pstate/min_perf_pct 94 # echo 95 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 95 That is confusing for two reasons. First, the initial attempt to change min_perf_pct to 94 seems to have no effect, even though setting the global limits should always work. Second, after changing scaling_max_freq for policy0 the global min_perf_pct attribute shows 94, even though it should have not been affected by that operation in principle. Moreover, the final attempt to change min_perf_pct to 95 worked as expected, because scaling_max_freq for the only policy with scaling_governor equal to "performance" was different from the maximum at that time. To make all that confusion go away, modify intel_pstate_set_policy() so that it doesn't reinitialize the limits at all. At the same time, change intel_pstate_set_performance_limits() to set min_sysfs_pct to 100 in the "performance" limits set so that switching the P-state selection algorithm to "performance" causes intel_pstate/min_perf_pct in sysfs to go to 100 (or whatever value min_sysfs_pct in the "performance" limits is set to later). That requires per-CPU limits to be initialized explicitly rather than by copying the global limits to avoid setting min_sysfs_pct in the per-CPU limits to 100. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-06cpufreq: intel_pstate: Fix intel_pstate_verify_policy()Rafael J. Wysocki
The code added to intel_pstate_verify_policy() by commit 1443ebbacfd7 (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy) should use perf_limits instead of limits, because otherwise setting global limits via sysfs may affect policies inconsistently. For example, in the sequence of shell commands below, the scaling_min_freq attribute for policy1 and policy2 should be affected in the same way, because scaling_governor is set in the same way for both of them: # cat cpufreq/policy1/scaling_governor powersave # cat cpufreq/policy2/scaling_governor powersave # echo performance > cpufreq/policy0/scaling_governor # echo 94 > intel_pstate/min_perf_pct # cat cpufreq/policy0/scaling_min_freq 2914000 # cat cpufreq/policy1/scaling_min_freq 2914000 # cat cpufreq/policy2/scaling_min_freq 800000 The are affected differently, because intel_pstate_verify_policy() is invoked with limits set to &performance_limits (left behind by policy0) for policy1 and with limits set to &powersave_limits (left behind by policy1) for policy2. Since perf_limits is set to the set of limits matching the policy being updated, using it instead of limits fixes the inconsistency. Fixes: 1443ebbacfd7 (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-06cpufreq: intel_pstate: Fix global settings in active modeRafael J. Wysocki
Commit 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) changed intel_pstate to invoke cpufreq_update_policy() for every registered CPU on global sysfs attributes updates, but that led to undesirable effects in the active mode if the "performance" P-state selection algorithm is configufred for one CPU and the "powersave" one is chosen for all of the other CPUs. Namely, in that case, the following is possible: # cd /sys/devices/system/cpu/ # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 26 # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 26 The reason why this happens is because intel_pstate attempts to maintain two sets of global limits in the active mode, one for the "performance" P-state selection algorithm and one for the "powersave" P-state selection algorithm, but the P-state selection algorithms are set per policy, so the global limits cannot reflect all of them at the same time if they are different for different policies. In the particular situation above, the attempt to change min_perf_pct to 94 caused cpufreq_update_policy() to be run for a CPU with the "powersave" P-state selection algorithm and intel_pstate_set_policy() called by it silently switched the global limits to the "powersave" set which finally was reflected by the sysfs interface. To prevent that from happening, modify intel_pstate_update_policies() to always switch back to the set of limits that was used right before it has been invoked. Fixes: 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-04cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarilyRafael J. Wysocki
In the passive mode the cpu_frequency trace event is already triggered by the cpufreq core or by scaling governors, so intel_pstate should not trigger it once again for the same P-state updates. In addition to that, the frequency returned by intel_cpufreq_fast_switch() and passed via freqs.new from intel_cpufreq_target() to cpufreq_freq_transition_end() should reflect the P-state actually set, so make that happen. Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-04cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()Rafael J. Wysocki
The intel_pstate_update_perf_limits() called from intel_cpufreq_verify_policy() may cause global P-state limits to change which is generally confusing and unnecessary. In the passive mode the global limits are only applied to the frequency selected by the scaling governor (they are not taken into account by governors when making decisions anyway), so making them follow the per-policy limits serves no purpose and may go against user expectations (as it generally causes the global attributes in sysfs to change even though they have not been written to in some cases). Fix that by dropping the intel_pstate_update_perf_limits() invocation from intel_cpufreq_verify_policy() (which also reduces the code size by a few lines). This change does not affect the per-CPU limits case, because those limits allow any P-state to be set by default in the passive mode and it removes the only piece of code updating them in that mode, so the per-policy settings will be the only ones taken into account in that case as expected. Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-04cpufreq: intel_pstate: Do not use performance_limits in passive modeRafael J. Wysocki
Using performance_limits in the passive mode doesn't make sense, because in that mode the global limits are applied to the frequency selected by the scaling governor. The maximum and minimum P-state limits in performance_limits are both set to 100 percent which will put all CPUs into the turbo range regardless of what governor is used and what frequencies are selected by it (that is particularly undesirable on CPUs with the generic powersave governor attached). For this reason, make intel_pstate_register_driver() always point limits to powersave_limits in the passive mode. Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-02Merge tag 'pm-turbostat-4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull turbostat utility updates from Rafael Wysocki: "Power management turbostat utility updates. These update turbostat significantly and in particular: - default output is now verbose, --debug is no longer required to get all counters. As a result, some options have been added to specify exactly what output is wanted. - added --quiet to skip system configuration output - added --list, --show and --hide parameters - added --cpu parameter - enhanced Baytrail SoC support - added Gemini Lake SoC support - added sysfs C-state columns Also the symbol definitions in arch/x86/include/asm/intel-family.h and arch/x86/include/asm/msr-index.h are updated and the intel_idle and intel_pstate drivers are modified to use the updated symbols. Credits to Len Brown for all of these changes" * tag 'pm-turbostat-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (44 commits) tools/power turbostat: version 17.02.24 tools/power turbostat: bugfix: --add u32 was printed as u64 tools/power turbostat: show error on exec tools/power turbostat: dump p-state software config tools/power turbostat: show package number, even without --debug tools/power turbostat: support "--hide C1" etc. tools/power turbostat: move --Package and --processor into the --cpu option tools/power turbostat: turbostat.8 update tools/power turbostat: update --list feature tools/power turbostat: use wide columns to display large numbers tools/power turbostat: Add --list option to show available header names tools/power turbostat: fix zero IRQ count shown in one-shot command mode tools/power turbostat: add --cpu parameter tools/power turbostat: print sysfs C-state stats tools/power turbostat: extend --add option to accept /sys path tools/power turbostat: skip unused counters on BDX tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits tools/power turbostat: skip unused counters on SKX tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7 tools/power turbostat: initial Gemini Lake SOC support ...
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/cpufreq.h> We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/cpufreq.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-01Merge branch 'turbostat' of ↵Rafael J. Wysocki
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull changes related to turbostat for v4.11 from Len Brown. * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (44 commits) tools/power turbostat: version 17.02.24 tools/power turbostat: bugfix: --add u32 was printed as u64 tools/power turbostat: show error on exec tools/power turbostat: dump p-state software config tools/power turbostat: show package number, even without --debug tools/power turbostat: support "--hide C1" etc. tools/power turbostat: move --Package and --processor into the --cpu option tools/power turbostat: turbostat.8 update tools/power turbostat: update --list feature tools/power turbostat: use wide columns to display large numbers tools/power turbostat: Add --list option to show available header names tools/power turbostat: fix zero IRQ count shown in one-shot command mode tools/power turbostat: add --cpu parameter tools/power turbostat: print sysfs C-state stats tools/power turbostat: extend --add option to accept /sys path tools/power turbostat: skip unused counters on BDX tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits tools/power turbostat: skip unused counters on SKX tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7 tools/power turbostat: initial Gemini Lake SOC support ...
2017-03-01intel_pstate: use MSR_ATOM_RATIOS definitions from msr-index.hLen Brown
Originally, these MSRs were locally defined in this driver. Now the definitions are in msr-index.h -- use them. Signed-off-by: Len Brown <len.brown@intel.com>
2017-02-28cpufreq: intel_pstate: Fix limits issue with operation mode switchingRafael J. Wysocki
There is a problem with intel_pstate operation mode switching introduced by commit fb1fe1041c04 (cpufreq: intel_pstate: Operation mode control from sysfs), because the global sysfs limits are preserved across operation modes while per-policy limits are reinitialized from scratch on a mode switch and both sets of limits may get out of sync this way. Fix that by always reinitializing the global limits upon the registration of the driver. Fixes: fb1fe1041c04 (cpufreq: intel_pstate: Operation mode control from sysfs) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
2017-02-09Merge back earlier cpufreq changes for v4.11.Rafael J. Wysocki
2017-02-06Merge branches 'pm-core-fixes' and 'pm-cpufreq-fixes'Rafael J. Wysocki
* pm-core-fixes: PM / runtime: Avoid false-positive warnings from might_sleep_if() * pm-cpufreq-fixes: cpufreq: intel_pstate: Disable energy efficiency optimization cpufreq: brcmstb-avs-cpufreq: properly retrieve P-state upon suspend cpufreq: brcmstb-avs-cpufreq: extend sysfs entry brcm_avs_pmap
2017-02-04cpufreq: intel_pstate: Disable energy efficiency optimizationSrinivas Pandruvada
Some Kabylake desktop processors may not reach max turbo when running in HWP mode, even if running under sustained 100% utilization. This occurs when the HWP.EPP (Energy Performance Preference) is set to "balance_power" (0x80) -- the default on most systems. It occurs because the platform BIOS may erroneously enable an energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not recommended to be enabled on this SKU. On the failing systems, this BIOS issue was not discovered when the desktop motherboard was tested with Windows, because the BIOS also neglects to provide the ACPI/CPPC table, that Windows requires to enable HWP, and so Windows runs in legacy P-state mode, where this setting has no effect. Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and so it runs in HWP mode, exposing this incorrect BIOS configuration. There are several ways to address this problem. First, Linux can also run in legacy P-state mode on this system. As intel_pstate is how Linux enables HWP, booting with "intel_pstate=disable" will run in acpi-cpufreq/ondemand legacy p-state mode. Or second, the "performance" governor can be used with intel_pstate, which will modify HWP.EPP to 0. Or third, starting in 4.10, the /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference attribute in can be updated from "balance_power" to "performance". Or fourth, apply this patch, which fixes the erroneous setting of MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default configuration to function as designed. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04cpufreq: intel_pstate: Calculate guaranteed performance for HWPSrinivas Pandruvada
When HWP is active, turbo activation ratio is not used to calculate max non turbo ratio. But on these systems the max non turbo ratio is decided by config TDP settings. This change removes usage of MSR_TURBO_ACTIVATION_RATIO for HWP systems, instead directly use TDP ratios, when more than one TDPs are available. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04cpufreq: intel_pstate: Make HWP limits compatible with legacySrinivas Pandruvada
Under HWP the performance limits are calculated using max_perf_pct and min_perf_pct using possible performance, not available performance. The available performance can be reduced by no_turbo setting. To make compatible with legacy mode, use max/min performance percentage with respect to available performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04cpufreq: intel_pstate: Lower frequency than expected under no_turboSrinivas Pandruvada
When turbo is not disabled by BIOS, but user disabled from intel P-State sysfs and changes max/min using cpufreq sysfs, the resultant frequency is lower than what user requested. The reason for this, when the perf limits are calculated in set_policy() callback, they are with reference to max cpu frequency (turbo frequency ), but when enforced in the intel_pstate_get_min_max() they are with reference to max available performance as documented in the intel_pstate documentation (in this case max non turbo P-State). This needs similar change as done in intel_cpufreq_verify_policy() for passive mode. Set policy->cpuinfo.max_freq based on the turbo status. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04cpufreq: intel_pstate: Operation mode control from sysfsRafael J. Wysocki
Make it possible to change the operation mode of intel_pstate with the help of a new sysfs attribute called "status". There are three possible configurations that can be selected using this attribute: "off" - The driver is not in use at this time. "active" - The driver works as a P-state governor (default). "passive" - The driver works as a regular cpufreq one and collaborates with the generic cpufreq governors (it sets P-states as requested by those governors). [This is the same mode the driver can be started in by passing intel_pstate=passive in the kernel command line.] The current setting is returned by reads from this attribute. Writing one of the above strings to it changes the operation mode as indicated by that string, if possible. If HW-managed P-states (HWP) feature is enabled, it is not possible to change the driver's operation mode and attempts to write to this attribute will fail. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04cpufreq: intel_pstate: Expose global sysfs attributes upfrontRafael J. Wysocki
Expose the intel_pstate's global sysfs attributes before registering the driver to prepare for the addition of an attribute that also will have to work if the driver is not registered. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>