diff options
author | Mel Gorman <mgorman@techsingularity.net> | 2020-11-30 22:32:02 +0000 |
---|---|---|
committer | Rafael J. Wysocki <rafael.j.wysocki@intel.com> | 2020-12-01 17:59:18 +0100 |
commit | 7a25759eaa04b8c0ecb3db134922d6641ab2e6d1 (patch) | |
tree | 205d63ac320e8a5f1948a89fba7a22413cc27344 | |
parent | 0f6e2cb45bcb003e5b3a5332b4de79bf82814f45 (diff) |
cpuidle: Select polling interval based on a c-state with a longer target residency
It was noted that a few workloads that idle rapidly regressed when commit
36fcb4292473 ("cpuidle: use first valid target residency as poll time")
was merged. The workloads in question were heavy communicators that idle
rapidly and were impacted by the c-state exit latency as the active CPUs
were not polling at the time of wakeup. As they were not particularly
realistic workloads, it was not considered to be a major problem.
Unfortunately, a bug was reported for a real workload in a production
environment that relied on large numbers of threads operating in a worker
pool pattern. These threads would idle for periods of time longer than the
C1 target residency and so incurred the c-state exit latency penalty. The
application is very sensitive to wakeup latency and indirectly relying
on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state: Add
time limit to poll_idle()") to poll for long enough to avoid the exit
latency cost.
The target residency of C1 is typically very short. On some x86 machines,
it can be as low as 2 microseconds. In poll_idle(), the clock is checked
every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one
iteration of that loop can be over 1 microsecond so the polling interval is
very close to the granularity of what poll_idle() can detect. Furthermore,
a basic ping pong workload like perf bench pipe has a longer round-trip
time than the 2 microseconds meaning that the CPU will almost certainly
not be polling when the ping-pong completes.
This patch selects a polling interval based on an enabled c-state that
has an target residency longer than 10usec. If there is no enabled-cstate
then polling will be up to a TICK_NSEC/16 similar to what it was up until
kernel 4.20. Polling for a full tick is unlikely (rescheduling event)
and is much longer than the existing target residencies for a deep c-state.
As an example, consider a CPU with the following c-state information from
an Intel CPU;
residency exit_latency
C1 2 2
C1E 20 10
C3 100 33
C6 400 133
The polling interval selected is 20usec. If booted with
intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper
c-states were not available.
On an AMD EPYC machine, the c-state information is more limited and
looks like
residency exit_latency
C1 2 1
C2 800 400
The polling interval selected is 250usec. While C2 was considered, the
polling interval was clamped by CPUIDLE_POLL_MAX.
Note that it is not expected that polling will be a universal win. As
well as potentially trading power for performance, the performance is not
guaranteed if the extra polling prevented a turbo state being reached.
Making it a tunable was considered but it's driver-specific, may be
overridden by a governor and is not a guaranteed polling interval making
it difficult to describe without knowledge of the implementation.
tbench4
vanilla polling
Hmean 1 497.89 ( 0.00%) 543.15 * 9.09%*
Hmean 2 975.88 ( 0.00%) 1059.73 * 8.59%*
Hmean 4 1953.97 ( 0.00%) 2081.37 * 6.52%*
Hmean 8 3645.76 ( 0.00%) 4052.95 * 11.17%*
Hmean 16 6882.21 ( 0.00%) 6995.93 * 1.65%*
Hmean 32 10752.20 ( 0.00%) 10731.53 * -0.19%*
Hmean 64 12875.08 ( 0.00%) 12478.13 * -3.08%*
Hmean 128 21500.54 ( 0.00%) 21098.60 * -1.87%*
Hmean 256 21253.70 ( 0.00%) 21027.18 * -1.07%*
Hmean 320 20813.50 ( 0.00%) 20580.64 * -1.12%*
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-rw-r--r-- | drivers/cpuidle/cpuidle.c | 25 |
1 files changed, 23 insertions, 2 deletions
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 83af15f77f66..ef2ea1b12cd8 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -368,6 +368,19 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index) cpuidle_curr_governor->reflect(dev, index); } +/* + * Min polling interval of 10usec is a guess. It is assuming that + * for most users, the time for a single ping-pong workload like + * perf bench pipe would generally complete within 10usec but + * this is hardware dependant. Actual time can be estimated with + * + * perf bench sched pipe -l 10000 + * + * Run multiple times to avoid cpufreq effects. + */ +#define CPUIDLE_POLL_MIN 10000 +#define CPUIDLE_POLL_MAX (TICK_NSEC / 16) + /** * cpuidle_poll_time - return amount of time to poll for, * governors can override dev->poll_limit_ns if necessary @@ -382,15 +395,23 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv, int i; u64 limit_ns; + BUILD_BUG_ON(CPUIDLE_POLL_MIN > CPUIDLE_POLL_MAX); + if (dev->poll_limit_ns) return dev->poll_limit_ns; - limit_ns = TICK_NSEC; + limit_ns = CPUIDLE_POLL_MAX; for (i = 1; i < drv->state_count; i++) { + u64 state_limit; + if (dev->states_usage[i].disable) continue; - limit_ns = drv->states[i].target_residency_ns; + state_limit = drv->states[i].target_residency_ns; + if (state_limit < CPUIDLE_POLL_MIN) + continue; + + limit_ns = min_t(u64, state_limit, CPUIDLE_POLL_MAX); break; } |