cpuidle子系统之（四）：governor层

关注公众号不迷路：DumpStack

扫码加关注

一、计算系统所能忍受的延迟
- 1.1 cpuidle_governor_latency_req - 计算系统能忍受的延迟
二、menu
三、ladder
四、haltpoll
五、teo
关注公众号不迷路：DumpStack

注：因为个人时间和精力的原因，本文未完结，后面有机会再补充吧！

CPU提供了多种idle级别，这些idle级别的主要区别是"进入idle时的功耗"和"退出时延迟"。driver层负责定义这些idle状态（即描述每一个状态的功耗和延迟分别是多少），并实现进入和退出相关的操作。最终，driver会把这些信息告诉governor层，由governor根据具体的应用场景，决定要选用哪种idle等级？依据是什么？是考虑cpu的负载还是系统中的iowaiter的个数？

所以governor层要完成的工作：制定游戏规则，并根据这个规则，选择出一个合适的C state

一、计算系统所能忍受的延迟

1.1 cpuidle_governor_latency_req - 计算系统能忍受的延迟

通过pm qos计算，关于pm qos相关内容我们在后面会有单独的文章分析

U:\linux-5.10.61\drivers\cpuidle\governor.c

/**

* cpuidle_governor_latency_req - Compute a latency constraint for CPU

* @cpu: Target CPU

s64 cpuidle_governor_latency_req(unsigned int cpu)

{

struct device *device = get_cpu_device(cpu);

//1.下面是通过QoS计算系统能承受的延迟

// 关于QoS相关内容，我们在后面介绍

int device_req = dev_pm_qos_raw_resume_latency(device);

int global_req = cpu_latency_qos_limit();

//2.取最小值，也就是最苛刻的那个值

if (device_req > global_req)

device_req = global_req;

//3.单位转化为us

return (s64)device_req * NSEC_PER_USEC;

}

二、menu

ladder在periodic timer tick system中使用，menu在tickless system中使用

2.1 设计原理

注：下文摘自蜗窝科技，向大牛致敬！

governor的主要职责，是根据系统的运行情况，选择一个合适idle state（在kernel的标准术语中，也称作C state，即CPU state）。具体的算法，需要基于下面两点考虑：

1）切换的代价（本文称为最小滞留时间）

进入C state的目的，是节省功耗，但CPU在C state和normal state之间切换，是要付出功耗上面的代价的，这个代价体现在cpuidle_state结构的target_residency字段上（本文称为"最小滞留时间"），driver在注册cpuidle_state时，要明确的指定该C state的切换代价，基于该代价，CPU必须在该C state中停留超过一定的时间才是划算的

因此governor在选择C state时，需要预测出CPU将要在C state中的预计停留时间，并和备选cpuidle_state的target_residency字段比较，选取满足"预计停留时间 > 最小滞留时间（target_residency）"的C state

2）系统对"退出延迟"容忍程度

备选的C state中，"功耗"和"退出延迟"是一对不可调和的矛盾，（处于idle的等级越大，表示睡得越深，此时"功耗"就越低，但是退出idle状态时的"退出延迟"就会越大，相反，睡得越浅，"功耗"就越大，退出idle时的"退出延迟"就越短）。电源管理的目标，是在保证"退出延迟"在系统可接受的范围内的情况下，尽可能的节省功耗

为了描述"功耗"和"退出延迟"，cpuidle_state结构中还提供下面两个信息：

CPU在某个C state下的功耗信息，对应cpuidle_state结构中的power_usage
退出该C state的延迟，对应cpuidle_state结构中的exit_latency

那么如果知道"当前系统所能容忍的延迟"（代码中使用latency_req变量表示），就可以在所有exit_latency小于latency_req的C state中，选取功耗最小的那个，因此，governor算法就转换为获取"当前系统所能容忍的延迟"，而这正是pm qos的特长

基于上面的考量，menu governor的主要任务就转化为两个：

根据系统的运行情况，预测CPU将在C state中停留的时间（对应变量predicted_us）
借助pm qos framework，获取当前"系统所能容忍的延迟"（对应变量latency_req）

对于任务1，menu governor从如下几个方面去达成：

前面讲过，menu governor用于tickless system，简化处理，menu将"距离下一个tick来临的时间"（由next timer event测量，对应menu_device::next_timer_us成员）作为基础的停留时间predicted_us

当然，这个基础的停留时间predicted_us是不准确的，因为在这段时间内，随时都可能产生除next timer event之外的其它wakeup event。为了使预测更准确，有必要加入一个"校正因子"（对应menu_device::correction_factor成员），该"校正因子"等于过去的实际停留时间predicted_us和next_timer_us之间的比值，例如，如果实际的wakeup event都是在预测的next timer event时间的一半时产生，则factor为0.5。另外，为了更精确，menu使用动态平均的factor

另外，对不同范围的next_timer_us，"校正因子"的影响程度是不一样的。例如对于50ms和500ms的next timer event，都是在10ms时产生了wakeup event，显然"校正因子"对500ms的影响比较大。如果计算平均值时将它们混在一起，就会对预测的准确性产生影响，所以在计算"校正因子"时，需要区分不同级别的next_timer_us，（也就是需要对next_timer_us划分不同的档位，也就是下面的"桶"的概念）

同时，系统是否存在io wait，对"校正因子"的敏感度也不同

基于这些考虑，menu使用了一组"校正因子"，对应menu_device::correction_factor[BUCKETS]数组，目前有12个，分别用于不同next_timer_us、不同io wait的场景下的的校正。

最后，在有些场合下，next_timer_us的预测是完全不正确的，如存在固定周期的中断时（音频等）。这时menu采用另一种不同的预测方式：统计过去8次停留时间的标准差（stand deviation），如果小于一定的门限值，则使用这8个停留时间的平均值，作为预测值。

对于任务2，是对"系统所能容忍的退出延迟"（对应变量latency_req）的估算，menu综合考虑了两种因素，如下：

1) 由pm qos获得的，系统期望的，CPU和DMA的延迟需求，这是一个硬性指标

2) 基于这样一个经验法则：越忙的系统，对系统延迟的要求越高，（即要求退出延迟的时间越短），结合任务1中预测到的停留时间（predicted_us），以及当前系统的CPU平均负荷和iowaiters的个数（get_iowait_load函数获得），计算出一个"系统所能容忍的延迟"，计算公式如下，这是一个经验公式

latency_req = predicted_us / (1 + 2 * loadavg + 10 * iowaiters)

这个公式反映的是退出延迟和预期停留时间之间的比例，loadavg和iowaiters越大，对退出延迟的要求就越高，即要求退出延迟越低

最后，latency_req的值取上面两个估值的最小值

关于menu的设计初衷，在内核源码中有很详细的注释，如下

W:\opensource\linux-5.10.61\drivers\cpuidle\governors\menu.c

* Concepts and ideas behind the menu governor

* For the menu governor, there are 3 decision factors for picking a C

* state:

* 1) Energy break even point

* 2) Performance impact

* 3) Latency tolerance (from pmqos infrastructure)

* These these three factors are treated independently.

* Energy break even point

* -----------------------

* C state entry and exit have an energy cost, and a certain amount of time in

* the C state is required to actually break even on this cost. CPUIDLE

* provides us this duration in the "target_residency" field. So all that we

* need is a good prediction of how long we'll be idle. Like the traditional

* menu governor, we start with the actual known "next timer event" time.

* Since there are other source of wakeups (interrupts for example) than

* the next timer event, this estimation is rather optimistic. To get a

* more realistic estimate, a correction factor is applied to the estimate,

* that is based on historic behavior. For example, if in the past the actual

* duration always was 50% of the next timer tick, the correction factor will

* be 0.5.

* menu uses a running average for this correction factor, however it uses a

* set of factors, not just a single factor. This stems from the realization

* that the ratio is dependent on the order of magnitude of the expected

* duration; if we expect 500 milliseconds of idle time the likelihood of

* getting an interrupt very early is much higher than if we expect 50 micro

* seconds of idle time. A second independent factor that has big impact on

* the actual factor is if there is (disk) IO outstanding or not.

* (as a special twist, we consider every sleep longer than 50 milliseconds

* as perfect; there are no power gains for sleeping longer than this)

* For these two reasons we keep an array of 12 independent factors, that gets

* indexed based on the magnitude of the expected duration as well as the

* "is IO outstanding" property.

* Repeatable-interval-detector

* ----------------------------

* There are some cases where "next timer" is a completely unusable predictor:

* Those cases where the interval is fixed, for example due to hardware

* interrupt mitigation, but also due to fixed transfer rate devices such as

* mice.

* For this, we use a different predictor: We track the duration of the last 8

* intervals and if the stand deviation of these 8 intervals is below a

* threshold value, we use the average of these intervals as prediction.

* Limiting Performance Impact

* ---------------------------

* C states, especially those with large exit latencies, can have a real

* noticeable impact on workloads, which is not acceptable for most sysadmins,

* and in addition, less performance has a power price of its own.

* As a general rule of thumb, menu assumes that the following heuristic

* holds:

* The busier the system, the less impact of C states is acceptable

* This rule-of-thumb is implemented using a performance-multiplier:

* If the exit latency times the performance multiplier is longer than

* the predicted duration, the C state is not considered a candidate

* for selection due to a too high performance impact. So the higher

* this multiplier is, the longer we need to be idle to pick a deep C

* state, and thus the less likely a busy CPU will hit such a deep

* C state.

* Two factors are used in determing this multiplier:

* a value of 10 is added for each point of "per cpu load average" we have.

* a value of 5 points is added for each process that is waiting for

* IO on this CPU.

* (these values are experimentally determined)

* The load average factor gives a longer term (few seconds) input to the

* decision, while the iowait value gives a cpu local instantanious input.

* The iowait factor may look low, but realize that this is also already

* represented in the system load average.

2.2 menu_device - 数据结构

struct menu_device {

//每次从C state返回时，kernel会调用governor的reflect接口，以便有机会让governor

//考虑这一次state切换的结果，并更新一些统计信息，以便在下一次挑选出更加合适的C state

//对menu而言，它的reflect接口会设置needs_update标志，并在下一次select时，更新状态，

//为什么不在reflect中直接完成状态更新呢？这是因为reflect是在关中断的上下文，不适宜

//做太多的工作，需要尽快完成工作并打开中断，处理中断服务程序，具体行为可参考后面的描述；

int needs_update;

//一个时间戳，记录上一次从idle中被唤醒是在什么时刻

int tick_wakeup;

//距离下一次tick来临的时间

u64 next_timer_ns;

//指明在选择C state时应该使用哪个校正因子，也就是下面correction_factor[]中的索引

unsigned int bucket;

//保存校正因子的数组，因子的个数为BUCKETS，当前为12

unsigned int correction_factor[BUCKETS];

//可参考上面的描述，用于计算停留时间的标准差，当前代码使用了8个停留时间（INTERVALS）

unsigned int intervals[INTERVALS];

//用于索引上面的intervals数组

int interval_ptr;

};

2.3 menu_governor

static struct cpuidle_governor menu_governor = {

.name = "menu",

.rating = 20,

.enable = menu_enable_device,

.select = menu_select,

.reflect = menu_reflect,

};

2.4 init_menu - menu初始化

/**

* init_menu - initializes the governor

static int __init init_menu(void)

{

return cpuidle_register_governor(&menu_governor);

}

postcore_initcall(init_menu); //驱动初始化的时候执行

2.5 menu_enable_device - 完成governor运行之前的准备工作

在cpuidle_register_device中会通过cpuidle_curr_governor->enable调用到该接口，由下面cpuidle_register_device的调用路径可知，在系统启动注册cpu时或者cpu热插拔时会调用该接口，完成这个cpu所使用的的governor的初始化相关工作，在menu中主要是对校正因子进行了初始化

cpuidle_register_device -> cpuidle_endble_device
acpi_processor_hotplug -> cpuidle_endble_device
acpi_processor_power_state_has_changed -> cpuidle_endble_device

/**

* menu_enable_device - scans a CPU's states and does setup

* @drv: cpuidle driver

* @dev: the CPU

static int menu_enable_device(

struct cpuidle_driver *drv,

struct cpuidle_device *dev)

{

//1.每个cpu对应一个menu_device结构

struct menu_device *data = &per_cpu(menu_devices, dev->cpu);

int i;

memset(data, 0, sizeof(struct menu_device));

* if the correction factor is 0 (eg first time init or cpu hotplug

* etc), we actually want to start out with a unity factor.

//2.初始化校正因子

// 其中：BUCKETS=12, RESOLUTION=1024, DECAY=8

for(i = 0; i < BUCKETS; i++)

data->correction_factor[i] = RESOLUTION * DECAY;

return 0;

}

2.6 menu_select - 选择一个C state

函数调用路径如下，在cpuidle_idle_call中，先选择出一个合适的C state，然后进入该C state

cpuidle_idle_call -> cpuidle_select

/**

* menu_select - selects the next idle state to enter

* @drv: cpuidle driver containing state data

* @dev: the CPU

* @stop_tick: indication on whether or not to stop the tick

static int menu_select(

struct cpuidle_driver *drv,

struct cpuidle_device *dev,

bool *stop_tick) //out，返回是否需要关闭tick

{

struct menu_device *data = this_cpu_ptr(&menu_devices);

//1.计算"系统所能忍受的退出延迟"latency_req，该函数是通过pm qos计算的

s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

//2.predicted_us表示"预计会在idle中停留多长时间"

unsigned int predicted_us;

u64 predicted_ns;

u64 interactivity_req;

unsigned long nr_iowaiters;

ktime_t delta_next;

int i, idx;

//2.根据needs_update标志，调用menu_update，更新统计信息

// 这里主要是对上一次的idle信息进行统计，这部分工作本该放在reflect中完成，

// 但是因为reflect是在关中断的上下文，不适合做太多的工作，所以移到了这里

if (data->needs_update) {

menu_update(drv, dev);

data->needs_update = 0;

}

/* determine the expected residency time, round up */

//3.返回当前sleep的预期长度，也就是距离下一次tick来临的时间

data->next_timer_ns = tick_nohz_get_sleep_length(&delta_next);

//4.获取这个cpu上中有多少个iowait

nr_iowaiters = nr_iowait_cpu(dev->cpu);

//5.由上面计算得到的预期睡眠时间和iowaiter的个数，确定应该使用那个桶里的校正因子

data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters);

//6.当满足下面情况的任意一种，返回0，C state为0表示不需要进入idle状态

// a) 这个driver只有一种状态，即normal状态，没有idle状态，也就不能进入idle状态了

// b) 上面计算得到的"系统所能容忍的退出延迟"为0，也就是说当前系统处于一个比较苛刻的

// 状态，不能进入idle状态

// c) 满足下面条件，则说明连最浅的睡眠状态都不满足要求，此时不允许进入睡眠状态

// i) 考虑到切换代价：因为cpuidle_driver::state[]数组中的idle状态是按照睡眠深浅

// 排序的，排的越靠前的睡眠越浅，退出时的延迟越短，功耗越高。所以如果上面计算得到

// 的预期睡眠时间data->next_timer_ns比等级1的"最小滞留时间"要小，则说明当前

// 不适合进入任何idle状态

// ii) 考虑到退出延迟："系统所能忍受的退出延迟"比等级1的C state的退出延迟还要短，

// （其他等级的C state的退出延迟更长），说明系统也不适合进入idle状态

// iii) dev->states_usage[0].disable为0表示等级为0的idle状态被禁用了，这里是考

// 虑什么情况呢？？？？？

if (unlikely(drv->state_count <= 1 || latency_req == 0) ||

((data->next_timer_ns < drv->states[1].target_residency_ns ||

latency_req < drv->states[1].exit_latency_ns) &&

!dev->states_usage[0].disable)) {

* In this case state[0] will be used no matter what, so return

* it right away and keep the tick running if state[0] is a

* polling one.

//6.1 代码走到这表示不适合进入idle状态，也就是normal状态，对应drv->states[0]

// 此时再normal状态下是否需要关闭idle由CPUIDLE_FLAG_POLLING决定

*stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING);

return 0;

}

/* Round up the result for half microseconds. */

//7.计算预计会在idle状态停留多长时间predicted_us

// 将next_timer_us乘以校正因子，得到predicted_us，计算时考虑了溢出、精度等情况

// a) RESOLUTION为1024，表示？？？

// b) DECAY为8，表示？？？？

// c) 加上(RESOLUTION * DECAY * NSEC_PER_USEC) / 2是为了实现四舍五入

predicted_us = div_u64(data->next_timer_ns *

data->correction_factor[data->bucket] +

(RESOLUTION * DECAY * NSEC_PER_USEC) / 2,

RESOLUTION * DECAY * NSEC_PER_USEC);

/* Use the lowest expected idle interval to pick the idle state. */

//8.调用get_typical_interval接口，检查是否存在固定周期的情况，检查的逻辑就是

// 计算8次停留时间的标准差，如果存在，则利用平均值更新predicted_us

predicted_ns = (u64)min(predicted_us,

get_typical_interval(data, predicted_us)) *

NSEC_PER_USEC;

if (tick_nohz_tick_stopped()) {

* If the tick is already stopped, the cost of possible short

* idle duration misprediction is much higher, because the CPU

* may be stuck in a shallow idle state for a long time as a

* result of it. In that case say we might mispredict and use

* the known time till the closest timer event for the idle

* state selection.

//由上面的注释可知，如果tick已经停止了，

//如果滴答已经停止，可能的短期空闲持续时间预测错误的代价要高得多，因为

//它可能导致CPU长时间停留在一个浅空闲状态。
在这种情况下，我们可能会错

//误地预测并使用已知的时间直到最近的计时器事件来选择空闲状态

if (predicted_ns < TICK_NSEC)

predicted_ns = delta_next;

} else {

* Use the performance multiplier and the user-configurable

* latency_req to determine the maximum exit latency.

//9.计算"系统所能容忍的延迟"，对应变量latency_req

//根据predicted_us和系统负荷情况（cpu load、iowaiters），

//估算另一个延迟容忍值，并和latency_req，取最小值；

interactivity_req = div64_u64(predicted_ns,

performance_multiplier(nr_iowaiters));

if (latency_req > interactivity_req)

latency_req = interactivity_req;

}

//10.代码走到这里，"系统所能忍受的退出延迟"latency_req和

// "预计会在idle中停留多长时间"predicted_us都已经计算出来了，

// 下面利用这两个信息挑出一个idle等级

* Find the idle state with the lowest power while satisfying

* our constraints.

idx = -1;

//因为cpuidle_driver->states[]是按照睡眠由浅到深排序的，排在前面的睡眠最浅，功耗最高

//所以下面for循环的顺序是从高功耗到低功耗遍历啊

for (i = 0; i < drv->state_count; i++) {

struct cpuidle_state *s = &drv->states[i];

//这个C state被禁用了，则跳过

if (dev->states_usage[i].disable)

continue;

if (idx == -1)

idx = i; /* first enabled state */

//满足该if条件，表示预计在idle状态停留的时间，不满足正在遍历的C state的"最小滞留时间"

if (s->target_residency_ns > predicted_ns) {

* Use a physical idle state, not busy polling, unless

* a timer is going to trigger soon enough.

//这里的idx中实际记录着上一次正在遍历的C state

//这里是什么逻辑呢？？？？

if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&

s->exit_latency_ns <= latency_req &&

s->target_residency_ns <= data->next_timer_ns) {

predicted_ns = s->target_residency_ns;

idx = i;

break;

}

if (predicted_ns < TICK_NSEC)

break;

if (!tick_nohz_tick_stopped()) {

* If the state selected so far is shallow,

* waking up early won't hurt, so retain the

* tick in that case and let the governor run

* again in the next iteration of the loop.

predicted_ns = drv->states[idx].target_residency_ns;

break;

}

* If the state selected so far is shallow and this

* state's target residency matches the time till the

* closest timer event, select this one to avoid getting

* stuck in the shallow one for too long.

if (drv->states[idx].target_residency_ns < TICK_NSEC &&

s->target_residency_ns <= delta_next)

idx = i;

return idx;

}

//这个if条件是在判断退出延迟，满足这个条件表示正在遍历的C state的退出延迟是满足要求的

if (s->exit_latency_ns > latency_req)

break;

idx = i;

}

//没有找到可用的C state，只能使用0，也就是normal状态

if (idx == -1)

idx = 0; /* No states enabled. Must use 0. */

* Don't stop the tick if the selected state is a polling one or if the

* expected idle duration is shorter than the tick period length.

//代码走到这，表示初步已经找到一个可使用的状态了

if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||

predicted_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {

*stop_tick = false;

if (idx > 0 && drv->states[idx].target_residency_ns > delta_next) {

* The tick is not going to be stopped and the target

* residency of the state to be returned is not within

* the time until the next timer event including the

* tick, so try to correct that.

for (i = idx - 1; i >= 0; i--) {

if (dev->states_usage[i].disable)

continue;

idx = i;

if (drv->states[i].target_residency_ns <= delta_next)

break;

}

return idx;

}

2.6.1 menu_update

每次从C state返回时，kernel会调用governor的reflect接口，以便有机会让governor考虑这一次state切换的结果，并更新一些统计信息，以便在下一次挑选出更加合适的C state

对menu而言，它的reflect接口仅仅是设置needs_update标志，真在的状态更新是在下一次select的时候，在下一次select时调用menu_update完成相关的数据统计工作

/**

* menu_update - attempts to guess what happened after entry

* @drv: cpuidle driver containing state data

* @dev: the CPU

static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)

{

struct menu_device *data = this_cpu_ptr(&menu_devices);

//1.获取这个cpu上一次进入的C state结构

int last_idx = dev->last_state_idx;

struct cpuidle_state *target = &drv->states[last_idx];

u64 measured_ns;

unsigned int new_factor;

//2.由下面的注释可知，下面是要计算在上一次C state中睡眠了多长时间

// 如果上一次所进入的C state不支持睡眠时间的度量

// 不管用那种测量方式，都将包含退出延时的时间，因为我们关系的是什么时候唤醒的

// 如果输入的空闲状态不支持驻留时间度量，那么无论如何，如果它们很短，

// 我们都会使用它们，如果很长，则将其截断到整个预期时间

// 任何测量的时间量都将包括退出延迟，因为我们感兴趣的是唤醒动作开始的时间，

// 而不是完成的时间，所以我们必须减去退出延迟，但是如果计算出来的时间量小于

// 退出延时，我们就假设从未进入过idle状态，并且此时的退出延时为0

* Try to figure out how much time passed between entry to low

* power state and occurrence of the wakeup event.

* If the entered idle state didn't support residency measurements,

* we use them anyway if they are short, and if long,

* truncate to the whole expected time.

* Any measured amount of time will include the exit latency.

* Since we are interested in when the wakeup begun, not when it

* was completed, we must subtract the exit latency. However, if

* the measured amount of time is less than the exit latency,

* assume the state was never reached and the exit latency is 0.

//3.预测下一次进入idle会睡眠多长时间

// 如果预测不准，则会导致选择错误的C state

// 例如预测的时间太短，则会进入浅睡眠，预测的时间太长，则会进入深睡眠

if (data->tick_wakeup && data->next_timer_ns > TICK_NSEC) {

//tick_wakeup是在reflect中被设置，也就是上一次从idle中刚唤醒的

//时候被设置，为true表示tick handler正在运行

//next_timer_ns是上一次在调用menu_select的时候被设置的，也就是

//上一次进入idle之前，计算得到的"距离下一次tick到来还有多长时间"

//同时满足上面两个条件表示当前系统处于深度睡眠的状态，

//下面的MAX_INTERESTING为50ms

//nohz代码说在滴答边界内不会有任何事件(如果滴答停止)，但空闲持续时间预

//测器有不同的意见。
由于CPU是在一个滴答声中被唤醒的(毕竟没有停止)，因

//此预测器并不完全正确，所以假设CPU可能已经空闲了很长时间(但不是永远)，

//以帮助空闲持续时间预测器下次做得更好

* The nohz code said that there wouldn't be any events within

* the tick boundary (if the tick was stopped), but the idle

* duration predictor had a differing opinion. Since the CPU

* was woken up by a tick (that wasn't stopped after all), the

* predictor was not quite right, so assume that the CPU could

* have been idle long (but not forever) to help the idle

* duration predictor do a better job next time.

measured_ns = 9 * MAX_INTERESTING / 10;

} else if ((drv->states[last_idx].flags & CPUIDLE_FLAG_POLLING) &&

dev->poll_time_limit) {

* The CPU exited the "polling" state due to a time limit, so

* the idle duration prediction leading to the selection of that

* state was inaccurate. If a better prediction had been made,

* the CPU might have been woken up from idle by the next timer.

* Assume that to be the case.

//CPU由于poll_time_limit的限制退出了polling状态，因此导致选择该状态的空闲

//持续时间预测是不准确的

//如果做出了更好的预测，CPU可能会在下一个计时器之前从空闲状态被唤醒

measured_ns = data->next_timer_ns;

} else {

/* measured value */

//上一次在idle中睡眠的时间

measured_ns = dev->last_residency_ns;

/* Deduct exit latency */

//这里要减去退出延时时间

if (measured_ns > 2 * target->exit_latency_ns)

measured_ns -= target->exit_latency_ns;

else

measured_ns /= 2;

}

/* Make sure our coefficients do not exceed unity */

//确保上面计算出来的"预计睡眠时间"不会超过下一次tick中断到来的时间

if (measured_ns > data->next_timer_ns)

measured_ns = data->next_timer_ns;

/* Update our correction ratio */

//下面计算"校正因子"

//先得到上一次的校正因子，然后衰减一半

new_factor = data->correction_factor[data->bucket];

new_factor -= new_factor / DECAY;

//下面是考虑本次"即将睡眠的时间"，计算本次使用的衰减因子

// MAX_INTERESTING为50ms

// RESOLUTION的值为1024，表示向1024做归一化

if (data->next_timer_ns > 0 && measured_ns < MAX_INTERESTING)

new_factor += div64_u64(RESOLUTION * measured_ns, data->next_timer_ns);

else

* we were idle so long that we count it as a perfect

* prediction

new_factor += RESOLUTION;

* We don't want 0 as factor; we always want at least

* a tiny bit of estimated time. Fortunately, due to rounding,

* new_factor will stay nonzero regardless of measured_us values

* and the compiler can eliminate this test as long as DECAY > 1.

//DECAY如果为1，则表示，menu中DECAY为8

//我们不想让0作为因子; 我们总是希望至少有一点点的预估时间。幸运的是，

//由于四舍五入的关系，无论measured_us的值是多少，new_factor都将保持

//非零状态，并且编译器可以消除这个测试，只要DECAY > 1。

if (DECAY == 1 && unlikely(new_factor == 0))

new_factor = 1;

//更新校正因子，这个还是上一次的校正因子

data->correction_factor[data->bucket] = new_factor;

/* update the repeating-pattern data */

//保留本次需要睡眠的时间

data->intervals[data->interval_ptr++] = ktime_to_us(measured_ns);

//索引回滚，防止溢出

if (data->interval_ptr >= INTERVALS)

data->interval_ptr = 0;

}

2.6.2 tick_nohz_get_sleep_length - 预估当前会sleep多长时间

//如果定义了CONFIG_NO_HZ_COMMON宏的话，则说明tick是可能会被关闭的

#ifdef CONFIG_NO_HZ_COMMON

/**

* tick_nohz_get_sleep_length - return the expected length of the current sleep

* @delta_next: duration until the next event if the tick cannot be stopped

* Called from power state control code with interrupts disabled

ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)

{

struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);

struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

int cpu = smp_processor_id();

* The idle entry time is expected to be a sufficient approximation of

* the current time at this point.

//用"进入idle的时间戳"当做now

ktime_t now = ts->idle_entrytime;

ktime_t next_event;

WARN_ON_ONCE(!ts->inidle);

//计算可能在idle状态睡眠多长时间

//next_event是下一个事件产生的时间，两者的差表示这个cpu会在idle状态睡眠多长时间

*delta_next = ktime_sub(dev->next_event, now);

//是否能够停止tick

if (!can_stop_idle_tick(cpu, ts))

return *delta_next;

//获取下一个事件产生的时间戳，并保存在next_event

next_event = tick_nohz_next_event(ts, cpu);

if (!next_event)

return *delta_next;

* If the next highres timer to expire is earlier than next_event, the

* idle governor needs to know that.

//考虑到高精度定时器的影响，重新计算下一个事件产生的时间戳

next_event = min_t(u64, next_event,

hrtimer_next_event_without(&ts->sched_timer));

//重新计算可能在idle状态睡眠多长时间

return ktime_sub(next_event, now);

}

#else

//如果没有定义CONFIG_NO_HZ_COMMON宏的话，则说明tick是周期性的来的

static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)

{

*delta_next = TICK_NSEC;

return *delta_next;

}

#endif

2.6.3 nr_iowait_cpu - 获取这个cpu上有多少个iowait

unsigned long nr_iowait_cpu(int cpu)

{

return atomic_read(&cpu_rq(cpu)->nr_iowait);

}

2.6.4 which_bucket - 计算要使用那个桶中的校正因子

static inline int which_bucket(

u64 duration_ns, //预计会睡眠多长时间

unsigned long nr_iowaiters) //这个cpu上有多少个iowait

{

int bucket = 0;

* We keep two groups of stats; one with no

* IO pending, one without.

* This allows us to calculate

* E(duration)|iowait

//BUCKETS为12，由上面的注释可知：

//0~5这六个桶适用于无iowaiter的场景，而6~11这六个桶适用于有iowaiter的场景

if (nr_iowaiters)

bucket = BUCKETS/2;

//下面的计算表示一个桶内的差异为10us吗

if (duration_ns < 10ULL * NSEC_PER_USEC)

return bucket;

if (duration_ns < 100ULL * NSEC_PER_USEC)

return bucket + 1;

if (duration_ns < 1000ULL * NSEC_PER_USEC)

return bucket + 2;

if (duration_ns < 10000ULL * NSEC_PER_USEC)

return bucket + 3;

if (duration_ns < 100000ULL * NSEC_PER_USEC)

return bucket + 4;

return bucket + 5;

}

2.6.5 get_typical_interval - 计算标准差

由下面的注释可知，get_typical_interval尝试通过跟踪最后8个间隔来检测重复模式，并检查那组点的标准偏差是否低于一个阈值。如果它确实低于的话，就用这8个点的平均值作为估计值

* Try detecting repeating patterns by keeping track of the last 8

* intervals, and checking if the standard deviation of that set

* of points is below a threshold. If it is... then use the

* average of these 8 points as the estimated value.

static unsigned int get_typical_interval(

struct menu_device *data,

unsigned int predicted_us)

{

int i, divisor;

unsigned int min, max, thresh, avg;

uint64_t sum, variance;

//最大值，大于这个值会被丢弃

thresh = INT_MAX; /* Discard outliers above this value */

again:

/* First calculate the average of past intervals */

min = UINT_MAX;

max = 0;

sum = 0;

divisor = 0;

//下面for循环完成求和以及个数累加，为后面的求平均值做准备

//INTERVALS为8，也就是8个窗口

for (i = 0; i < INTERVALS; i++) {

unsigned int value = data->intervals[i];

if (value <= thresh) {

sum += value;

divisor++;

//下面两个if条件找出最大和最小值

if (value > max)

max = value;

if (value < min)

min = value;

}

* If the result of the computation is going to be discarded anyway,

* avoid the computation altogether.

//由上面的注释可知，如果连最小值都比传递进来的基础睡眠时间还要大，则注定下面计

//算出来的结果会被丢弃掉，（因为返回后取min），那还不如不计算，直接返回好了

if (min >= predicted_us)

return UINT_MAX;

//求平均值，使用移位更加方便

if (divisor == INTERVALS)

avg = sum >> INTERVAL_SHIFT;

else

avg = div_u64(sum, divisor);

/* Then try to determine variance */

//下面开始求方差，尼玛数学都他妈忘记了啊

variance = 0;

for (i = 0; i < INTERVALS; i++) {

unsigned int value = data->intervals[i];

if (value <= thresh) {

int64_t diff = (int64_t)value - avg;

variance += diff * diff;

}

if (divisor == INTERVALS)

variance >>= INTERVAL_SHIFT;

else

do_div(variance, divisor);

* The typical interval is obtained when standard deviation is

* small (stddev <= 20 us, variance <= 400 us^2) or standard

* deviation is small compared to the average interval (avg >

* 6*stddev, avg^2 > 36*variance). The average is smaller than

* UINT_MAX aka U32_MAX, so computing its square does not

* overflow a u64. We simply reject this candidate average if

* the standard deviation is greater than 715 s (which is

* rather unlikely).

* Use this result only if there is no timer to wake us up sooner.

//这个是什么逻辑啊，和标准差相关吗？？？？

if (likely(variance <= U64_MAX/36)) {

if ((((u64)avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3))

|| variance <= 400) {

return avg;

}

* If we have outliers to the upside in our distribution, discard

* those by setting the threshold to exclude these outliers, then

* calculate the average and standard deviation again. Once we get

* down to the bottom 3/4 of our samples, stop excluding samples.

* This can deal with workloads that have long pauses interspersed

* with sporadic activity with a bunch of short pauses.

//如果我们的分布中存在向上的异常值，则通过设置阈值来排除这些异常值，从而丢弃

//这些异常值，然后再次计算平均值和标准偏差。
当我们到达底部的3/4样本时，停止排除样本

//这可以处理长时间暂停的工作负载，这些工作负载中穿插着一些零星的活动，其中有一些短暂的暂停

if ((divisor * 4) <= INTERVALS * 3)

return UINT_MAX;

thresh = max - 1;

goto again;

}

2.7 menu_reflect - 更新相关数据结构

对menu而言，它的reflect接口仅仅是设置needs_update标志，真在的状态更新是在下一次select的时候，为什么不在reflect中直接完成状态更新呢？这是因为reflect是在关中断的上下文，不适宜做太多的工作，需要尽快完成工作并打开中断，处理中断服务程序

该函数会在下面调用路径中被执行，在cpuidle_idle_call的后半部退出idle的时候调用cpuidle_reflect，让governor完成一些自定义的工作

cpuidle_idle_call -> cpuidle_reflect

/**

* menu_reflect - records that data structures need update

* @dev: the CPU

* @index: the index of actual entered state

* NOTE: it's important to be fast here because this operation will add to

* the overall exit latency.

static void menu_reflect(

struct cpuidle_device *dev,

int index) //刚刚是从那个C state中退出的

{

//1.每个cpu都有自己的menu_device，从这里也能看出每个cpu是可以

// 单独的进入idle的，不受cluster中的其他cpu的影响

struct menu_device *data = this_cpu_ptr(&menu_devices);

dev->last_state_idx = index;

//2.置位data->needs_update标志，在下一个select的时候对一些统计数据进行更新，

// 为什么不直接在reflect中更新状态，而是到下一次select时再更新呢？

// 由上面对idle线程的分析可知，在进入idle之前，会将local_irq关闭，然后执行

// wfi指令进入睡眠，当收到中断从wfi退出时，此时的local_irq还是处于关闭状态的，

// cpu不能立刻跳转中断向量表去执行中断服务程序，只能接着往下执行，直到local_irq

// 被打开，才能执行相应的中断handler。也就是说reflect是在关中断时被调用的，需要

// 尽快返回，以便处理中断事件

data->needs_update = 1;

//3.下面返回值是bool型，为true表示tick handler是否正在运行

data->tick_wakeup = tick_nohz_idle_got_tick();

}

三、ladder

ladder翻译过来就是梯子的意思

ladder在periodic timer tick system中使用，menu在tickless system中使用

3.1 数据结构

core层已经为描述一个cpu和C state，分别提供了cpuidle_device和cpuidle_state数据结构，但是在ladder governor中，这两个数据结构是不够的，所以ladder又定义了下面两个数据结构，作为对cpu和C state的补充

3.1.1 ladder_device_state - ladder中对C state的补充描述

struct ladder_device_state {

struct {

u32 promotion_count; //默认值为4

u32 demotion_count; //默认值为1

//在该C state的实际睡眠时间必须在[demotion_time_ns, promotion_time_ns]之间，

//超出该范围则说明这个C state不适用，需要重新选择更深或者更浅的睡眠状态

//promotion_time_ns控制上调阈值，实际睡眠时间大于该值，则表示之前选的C state

//已经不适合了，需要选择更深睡眠等级

//demotion_time_ns控制下调阈值，如果实际的睡眠时间小于该值，则表示之前选的

//C state也不太适合，此时需要选择更浅的睡眠等级

//注意：这里所说的睡眠时间不包括退出延迟，而是真正的睡眠时间

u64 promotion_time_ns;

u64 demotion_time_ns;

} threshold;

struct {

int promotion_count;

int demotion_count;

} stats;

};

3.1.2 ladder_device - ladder中对cpu设备的补充描述

struct ladder_device {

struct ladder_device_state states[CPUIDLE_STATE_MAX];

};

3.2 ladder_governor

static struct cpuidle_governor ladder_governor = {

.name = "ladder",

.rating = 10,

.enable = ladder_enable_device,

.select = ladder_select_state,

.reflect = ladder_reflect,

};

3.3 init_ladder - 初始化

/**

* init_ladder - initializes the governor

static int __init init_ladder(void)

{

* When NO_HZ is disabled, or when booting with nohz=off, the ladder

* governor is better so give it a higher rating than the menu

* governor.

if (!tick_nohz_enabled)

ladder_governor.rating = 25;

return cpuidle_register_governor(&ladder_governor);

}

postcore_initcall(init_ladder); //驱动初始化

3.4 ladder_enable_device - governor被启用时，完成数据结构的初始化

/**

* ladder_enable_device - setup for the governor

* @drv: cpuidle driver

* @dev: the CPU

static int ladder_enable_device(

struct cpuidle_driver *drv,

struct cpuidle_device *dev)

{

int i;

//1.确定第一个搜索的C state

// 如果idx=0对应的C state被打上了CPUIDLE_FLAG_POLLING标记，则表示这个

// C state是一个忙等，而不是真正意义上的idle状态，此时跳过，为啥捏？？？

//下面first_idx表示仅在真正意义上的idle状态中查找合适的C state

int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0;

struct ladder_device *ldev = &per_cpu(ladder_devices, dev->cpu);

struct ladder_device_state *lstate;

struct cpuidle_state *state;

dev->last_state_idx = first_idx;

//2.遍历所有的C state

for (i = first_idx; i < drv->state_count; i++) {

state = &drv->states[i];

lstate = &ldev->states[i];

lstate->stats.promotion_count = 0;

lstate->stats.demotion_count = 0;

lstate->threshold.promotion_count = PROMOTION_COUNT; //4

lstate->threshold.demotion_count = DEMOTION_COUNT; //1

if (i < drv->state_count - 1)

lstate->threshold.promotion_time_ns = state->exit_latency_ns;

if (i > first_idx)

lstate->threshold.demotion_time_ns = state->exit_latency_ns;

}

return 0;

}

3.5 ladder_select_state - 挑选一个C state

/**

* ladder_select_state - selects the next state to enter

* @drv: cpuidle driver

* @dev: the CPU

* @dummy: not used

static int ladder_select_state(

struct cpuidle_driver *drv,

struct cpuidle_device *dev,

bool *dummy)

{

struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);

struct ladder_device_state *last_state;

//1.这个cpu上一次所处于的idle状态

int last_idx = dev->last_state_idx;

//2.带有CPUIDLE_FLAG_POLLING标记表示cpu不是真正意义上的idle，而是处于忙等

// 下面first_idx表示仅在真正意义上的idle状态中查找合适的C state

int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0;

//3.计算这个cpu所能忍受的"退出延迟"

s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

s64 last_residency;

/* Special case when user has set very strict latency requirement */

//4.如果latency_req的值为0，则表示当前系统处于一种苛刻的状态，

// 不允许进入idle，这时直接返回0，表示normal状态

if (unlikely(latency_req == 0)) {

ladder_do_selection(dev, ldev, last_idx, 0);

return 0;

}

//5.这个cpu上一次所处于的idle状态

last_state = &ldev->states[last_idx];

//6.计算上一次在idle状态睡眠的时间，（不包括退出延迟）

// dev->last_residency_ns表示这个cpu上一次在idle状态的停留时间，这个值是

// cpu被唤醒后执行的，包含下面的退出延迟

// drv->states[last_idx].exit_latency_ns表示上一次C state的退出延迟信息

// 两者的差表示真正在idle状态的睡眠时间

last_residency = dev->last_residency_ns - drv->states[last_idx].exit_latency_ns;

/* consider promotion */

//7.上调，即进入更深的睡眠状态，一次只提升一个等级，下面4个条件依次为

// a) last_idx不是最后一个C state状态，才有上升空间

// b) 即将启用的这个C state是可用的，也就是last_idx+1对应的C state

// c) 上一次正在的睡眠时间已经超过了last_idx对应的C state的上升阈值

// d) 即将启用的这个C state的退出延迟满足系统所能忍受的退出延迟

if (last_idx < drv->state_count - 1 &&

!dev->states_usage[last_idx + 1].disable &&

last_residency > last_state->threshold.promotion_time_ns &&

drv->states[last_idx + 1].exit_latency_ns <= latency_req) {

//7.1 首先对上一次的睡眠的信息进行统计，请求上调次数累加，下调次数清零

last_state->stats.promotion_count++;

last_state->stats.demotion_count = 0;

//7.2 从这个if条件可知，并不是每次请求上调都会得到满足

// 只有请求累加到一定的次数后，才执行真正的上调的动作

// 默认为连续4次上调请求，才会真正的执行上调操作

if (last_state->stats.promotion_count >= last_state->threshold.promotion_count) {

ladder_do_selection(dev, ldev, last_idx, last_idx + 1);

return last_idx + 1;

}

//8.代码走到这里，说明上面的上调操作没有成功，有可能是因为根本就不需要上调，

// 也有可能是因为连续上调的次数没有达到预祝，也有可能是因为，即使由上调需

// 求，但是要上调到的C state等级的退出延迟不满足当前系统的需求等原因

/* consider demotion */

//9.下调，下调也是有好几个条件的，但是这里我们首先考虑两个比较急迫的场景：

// a) last_idx对应的C state已经被禁用了，或者

// b) last_idx对应的C state的退出延时已经不满足系统急迫的需求

if (last_idx > first_idx &&

(dev->states_usage[last_idx].disable ||

drv->states[last_idx].exit_latency_ns > latency_req)) {

int i;

//9.1 向前找，找到一个退出延时满足系统要求的C state

// 由下面的for循环可知，如果一直没有找到符合条件的C state的话，

// 则使用first_idx对应的C state，也就是那个睡的最浅的、真正意

// 义上说面的C state

for (i = last_idx - 1; i > first_idx; i--) {

if (drv->states[i].exit_latency_ns <= latency_req)

break;

}

//9.2 代码走到这里，表示向下已经找到一个满足条件的C state了

ladder_do_selection(dev, ldev, last_idx, i);

return i;

}

//10.继续考虑下调，下面两个条件表示

// a) last_idx不是第一个C state，表示还有下降空间

// b) 上一次在C state中真正停留的时间，已经跌破下调的阈值

if (last_idx > first_idx &&

last_residency < last_state->threshold.demotion_time_ns) {

//10.1 统计计数

last_state->stats.demotion_count++;

last_state->stats.promotion_count = 0;

//10.2 从这个if条件可知，并不是每次请求下调都会得到满足

// 只有请求累加到一定的次数后，才执行真正的下调的动作

// 罢特：默认阈值为1，也就是说只要由1次下调请求，就满足你

if (last_state->stats.demotion_count >= last_state->threshold.demotion_count) {

ladder_do_selection(dev, ldev, last_idx, last_idx - 1);

return last_idx - 1;

}

/* otherwise remain at the current state */

//11.维持现状吧

return last_idx;

}

3.5.1 ladder_do_selection

/**

* ladder_do_selection - prepares private data for a state change

* @ldev: the ladder device

* @old_idx: the current state index

* @new_idx: the new target state index

static inline void ladder_do_selection(

struct cpuidle_device *dev,

struct ladder_device *ldev,

int old_idx,

int new_idx)

{

//统计计数相关

//完成一次上调或者下调后，计数清零

ldev->states[old_idx].stats.promotion_count = 0;

ldev->states[old_idx].stats.demotion_count = 0;

dev->last_state_idx = new_idx;

}

3.6 ladder_reflect - 更新相关数据结构

在ladder中主要是更新了last_state_idx，也就是上一次在哪个idle状态

/**

* ladder_reflect - update the correct last_state_idx

* @dev: the CPU

* @index: the index of actual state entered

static void ladder_reflect(struct cpuidle_device *dev, int index)

{

//在reflect中，只是记录上一次是从那个C state中唤醒的

if (index > 0)

dev->last_state_idx = index;

}

四、haltpoll

haltpoll不需要额外定义数据结构

4.1 几个全局变量

static unsigned int guest_halt_poll_ns __read_mostly = 200000;

module_param(guest_halt_poll_ns, uint, 0644);

/* division factor to shrink halt_poll_ns */

static unsigned int guest_halt_poll_shrink __read_mostly = 2;

module_param(guest_halt_poll_shrink, uint, 0644);

/* multiplication factor to grow per-cpu poll_limit_ns */

static unsigned int guest_halt_poll_grow __read_mostly = 2;

module_param(guest_halt_poll_grow, uint, 0644);

/* value in us to start growing per-cpu halt_poll_ns */

static unsigned int guest_halt_poll_grow_start __read_mostly = 50000;

module_param(guest_halt_poll_grow_start, uint, 0644);

/* allow shrinking guest halt poll */

static bool guest_halt_poll_allow_shrink __read_mostly = true;

module_param(guest_halt_poll_allow_shrink, bool, 0644);

4.2 haltpoll_governor

static struct cpuidle_governor haltpoll_governor = {

.name = "haltpoll",

.rating = 9,

.enable = haltpoll_enable_device,

.select = haltpoll_select,

.reflect = haltpoll_reflect,

};

4.3 init_haltpoll - 初始化

static int __init init_haltpoll(void)

{

if (kvm_para_available())

return cpuidle_register_governor(&haltpoll_governor);

return 0;

}

postcore_initcall(init_haltpoll); //在驱动初始化的时候执行

4.4 haltpoll_enable_device - governor被启用时完成相应的初始化

/**

* haltpoll_enable_device - scans a CPU's states and does setup

* @drv: cpuidle driver

* @dev: the CPU

static int haltpoll_enable_device(

struct cpuidle_driver *drv,

struct cpuidle_device *dev)

{

dev->poll_limit_ns = 0;

return 0;

}

4.5 haltpoll_select - 挑选一个idle等级

我尼玛，这个函数要么返回0，要么返回1，也就是说只有两个状态吗？

/**

* haltpoll_select - selects the next idle state to enter

* @drv: cpuidle driver containing state data

* @dev: the CPU

* @stop_tick: indication on whether or not to stop the tick

static int haltpoll_select(

struct cpuidle_driver *drv,

struct cpuidle_device *dev,

bool *stop_tick) //返回，标记所选的C state是否要停止tick

{

//1.获取系统允许的延迟信息

s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

//2.latency_req等于0，表示当前系统处于一种很苛刻的状态，此时返回0，

// 表示使用idx=0的C state，也就是不需要睡眠，处于normal模式

// 此时也不需要停止tick

if (!drv->state_count || latency_req == 0) {

*stop_tick = false;

return 0;

}

//3.额，这是为啥，为啥就直接返回idx=1的C state呢？？？？

if (dev->poll_limit_ns == 0)

return 1;

/* Last state was poll? */

//4.由上面的注释可知，上一次的C state是poll，难道idx=0的不是normal模式吗？？？

if (dev->last_state_idx == 0) {

/* Halt if no event occurred on poll window */

if (dev->poll_time_limit == true)

return 1;

*stop_tick = false;

/* Otherwise, poll again */

return 0;

}

*stop_tick = false;

/* Last state was halt: poll */

return 0;

}

4.6 haltpoll_reflect

/**

* haltpoll_reflect - update variables and update poll time

* @dev: the CPU

* @index: the index of actual entered state

static void haltpoll_reflect(struct cpuidle_device *dev, int index)

{

dev->last_state_idx = index;

if (index != 0)

adjust_poll_limit(dev, dev->last_residency_ns);

}

4.6.1 adjust_poll_limit

static void adjust_poll_limit(struct cpuidle_device *dev, u64 block_ns)

{

unsigned int val;

/* Grow cpu_halt_poll_us if

* cpu_halt_poll_us < block_ns < guest_halt_poll_us

if (block_ns > dev->poll_limit_ns && block_ns <= guest_halt_poll_ns) {

val = dev->poll_limit_ns * guest_halt_poll_grow;

//val的范围不能超过[guest_halt_poll_grow_start, guest_halt_poll_ns]

if (val < guest_halt_poll_grow_start)

val = guest_halt_poll_grow_start;

if (val > guest_halt_poll_ns)

val = guest_halt_poll_ns;

dev->poll_limit_ns = val;

} else if (block_ns > guest_halt_poll_ns && guest_halt_poll_allow_shrink) {

unsigned int shrink = guest_halt_poll_shrink;

val = dev->poll_limit_ns;

if (shrink == 0)

val = 0;

else

val /= shrink;

dev->poll_limit_ns = val;

}

五、teo

teo是timer event oriented的缩写，也就是面向timer事件的

5.1 数据结构

5.1.1 teo_idle_state - 描述一个C state

/**

* struct teo_idle_state - Idle state data used by the TEO cpuidle governor.

* @early_hits: "Early" CPU wakeups "matching" this state.

* @hits: "On time" CPU wakeups "matching" this state.

* @misses: CPU wakeups "missing" this state.

* A CPU wakeup is "matched" by a given idle state if the idle duration measured

* after the wakeup is between the target residency of that state and the target

* residency of the next one (or if this is the deepest available idle state, it

* "matches" a CPU wakeup when the measured idle duration is at least equal to

* its target residency).

* Also, from the TEO governor perspective, a CPU wakeup from idle is "early" if

* it occurs significantly earlier than the closest expected timer event (that

* is, early enough to match an idle state shallower than the one matching the

* time till the closest timer event). Otherwise, the wakeup is "on time", or

* it is a "hit".

* A "miss" occurs when the given state doesn't match the wakeup, but it matches

* the time till the closest timer event used for idle state selection.

struct teo_idle_state {

unsigned int early_hits;

unsigned int hits;

unsigned int misses;

};

5.1.2 teo_cpu - 描述一个cpu

/**

* struct teo_cpu - CPU data used by the TEO cpuidle governor.

* @time_span_ns: Time between idle state selection and post-wakeup update.

* @sleep_length_ns: Time till the closest timer event (at the selection time).

* @states: Idle states data corresponding to this CPU.

* @interval_idx: Index of the most recent saved idle interval.

* @intervals: Saved idle duration values.

struct teo_cpu {

u64 time_span_ns;

u64 sleep_length_ns;

struct teo_idle_state states[CPUIDLE_STATE_MAX];

int interval_idx;

u64 intervals[INTERVALS];

};

5.2 teo_governor

static struct cpuidle_governor teo_governor = {

.name = "teo",

.rating = 19,

.enable = teo_enable_device,

.select = teo_select,

.reflect = teo_reflect,

};

5.3 teo_governor_init - 初始化

static int __init teo_governor_init(void)

{

return cpuidle_register_governor(&teo_governor);

}

postcore_initcall(teo_governor_init);

5.4 teo_enable_device - governor被启用时，完成数据的初始化

/**

* teo_enable_device - Initialize the governor's data for the target CPU.

* @drv: cpuidle driver (not used).

* @dev: Target CPU.

static int teo_enable_device(

struct cpuidle_driver *drv,

struct cpuidle_device *dev)

{

struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

int i;

memset(cpu_data, 0, sizeof(*cpu_data));

//INTERVALS在teo中为8，可理解为8个档位

for (i = 0; i < INTERVALS; i++)

cpu_data->intervals[i] = U64_MAX;

return 0;

}

5.5 teo_select - 选择一个idle等级

/**

* teo_select - Selects the next idle state to enter.

* @drv: cpuidle driver containing state data.

* @dev: Target CPU.

* @stop_tick: Indication on whether or not to stop the scheduler tick.

static int teo_select(

struct cpuidle_driver *drv,

struct cpuidle_device *dev,

bool *stop_tick) //返回，标记是否要关闭tick

{

struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

//1.系统所能容忍的延迟

s64 latency_req = cpuidle_governor_latency_req(dev->cpu);

u64 duration_ns;

unsigned int hits, misses, early_hits;

int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;

ktime_t delta_tick;

//1.大于0表示确实进入睡眠了，则更新

if (dev->last_state_idx >= 0) {

teo_update(drv, dev);

dev->last_state_idx = -1;

}

cpu_data->time_span_ns = local_clock();

duration_ns = tick_nohz_get_sleep_length(&delta_tick);

cpu_data->sleep_length_ns = duration_ns;

hits = 0;

misses = 0;

early_hits = 0;

max_early_idx = -1;

prev_max_early_idx = -1;

constraint_idx = drv->state_count;

idx = -1;

for (i = 0; i < drv->state_count; i++) {

struct cpuidle_state *s = &drv->states[i];

if (dev->states_usage[i].disable) {

* Ignore disabled states with target residencies beyond

* the anticipated idle duration.

if (s->target_residency_ns > duration_ns)

continue;

* This state is disabled, so the range of idle duration

* values corresponding to it is covered by the current

* candidate state, but still the "hits" and "misses"

* metrics of the disabled state need to be used to

* decide whether or not the state covering the range in

* question is good enough.

hits = cpu_data->states[i].hits;

misses = cpu_data->states[i].misses;

if (early_hits >= cpu_data->states[i].early_hits ||

idx < 0)

continue;

* If the current candidate state has been the one with

* the maximum "early hits" metric so far, the "early

* hits" metric of the disabled state replaces the

* current "early hits" count to avoid selecting a

* deeper state with lower "early hits" metric.

if (max_early_idx == idx) {

early_hits = cpu_data->states[i].early_hits;

continue;

}

* The current candidate state is closer to the disabled

* one than the current maximum "early hits" state, so

* replace the latter with it, but in case the maximum

* "early hits" state index has not been set so far,

* check if the current candidate state is not too

* shallow for that role.

if (teo_time_ok(drv->states[idx].target_residency_ns)) {

prev_max_early_idx = max_early_idx;

early_hits = cpu_data->states[i].early_hits;

max_early_idx = idx;

}

continue;

}

if (idx < 0) {

idx = i; /* first enabled state */

hits = cpu_data->states[i].hits;

misses = cpu_data->states[i].misses;

}

if (s->target_residency_ns > duration_ns)

break;

if (s->exit_latency_ns > latency_req && constraint_idx > i)

constraint_idx = i;

idx = i;

hits = cpu_data->states[i].hits;

misses = cpu_data->states[i].misses;

if (early_hits < cpu_data->states[i].early_hits &&

teo_time_ok(drv->states[i].target_residency_ns)) {

prev_max_early_idx = max_early_idx;

early_hits = cpu_data->states[i].early_hits;

max_early_idx = i;

}

* If the "hits" metric of the idle state matching the sleep length is

* greater than its "misses" metric, that is the one to use. Otherwise,

* it is more likely that one of the shallower states will match the

* idle duration observed after wakeup, so take the one with the maximum

* "early hits" metric, but if that cannot be determined, just use the

* state selected so far.

if (hits <= misses) {

* The current candidate state is not suitable, so take the one

* whose "early hits" metric is the maximum for the range of

* shallower states.

if (idx == max_early_idx)

max_early_idx = prev_max_early_idx;

if (max_early_idx >= 0) {

idx = max_early_idx;

duration_ns = drv->states[idx].target_residency_ns;

}

* If there is a latency constraint, it may be necessary to use a

* shallower idle state than the one selected so far.

if (constraint_idx < idx)

idx = constraint_idx;

if (idx < 0) {

idx = 0; /* No states enabled. Must use 0. */

} else if (idx > 0) {

unsigned int count = 0;

u64 sum = 0;

* Count and sum the most recent idle duration values less than

* the current expected idle duration value.

for (i = 0; i < INTERVALS; i++) {

u64 val = cpu_data->intervals[i];

if (val >= duration_ns)

continue;

count++;

sum += val;

}

* Give up unless the majority of the most recent idle duration

* values are in the interesting range.

if (count > INTERVALS / 2) {

u64 avg_ns = div64_u64(sum, count);

* Avoid spending too much time in an idle state that

* would be too shallow.

if (teo_time_ok(avg_ns)) {

duration_ns = avg_ns;

if (drv->states[idx].target_residency_ns > avg_ns)

idx = teo_find_shallower_state(drv, dev,

idx, avg_ns);

}

* Don't stop the tick if the selected state is a polling one or if the

* expected idle duration is shorter than the tick period length.

if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||

duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {

*stop_tick = false;

* The tick is not going to be stopped, so if the target

* residency of the state to be returned is not within the time

* till the closest timer including the tick, try to correct

* that.

if (idx > 0 && drv->states[idx].target_residency_ns > delta_tick)

idx = teo_find_shallower_state(drv, dev, idx, delta_tick);

}

return idx;

}

5.5.1 teo_update

/**

* teo_update - Update CPU data after wakeup.

* @drv: cpuidle driver containing state data.

* @dev: Target CPU.

static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)

{

struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

int i, idx_hit = -1, idx_timer = -1;

u64 measured_ns;

if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {

* One of the safety nets has triggered or the wakeup was close

* enough to the closest timer event expected at the idle state

* selection time to be discarded.

measured_ns = U64_MAX;

} else {

u64 lat_ns = drv->states[dev->last_state_idx].exit_latency_ns;

* The computations below are to determine whether or not the

* (saved) time till the next timer event and the measured idle

* duration fall into the same "bin", so use last_residency_ns

* for that instead of time_span_ns which includes the cpuidle

* overhead.

measured_ns = dev->last_residency_ns;

* The delay between the wakeup and the first instruction

* executed by the CPU is not likely to be worst-case every

* time, so take 1/2 of the exit latency as a very rough

* approximation of the average of it.

if (measured_ns >= lat_ns)

measured_ns -= lat_ns / 2;

else

measured_ns /= 2;

}

* Decay the "early hits" metric for all of the states and find the

* states matching the sleep length and the measured idle duration.

for (i = 0; i < drv->state_count; i++) {

unsigned int early_hits = cpu_data->states[i].early_hits;

cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;

if (drv->states[i].target_residency_ns <= cpu_data->sleep_length_ns) {

idx_timer = i;

if (drv->states[i].target_residency_ns <= measured_ns)

idx_hit = i;

}

* Update the "hits" and "misses" data for the state matching the sleep

* length. If it matches the measured idle duration too, this is a hit,

* so increase the "hits" metric for it then. Otherwise, this is a

* miss, so increase the "misses" metric for it. In the latter case

* also increase the "early hits" metric for the state that actually

* matches the measured idle duration.

if (idx_timer >= 0) {

unsigned int hits = cpu_data->states[idx_timer].hits;

unsigned int misses = cpu_data->states[idx_timer].misses;

hits -= hits >> DECAY_SHIFT;

misses -= misses >> DECAY_SHIFT;

if (idx_timer > idx_hit) {

misses += PULSE;

if (idx_hit >= 0)

cpu_data->states[idx_hit].early_hits += PULSE;

} else {

hits += PULSE;

}

cpu_data->states[idx_timer].misses = misses;

cpu_data->states[idx_timer].hits = hits;

}

* Save idle duration values corresponding to non-timer wakeups for

* pattern detection.

cpu_data->intervals[cpu_data->interval_idx++] = measured_ns;

if (cpu_data->interval_idx >= INTERVALS)

cpu_data->interval_idx = 0;

}

5.6 teo_reflect - 更新数据

/**

* teo_reflect - Note that governor data for the CPU need to be updated.

* @dev: Target CPU.

* @state: Entered state.

static void teo_reflect(

struct cpuidle_device *dev,

int state) //刚刚是从哪个C state中退出的

{

struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);

//1.记录上一次所处的C state

dev->last_state_idx = state;

* If the wakeup was not "natural", but triggered by one of the safety

* nets, assume that the CPU might have been idle for the entire sleep

* length time.

if (dev->poll_time_limit ||

(tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) {

dev->poll_time_limit = false;

cpu_data->time_span_ns = cpu_data->sleep_length_ns;

} else {

cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns;

}

关注公众号不迷路：DumpStack

扫码加关注

本作品采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

文章评论

Clarke Li

haltpoll是用于优化虚拟机性能的一种cpu idle governor。在drivers/cpuidle/cpuidle-haltpoll.c中定义实现了专门的haltpoll_driver，haltpoll_driver里只定义了两种状态，state[0]是poll状态，state[1]是通过default_enter_idle回调使用默认的idle处理函数。
https://lwn.net/Articles/792618/

2022年9月1日

tmmdh

@Clarke Li 感谢您的恢复，这个地方我再学习一下，后期再完善一下文章！

2023年1月8日

回复