关注公众号不迷路:DumpStack
扫码加关注

目录
注:因为个人时间和精力的原因,本文未完结,后面有机会再补充吧!
CPU提供了多种idle级别,这些idle级别的主要区别是"进入idle时的功耗"和"退出时延迟"。driver层负责定义这些idle状态(即描述每一个状态的功耗和延迟分别是多少),并实现进入和退出相关的操作。最终,driver会把这些信息告诉governor层,由governor根据具体的应用场景,决定要选用哪种idle等级?依据是什么?是考虑cpu的负载还是系统中的iowaiter的个数?
所以governor层要完成的工作:制定游戏规则,并根据这个规则,选择出一个合适的C state
一、计算系统所能忍受的延迟
1.1 cpuidle_governor_latency_req - 计算系统能忍受的延迟
通过pm qos计算,关于pm qos相关内容我们在后面会有单独的文章分析
U:\linux-5.10.61\drivers\cpuidle\governor.c
/** * cpuidle_governor_latency_req - Compute a latency constraint for CPU * @cpu: Target CPU */ s64 cpuidle_governor_latency_req(unsigned int cpu) { struct device *device = get_cpu_device(cpu);
//1.下面是通过QoS计算系统能承受的延迟 // 关于QoS相关内容,我们在后面介绍 int device_req = dev_pm_qos_raw_resume_latency(device); int global_req = cpu_latency_qos_limit();
//2.取最小值,也就是最苛刻的那个值 if (device_req > global_req) device_req = global_req;
//3.单位转化为us return (s64)device_req * NSEC_PER_USEC; } |
ladder在periodic timer tick system中使用,menu在tickless system中使用
2.1 设计原理
注:下文摘自蜗窝科技,向大牛致敬!
governor的主要职责,是根据系统的运行情况,选择一个合适idle state(在kernel的标准术语中,也称作C state,即CPU state)。具体的算法,需要基于下面两点考虑:
1)切换的代价(本文称为最小滞留时间)
进入C state的目的,是节省功耗,但CPU在C state和normal state之间切换,是要付出功耗上面的代价的,这个代价体现在cpuidle_state结构的target_residency字段上(本文称为"最小滞留时间"),driver在注册cpuidle_state时,要明确的指定该C state的切换代价,基于该代价,CPU必须在该C state中停留超过一定的时间才是划算的
因此governor在选择C state时,需要预测出CPU将要在C state中的预计停留时间,并和备选cpuidle_state的target_residency字段比较,选取满足"预计停留时间 > 最小滞留时间(target_residency)"的C state
2)系统对"退出延迟"容忍程度
备选的C state中,"功耗"和"退出延迟"是一对不可调和的矛盾,(处于idle的等级越大,表示睡得越深,此时"功耗"就越低,但是退出idle状态时的"退出延迟"就会越大,相反,睡得越浅,"功耗"就越大,退出idle时的"退出延迟"就越短)。电源管理的目标,是在保证"退出延迟"在系统可接受的范围内的情况下,尽可能的节省功耗
为了描述"功耗"和"退出延迟",cpuidle_state结构中还提供下面两个信息:
-
CPU在某个C state下的功耗信息,对应cpuidle_state结构中的power_usage
-
退出该C state的延迟,对应cpuidle_state结构中的exit_latency
那么如果知道"当前系统所能容忍的延迟"(代码中使用latency_req变量表示),就可以在所有exit_latency小于latency_req的C state中,选取功耗最小的那个,因此,governor算法就转换为获取"当前系统所能容忍的延迟",而这正是pm qos的特长
基于上面的考量,menu governor的主要任务就转化为两个:
-
根据系统的运行情况,预测CPU将在C state中停留的时间(对应变量predicted_us)
-
借助pm qos framework,获取当前"系统所能容忍的延迟"(对应变量latency_req)
对于任务1,menu governor从如下几个方面去达成:
前面讲过,menu governor用于tickless system,简化处理,menu将"距离下一个tick来临的时间"(由next timer event测量,对应menu_device::next_timer_us成员)作为基础的停留时间predicted_us
当然,这个基础的停留时间predicted_us是不准确的,因为在这段时间内,随时都可能产生除next timer event之外的其它wakeup event。为了使预测更准确,有必要加入一个"校正因子"(对应menu_device::correction_factor成员),该"校正因子"等于过去的实际停留时间predicted_us和next_timer_us之间的比值,例如,如果实际的wakeup event都是在预测的next timer event时间的一半时产生,则factor为0.5。另外,为了更精确,menu使用动态平均的factor
另外,对不同范围的next_timer_us,"校正因子"的影响程度是不一样的。例如对于50ms和500ms的next timer event,都是在10ms时产生了wakeup event,显然"校正因子"对500ms的影响比较大。如果计算平均值时将它们混在一起,就会对预测的准确性产生影响,所以在计算"校正因子"时,需要区分不同级别的next_timer_us,(也就是需要对next_timer_us划分不同的档位,也就是下面的"桶"的概念)
同时,系统是否存在io wait,对"校正因子"的敏感度也不同
基于这些考虑,menu使用了一组"校正因子",对应menu_device::correction_factor[BUCKETS]数组,目前有12个,分别用于不同next_timer_us、不同io wait的场景下的的校正。
最后,在有些场合下,next_timer_us的预测是完全不正确的,如存在固定周期的中断时(音频等)。这时menu采用另一种不同的预测方式:统计过去8次停留时间的标准差(stand deviation),如果小于一定的门限值,则使用这8个停留时间的平均值,作为预测值。
对于任务2,是对"系统所能容忍的退出延迟"(对应变量latency_req)的估算,menu综合考虑了两种因素,如下:
1) 由pm qos获得的,系统期望的,CPU和DMA的延迟需求,这是一个硬性指标
2) 基于这样一个经验法则:越忙的系统,对系统延迟的要求越高,(即要求退出延迟的时间越短),结合任务1中预测到的停留时间(predicted_us),以及当前系统的CPU平均负荷和iowaiters的个数(get_iowait_load函数获得),计算出一个"系统所能容忍的延迟",计算公式如下,这是一个经验公式
latency_req = predicted_us / (1 + 2 * loadavg + 10 * iowaiters) |
这个公式反映的是退出延迟和预期停留时间之间的比例,loadavg和iowaiters越大,对退出延迟的要求就越高,即要求退出延迟越低
最后,latency_req的值取上面两个估值的最小值
关于menu的设计初衷,在内核源码中有很详细的注释,如下
W:\opensource\linux-5.10.61\drivers\cpuidle\governors\menu.c
/* * Concepts and ideas behind the menu governor * * For the menu governor, there are 3 decision factors for picking a C * state: * 1) Energy break even point * 2) Performance impact * 3) Latency tolerance (from pmqos infrastructure) * These these three factors are treated independently. * * Energy break even point * ----------------------- * C state entry and exit have an energy cost, and a certain amount of time in * the C state is required to actually break even on this cost. CPUIDLE * provides us this duration in the "target_residency" field. So all that we * need is a good prediction of how long we'll be idle. Like the traditional * menu governor, we start with the actual known "next timer event" time. * * Since there are other source of wakeups (interrupts for example) than * the next timer event, this estimation is rather optimistic. To get a * more realistic estimate, a correction factor is applied to the estimate, * that is based on historic behavior. For example, if in the past the actual * duration always was 50% of the next timer tick, the correction factor will * be 0.5. * * menu uses a running average for this correction factor, however it uses a * set of factors, not just a single factor. This stems from the realization * that the ratio is dependent on the order of magnitude of the expected * duration; if we expect 500 milliseconds of idle time the likelihood of * getting an interrupt very early is much higher than if we expect 50 micro * seconds of idle time. A second independent factor that has big impact on * the actual factor is if there is (disk) IO outstanding or not. * (as a special twist, we consider every sleep longer than 50 milliseconds * as perfect; there are no power gains for sleeping longer than this) * * For these two reasons we keep an array of 12 independent factors, that gets * indexed based on the magnitude of the expected duration as well as the * "is IO outstanding" property. * * Repeatable-interval-detector * ---------------------------- * There are some cases where "next timer" is a completely unusable predictor: * Those cases where the interval is fixed, for example due to hardware * interrupt mitigation, but also due to fixed transfer rate devices such as * mice. * For this, we use a different predictor: We track the duration of the last 8 * intervals and if the stand deviation of these 8 intervals is below a * threshold value, we use the average of these intervals as prediction. * * Limiting Performance Impact * --------------------------- * C states, especially those with large exit latencies, can have a real * noticeable impact on workloads, which is not acceptable for most sysadmins, * and in addition, less performance has a power price of its own. * * As a general rule of thumb, menu assumes that the following heuristic * holds: * The busier the system, the less impact of C states is acceptable * * This rule-of-thumb is implemented using a performance-multiplier: * If the exit latency times the performance multiplier is longer than * the predicted duration, the C state is not considered a candidate * for selection due to a too high performance impact. So the higher * this multiplier is, the longer we need to be idle to pick a deep C * state, and thus the less likely a busy CPU will hit such a deep * C state. * * Two factors are used in determing this multiplier: * a value of 10 is added for each point of "per cpu load average" we have. * a value of 5 points is added for each process that is waiting for * IO on this CPU. * (these values are experimentally determined) * * The load average factor gives a longer term (few seconds) input to the * decision, while the iowait value gives a cpu local instantanious input. * The iowait factor may look low, but realize that this is also already * represented in the system load average. * */ |
struct menu_device { //每次从C state返回时,kernel会调用governor的reflect接口,以便有机会让governor //考虑这一次state切换的结果,并更新一些统计信息,以便在下一次挑选出更加合适的C state //对menu而言,它的reflect接口会设置needs_update标志,并在下一次select时,更新状态, //为什么不在reflect中直接完成状态更新呢?这是因为reflect是在关中断的上下文,不适宜 //做太多的工作,需要尽快完成工作并打开中断,处理中断服务程序,具体行为可参考后面的描述; int needs_update;
//一个时间戳,记录上一次从idle中被唤醒是在什么时刻 int tick_wakeup;
//距离下一次tick来临的时间 u64 next_timer_ns;
//指明在选择C state时应该使用哪个校正因子,也就是下面correction_factor[]中的索引 unsigned int bucket;
//保存校正因子的数组,因子的个数为BUCKETS,当前为12 unsigned int correction_factor[BUCKETS];
//可参考上面的描述,用于计算停留时间的标准差,当前代码使用了8个停留时间(INTERVALS) unsigned int intervals[INTERVALS];
//用于索引上面的intervals数组 int interval_ptr; }; |
static struct cpuidle_governor menu_governor = { .name = "menu", .rating = 20, .enable = menu_enable_device, .select = menu_select, .reflect = menu_reflect, }; |
/** * init_menu - initializes the governor */ static int __init init_menu(void) { return cpuidle_register_governor(&menu_governor); } postcore_initcall(init_menu); //驱动初始化的时候执行 |
在cpuidle_register_device中会通过cpuidle_curr_governor->enable调用到该接口,由下面cpuidle_register_device的调用路径可知,在系统启动注册cpu时或者cpu热插拔时会调用该接口,完成这个cpu所使用的的governor的初始化相关工作,在menu中主要是对校正因子进行了初始化
-
cpuidle_register_device -> cpuidle_endble_device
-
acpi_processor_hotplug -> cpuidle_endble_device
-
acpi_processor_power_state_has_changed -> cpuidle_endble_device
/** * menu_enable_device - scans a CPU's states and does setup * @drv: cpuidle driver * @dev: the CPU */ static int menu_enable_device( struct cpuidle_driver *drv, struct cpuidle_device *dev) { //1.每个cpu对应一个menu_device结构 struct menu_device *data = &per_cpu(menu_devices, dev->cpu); int i;
memset(data, 0, sizeof(struct menu_device));
/* * if the correction factor is 0 (eg first time init or cpu hotplug * etc), we actually want to start out with a unity factor. */ //2.初始化校正因子 // 其中:BUCKETS=12, RESOLUTION=1024, DECAY=8 for(i = 0; i < BUCKETS; i++) data->correction_factor[i] = RESOLUTION * DECAY;
return 0; } |
函数调用路径如下,在cpuidle_idle_call中,先选择出一个合适的C state,然后进入该C state
-
cpuidle_idle_call -> cpuidle_select
/** * menu_select - selects the next idle state to enter * @drv: cpuidle driver containing state data * @dev: the CPU * @stop_tick: indication on whether or not to stop the tick */ static int menu_select( struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick) //out,返回是否需要关闭tick { struct menu_device *data = this_cpu_ptr(&menu_devices);
//1.计算"系统所能忍受的退出延迟"latency_req,该函数是通过pm qos计算的 s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
//2.predicted_us表示"预计会在idle中停留多长时间" unsigned int predicted_us; u64 predicted_ns; u64 interactivity_req; unsigned long nr_iowaiters; ktime_t delta_next; int i, idx;
//2.根据needs_update标志,调用menu_update,更新统计信息 // 这里主要是对上一次的idle信息进行统计,这部分工作本该放在reflect中完成, // 但是因为reflect是在关中断的上下文,不适合做太多的工作,所以移到了这里 if (data->needs_update) { menu_update(drv, dev); data->needs_update = 0; }
/* determine the expected residency time, round up */ //3.返回当前sleep的预期长度,也就是距离下一次tick来临的时间 data->next_timer_ns = tick_nohz_get_sleep_length(&delta_next);
//4.获取这个cpu上中有多少个iowait nr_iowaiters = nr_iowait_cpu(dev->cpu);
//5.由上面计算得到的预期睡眠时间和iowaiter的个数,确定应该使用那个桶里的校正因子 data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters);
//6.当满足下面情况的任意一种,返回0,C state为0表示不需要进入idle状态 // a) 这个driver只有一种状态,即normal状态,没有idle状态,也就不能进入idle状态了 // b) 上面计算得到的"系统所能容忍的退出延迟"为0,也就是说当前系统处于一个比较苛刻的 // 状态,不能进入idle状态 // c) 满足下面条件,则说明连最浅的睡眠状态都不满足要求,此时不允许进入睡眠状态 // i) 考虑到切换代价:因为cpuidle_driver::state[]数组中的idle状态是按照睡眠深浅 // 排序的,排的越靠前的睡眠越浅,退出时的延迟越短,功耗越高。所以如果上面计算得到 // 的预期睡眠时间data->next_timer_ns比等级1的"最小滞留时间"要小,则说明当前 // 不适合进入任何idle状态 // ii) 考虑到退出延迟:"系统所能忍受的退出延迟"比等级1的C state的退出延迟还要短, // (其他等级的C state的退出延迟更长),说明系统也不适合进入idle状态 // iii) dev->states_usage[0].disable为0表示等级为0的idle状态被禁用了,这里是考 // 虑什么情况呢????? if (unlikely(drv->state_count <= 1 || latency_req == 0) || ((data->next_timer_ns < drv->states[1].target_residency_ns || latency_req < drv->states[1].exit_latency_ns) && !dev->states_usage[0].disable)) { /* * In this case state[0] will be used no matter what, so return * it right away and keep the tick running if state[0] is a * polling one. */ //6.1 代码走到这表示不适合进入idle状态,也就是normal状态,对应drv->states[0] // 此时再normal状态下是否需要关闭idle由CPUIDLE_FLAG_POLLING决定 *stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING); return 0; }
/* Round up the result for half microseconds. */ //7.计算预计会在idle状态停留多长时间predicted_us // 将next_timer_us乘以校正因子,得到predicted_us,计算时考虑了溢出、精度等情况 // a) RESOLUTION为1024,表示??? // b) DECAY为8,表示???? // c) 加上(RESOLUTION * DECAY * NSEC_PER_USEC) / 2是为了实现四舍五入 predicted_us = div_u64(data->next_timer_ns * data->correction_factor[data->bucket] + (RESOLUTION * DECAY * NSEC_PER_USEC) / 2, RESOLUTION * DECAY * NSEC_PER_USEC);
/* Use the lowest expected idle interval to pick the idle state. */ //8.调用get_typical_interval接口,检查是否存在固定周期的情况,检查的逻辑就是 // 计算8次停留时间的标准差,如果存在,则利用平均值更新predicted_us predicted_ns = (u64)min(predicted_us, get_typical_interval(data, predicted_us)) * NSEC_PER_USEC;
if (tick_nohz_tick_stopped()) { /* * If the tick is already stopped, the cost of possible short * idle duration misprediction is much higher, because the CPU * may be stuck in a shallow idle state for a long time as a * result of it. In that case say we might mispredict and use * the known time till the closest timer event for the idle * state selection. */ //由上面的注释可知,如果tick已经停止了, //如果滴答已经停止,可能的短期空闲持续时间预测错误的代价要高得多,因为 //它可能导致CPU长时间停留在一个浅空闲状态。 //误地预测并使用已知的时间直到最近的计时器事件来选择空闲状态 if (predicted_ns < TICK_NSEC) predicted_ns = delta_next; } else { /* * Use the performance multiplier and the user-configurable * latency_req to determine the maximum exit latency. */ //9.计算"系统所能容忍的延迟",对应变量latency_req //根据predicted_us和系统负荷情况(cpu load、iowaiters), //估算另一个延迟容忍值,并和latency_req,取最小值; interactivity_req = div64_u64(predicted_ns, performance_multiplier(nr_iowaiters)); if (latency_req > interactivity_req) latency_req = interactivity_req; }
//10.代码走到这里,"系统所能忍受的退出延迟"latency_req和 // "预计会在idle中停留多长时间"predicted_us都已经计算出来了, // 下面利用这两个信息挑出一个idle等级
/* * Find the idle state with the lowest power while satisfying * our constraints. */ idx = -1;
//因为cpuidle_driver->states[]是按照睡眠由浅到深排序的,排在前面的睡眠最浅,功耗最高 //所以下面for循环的顺序是从高功耗到低功耗遍历啊 for (i = 0; i < drv->state_count; i++) { struct cpuidle_state *s = &drv->states[i];
//这个C state被禁用了,则跳过 if (dev->states_usage[i].disable) continue;
if (idx == -1) idx = i; /* first enabled state */
//满足该if条件,表示预计在idle状态停留的时间,不满足正在遍历的C state的"最小滞留时间" if (s->target_residency_ns > predicted_ns) { /* * Use a physical idle state, not busy polling, unless * a timer is going to trigger soon enough. */
//这里的idx中实际记录着上一次正在遍历的C state //这里是什么逻辑呢???? if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && s->exit_latency_ns <= latency_req && s->target_residency_ns <= data->next_timer_ns) { predicted_ns = s->target_residency_ns; idx = i; break; } if (predicted_ns < TICK_NSEC) break;
if (!tick_nohz_tick_stopped()) { /* * If the state selected so far is shallow, * waking up early won't hurt, so retain the * tick in that case and let the governor run * again in the next iteration of the loop. */ predicted_ns = drv->states[idx].target_residency_ns; break; }
/* * If the state selected so far is shallow and this * state's target residency matches the time till the * closest timer event, select this one to avoid getting * stuck in the shallow one for too long. */ if (drv->states[idx].target_residency_ns < TICK_NSEC && s->target_residency_ns <= delta_next) idx = i;
return idx; }
//这个if条件是在判断退出延迟,满足这个条件表示正在遍历的C state的退出延迟是满足要求的 if (s->exit_latency_ns > latency_req) break;
idx = i; }
//没有找到可用的C state,只能使用0,也就是normal状态 if (idx == -1) idx = 0; /* No states enabled. Must use 0. */
/* * Don't stop the tick if the selected state is a polling one or if the * expected idle duration is shorter than the tick period length. */ //代码走到这,表示初步已经找到一个可使用的状态了 if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || predicted_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { *stop_tick = false;
if (idx > 0 && drv->states[idx].target_residency_ns > delta_next) { /* * The tick is not going to be stopped and the target * residency of the state to be returned is not within * the time until the next timer event including the * tick, so try to correct that. */ for (i = idx - 1; i >= 0; i--) { if (dev->states_usage[i].disable) continue;
idx = i; if (drv->states[i].target_residency_ns <= delta_next) break; } } }
return idx; } |
每次从C state返回时,kernel会调用governor的reflect接口,以便有机会让governor考虑这一次state切换的结果,并更新一些统计信息,以便在下一次挑选出更加合适的C state
对menu而言,它的reflect接口仅仅是设置needs_update标志,真在的状态更新是在下一次select的时候,在下一次select时调用menu_update完成相关的数据统计工作
/** * menu_update - attempts to guess what happened after entry * @drv: cpuidle driver containing state data * @dev: the CPU */ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) { struct menu_device *data = this_cpu_ptr(&menu_devices);
//1.获取这个cpu上一次进入的C state结构 int last_idx = dev->last_state_idx; struct cpuidle_state *target = &drv->states[last_idx]; u64 measured_ns; unsigned int new_factor;
//2.由下面的注释可知,下面是要计算在上一次C state中睡眠了多长时间 // 如果上一次所进入的C state不支持睡眠时间的度量 // 不管用那种测量方式,都将包含退出延时的时间,因为我们关系的是什么时候唤醒的
// 如果输入的空闲状态不支持驻留时间度量,那么无论如何,如果它们很短, // 我们都会使用它们,如果很长,则将其截断到整个预期时间 // 任何测量的时间量都将包括退出延迟,因为我们感兴趣的是唤醒动作开始的时间, // 而不是完成的时间,所以我们必须减去退出延迟,但是如果计算出来的时间量小于 // 退出延时,我们就假设从未进入过idle状态,并且此时的退出延时为0 /* * Try to figure out how much time passed between entry to low * power state and occurrence of the wakeup event. * * If the entered idle state didn't support residency measurements, * we use them anyway if they are short, and if long, * truncate to the whole expected time. * * Any measured amount of time will include the exit latency. * Since we are interested in when the wakeup begun, not when it * was completed, we must subtract the exit latency. However, if * the measured amount of time is less than the exit latency, * assume the state was never reached and the exit latency is 0. */
//3.预测下一次进入idle会睡眠多长时间 // 如果预测不准,则会导致选择错误的C state // 例如预测的时间太短,则会进入浅睡眠,预测的时间太长,则会进入深睡眠 if (data->tick_wakeup && data->next_timer_ns > TICK_NSEC) { //tick_wakeup是在reflect中被设置,也就是上一次从idle中刚唤醒的 //时候被设置,为true表示tick handler正在运行 //next_timer_ns是上一次在调用menu_select的时候被设置的,也就是 //上一次进入idle之前,计算得到的"距离下一次tick到来还有多长时间" //同时满足上面两个条件表示当前系统处于深度睡眠的状态,
//下面的MAX_INTERESTING为50ms
//nohz代码说在滴答边界内不会有任何事件(如果滴答停止),但空闲持续时间预 //测器有不同的意见。 //此预测器并不完全正确,所以假设CPU可能已经空闲了很长时间(但不是永远), //以帮助空闲持续时间预测器下次做得更好 /* * The nohz code said that there wouldn't be any events within * the tick boundary (if the tick was stopped), but the idle * duration predictor had a differing opinion. Since the CPU * was woken up by a tick (that wasn't stopped after all), the * predictor was not quite right, so assume that the CPU could * have been idle long (but not forever) to help the idle * duration predictor do a better job next time. */ measured_ns = 9 * MAX_INTERESTING / 10; } else if ((drv->states[last_idx].flags & CPUIDLE_FLAG_POLLING) && dev->poll_time_limit) { /* * The CPU exited the "polling" state due to a time limit, so * the idle duration prediction leading to the selection of that * state was inaccurate. If a better prediction had been made, * the CPU might have been woken up from idle by the next timer. * Assume that to be the case. */ //CPU由于poll_time_limit的限制退出了polling状态,因此导致选择该状态的空闲 //持续时间预测是不准确的 //如果做出了更好的预测,CPU可能会在下一个计时器之前从空闲状态被唤醒 measured_ns = data->next_timer_ns; } else { /* measured value */ //上一次在idle中睡眠的时间 measured_ns = dev->last_residency_ns;
/* Deduct exit latency */ //这里要减去退出延时时间 if (measured_ns > 2 * target->exit_latency_ns) measured_ns -= target->exit_latency_ns; else measured_ns /= 2; }
/* Make sure our coefficients do not exceed unity */ //确保上面计算出来的"预计睡眠时间"不会超过下一次tick中断到来的时间 if (measured_ns > data->next_timer_ns) measured_ns = data->next_timer_ns;
/* Update our correction ratio */ //下面计算"校正因子" //先得到上一次的校正因子,然后衰减一半 new_factor = data->correction_factor[data->bucket]; new_factor -= new_factor / DECAY;
//下面是考虑本次"即将睡眠的时间",计算本次使用的衰减因子 // MAX_INTERESTING为50ms // RESOLUTION的值为1024,表示向1024做归一化 if (data->next_timer_ns > 0 && measured_ns < MAX_INTERESTING) new_factor += div64_u64(RESOLUTION * measured_ns, data->next_timer_ns); else /* * we were idle so long that we count it as a perfect * prediction */ new_factor += RESOLUTION;
/* * We don't want 0 as factor; we always want at least * a tiny bit of estimated time. Fortunately, due to rounding, * new_factor will stay nonzero regardless of measured_us values * and the compiler can eliminate this test as long as DECAY > 1. */ //DECAY如果为1,则表示,menu中DECAY为8 //我们不想让0作为因子; 我们总是希望至少有一点点的预估时间。幸运的是, //由于四舍五入的关系,无论measured_us的值是多少,new_factor都将保持 //非零状态,并且编译器可以消除这个测试,只要DECAY > 1。 if (DECAY == 1 && unlikely(new_factor == 0)) new_factor = 1;
//更新校正因子,这个还是上一次的校正因子 data->correction_factor[data->bucket] = new_factor;
/* update the repeating-pattern data */ //保留本次需要睡眠的时间 data->intervals[data->interval_ptr++] = ktime_to_us(measured_ns);
//索引回滚,防止溢出 if (data->interval_ptr >= INTERVALS) data->interval_ptr = 0; } |
2.6.2 tick_nohz_get_sleep_length - 预估当前会sleep多长时间
//如果定义了CONFIG_NO_HZ_COMMON宏的话,则说明tick是可能会被关闭的 #ifdef CONFIG_NO_HZ_COMMON /** * tick_nohz_get_sleep_length - return the expected length of the current sleep * @delta_next: duration until the next event if the tick cannot be stopped * * Called from power state control code with interrupts disabled */ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next) { struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev); struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); int cpu = smp_processor_id();
/* * The idle entry time is expected to be a sufficient approximation of * the current time at this point. */ //用"进入idle的时间戳"当做now ktime_t now = ts->idle_entrytime; ktime_t next_event;
WARN_ON_ONCE(!ts->inidle);
//计算可能在idle状态睡眠多长时间 //next_event是下一个事件产生的时间,两者的差表示这个cpu会在idle状态睡眠多长时间 *delta_next = ktime_sub(dev->next_event, now);
//是否能够停止tick if (!can_stop_idle_tick(cpu, ts)) return *delta_next;
//获取下一个事件产生的时间戳,并保存在next_event next_event = tick_nohz_next_event(ts, cpu); if (!next_event) return *delta_next;
/* * If the next highres timer to expire is earlier than next_event, the * idle governor needs to know that. */ //考虑到高精度定时器的影响,重新计算下一个事件产生的时间戳 next_event = min_t(u64, next_event, hrtimer_next_event_without(&ts->sched_timer));
//重新计算可能在idle状态睡眠多长时间 return ktime_sub(next_event, now); } #else //如果没有定义CONFIG_NO_HZ_COMMON宏的话,则说明tick是周期性的来的 static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next) { *delta_next = TICK_NSEC; return *delta_next; } #endif |
2.6.3 nr_iowait_cpu - 获取这个cpu上有多少个iowait
unsigned long nr_iowait_cpu(int cpu) { return atomic_read(&cpu_rq(cpu)->nr_iowait); } |
2.6.4 which_bucket - 计算要使用那个桶中的校正因子
static inline int which_bucket( u64 duration_ns, //预计会睡眠多长时间 unsigned long nr_iowaiters) //这个cpu上有多少个iowait { int bucket = 0;
/* * We keep two groups of stats; one with no * IO pending, one without. * This allows us to calculate * E(duration)|iowait */ //BUCKETS为12,由上面的注释可知: //0~5这六个桶适用于无iowaiter的场景,而6~11这六个桶适用于有iowaiter的场景 if (nr_iowaiters) bucket = BUCKETS/2;
//下面的计算表示一个桶内的差异为10us吗 if (duration_ns < 10ULL * NSEC_PER_USEC) return bucket; if (duration_ns < 100ULL * NSEC_PER_USEC) return bucket + 1; if (duration_ns < 1000ULL * NSEC_PER_USEC) return bucket + 2; if (duration_ns < 10000ULL * NSEC_PER_USEC) return bucket + 3; if (duration_ns < 100000ULL * NSEC_PER_USEC) return bucket + 4; return bucket + 5; } |
2.6.5 get_typical_interval - 计算标准差
由下面的注释可知,get_typical_interval尝试通过跟踪最后8个间隔来检测重复模式,并检查那组点的标准偏差是否低于一个阈值。如果它确实低于的话,就用这8个点的平均值作为估计值
/* * Try detecting repeating patterns by keeping track of the last 8 * intervals, and checking if the standard deviation of that set * of points is below a threshold. If it is... then use the * average of these 8 points as the estimated value. */ static unsigned int get_typical_interval( struct menu_device *data, unsigned int predicted_us) { int i, divisor; unsigned int min, max, thresh, avg; uint64_t sum, variance;
//最大值,大于这个值会被丢弃 thresh = INT_MAX; /* Discard outliers above this value */
again:
/* First calculate the average of past intervals */ min = UINT_MAX; max = 0; sum = 0; divisor = 0;
//下面for循环完成求和以及个数累加,为后面的求平均值做准备 //INTERVALS为8,也就是8个窗口 for (i = 0; i < INTERVALS; i++) { unsigned int value = data->intervals[i]; if (value <= thresh) { sum += value; divisor++;
//下面两个if条件找出最大和最小值 if (value > max) max = value;
if (value < min) min = value; } }
/* * If the result of the computation is going to be discarded anyway, * avoid the computation altogether. */ //由上面的注释可知,如果连最小值都比传递进来的基础睡眠时间还要大,则注定下面计 //算出来的结果会被丢弃掉,(因为返回后取min),那还不如不计算,直接返回好了 if (min >= predicted_us) return UINT_MAX;
//求平均值,使用移位更加方便 if (divisor == INTERVALS) avg = sum >> INTERVAL_SHIFT; else avg = div_u64(sum, divisor);
/* Then try to determine variance */ //下面开始求方差,尼玛数学都他妈忘记了啊 variance = 0; for (i = 0; i < INTERVALS; i++) { unsigned int value = data->intervals[i]; if (value <= thresh) { int64_t diff = (int64_t)value - avg; variance += diff * diff; } } if (divisor == INTERVALS) variance >>= INTERVAL_SHIFT; else do_div(variance, divisor);
/* * The typical interval is obtained when standard deviation is * small (stddev <= 20 us, variance <= 400 us^2) or standard * deviation is small compared to the average interval (avg > * 6*stddev, avg^2 > 36*variance). The average is smaller than * UINT_MAX aka U32_MAX, so computing its square does not * overflow a u64. We simply reject this candidate average if * the standard deviation is greater than 715 s (which is * rather unlikely). * * Use this result only if there is no timer to wake us up sooner. */ //这个是什么逻辑啊,和标准差相关吗???? if (likely(variance <= U64_MAX/36)) { if ((((u64)avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3)) || variance <= 400) { return avg; } }
/* * If we have outliers to the upside in our distribution, discard * those by setting the threshold to exclude these outliers, then * calculate the average and standard deviation again. Once we get * down to the bottom 3/4 of our samples, stop excluding samples. * * This can deal with workloads that have long pauses interspersed * with sporadic activity with a bunch of short pauses. */ //如果我们的分布中存在向上的异常值,则通过设置阈值来排除这些异常值,从而丢弃 //这些异常值,然后再次计算平均值和标准偏差。 //这可以处理长时间暂停的工作负载,这些工作负载中穿插着一些零星的活动,其中有一些短暂的暂停 if ((divisor * 4) <= INTERVALS * 3) return UINT_MAX;
thresh = max - 1; goto again; } |
每次从C state返回时,kernel会调用governor的reflect接口,以便有机会让governor考虑这一次state切换的结果,并更新一些统计信息,以便在下一次挑选出更加合适的C state
对menu而言,它的reflect接口仅仅是设置needs_update标志,真在的状态更新是在下一次select的时候,为什么不在reflect中直接完成状态更新呢?这是因为reflect是在关中断的上下文,不适宜做太多的工作,需要尽快完成工作并打开中断,处理中断服务程序
该函数会在下面调用路径中被执行,在cpuidle_idle_call的后半部退出idle的时候调用cpuidle_reflect,让governor完成一些自定义的工作
-
cpuidle_idle_call -> cpuidle_reflect
/** * menu_reflect - records that data structures need update * @dev: the CPU * @index: the index of actual entered state * * NOTE: it's important to be fast here because this operation will add to * the overall exit latency. */ static void menu_reflect( struct cpuidle_device *dev, int index) //刚刚是从那个C state中退出的 { //1.每个cpu都有自己的menu_device,从这里也能看出每个cpu是可以 // 单独的进入idle的,不受cluster中的其他cpu的影响 struct menu_device *data = this_cpu_ptr(&menu_devices);
dev->last_state_idx = index;
//2.置位data->needs_update标志,在下一个select的时候对一些统计数据进行更新, // 为什么不直接在reflect中更新状态,而是到下一次select时再更新呢? // 由上面对idle线程的分析可知,在进入idle之前,会将local_irq关闭,然后执行 // wfi指令进入睡眠,当收到中断从wfi退出时,此时的local_irq还是处于关闭状态的, // cpu不能立刻跳转中断向量表去执行中断服务程序,只能接着往下执行,直到local_irq // 被打开,才能执行相应的中断handler。也就是说reflect是在关中断时被调用的,需要 // 尽快返回,以便处理中断事件 data->needs_update = 1;
//3.下面返回值是bool型,为true表示tick handler是否正在运行 data->tick_wakeup = tick_nohz_idle_got_tick(); } |
三、ladder
ladder翻译过来就是梯子的意思
ladder在periodic timer tick system中使用,menu在tickless system中使用
3.1 数据结构
core层已经为描述一个cpu和C state,分别提供了cpuidle_device和cpuidle_state数据结构,但是在ladder governor中,这两个数据结构是不够的,所以ladder又定义了下面两个数据结构,作为对cpu和C state的补充
3.1.1 ladder_device_state - ladder中对C state的补充描述
struct ladder_device_state { struct { u32 promotion_count; //默认值为4 u32 demotion_count; //默认值为1
//在该C state的实际睡眠时间必须在[demotion_time_ns, promotion_time_ns]之间, //超出该范围则说明这个C state不适用,需要重新选择更深或者更浅的睡眠状态 //promotion_time_ns控制上调阈值,实际睡眠时间大于该值,则表示之前选的C state //已经不适合了,需要选择更深睡眠等级 //demotion_time_ns控制下调阈值,如果实际的睡眠时间小于该值,则表示之前选的 //C state也不太适合,此时需要选择更浅的睡眠等级 //注意:这里所说的睡眠时间不包括退出延迟,而是真正的睡眠时间 u64 promotion_time_ns; u64 demotion_time_ns; } threshold;
struct { int promotion_count; int demotion_count; } stats; }; |
3.1.2 ladder_device - ladder中对cpu设备的补充描述
struct ladder_device { struct ladder_device_state states[CPUIDLE_STATE_MAX]; }; |
3.2 ladder_governor
static struct cpuidle_governor ladder_governor = { .name = "ladder", .rating = 10, .enable = ladder_enable_device, .select = ladder_select_state, .reflect = ladder_reflect, }; |
3.3 init_ladder - 初始化
/** * init_ladder - initializes the governor */ static int __init init_ladder(void) { /* * When NO_HZ is disabled, or when booting with nohz=off, the ladder * governor is better so give it a higher rating than the menu * governor. */ if (!tick_nohz_enabled) ladder_governor.rating = 25;
return cpuidle_register_governor(&ladder_governor); } postcore_initcall(init_ladder); //驱动初始化 |
3.4 ladder_enable_device - governor被启用时,完成数据结构的初始化
/** * ladder_enable_device - setup for the governor * @drv: cpuidle driver * @dev: the CPU */ static int ladder_enable_device( struct cpuidle_driver *drv, struct cpuidle_device *dev) { int i;
//1.确定第一个搜索的C state // 如果idx=0对应的C state被打上了CPUIDLE_FLAG_POLLING标记,则表示这个 // C state是一个忙等,而不是真正意义上的idle状态,此时跳过,为啥捏??? //下面first_idx表示仅在真正意义上的idle状态中查找合适的C state int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0; struct ladder_device *ldev = &per_cpu(ladder_devices, dev->cpu); struct ladder_device_state *lstate; struct cpuidle_state *state;
dev->last_state_idx = first_idx;
//2.遍历所有的C state for (i = first_idx; i < drv->state_count; i++) { state = &drv->states[i]; lstate = &ldev->states[i];
lstate->stats.promotion_count = 0; lstate->stats.demotion_count = 0;
lstate->threshold.promotion_count = PROMOTION_COUNT; //4 lstate->threshold.demotion_count = DEMOTION_COUNT; //1
if (i < drv->state_count - 1) lstate->threshold.promotion_time_ns = state->exit_latency_ns; if (i > first_idx) lstate->threshold.demotion_time_ns = state->exit_latency_ns; }
return 0; } |
3.5 ladder_select_state - 挑选一个C state
/** * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU * @dummy: not used */ static int ladder_select_state( struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *dummy) { struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); struct ladder_device_state *last_state;
//1.这个cpu上一次所处于的idle状态 int last_idx = dev->last_state_idx;
//2.带有CPUIDLE_FLAG_POLLING标记表示cpu不是真正意义上的idle,而是处于忙等 // 下面first_idx表示仅在真正意义上的idle状态中查找合适的C state int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0;
//3.计算这个cpu所能忍受的"退出延迟" s64 latency_req = cpuidle_governor_latency_req(dev->cpu); s64 last_residency;
/* Special case when user has set very strict latency requirement */ //4.如果latency_req的值为0,则表示当前系统处于一种苛刻的状态, // 不允许进入idle,这时直接返回0,表示normal状态 if (unlikely(latency_req == 0)) { ladder_do_selection(dev, ldev, last_idx, 0); return 0; }
//5.这个cpu上一次所处于的idle状态 last_state = &ldev->states[last_idx];
//6.计算上一次在idle状态睡眠的时间,(不包括退出延迟) // dev->last_residency_ns表示这个cpu上一次在idle状态的停留时间,这个值是 // cpu被唤醒后执行的,包含下面的退出延迟 // drv->states[last_idx].exit_latency_ns表示上一次C state的退出延迟信息 // 两者的差表示真正在idle状态的睡眠时间 last_residency = dev->last_residency_ns - drv->states[last_idx].exit_latency_ns;
/* consider promotion */ //7.上调,即进入更深的睡眠状态,一次只提升一个等级,下面4个条件依次为 // a) last_idx不是最后一个C state状态,才有上升空间 // b) 即将启用的这个C state是可用的,也就是last_idx+1对应的C state // c) 上一次正在的睡眠时间已经超过了last_idx对应的C state的上升阈值 // d) 即将启用的这个C state的退出延迟满足系统所能忍受的退出延迟 if (last_idx < drv->state_count - 1 && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time_ns && drv->states[last_idx + 1].exit_latency_ns <= latency_req) {
//7.1 首先对上一次的睡眠的信息进行统计,请求上调次数累加,下调次数清零 last_state->stats.promotion_count++; last_state->stats.demotion_count = 0;
//7.2 从这个if条件可知,并不是每次请求上调都会得到满足 // 只有请求累加到一定的次数后,才执行真正的上调的动作 // 默认为连续4次上调请求,才会真正的执行上调操作 if (last_state->stats.promotion_count >= last_state->threshold.promotion_count) { ladder_do_selection(dev, ldev, last_idx, last_idx + 1); return last_idx + 1; } }
//8.代码走到这里,说明上面的上调操作没有成功,有可能是因为根本就不需要上调, // 也有可能是因为连续上调的次数没有达到预祝,也有可能是因为,即使由上调需 // 求,但是要上调到的C state等级的退出延迟不满足当前系统的需求等原因
/* consider demotion */ //9.下调,下调也是有好几个条件的,但是这里我们首先考虑两个比较急迫的场景: // a) last_idx对应的C state已经被禁用了,或者 // b) last_idx对应的C state的退出延时已经不满足系统急迫的需求 if (last_idx > first_idx && (dev->states_usage[last_idx].disable || drv->states[last_idx].exit_latency_ns > latency_req)) { int i;
//9.1 向前找,找到一个退出延时满足系统要求的C state // 由下面的for循环可知,如果一直没有找到符合条件的C state的话, // 则使用first_idx对应的C state,也就是那个睡的最浅的、真正意 // 义上说面的C state for (i = last_idx - 1; i > first_idx; i--) { if (drv->states[i].exit_latency_ns <= latency_req) break; }
//9.2 代码走到这里,表示向下已经找到一个满足条件的C state了 ladder_do_selection(dev, ldev, last_idx, i); return i; }
//10.继续考虑下调,下面两个条件表示 // a) last_idx不是第一个C state,表示还有下降空间 // b) 上一次在C state中真正停留的时间,已经跌破下调的阈值 if (last_idx > first_idx && last_residency < last_state->threshold.demotion_time_ns) {
//10.1 统计计数 last_state->stats.demotion_count++; last_state->stats.promotion_count = 0;
//10.2 从这个if条件可知,并不是每次请求下调都会得到满足 // 只有请求累加到一定的次数后,才执行真正的下调的动作 // 罢特:默认阈值为1,也就是说只要由1次下调请求,就满足你 if (last_state->stats.demotion_count >= last_state->threshold.demotion_count) { ladder_do_selection(dev, ldev, last_idx, last_idx - 1); return last_idx - 1; } }
/* otherwise remain at the current state */ //11.维持现状吧 return last_idx; } |
3.5.1 ladder_do_selection
/** * ladder_do_selection - prepares private data for a state change * @ldev: the ladder device * @old_idx: the current state index * @new_idx: the new target state index */ static inline void ladder_do_selection( struct cpuidle_device *dev, struct ladder_device *ldev, int old_idx, int new_idx) { //统计计数相关 //完成一次上调或者下调后,计数清零 ldev->states[old_idx].stats.promotion_count = 0; ldev->states[old_idx].stats.demotion_count = 0; dev->last_state_idx = new_idx; } |
3.6 ladder_reflect - 更新相关数据结构
在ladder中主要是更新了last_state_idx,也就是上一次在哪个idle状态
/** * ladder_reflect - update the correct last_state_idx * @dev: the CPU * @index: the index of actual state entered */ static void ladder_reflect(struct cpuidle_device *dev, int index) { //在reflect中,只是记录上一次是从那个C state中唤醒的 if (index > 0) dev->last_state_idx = index; } |
四、haltpoll
haltpoll不需要额外定义数据结构
4.1 几个全局变量
static unsigned int guest_halt_poll_ns __read_mostly = 200000; module_param(guest_halt_poll_ns, uint, 0644);
/* division factor to shrink halt_poll_ns */ static unsigned int guest_halt_poll_shrink __read_mostly = 2; module_param(guest_halt_poll_shrink, uint, 0644);
/* multiplication factor to grow per-cpu poll_limit_ns */ static unsigned int guest_halt_poll_grow __read_mostly = 2; module_param(guest_halt_poll_grow, uint, 0644);
/* value in us to start growing per-cpu halt_poll_ns */ static unsigned int guest_halt_poll_grow_start __read_mostly = 50000; module_param(guest_halt_poll_grow_start, uint, 0644);
/* allow shrinking guest halt poll */ static bool guest_halt_poll_allow_shrink __read_mostly = true; module_param(guest_halt_poll_allow_shrink, bool, 0644); |
4.2 haltpoll_governor
static struct cpuidle_governor haltpoll_governor = { .name = "haltpoll", .rating = 9, .enable = haltpoll_enable_device, .select = haltpoll_select, .reflect = haltpoll_reflect, }; |
4.3 init_haltpoll - 初始化
static int __init init_haltpoll(void) { if (kvm_para_available()) return cpuidle_register_governor(&haltpoll_governor);
return 0; } postcore_initcall(init_haltpoll); //在驱动初始化的时候执行 |
4.4 haltpoll_enable_device - governor被启用时完成相应的初始化
/** * haltpoll_enable_device - scans a CPU's states and does setup * @drv: cpuidle driver * @dev: the CPU */ static int haltpoll_enable_device( struct cpuidle_driver *drv, struct cpuidle_device *dev) { dev->poll_limit_ns = 0;
return 0; } |
4.5 haltpoll_select - 挑选一个idle等级
我尼玛,这个函数要么返回0,要么返回1,也就是说只有两个状态吗?
/** * haltpoll_select - selects the next idle state to enter * @drv: cpuidle driver containing state data * @dev: the CPU * @stop_tick: indication on whether or not to stop the tick */ static int haltpoll_select( struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick) //返回,标记所选的C state是否要停止tick { //1.获取系统允许的延迟信息 s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
//2.latency_req等于0,表示当前系统处于一种很苛刻的状态,此时返回0, // 表示使用idx=0的C state,也就是不需要睡眠,处于normal模式 // 此时也不需要停止tick if (!drv->state_count || latency_req == 0) { *stop_tick = false; return 0; }
//3.额,这是为啥,为啥就直接返回idx=1的C state呢???? if (dev->poll_limit_ns == 0) return 1;
/* Last state was poll? */ //4.由上面的注释可知,上一次的C state是poll,难道idx=0的不是normal模式吗??? if (dev->last_state_idx == 0) { /* Halt if no event occurred on poll window */ if (dev->poll_time_limit == true) return 1;
*stop_tick = false; /* Otherwise, poll again */ return 0; }
*stop_tick = false; /* Last state was halt: poll */ return 0; } |
4.6 haltpoll_reflect
/** * haltpoll_reflect - update variables and update poll time * @dev: the CPU * @index: the index of actual entered state */ static void haltpoll_reflect(struct cpuidle_device *dev, int index) { dev->last_state_idx = index;
if (index != 0) adjust_poll_limit(dev, dev->last_residency_ns); } |
4.6.1 adjust_poll_limit
static void adjust_poll_limit(struct cpuidle_device *dev, u64 block_ns) { unsigned int val;
/* Grow cpu_halt_poll_us if * cpu_halt_poll_us < block_ns < guest_halt_poll_us */ if (block_ns > dev->poll_limit_ns && block_ns <= guest_halt_poll_ns) { val = dev->poll_limit_ns * guest_halt_poll_grow;
//val的范围不能超过[guest_halt_poll_grow_start, guest_halt_poll_ns] if (val < guest_halt_poll_grow_start) val = guest_halt_poll_grow_start; if (val > guest_halt_poll_ns) val = guest_halt_poll_ns;
dev->poll_limit_ns = val; } else if (block_ns > guest_halt_poll_ns && guest_halt_poll_allow_shrink) { unsigned int shrink = guest_halt_poll_shrink;
val = dev->poll_limit_ns; if (shrink == 0) val = 0; else val /= shrink; dev->poll_limit_ns = val; } } |
五、teo
teo是timer event oriented的缩写,也就是面向timer事件的
5.1 数据结构
5.1.1 teo_idle_state - 描述一个C state
/** * struct teo_idle_state - Idle state data used by the TEO cpuidle governor. * @early_hits: "Early" CPU wakeups "matching" this state. * @hits: "On time" CPU wakeups "matching" this state. * @misses: CPU wakeups "missing" this state. * * A CPU wakeup is "matched" by a given idle state if the idle duration measured * after the wakeup is between the target residency of that state and the target * residency of the next one (or if this is the deepest available idle state, it * "matches" a CPU wakeup when the measured idle duration is at least equal to * its target residency). * * Also, from the TEO governor perspective, a CPU wakeup from idle is "early" if * it occurs significantly earlier than the closest expected timer event (that * is, early enough to match an idle state shallower than the one matching the * time till the closest timer event). Otherwise, the wakeup is "on time", or * it is a "hit". * * A "miss" occurs when the given state doesn't match the wakeup, but it matches * the time till the closest timer event used for idle state selection. */ struct teo_idle_state { unsigned int early_hits; unsigned int hits; unsigned int misses; }; |
5.1.2 teo_cpu - 描述一个cpu
/** * struct teo_cpu - CPU data used by the TEO cpuidle governor. * @time_span_ns: Time between idle state selection and post-wakeup update. * @sleep_length_ns: Time till the closest timer event (at the selection time). * @states: Idle states data corresponding to this CPU. * @interval_idx: Index of the most recent saved idle interval. * @intervals: Saved idle duration values. */ struct teo_cpu { u64 time_span_ns; u64 sleep_length_ns; struct teo_idle_state states[CPUIDLE_STATE_MAX]; int interval_idx; u64 intervals[INTERVALS]; }; |
5.2 teo_governor
static struct cpuidle_governor teo_governor = { .name = "teo", .rating = 19, .enable = teo_enable_device, .select = teo_select, .reflect = teo_reflect, }; |
5.3 teo_governor_init - 初始化
static int __init teo_governor_init(void) { return cpuidle_register_governor(&teo_governor); } postcore_initcall(teo_governor_init); |
5.4 teo_enable_device - governor被启用时,完成数据的初始化
/** * teo_enable_device - Initialize the governor's data for the target CPU. * @drv: cpuidle driver (not used). * @dev: Target CPU. */ static int teo_enable_device( struct cpuidle_driver *drv, struct cpuidle_device *dev) { struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); int i;
memset(cpu_data, 0, sizeof(*cpu_data));
//INTERVALS在teo中为8,可理解为8个档位 for (i = 0; i < INTERVALS; i++) cpu_data->intervals[i] = U64_MAX;
return 0; } |
5.5 teo_select - 选择一个idle等级
/** * teo_select - Selects the next idle state to enter. * @drv: cpuidle driver containing state data. * @dev: Target CPU. * @stop_tick: Indication on whether or not to stop the scheduler tick. */ static int teo_select( struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick) //返回,标记是否要关闭tick { struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
//1.系统所能容忍的延迟 s64 latency_req = cpuidle_governor_latency_req(dev->cpu); u64 duration_ns; unsigned int hits, misses, early_hits; int max_early_idx, prev_max_early_idx, constraint_idx, idx, i; ktime_t delta_tick;
//1.大于0表示确实进入睡眠了,则更新 if (dev->last_state_idx >= 0) { teo_update(drv, dev); dev->last_state_idx = -1; }
cpu_data->time_span_ns = local_clock();
duration_ns = tick_nohz_get_sleep_length(&delta_tick); cpu_data->sleep_length_ns = duration_ns;
hits = 0; misses = 0; early_hits = 0; max_early_idx = -1; prev_max_early_idx = -1; constraint_idx = drv->state_count; idx = -1;
for (i = 0; i < drv->state_count; i++) { struct cpuidle_state *s = &drv->states[i];
if (dev->states_usage[i].disable) { /* * Ignore disabled states with target residencies beyond * the anticipated idle duration. */ if (s->target_residency_ns > duration_ns) continue;
/* * This state is disabled, so the range of idle duration * values corresponding to it is covered by the current * candidate state, but still the "hits" and "misses" * metrics of the disabled state need to be used to * decide whether or not the state covering the range in * question is good enough. */ hits = cpu_data->states[i].hits; misses = cpu_data->states[i].misses;
if (early_hits >= cpu_data->states[i].early_hits || idx < 0) continue;
/* * If the current candidate state has been the one with * the maximum "early hits" metric so far, the "early * hits" metric of the disabled state replaces the * current "early hits" count to avoid selecting a * deeper state with lower "early hits" metric. */ if (max_early_idx == idx) { early_hits = cpu_data->states[i].early_hits; continue; }
/* * The current candidate state is closer to the disabled * one than the current maximum "early hits" state, so * replace the latter with it, but in case the maximum * "early hits" state index has not been set so far, * check if the current candidate state is not too * shallow for that role. */ if (teo_time_ok(drv->states[idx].target_residency_ns)) { prev_max_early_idx = max_early_idx; early_hits = cpu_data->states[i].early_hits; max_early_idx = idx; }
continue; }
if (idx < 0) { idx = i; /* first enabled state */ hits = cpu_data->states[i].hits; misses = cpu_data->states[i].misses; }
if (s->target_residency_ns > duration_ns) break;
if (s->exit_latency_ns > latency_req && constraint_idx > i) constraint_idx = i;
idx = i; hits = cpu_data->states[i].hits; misses = cpu_data->states[i].misses;
if (early_hits < cpu_data->states[i].early_hits && teo_time_ok(drv->states[i].target_residency_ns)) { prev_max_early_idx = max_early_idx; early_hits = cpu_data->states[i].early_hits; max_early_idx = i; } }
/* * If the "hits" metric of the idle state matching the sleep length is * greater than its "misses" metric, that is the one to use. Otherwise, * it is more likely that one of the shallower states will match the * idle duration observed after wakeup, so take the one with the maximum * "early hits" metric, but if that cannot be determined, just use the * state selected so far. */ if (hits <= misses) { /* * The current candidate state is not suitable, so take the one * whose "early hits" metric is the maximum for the range of * shallower states. */ if (idx == max_early_idx) max_early_idx = prev_max_early_idx;
if (max_early_idx >= 0) { idx = max_early_idx; duration_ns = drv->states[idx].target_residency_ns; } }
/* * If there is a latency constraint, it may be necessary to use a * shallower idle state than the one selected so far. */ if (constraint_idx < idx) idx = constraint_idx;
if (idx < 0) { idx = 0; /* No states enabled. Must use 0. */ } else if (idx > 0) { unsigned int count = 0; u64 sum = 0;
/* * Count and sum the most recent idle duration values less than * the current expected idle duration value. */ for (i = 0; i < INTERVALS; i++) { u64 val = cpu_data->intervals[i];
if (val >= duration_ns) continue;
count++; sum += val; }
/* * Give up unless the majority of the most recent idle duration * values are in the interesting range. */ if (count > INTERVALS / 2) { u64 avg_ns = div64_u64(sum, count);
/* * Avoid spending too much time in an idle state that * would be too shallow. */ if (teo_time_ok(avg_ns)) { duration_ns = avg_ns; if (drv->states[idx].target_residency_ns > avg_ns) idx = teo_find_shallower_state(drv, dev, idx, avg_ns); } } }
/* * Don't stop the tick if the selected state is a polling one or if the * expected idle duration is shorter than the tick period length. */ if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { *stop_tick = false;
/* * The tick is not going to be stopped, so if the target * residency of the state to be returned is not within the time * till the closest timer including the tick, try to correct * that. */ if (idx > 0 && drv->states[idx].target_residency_ns > delta_tick) idx = teo_find_shallower_state(drv, dev, idx, delta_tick); }
return idx; } |
5.5.1 teo_update
/** * teo_update - Update CPU data after wakeup. * @drv: cpuidle driver containing state data. * @dev: Target CPU. */ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) { struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); int i, idx_hit = -1, idx_timer = -1; u64 measured_ns;
if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { /* * One of the safety nets has triggered or the wakeup was close * enough to the closest timer event expected at the idle state * selection time to be discarded. */ measured_ns = U64_MAX; } else { u64 lat_ns = drv->states[dev->last_state_idx].exit_latency_ns;
/* * The computations below are to determine whether or not the * (saved) time till the next timer event and the measured idle * duration fall into the same "bin", so use last_residency_ns * for that instead of time_span_ns which includes the cpuidle * overhead. */ measured_ns = dev->last_residency_ns; /* * The delay between the wakeup and the first instruction * executed by the CPU is not likely to be worst-case every * time, so take 1/2 of the exit latency as a very rough * approximation of the average of it. */ if (measured_ns >= lat_ns) measured_ns -= lat_ns / 2; else measured_ns /= 2; }
/* * Decay the "early hits" metric for all of the states and find the * states matching the sleep length and the measured idle duration. */ for (i = 0; i < drv->state_count; i++) { unsigned int early_hits = cpu_data->states[i].early_hits;
cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
if (drv->states[i].target_residency_ns <= cpu_data->sleep_length_ns) { idx_timer = i; if (drv->states[i].target_residency_ns <= measured_ns) idx_hit = i; } }
/* * Update the "hits" and "misses" data for the state matching the sleep * length. If it matches the measured idle duration too, this is a hit, * so increase the "hits" metric for it then. Otherwise, this is a * miss, so increase the "misses" metric for it. In the latter case * also increase the "early hits" metric for the state that actually * matches the measured idle duration. */ if (idx_timer >= 0) { unsigned int hits = cpu_data->states[idx_timer].hits; unsigned int misses = cpu_data->states[idx_timer].misses;
hits -= hits >> DECAY_SHIFT; misses -= misses >> DECAY_SHIFT;
if (idx_timer > idx_hit) { misses += PULSE; if (idx_hit >= 0) cpu_data->states[idx_hit].early_hits += PULSE; } else { hits += PULSE; }
cpu_data->states[idx_timer].misses = misses; cpu_data->states[idx_timer].hits = hits; }
/* * Save idle duration values corresponding to non-timer wakeups for * pattern detection. */ cpu_data->intervals[cpu_data->interval_idx++] = measured_ns; if (cpu_data->interval_idx >= INTERVALS) cpu_data->interval_idx = 0; } |
5.6 teo_reflect - 更新数据
/** * teo_reflect - Note that governor data for the CPU need to be updated. * @dev: Target CPU. * @state: Entered state. */ static void teo_reflect( struct cpuidle_device *dev, int state) //刚刚是从哪个C state中退出的 { struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
//1.记录上一次所处的C state dev->last_state_idx = state; /* * If the wakeup was not "natural", but triggered by one of the safety * nets, assume that the CPU might have been idle for the entire sleep * length time. */ if (dev->poll_time_limit || (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) { dev->poll_time_limit = false; cpu_data->time_span_ns = cpu_data->sleep_length_ns; } else { cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns; } } |
文章评论
haltpoll是用于优化虚拟机性能的一种cpu idle governor。在drivers/cpuidle/cpuidle-haltpoll.c中定义实现了专门的haltpoll_driver,haltpoll_driver里只定义了两种状态,state[0]是poll状态,state[1]是通过default_enter_idle回调使用默认的idle处理函数。
https://lwn.net/Articles/792618/
@Clarke Li 感谢您的恢复,这个地方我再学习一下,后期再完善一下文章!