linux内核工程师 2.09节 Scheduling Domains（一）

Scheduling Domains 引入的背景

Scheduling Domains 是现代硬件技术尤其是多 CPU 多核技术发展的产物。现在，一个复杂的高端系统由上到下可以这样构成：

它是一个 NUMA 架构的系统，系统中的每个 Node 访问系统中不同区域的内存有不同的速度。
同时它又是一个 SMP 系统。由多个物理 CPU(Physical Package) 构成。这些物理 CPU 共享系统中所有的内存。但都有自己独立的 Cache 。
每个物理 CPU 又由多个核 (Core) 构成，即 Multi-core 技术或者叫 Chip-level Multi processor(CMP) 。这些核都被集成在一块 die 里面。一般有自己独立的 L1 Cache，但可能共享 L2 Cache 。
每个核中又通过 SMT 之类的技术实现多个硬件线程，或者叫 Virtual CPU( 比如 Intel 的 Hyper-threading 技术 ) 。这些硬件线程，逻辑上看是就是一个 CPU 。它们之间几乎所有的东西都共享。包括 L1 Cache，甚至是逻辑运算单元 (ALU) 以及 Power 。

在上述系统中，最小的执行单元是逻辑 CPU，进程的调度执行也是相对于逻辑 CPU 的。因此，后文皆简称逻辑 CPU 为 CPU，是物理 CPU 时会特别说明。

在这样复杂的系统，调度器要解决的一个首要问题就是如何发挥这么多 CPU 的性能，使得负载均衡。不存某些 CPU 一直很忙，进程在排队等待运行，而某些 CPU 却是处于空闲状态。

但是在这些 CPU 之间进行 Load Balance 是有代价的，比如对处于两个不同物理 CPU 的进程之间进行负载平衡的话，将会使得 Cache 失效。造成效率的下降。而且过多的 Load Balance 会大量占用 CPU 资源。

还有一个要考虑的就是功耗 (Power) 的问题。一个物理 CPU 中的两个 Virtual CPU 各执行一个进程，显然比两个物理 CPU 中的 Virtual CPU 各执行一个进程节省功耗。因为硬件上可以实现一个物理 CPU 不用时关掉它以节省功耗。

为了解决上述的这些问题，内核开发人员 Nick Piggin 等人在 Linux 2.6 中引入基于 Scheduling Domains 的解决方案。

Scheduling Domains 原理

每个 Scheduling Domain 其实就是具有相同属性的一组 cpu 的集合。并且跟据 Hyper-threading, Multi-core, SMP, NUMA architectures 这样的系统结构划分成不同的级别。不同级之间通过指针链接在一起，从而形成一种的树状的关系。如下图所示：

图 1. Scheduling Domains 原理

负载平衡就是针对 Scheduling domain 的。从叶节点往上遍历。直到所有的 domain 中的负载都是平衡的。当然对不同的 domain 会有不同的策略识别是否负载不平衡，以及不同的调度策略。通过这样的方式，从而很好的发挥众多 cpu 的效率。

Scheduling Domains 结构

基于 Scheduling Domains 的调度器引入了一组新的数据结构。下面先讲一下两个主要的数据结构：

struct sched_domain: 代表一个 Scheduling Domain，也就是一个 CPU 集合，这个集合里所有的 CPU 都具有相同的属性和调度策略。 Load Balance 是针对每个 domain 里的 CPU 进行的。这里要注意 Scheduling Domains 是分级的。像上节所讲的复杂系统就分为 Allnuma_domain,Numa_domain, Phy_domain, Core_domain, Smt_domain(Cpu_domain) 五个等级。

表 1 列出了这个结构的主要字段。

struct sched_group: 每个 Scheduling domain 都有一个或多个 CPU group，每个 group 都被 domain 当做一个单独的单元来对待。 Load Balance 就是在这些 CPU group 之间的 CPU 进行的。表 2 列出了它的主要字段。

表 1. sched_domain 数据结构

类型	名称	描述
struct domain *	Parent	当前 domain 的父 domain
struct domain *	Child	当前 domain 的子 domain
cpumask_t	Span	当前 domain 中的所有 cpu 位图
unsigned long	min_interval	最小的 load balance 间隔
unsigned long	max_interval	最大的 load balance 间隔
unsigned int	busy_factor	Busy 时延迟进行 balance 的系数
unsigned int	busy_idx
unsigned int	idle_idx
unsigned int	newidle_idx
unsigned int	wake_idx
unsigned int	forkexec_idx
int	flags	当前 domain 的一些状态标记
enum sched_domain_level	level	当前 domain 的级别
unsigned long	last_balance	当前 domain 最近一次进行 balance 时的时间 (jiffies 为单位 )
unsigned int	balance_interval	进行 balance 的时间间隔（ms 为单位）
unsigned int	nr_balance_failed	balance 失败的次数

表 2. sched_group 数据结构

类型	名称	描述
struct sched_group *	next	下一个 group 的指针
cpumask_t	cpumask	当前 group 有哪些 CPU
unsigned int	__cpu_power	当前 group 的 CPU power
u32	reciprocal_cpu_power	CPU power 的倒数

下面以我们之前提到的 NUMA，SMP，Multcore，SMT 结合的复杂系统为例来具体解释一下 Scheduling domains 和 CPU groups 之间的关系。

这里为了简化讨论，假设每个物理 CPU 只有两个核，每个核只有两个逻辑 CPU 。

如下图所示：

图 2. 物理 CPU 示意图

当系统启动时，会分别把每个核的两个逻辑 CPU 放入一个 Scheduling Domain，

这个级别的 domain 叫做 cpu_domain 。其中每个 domain 包括两个 CPU groups，每个 CPU group 只有一个逻辑 CPU 。

如下图所示：

图 3. 逻辑 CPU

同时每个物理 CPU 的两个核被放入一个高一级的 Scheduling Domain 。这个 domain 命名为 core_domain 。其中每个 domain 包括两个 CPU groups，每个 CPU group 有两个逻辑 CPU 。

如下图所示：

图 4. CPU group

对于我们前述的复杂系统，再往上的话依次还有 phys_domain( 物理 CPU 放入的 domain) ；

numa_domain(NUMA 中的 16 个 nodes 放入的 domain) ； allnode_domain( 所有 NUMA 中的 node 放入的 domain) 。从而将所有 CPU 组织成一个基于 domain 的层次结构。

Scheduling Domains 实现

对每个 Scheduling Domain 中的 CPU groups 之间的 CPU 进行的。

每个 Scheduling Domain 都包含一些重要的信息用来决定在这级 domain 的 CPU groups 之间如何进行 Load Balance 。

典型的一些原则如下：
1. 在 cpu_domain 级：因为是共享 cache，cpu_power 也基本是共用的。所以可以在这个 domain 级的 cpu groups 之间可以不受限制的进行 load balance 。
2. 在 core_domain 级：可能会共享 L2 级 cache, 具体跟实现相关了。因此这一级的 balance 相对没那么频繁。要 core 之间负载的不平衡达到一定程度才进行 balance 。
3. 在 phys_domain 级：在这一个 domain 级，如果进行 balance 。则每个物理 CPU 上的 Cache 会失效一段时间，并且要考虑 cpu_power，因为物理 CPU 级的 power 一般是被数关系。比如两个物理 CPU 是 power*2，而不像 core, 或逻辑 CPU，只是 power*1.1 这样的关系。
4. 在 numa_domain 级：这一级的开销最大，一般很少进行 balance 。

基本实现

Linux 主要通过以下几个方面来实现基于 Scheduling domains 的 Load Balance 。

周期性的 load balance

每次时钟中断到来，如果发现当前 cpu 的运行队列需要进行下一次的 balance 的时间已

经到了，则触发 SCHED_SOFTIRQ 软中断。

触发软中断

if (time_after_eq(jiffies, rq->next_balance))

 
              raise_softirq(SCHED_SOFTIRQ);

软中断的执行函数是 run_rebalance_domains(), 它主要是再调 rebalance_domains() 来实现。

清单 1. run_rebalance_domains()

 
              static void run_rebalance_domains(struct softirq_action *h)  
             
              {  
             
              int this_cpu = smp_processor_id();  
             
              struct rq *this_rq = cpu_rq(this_cpu);  
             
              enum cpu_idle_type idle = this_rq->idle_at_tick ?  
             
              CPU_IDLE : CPU_NOT_IDLE;  
             
              rebalance_domains(this_cpu, idle);  
             
              #ifdef CONFIG_NO_HZ  
             
              #endif  
             
              }

rebalance_domains()，根据 domain 的级别，从下往上扫描每一级 Scheduling Domain 。如果发现这个 domain 的 balance 之间的间隔时间到了，则进一步进行 task 的移动。不同级别的 domain 是会有不同的间隔时间的。而且级别越高值越大，因为移动 task 的代价越大。

清单 2. rebalance_domains()

 
              static void rebalance_domains(int cpu, enum cpu_idle_type idle)  
             
              {  
             
              int balance = 1;  
             
              struct rq *rq = cpu_rq(cpu);  
             
              unsigned long interval;  
             
              struct sched_domain *sd;  
             
              /* Earliest time when we have to do rebalance again */  
             
              unsigned long next_balance = jiffies + 60*HZ;  
             
              int update_next_balance = 0;  
             
              int need_serialize;  
             
              cpumask_t tmp;  
             
              for_each_domain(cpu, sd) {  
             
              if (!(sd->flags & SD_LOAD_BALANCE))  
             
              continue;  
             
              interval = sd->balance_interval;  
             
              if (idle != CPU_IDLE)  
             
              interval *= sd->busy_factor;  
             
              /* scale ms to jiffies */  
             
              interval = msecs_to_jiffies(interval);  
             
              if (unlikely(!interval))  
             
              interval = 1;  
             
              if (interval > HZ*NR_CPUS/10)  
             
              interval = HZ*NR_CPUS/10;  
             
              need_serialize = sd->flags & SD_SERIALIZE;  
             
              if (need_serialize) {  
             
              if (!spin_trylock(&balancing))  
             
              goto out;  
             
              }  
             
              if (time_after_eq(jiffies, sd->last_balance + interval)) {  
             
              if (load_balance(cpu, rq, sd, idle, &balance, &tmp)) {  
             
              /*  
             
              * We've pulled tasks over so either we're no  
             
              * longer idle, or one of our SMT siblings is  
             
              * not idle.  
             
              */  
             
              idle = CPU_NOT_IDLE;  
             
              }  
             
              sd->last_balance = jiffies;  
             
              }  
             
              if (need_serialize)  
             
              spin_unlock(&balancing);  
             
              out:  
             
              if (time_after(next_balance, sd->last_balance + interval)) {  
             
              next_balance = sd->last_balance + interval;  
             
              update_next_balance = 1;  
             
              }  
             
              /*  
             
              * Stop the load balance at this level. There is another  
             
              * CPU in our sched group which is doing load balancing more  
             
              * actively.  
             
              */  
             
              if (!balance)  
             
              break;  
             
              }  
             
              /*  
             
              * next_balance will be updated only when there is a need.  
             
              * When the cpu is attached to null domain for ex, it will not be  
             
              * updated.  
             
              */  
             
              if (likely(update_next_balance))  
             
              rq->next_balance = next_balance;  
             
              }

针对 CPU IDLE 的处理
如果一个逻辑 CPU 进入了 IDLE 状态，并且它所属的 domain 设置了 SD_BALANCE_NEWIDLE，则马上就会进行 balance，把忙的 CPU 上的进程 move 过来，从而最大的发挥多 CPU 的优势。
针对 fork(), exec() 的处理
当一个进行调用 exec() 执行时，本来就是要加载一个新进程，缓存本来就会失效。所以，move 到哪里都可以。因此找设置了 SD_BALANCE_EXEC 标记的 domain 。然后把进程移动到那个 domain 中最闲的 CPU group 的 CPU 上。
fork() 时也进行类似的处理。
其它因素比如针对 cpu_power 的处理