CFS Group Scheduling

Note: The abbreviation description in this article

db14c3e7066cc1e5be2017109782754f.png

1. Introduction to CFS Group Scheduling

1.1. Reason for existence

In summary, it is hoped that tasks in different groups can allocate a controllable proportion of CPU resources under high load. Why is there such a requirement? For example, in a multi-user computer system, all tasks of each user are divided into one group. User A has 90 identical tasks, while user B has only 10 identical tasks. If the CPU is completely full, then User A will take 90% of the CPU time, while user B only takes 10% of the CPU time, which is obviously unfair to user B. Or if the same user wants to compile quickly with -j64, but does not want to be affected by the compilation task, he can also set the compilation task to the corresponding group to limit its CPU resources.

1.2. Group status on mobile devices

99e96464e0df770c3cee8fcc52cd2ab0.png

The /dev/cpuctl directory is represented by struct task_group root_task_group. Each subdirectory below it is abstracted into a task_group structure. There are a few points to note:

  1. The default value of the cpu.shares file under the root group is 1024, and setting is not supported. The load for the root group is also not updated.

  2. The root group is also a task group, and the tasks under the root group are also grouped and will no longer belong to other groups.

  3. The kernel thread is under the root group by default, and it directly allocates time slices from the root cfs_rq, which has a great advantage. If it keeps running, it is almost "running all the time" in the trace, such as the common kswapd kernel thread.

  4. Under the default cpu.shares configuration, if all tasks are nice=0 and only single core is considered, the time slice that each task under the root group can get when fully loaded is equal to the time slice that can be allocated to all tasks under other groups The sum of the time slices arrived.

Note: Both task and task group allocate time slices by weight, but the weight of task comes from its priority, while the weight of task group comes from the value set in the cpu.shares file in its cgroup directory. After group scheduling is enabled, depending on the time slice allocated to a task, you can not only look at the weight corresponding to its prio, but also look at the weight allocated to its task group and the running status of other tasks in the group.

Two, the task group grouping of tasks

The CFS group scheduling function is mainly reflected by grouping tasks, and a group is represented by a struct task_group.

2.1. How to set up grouping

The task group grouping configuration interface is exported to user space by the cpu cgroup subsystem through the cgroup directory hierarchy.

9e94d656874a6c80d1576da49389eb88.png

How to remove a task from a task group? There is no way to remove it directly. Under cgroup semantics, a task must belong to a task group at a certain moment. Only by echoing it to other groups can it be removed from the current group. remove.

2.2. How to set grouping in Android

Process.java provides setProcessGroup(int pid, int group) to other modules to set the pid process into the group specified by the group parameter for other modules to call and set. For example, in OomAdjuster.java, switch the task to the front/background and call the parameter group=THREAD_GROUP_TOP_APP/THREAD_GROUP_BACKGROUND respectively.

libprocessgroup provides a configuration file named task_profiles.json, in which the AggregateProfiles aggregate attribute field configures the corresponding behavior after the upper layer is set. For example, the aggregation attribute corresponding to THREAD_GROUP_TOP_APP is SCHED_SP_TOP_APP, and the behavior corresponding to the MaxPerformance attribute is to join the cpu top-app group.

e2eea5144b4f9c1a8176cc01642f4a9d.png

The configuration of the "MaxPerformance" attribute is very readable, and it can be seen that it is added to the top-app group of the cpu subsystem.

2.3. Android is set as TOP-APP grouping, what is set for cgroup

Because there are multiple cgroup subsystems, in addition to the cpu cgroup subsystem attached to the CFS group scheduling we are talking about, there are cpuset cgroup subsystems (limiting the CPU and available memory nodes that tasks can run), blkio cgroup subsystems ( The block device io that limits the process), the freezer cgroup subsystem (provides the process freezing function), etc. The upper-level configuration grouping may not only cut a cgroup, but which subsystems are specifically cut is reflected in the array members of the aggregate attribute AggregateProfiles. For example, the behaviors corresponding to the other two attributes in the above example are to join the root group of the blkio subsystem and the task's timer_slack_ns (a parameter that balances the timeliness and power consumption of hrtimer timing wake-up) is set to 50000ns.

192454aea67f52589432d17db0fab2f8.png

2.4. After the service is started, it will be placed in the specified group

Use task_profiles to configure when starting the service, for example.

588f84b44b1d6d7594ad17d86b61d175.png

3. Overview of Kernel Implementation

Above we discussed the configuration of task group grouping, this section will start to enter the kernel to understand its implementation.

3.1. Related functional dependencies

Kernel-related CONFIG_* dependencies are as follows:

figure 1:

81a8b2a01d3759969c887bb0176b9a3e.png

CGROUP provides the function of cgroup directory hierarchy; CGROUP_SCHED provides cpu cgroup directory hierarchy (such as /dev/cpuctl/top-app), and provides the concept of task group for each cpu cgroup directory; FAIR_GROUP_SCHED provides based on the task group provided by cpu cgroup CFS task group scheduling function. The gray dotted frame in Figure 1 is not enabled by default in the Android phone kernel.

3.2. Task group data structure diagram

As shown in Figure 2 below, it shows a block diagram of the main data structure maintained by a task group in the kernel, which is easier to understand by comparing the following concepts in the figure:

  1. Since it is impossible to determine which CPU the tasks in a group run on, a task group maintains a group cfs_rq on each CPU, because the task group also participates in scheduling (you must first select the task group to select the group cfs_rq on the group cfs_rq task), so a group se is also maintained on each CPU.

  2. The my_q member of task se is NULL, and the my_q member of group se points to its corresponding group cfs_rq, and the ready tasks grouped on this CPU are hung on this group cfs_rq.

  3. The task group can be nested, and the parent/siblings/children members form an inverted tree hierarchy, and the parent of the root task group points to NULL.

  4. The root cfs_rq of all CPUs belongs to the root task group.

figure 2:

2146a94a2b535cf7190127b177855470.png

Note: It is troublesome to draw a picture. This picture is based on the cochlear fossa.

4. Relevant structures

4.1. struct task_group

A struct task_group represents a cpu cgroup grouping. When FAIR_GROUP_SCHED is enabled, a struct task_group represents a task group in CFS group scheduling.

1231664649ad6579d922f92b95f681e0.png

css: The cgroup status information corresponding to the task group, through which it is attached to the cgroup directory hierarchy.

se: group se, is an array pointer, the size of the array is the number of CPUs. Because there are multiple tasks in a task group, and they can run on all CPUs, each CPU must have an se of the task group.

cfs_rq: The cfs_rq of the group se is the cfs_rq of the task group on each CPU, and it is also an array pointer, and the size of the array is the number of CPUs. When the task of this task group is ready, hang on this cfs_rq. Its corresponding pointer on each CPU has the same pointer as se->my_q of task group on each CPU, see init_tg_cfs_entry().

shares: The weight of the task group, the default is scale_up(1024). Similar to the weight of task se, the larger the value, the more CPU time slices the task group can obtain. But different from the task, the task group has a group se on each CPU, so it needs to be assigned to the group se on each CPU according to certain rules. The allocation rules will be explained below.

load_avg: The load of this task group, and it is only a load_avg variable (unlike se and cfs_rq are a structure), which will be explained below. Note that it is not per-cpu. The task of this task group will update it when it is updated on each CPU, so you need to pay attention to the impact on performance.

parent/siblings/children: The hierarchy that makes up the task group.

There is a global variable struct task_group root_task_group in the kernel, which represents the root group. Its cfs_rq[] is the cfs_rq of each cpu, and its se[] is NULL. Its weights are not allowed to be set, see sched_group_set_shares(). The load is also not updated, see update_tg_load_avg().

All task group structures in the system will be added to the task_groups linked list, which is used in CFS bandwidth control.

Beginners may confuse the two concepts of group scheduling and scheduling group, as well as the difference between struct task_group and struct sched_group. Group scheduling corresponds to struct task_group, which is used to describe a group of tasks. It is mainly used for util uclamp (to clamp the computing power requirements of a group of tasks) and CPU resource usage restrictions, which is also the content to be explained in this article. The scheduling group corresponds to struct sched_group, which is a concept in the CPU topology sched_domain. It is used to describe the attributes of a CPU (MC level)/Cluster (DIE level), and is mainly used for core selection and load balancing.

4.2. struct sched_entity

A sched_entity can represent not only a task se, but also a group se. The following mainly introduces some new members after enabling group scheduling.

024369a93f988ebdead9c96c06bb4bca.png

load: Indicates the weight of se. For gse, it is initialized to NICE_0_LOAD when new, see init_tg_cfs_entry().

depth: Indicates the nesting depth of the task group, the depth of the se under the root group is 0, and it is increased by 1 for each deeper nesting level. For example, there is a tg1 directory under the /dev/cpuctl directory, and a tg2 directory under the tg1 directory. The depth of the group se corresponding to tg1 is 0, the depth of the task se under tg1 is 1, and the depth of the task se under tg2 is 2. . See init_tg_cfs_entry()/attach_entity_cfs_rq() for update location.

parent: point to the parent se node, both the parent and child se nodes correspond to the same cpu. The task under the root group points to NULL.

cfs_rq: The cfs_rq to which this se is mounted. For the tasks under the root group point to cfs_rq of rq, the tasks of the non-root group point to its parent->my_rq, see init_tg_cfs_entry().

my_q: The cfs_rq of this se, only the group se has cfs_rq, and the task se is NULL. The entity_is_task() macro judges whether it is a task se or a group se through this member.

runnable_weight: Cache the value of gse->my_q->h_nr_running, used when calculating the runnable load of gse.

avg: The load of se, for tse it will be initialized as its weight (it is assumed that its load is high when it is created), and for gse it will be initialized to 0, see init_entity_runnable_average(). There is a certain difference between task se and group se, which will be explained in Chapter 5 below.

4.3. struct cfs_rq

struct cfs_rq can represent both the CFS ready queue of per-cpu and the my_q queue of gse. Some members who are more critical to group scheduling are listed below to explain.

0d853eb1e91c1d0481d528309cbe891f.jpeg

load: Indicates the weight of cfs_rq, whether it is the root cfs_rq or grq, the weight here is equal to the sum of the weights of all tasks hung on its queue.

nr_running: The sum of the number of task se and group se under the current level.

h_nr_running: The sum of the number of task se under the current level and all sub-levels, excluding group se.

avg: load of cfs_rq. The following will explain the load by comparing task se, group se, and cfs_rq.

removed: When a task exits or migrates to other CPUs after waking up, the load brought by the task needs to be removed from the cfs rq of the original CPU. This removal action will first record the removed load in the removed member, and then remove it the next time update_cfs_rq_load_avg() is called to update the cfs_rq load. nr indicates the number of se to be removed, and *_avg indicates the sum of various loads to be removed.

tg_load_avg_contrib: It is a cache for grq->avg.load_avg, indicating the contribution value of the current grq load to tg. Used to reduce the number of accesses to tg->load_avg while updating tg->load_avg. It is also used in the approximate algorithm when calculating the weight quota that gse gets from tg, see calc_group_shares()/update_tg_load_avg().

propagate: Mark whether there is a load that needs to be propagated to the upper layer. Section 7.3 below will explain.

prop_runnable_sum: When the load propagates to the upper layer along the task group hierarchy, it indicates the load_sum value of the tse/gse to be uploaded.

h_load: Hierarchy load, indicating the contribution value of load_avg of cfs_rq of this layer to load_avg of CPU, mainly used in the load balancing path. It will be explained below.

last_h_load_update: Indicates the last time h_load was updated (unit jiffies).

h_load_next: Point to the sub-gse, in order to obtain the hierarchy load of the task (task_h_load function), it is necessary to update the h_load of the cfs rq of each level from the top cfs down. Therefore, h_load_next here is to form a cfs rq--se--cfs rq--se relationship chain from the top cfs rq to the bottom cfs rq.

rq: This member is added after group scheduling is enabled. If group scheduling is not enabled, cfs_rq is a member of rq, and container_of is used for routing. After group scheduling is enabled, an rq member is added to route from cfs_rq to rq .

on_list/leaf_cfs_rq_list: Try to connect leaf cfs_rq in series and use them in CFS load balancing and bandwidth control related logic.

tg: The task group to which the cfs_rq belongs.

Five, task group weight

The weight of the task group is represented by the shares member of the struct task_group, and the default value is scale_load(1024). You can read and write through the cpu.shares file in the cgroup directory, echo weight > cpu.shares is to configure the task group weight as weight, and the value saved in the shares member variable is scale_load(weight). root_task_group does not support setting weights.

The weight of different task groups indicates which task group can run more and which task group should run less after the system CPU is full.

5.1. Weight of gse

The task group has a group se on each CPU, so it is necessary to assign the weight tg->shares of the task group to each gse according to certain rules. The rule is formula (1):

*                     tg->weight * grq->load.weight

*   ge->load.weight =   -----------------------------------------              (1)

*                       \Sum grq->load.weight

Among them, tg->weight is tg->shares, and grq->load.weight represents the grq weight of tg on each CPU. That is, each gse allocates the weight of tg according to the weight ratio of its cfs_rq. The weight of cfs_rq is equal to the sum of the weights of tasks mounted on it. Suppose the weight of tg is 1024, and there are only 2 CPUs in the system, so there are two gse, if the task status on grq is as shown in Figure 3, then the weight of gse[0] is 1024 * (1024+2048+3072) /(1024+2048+3072+1024+1024) = 768; the weight of gse[1] is 1024 * (1024+1024)/(1024+2048+3072+1024+1024) = 256.

image 3:

8ae44155b35b5a67a3d159a5753227c6.png

The weight update function of gse is update_cfs_group(), see its specific implementation below:

d6491d8c2c6a0d5850c09d71353def50.png

The distribution of the weight of tg to gse[X] is done in calc_group_shares().

\Sum grq->load.weight is used in formula (1), which means that the update of a gse weight needs to access the grq on each CPU, and the cost of lock competition is relatively high, so a series of approximate calculations are performed.

First do the replacement:

*   grq->load.weight --> grq->avg.load_avg                         (2)

and then get:

*                     tg->weight * grq->avg.load_avg

*   ge->load.weight =    ----------------------------------------             (3)

*                             tg->load_avg

*

* Where: tg->load_avg ~= \Sum grq->avg.load_avg

Since cfs_rq->avg.load_avg = cfs_rq->avg.load_sum/divider. And cfs_rq->avg.load_sum is equal to cfs_rq->load.weight multiplied by the geometric progression in the non-idle state. This approximation is strictly equal under the premise that the time series of grq's non-idle state on each CPU of tg is the same. That is to say, the more consistent the running status of tg tasks on each CPU, the closer to this approximate value.

When the task group is idle, start a task. grq->avg.load_avg takes time to build, in the special case of build time Equation 1 simplifies to:

*                     tg->weight * grq->load.weight

*   ge->load.weight =    ---------------------------------------   =   tg->weight   (4)

*                         grp->load.weight

It is equivalent to the state of a single-core system. In order to make formula (3) closer to formula (4) in this special case, another approximation is made to get:

*                             tg->weight * grq->load.weight

*   ge->load.weight =     --------------------------------------------------------------------         (5)

*                      tg->load_avg - grq->avg.load_avg + grq->load.weight

But because there is no task on grq, grq->load.weight can drop to 0, resulting in division by zero, you need to use grq->avg.load_avg as its lower limit, and then give:

*                     tg->weight * grq->load.weight

*   ge->load.weight =    ------------------------------------------   (6)

*                             tg_load_avg'

*

* in:

*   tg_load_avg' = tg->load_avg - grq->avg.load_avg + max(grq->load.weight, grq->avg.load_avg)

max(grq->load.weight, grq->avg.load_avg) generally takes grq->load.weight, because only when there are always tasks running+runnable on grq will it approach grq->load.weight.

The calc_group_shares() function approximates the weight of each gse through the formula (6):

f4b7dbdb680cb440a4ff4c405b496141.png

Since each task in tg contributes to the weight of gse, the weight of gse must be updated when the number of tasks on grq changes. The load of se is used in the approximation process, and an update is also performed in entity_tick(). Call path:

b8065ddf7cf243bd4db441c8a77257b3.png

5.2. The weight assigned to each tse on gse

The tasks in the task group are also assigned the weight of gse according to their weight ratio. As shown in Figure 2 above, for the three tasks hung on the grq of gse[0], the weight of tse1 is 768*1024/(1024+2048+3072)=128, and the weight of tse2 is 768*2048/(1024+ 2048+3072)=256, the weight of tse3 is 768*3072/(1024+2048+3072)=384.

When the tasks in tg allocate the time slices allocated by tg, this proportional weight will be used. The deeper the nesting of the group, the smaller the weight that can be allocated proportionally. It can be seen that the tasks in the task group are not conducive to the allocation of time slices under high load.

Six, task group time slice

6.1. Time slice allocation

If CFS group scheduling is enabled, the time slices allocated by the upper layer will be allocated layer by layer through the weight ratio from top to bottom. The allocation function is sched_slice(). But it is not convenient to traverse from top to bottom, so it is changed to traverse from bottom to top. After all, A*B*C and C*B*A are equal.

9530895e61c8f8621bcc84a9f24b7962.png

The main path of sched_slice is as follows:

5dab32d84992ae330f47495f1f1734da.png

In the tick interrupt, if it is found that se's running time has exceeded its allocated time slice, it will trigger preemption so that it can give up the CPU.

As shown in Figure 4, assuming that tg is nested with 2 layers, and the weight of each layer of gse from tg on the current CPU is 1024, and assuming that the period is calculated directly by the number of tasks, 5 tse, period is 3 * 5 = 15ms then:

tse1 gets 1024/(1024+1024) * 15 = 7.5ms;

tse2 gets [1024/(1024+1024+1024)] * {[1024/(1024+1024)] * 15 }= 2.5ms

tse4 gets [1024/(1024+1024)] * {[1024/(1024+1024+1024)] * [1024/(1024+1024)] * 15} = 1.25ms

Figure 4:

73625b62657fcda8b896b7382989418b.png

Note: The weights of tg1 and tg2 are configured through the cpu.shares file, and then the gse on each cpu distributes the weight according to the weight ratio of grq on it from the weight configured by cpu.shares. The weight of gse is no longer linked to the nice value.

6.2. Runtime conduction

pick_next_task_fair() will give priority to picking se with the smallest virtual time. How is the virtual time of gse updated? The virtual time is updated in update_curr(), and then the virtual time of gse is updated layer by layer through for_each_sched_entity. If tse runs for 5ms, each gse of its parent level runs for 5ms, and then each level updates the virtual time according to its own weight.

60b2685dbb2a0a4200fee852c411e2a8.png

Main calling path:

4f59c378f24b994d8e71630f04294a9c.png

When selecting the next task to run, select se with the smallest virtual time level by level. If gse is selected, continue to select from its grq until tse is selected.

7. PELT load of task group

7.1. Calculate the timeline used by the load

The timeline used to calculate the load is different from the timeline used to calculate the virtual time. The timeline used when calculating the virtual time is rq->clock_task, which is how long it takes to run. The timeline used by the calculation load is rq->clock_pelt, which is scaled according to the computing power of the CPU and the current frequency point. When the CPU is idle, it will be synchronized to rq->clock_task. Therefore, the load calculated by PELT can be used directly, without the need for scale like the load calculated by WALT. The function to update the timeline of rq->clock_pelt is update_rq_clock_pelt()

778a45ee47680d5c876133eb46a484d6.png

The final calculated delta= delta * (capacity_cpu / capacity_max(1024)) * (cur_cpu_freq / max_cpu_freq) is the delta time value obtained by running the current cpu at the current frequency point, and scaling it to the maximum frequency point corresponding to the maximum performance CPU delta time value. Then add to clock_pelt. For example, running 5 ms at 1 GHz on a small core may only be equivalent to running 1 ms on a super large core. Therefore, running the same time on CPU cores of different Clusters will result in different load increases.

7.2. Load definition and calculation

load_avg is defined as: load_avg = runnable% * scale_load_down(load).

runnable_avg 定义为:runnable_avg = runnable% * SCHED_CAPACITY_SCALE。

util_avg is defined as: util_avg = running% * SCHED_CAPACITY_SCALE.

These load values ​​are stored in the struct sched_avg structure, which is embedded in the se and cfs_rq structures. In addition, struct sched_avg also introduces load_sum, runnable_sum, util_sum members to assist calculation. The load of different entities (tse/gse/grq/cfs_rq) is just how much their runnable% wants to run, and how much they run is different from running%. These two factors only take the value [0,1] for tse, and it is beyond this range for other entities.

7.2.1. tse load

Let's take a look at the tse load calculation formula. In order to deepen the impression, I will give an example of running an endless loop. See update_load_avg --> __update_load_avg_se() for the calculation function.

load_avg: equal to weight * load_sum / divider, where weight = sched_prio_to_weight[prio-100]. Since load_sum is the geometric progression of the task running+runnable state, divider is approximately the maximum value of the geometric progression, so the load_avg of an infinite loop task is close to its weight.

runnable_avg: equal to runnable_sum / divider. Since runnable_sum is the geometric progression of the task running+runnable state and then scaled up, the divider is approximately the maximum value of the geometric progression, so the runnable_avg of an infinite loop task is close to SCHED_CAPACITY_SCALE.

util_avg: equal to util_sum / divider. Since util_sum is the geometric progression of the running state of the task and then scaled up, the divider is approximately the maximum value of the geometric progression, so the util_avg of an infinite loop task is close to SCHED_CAPACITY_SCALE.

load_sum: It is the cumulative value of the geometric progression for the simple running+runnable state of the task. For an infinite loop, this value approaches LOAD_AVG_MAX.

runnable_sum: It is the cumulative value of the geometric progression of the task running+runnable state and then the value after scale up. For an infinite loop, this value tends to be LOAD_AVG_MAX * SCHED_CAPACITY_SCALE.

util_sum: It is the cumulative value of the geometric progression of the running state of the task and then the value after scale up. For an infinite loop that monopolizes a certain core, this value tends to be LOAD_AVG_MAX * SCHED_CAPACITY_SCALE. If it cannot be monopolized, it will be smaller than this value.

7.2.2. Load of cfs_rq

Let's take a look at the cfs_rq load calculation formula. In order to deepen the impression, I will give an example of running an infinite loop. For the calculation function, see update_load_avg --> update_cfs_rq_load_avg --> __update_load_avg_cfs_rq().

load_avg: directly equal to load_sum / divider. cfs_rq runs full (running an infinite loop or multiple infinite loops), approaching the weight of cfs_rq, which is the sum of the weights of all scheduling entities attached to it, namely Sum(sched_prio_to_weight[prio-100]).

runnable_avg: equal to runnable_sum / divider. cfs_rq runs full (running an infinite loop or multiple infinite loops), approaching the number of tasks on cfs_rq multiplied by SCHED_CAPACITY_SCALE.

util_avg: equal to util_sum / divider. cfs_rq runs full (running an infinite loop or multiple infinite loops), approaching SCHED_CAPACITY_SCALE.

load_sum: The weight of cfs_rq, that is, the sum of the weights of all ses at this level multiplied by the geometric progression in the non-idle state. Note that it is this level, which is useful when explaining the level load h_load below.

runnable_sum: The number of runnable+running state tasks at all levels on cfs_rq multiplied by the geometric progression in the non-idle state, and then multiplied by the value of SCHED_CAPACITY_SCALE. See __update_load_avg_cfs_rq().

util_sum: the sum of the geometric progression of all tasks running on cfs_rq multiplied by SCHED_CAPACITY_SCALE.

load_avg, runnable_avg, and util_avg describe the CPU load from three dimensions: weight (priority), number of tasks, and CPU time slice occupation.

7.2.3. gse load

In contrast to tse to explain gse:

(1) gse will follow the same load update process as tse (updating layer by layer, it will update to gse).

(2) The runnable load of gse is different from that of tse. The runnable_sum of tse is the cumulative value of the geometric progression of the task running+runnable state and then the value after scale up. And gse is the sum of the number of tse of all levels under the current level multiplied by the geometric progression of time and then scale up, see the difference in the runnable parameters of the __update_load_avg_se() function.

(3) Although the load_avg of gse and tse is equal to se->weight * load_sum/divider, see the parameter difference of ___update_load_avg(). But the source of weight is different, so it can be regarded as a point of difference. tse->weight comes from its priority, and gse comes from its quota allocated from tg.

(4) gse will have one more load conduction update process than tse, which will be explained below (if CFS group scheduling is not enabled, there is only one layer, and there is no hierarchical structure of tg, so conduction is not required, only need to be updated to cfs_rq can be).

7.2.4. grq load

The grq load is no different from the cfs_rq load in terms of updates. grq will have one more load conduction update process than cfs_rq, which will be explained below.

7.2.5. Load of tg

tg has only one load, which is tg->load_avg, and the value is \Sum tg->cfs_rq[]->avg.load_avg, which is the sum of load_avg of grq on all CPUs of tg. The tg load update is implemented in update_tg_load_avg(), which is mainly used to assign weights to gse[].

c4b15b815d91ab779b2fb94edd787395.png

Call path:

5adc48146396ed83c0e064455ed11602.png

7.3. Load conduction

Load conduction is a concept only after CFS group scheduling is enabled. When a tse is inserted or deleted on the tg hierarchy, the load of the entire hierarchy changes, so it needs to be conducted layer by layer.

7.3.1. Load conduction trigger conditions

Whether load conduction is required is marked through the propagate member of struct cfs_rq. The load conduction process is triggered when adding/removing tse on grq. The load_sum value of tse will be recorded on the prop_runnable_sum member of struct cfs_rq, and then conducted upward layer by layer. Other loads (runnable_*, util_*) will be transmitted layer by layer through tse-->grq-->gse-->grq...

Mark the need for load conduction in add_tg_cfs_propagate():

6e86350720f0d77e658414866698bc88.png

This function call path:

655160c0dfc68804c6c74094ba52cdab.png

It can be seen from the above that when the non-CSF scheduling class is changed to the CFS scheduling class, moved to the current tg, the newly created task starts to hang on cfs_rq, and migrates to the current CPU, the load transfer process will be triggered. At this time, it will be transferred to the entire hierarchy Conduction adds the load brought by this task. When a task migrates away from the current CPU, becomes a non-CFS scheduling class, or migrates away from tg, the load reduced by removing this task will be transferred to the entire hierarchy.

Note that the load of the task is not removed when it is dormant, but its load does not increase during the dormancy period and decays over time.

7.3.2. Load conduction process

The load conduction process is reflected in the process of updating the load layer by layer. As follows, the load update function update_load_avg() is called on each layer under the main path:

7b619c7b683639b9d9cdefdb23544b4c.png

The load transfer function and the function that needs to be transferred are the same, which is add_tg_cfs_propagate(), and its call path is as follows:

c5fbec465e0d5cc8d4ac9d7ffa15cbf3.png

7.3.2.1. update_tg_cfs_util() Updates the util_* payloads of gse and grq, and is responsible for delivering the payloads to the upper layer.

eed9c5be988788e08fc11e35a78417e0.png

It can be seen that the util load of gse directly takes the util load on its grq during conduction. Then transmit to the upper layer by updating the util_avg of the upper layer grq.

7.3.2.2. update_tg_cfs_runnable() Update the runnable_* load of gse and grq, and is responsible for passing the load to the upper layer.

7fe9c357740b63291ac657ecad35506f.png

It can be seen that the runnable load of gse is also directly taken from the runnable load on its grq during conduction. Then conduct to the upper layer by updating the runnable_avg of the upper layer grq.

7.3.2.3. update_tg_cfs_load() Update the load_* load of gse and grq, and is responsible for delivering the load to the upper layer.

The load load is quite special. When the load is transmitted, it is not directly taken from the load load of grq, but the load_sum value of tse is recorded when adding/deleting tasks to grq, and then it is transmitted layer by layer in add_tg_cfs_propagate(), and the transmission position is called path:

17f1327c45ea7043b5c58a9a8850303e.png

The marking and conduction of load loads are all this function:

e15a5a5228beeb980103b811fa769fbb.png

load load update function:

1aff3e0f193d891621cc118ab49d3b5e.png

The deletion task is to assign the average load_sum of se on grq to gse. The adding task is to directly add the delta value to the load_sum of gse.

load_avg is calculated in the same way as ordinary tse, which is load_sum*se_weight(gse)/divider.

It can be seen from the comparison that the conduction direction of the runnable load and the util load is from grq-->gse, respectively through runnable_avg/util_avg, and gse directly takes the value of grq. The conduction direction of the load load is conducted by gse-->grq, and is conducted through load_sum.

The reason why the load conduction assignment method is different from the runnable load and util load may be related to its statistical algorithm. For runnable_avg, gse calculates the ratio of the number of tse on all levels under the current level multiplied by the time series of the runnable state, adding one tse to the upper level is equivalent to adding one to the number of tse; for util_avg, gse calculates The ratio of the running state geometric progression of all tse to the time progression, adding one tse to the upper layer is equivalent to increasing the geometric progression of the running state of tse; and load_avg is related to the weight of se, and the weight of gse and tse The sources of weight are different. The former comes from the quota allocated from tg->shares, while the latter comes from the priority and cannot be added or subtracted directly. For se, load_sum is a simple time series of runnable state and does not involve weight, so both tse and gse can use it.

For the conduction of load_avg, for example, as shown in Figure 5 below, if ts2 has been sleeping, ts1 and ts3 are two endless loops, then the load_avg of grq1 of gse1 will approach 4096, and the load of the root cfs_rq will approach 2048, if At this time, ts3 needs to be migrated away. If you want to reduce it directly like calculating the load of runnable and util, the obtained delta value is -4096, then the load_avg of the root cfs_rq will be a negative value (2048-4096<0), which is obviously wrong. reasonable. If it is conducted through load_sum, it is only a time series, and after subtraction, it is only equivalent to losing 50% of the load on the root cfs_rq.

Figure 5:

f89e8c534ef4322614183a4f814eeb28.png

NOTE: This is just the conduction update path for loads when tasks are added/removed in tg's hierarchy, over time the loads of gse/grq will be updated even if no tasks are added/removed, as normal load updates The function __update_load_avg_se()/update_cfs_rq_load_avg() does not distinguish between tse or gse, cfs_rq or grq.

7.4. Hierarchical Loading

During load balancing, the load on the CPU needs to be migrated to achieve balance. To achieve this goal, tasks need to be migrated between CPUs. However, the load avg of each task se cannot truly reflect its load contribution to the root cfs rq (that is, the CPU), because task se/cfs rq always calculates its load avg at a specific level. For example, the load_avg of grq is not equal to the sum of the load_avg of all tse connected to it, because the time series of runnable must be Sum(tse) > grq (there is a state of runnable waiting to run).

In order to calculate the load (h_load) of the task on the CPU, the concept of hierarchy load is introduced on each cfs rq. For the top-level cfs rq, its hierarchy load is equal to the load avg of the cfs rq. As the hierarchy progresses, the cfs rq The hierarchy load is defined as follows:

The h_load of the cfs rq of the next layer = the h_load of the cfs rq of the previous layer x the proportion of gse load in the cfs load of the previous layer

When calculating the h_load of the bottom tse, we use the following formula:

h_load of tse = h_load of grq x load avg of tse / load avg of grq

The function to obtain and update the h_load of the task is as follows:

3e31215e0b38d5d0a3db56e0dc15ae14.png

The function to update h_load of grq is as follows:

7ead88e80f03e2694358006984e02fa9.png

Call path:

526c2f62de17408d5dee38a63aa1ac0e.png

It can be seen that it is mainly used in wake_affine_weight mechanism and load balancing logic. For example, in load balancing where the migration type is load, how much load_avg needs to be migrated to make the load balance, the task_h_load() is used, see detach_tasks().

8. Summary

This article introduces the reason for the introduction of the CFS group scheduling function, the configuration method, and some implementation details. This function can "soft limit" (compared with CFS bandwidth control) the proportion of CPU resources used by each group task under high load, so as to achieve the purpose of fair use of CPU resources among groups. In the old version of native Android code, the restrictions on background grouping are strict (even setting background/cpu.shares to 52), and the focus of CPU resources is tilted towards foreground grouping, but this configuration may appear in the foreground in some scenarios When the task is stuck by the background task, for universal configuration, in the latest Android versions, the cpu.shares of each group is set to 1024 to pursue the fairness of CPU resources among the groups.

9. Reference

1. Kernel source code (https://www.kernel.org/) and Android source code (https://source.android.com/docs/setup/download/downloading)

2. Kernel documentation Documentation/admin-guide/cgroup-v1

3. CFS scheduler (3) - group scheduling: http://www.wowotech.net/process_management/449.html

4. Analysis of PELT algorithm: http://www.wowotech.net/process_management/pelt.html

cd0efbe34c6a1c76cd569d3256ced31c.gif

Long press to follow Kernel Craftsman WeChat

Linux Kernel Black Technology | Technical Articles | Featured Tutorials

Guess you like

Origin blog.csdn.net/feelabclihu/article/details/128586905