CFS scheduling main code analysis one

I learned the principles of CFS scheduling and the main data structure in the front. Today we will enter the code analysis. Of course, the code analysis only looks at the main trunk and not the capillary. At the same time, we also analyze some important codes according to the idea of ​​how a process is scheduled.

Before analyzing the code, there are some small functions that need to be analyzed first. As the saying goes, the tall buildings rise to the ground, these small functions are still very important.

calc_delta_fair

The calc_delta_fair function is a function used to calculate the vruntime of the process. In the previous CFS Principles, I learned that the calculation formula of a process's vruntime is as follows:

                                       

The vruntime of a process is equal to the actual running time of the process multiplied by the corresponding weight of the NICE0 process divided by the weight of the current process. In order to ensure that floating-point operations are not involved in the division calculation, the Linux kernel avoids floating-point operations by shifting left by 32 bits and then by 32 bits to the right. The modified formula is:

                                 

And then converted into the following formula

                                  

Among them, the value of inv_weight has been calculated in the kernel code. When using it, you only need to look up the table to get the value of Inv_weigth

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

In this way, the vruntime of a process can be easily obtained through the above calculation method. After we know the calculation process, let's look at the code

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

	return delta;
}

 If the weight value of the current scheduling entity is equal to NICE_0_LOAD, the actual running time of the process is directly returned. Because the virtual time of the nice0 process is equal to the physical time. Otherwise, call the __calc_delta function to calculate the vruntime of the process

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);
	int shift = WMULT_SHIFT;

	__update_inv_weight(lw);

	if (unlikely(fact >> 32)) {
		while (fact >> 32) {
			fact >>= 1;
			shift--;
		}
	}

	/* hint to use a 32x32->64 mul */
	fact = (u64)(u32)fact * lw->inv_weight;

	while (fact >> 32) {
		fact >>= 1;
		shift--;
	}

	return mul_u64_u32_shr(delta_exec, fact, shift);
}

Finally, the virtual running time of a process can be calculated by the above calculation formula. The detailed code is not pushed, it is not necessary. Those interested can take a look.

sched_slice

This function is used to calculate how much running time a scheduling entity can allocate within a scheduling period

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

	for_each_sched_entity(se) {
		struct load_weight *load;
		struct load_weight lw;

		cfs_rq = cfs_rq_of(se);
		load = &cfs_rq->load;

		if (unlikely(!se->on_rq)) {
			lw = cfs_rq->load;

			update_load_add(&lw, se->load.weight);
			load = &lw;
		}
		slice = __calc_delta(slice, se->load.weight, load);
	}
	return slice;
}

The __sched_period function is a function to calculate the scheduling period. When the number of processes is less than 8, the scheduling period is equal to the scheduling delay equal to 6ms. Otherwise, the scheduling period is equal to the number of processes multiplied by 0.75ms, indicating that a process can run at least 0.75ms to prevent the context switch from happening too quickly.

The next step is to traverse the current scheduling entity. If the scheduling entity does not have a scheduling group relationship, it runs only once. Get the current CFS run queue cfs_rq, get the weight of the run queue cfs_rq-> rq represents the weight of this run queue. Finally, the actual running time of this process is calculated by __calc_delta.

__calc_delta This function was introduced when calculating the virtual function before, it can not only calculate the virtual time of a process, it is here to calculate the running time of a process in the total scheduling period, the formula is

进程的运行时间 = (调度周期时间 * 进程的weight) / CFS运行队列的总weigth

place_entity

This function is used to punish a scheduling entity, the essence is to modify its vruntime value

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
	u64 vruntime = cfs_rq->min_vruntime;

	/*
	 * The 'current' period is already promised to the current tasks,
	 * however the extra weight of the new task will slow them down a
	 * little, place the new task so that it fits in the slot that
	 * stays open at the end.
	 */
	if (initial && sched_feat(START_DEBIT))
		vruntime += sched_vslice(cfs_rq, se);

	/* sleeps up to a single latency don't count. */
	if (!initial) {
		unsigned long thresh = sysctl_sched_latency;

		/*
		 * Halve their sleep time's effect, to allow
		 * for a gentler effect of sleepers:
		 */
		if (sched_feat(GENTLE_FAIR_SLEEPERS))
			thresh >>= 1;

		vruntime -= thresh;
	}

	/* ensure we never gain time by being placed backwards. */
	se->vruntime = max_vruntime(se->vruntime, vruntime);
}
  • Get the min_vruntime value of the current CFS run queue
  • When the parameter initial is equal to true, it represents a newly created process, and the newly created process adds value to its vruntime, which means it punishes it. This is a punishment for the newly created process, because the vruntime of the newly created process is too small to prevent it from occupying the CPU
  • If inital is not true, it represents the process of awakening. You need to take care of the process of awakening. The biggest care is half of the scheduling delay.
  • Make sure that the vruntime of the scheduling entity cannot go backwards, and get the maximum vruntime through max_vruntime.

update_curr

The update_curr function is used to update the running time information of the current process

static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_clock_task(rq_of(cfs_rq));
	u64 delta_exec;

	if (unlikely(!curr))
		return;

	delta_exec = now - curr->exec_start;
	if (unlikely((s64)delta_exec <= 0))
		return;

	curr->exec_start = now;

	schedstat_set(curr->statistics.exec_max,
		      max(delta_exec, curr->statistics.exec_max));

	curr->sum_exec_runtime += delta_exec;
	schedstat_add(cfs_rq->exec_clock, delta_exec);

	curr->vruntime += calc_delta_fair(delta_exec, curr);
	update_min_vruntime(cfs_rq);


	account_cfs_rq_runtime(cfs_rq, delta_exec);
}
  • delta_exec = now-curr-> exec_start; Calculate the difference between the current CFS run queue process and the last updated virtual time
  • curr-> exec_start = now; update the value of exec_start
  • curr-> sum_exec_runtime + = delta_exec; update the total execution time of the current process
  • Calculate the virtual time of the current process by calc_delta_fair
  • Update the minimum vruntime value in the CFS run queue through the update_min_vruntime function

New process creation

Through the new process creation process, to analyze how the CFS scheduler sets up the newly created process. When we created a new process in fork, we brought it to the sched module. Here we focus on analysis.

int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
	unsigned long flags;

	__sched_fork(clone_flags, p);
	/*
	 * We mark the process as NEW here. This guarantees that
	 * nobody will actually run it, and a signal or other external
	 * event cannot wake it up and insert it on the runqueue either.
	 */
	p->state = TASK_NEW;

	/*
	 * Make sure we do not leak PI boosting priority to the child.
	 */
	p->prio = current->normal_prio;


	if (dl_prio(p->prio))
		return -EAGAIN;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class;
	else
		p->sched_class = &fair_sched_class;

	init_entity_runnable_average(&p->se);

	raw_spin_lock_irqsave(&p->pi_lock, flags);
	/*
	 * We're setting the CPU for the first time, we don't migrate,
	 * so use __set_task_cpu().
	 */
	__set_task_cpu(p, smp_processor_id());
	if (p->sched_class->task_fork)
		p->sched_class->task_fork(p);
	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

	init_task_preempt_count(p);
	return 0;
}
  • __sched_fork is mainly to initialize the scheduling entity, here does not need to inherit the parent process, because the child process will re-run, these values ​​will be the process of re-copying
  • Set the status of the process to TASK_NEW, which means this is a new process
  • Set the priority of the current current process to the newly created process, the dynamic priority of the newly created process p-> prio = current-> normal_prio
  • Set the scheduling class of the process according to the priority of the process, if it is an RT process, set the scheduling class to rt_sched_class, if it is a normal process, set the scheduling class to fair_sched_class
  • Set which CPU the current process is running on, here is just a simple setting. It will also be set once when joining the run queue of the scheduler
  • Finally call the task_fork function pointer in the scheduling class, and finally call the task_fork function pointer in the fair_sched_class
static void task_fork_fair(struct task_struct *p)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se, *curr;
	struct rq *rq = this_rq();
	struct rq_flags rf;

	rq_lock(rq, &rf);
	update_rq_clock(rq);

	cfs_rq = task_cfs_rq(current);
	curr = cfs_rq->curr;
	if (curr) {
		update_curr(cfs_rq);
		se->vruntime = curr->vruntime;
	}
	place_entity(cfs_rq, se, 1);

	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
		/*
		 * Upon rescheduling, sched_class::put_prev_task() will place
		 * 'current' within the tree based on its new key value.
		 */
		swap(curr->vruntime, se->vruntime);
		resched_curr(rq);
	}

	se->vruntime -= cfs_rq->min_vruntime;
	rq_unlock(rq, &rf);
}
  • Get the current CFS running queue through current, get the current scheduling entity through the curr pointer of the running queue, and then update the running time of the current scheduling entity through update_curr, and at the same time assign the value of the virtual vruntime of the current scheduling entity to the vruntime of the newly created process.
  • Penalize the newly created process through the place_entity function, thinking that the third parameter is 1, the newly created process will be punished
  • se-> vruntime-= cfs_rq-> min_vruntime; Subtract min_vruntime from the virtual runtime of the current scheduling entity The value will change. When added to the run queue, it seems fair.

The above is the newly created process flow, summarized with the following flow chart

       

Add the new process to the ready queue

After the fork process is completed, it will wake up a process through the wake_up_new_task function and add the newly created process to the ready queue

void wake_up_new_task(struct task_struct *p)
{
	struct rq_flags rf;
	struct rq *rq;

	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
	p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
	/*
	 * Fork balancing, do it here and not earlier because:
	 *  - cpus_allowed can change in the fork path
	 *  - any previously selected CPU might disappear through hotplug
	 *
	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
	 * as we're not fully set-up yet.
	 */
	p->recent_used_cpu = task_cpu(p);
	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
	rq = __task_rq_lock(p, &rf);
	update_rq_clock(rq);
	post_init_entity_util_avg(&p->se);

	activate_task(rq, p, ENQUEUE_NOCLOCK);
	p->on_rq = TASK_ON_RQ_QUEUED;
	trace_sched_wakeup_new(p);
	check_preempt_curr(rq, p, WF_FORK);
	task_rq_unlock(rq, p, &rf);
}
  • Set the state of the process to TASK_RUNNING, which means that the process is already in a ready state
  • If you open SMP, it will reset an optimal CPU through __set_task_cpu and let the new process run on it
  • Finally, activate_task (rq, p, ENQUEUE_NOCLOCK); function to add the newly created process to the ready queue
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
	if (task_contributes_to_load(p))
		rq->nr_uninterruptible--;

	enqueue_task(rq, p, flags);
}

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
	if (!(flags & ENQUEUE_NOCLOCK))
		update_rq_clock(rq);

	if (!(flags & ENQUEUE_RESTORE)) {
		sched_info_queued(rq, p);
		psi_enqueue(p, flags & ENQUEUE_WAKEUP);
	}

	p->sched_class->enqueue_task(rq, p, flags);
}

Will eventually call the enqueue_task function pointer in the CFS scheduling class.

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se;

	for_each_sched_entity(se) {
		if (se->on_rq)
			break;
		cfs_rq = cfs_rq_of(se);
		enqueue_entity(cfs_rq, se, flags);

		/*
		 * end evaluation on encountering a throttled cfs_rq
		 *
		 * note: in the case of encountering a throttled cfs_rq we will
		 * post the final h_nr_running increment below.
		 */
		if (cfs_rq_throttled(cfs_rq))
			break;
		cfs_rq->h_nr_running++;

		flags = ENQUEUE_WAKEUP;
	}

}
  • If the on_rq of the scheduling entity has been set, the representative is in the ready queue and jumps out directly
  • The enqueue_entity function will enqueue the scheduling entity
  • Increase the number of CFS run queues that can be run h_nr_running
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
	bool curr = cfs_rq->curr == se;

	/*
	 * If we're the current task, we must renormalise before calling
	 * update_curr().
	 */
	if (renorm && curr)
		se->vruntime += cfs_rq->min_vruntime;

	update_curr(cfs_rq);

	/*
	 * Otherwise, renormalise after, such that we're placed at the current
	 * moment in time, instead of some random moment in the past. Being
	 * placed in the past could significantly boost this task to the
	 * fairness detriment of existing tasks.
	 */
	if (renorm && !curr)
		se->vruntime += cfs_rq->min_vruntime;

	/*
	 * When enqueuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Add its load to cfs_rq->runnable_avg
	 *   - For group_entity, update its weight to reflect the new share of
	 *     its group cfs_rq
	 *   - Add its new weight to cfs_rq->load.weight
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
	update_cfs_group(se);
	enqueue_runnable_load_avg(cfs_rq, se);
	account_entity_enqueue(cfs_rq, se);

	if (flags & ENQUEUE_WAKEUP)
		place_entity(cfs_rq, se, 0);

	check_schedstat_required();
	update_stats_enqueue(cfs_rq, se, flags);
	check_spread(cfs_rq, se);
	if (!curr)
		__enqueue_entity(cfs_rq, se);
	se->on_rq = 1;

	if (cfs_rq->nr_running == 1) {
		list_add_leaf_cfs_rq(cfs_rq);
		check_enqueue_throttle(cfs_rq);
	}
}
  • se-> vruntime + = cfs_rq-> min_vruntime; add the virtual time of the scheduling entity back, min_vruntime was subtracted in the fork before, now it needs to be added back, now the min_vruntime is more accurate
  • update_curr (cfs_rq); to update the running time of the current scheduling entity and the min_vruntime of the CFS running queue
  • Through the comments, when a scheduling entity is added to the ready queue, the load of the running queue and the load of the scheduling entity need to be updated
  • If ENQUEUE_WAKEUP is set, it means that the current process is a wake-up process, and certain compensation is required
  • __enqueue_entity adds the scheduling entity to the CFS red-black tree
  • se-> on_rq = 1; set on_rq to 1, which means it has been added to the run queue

Select the next running process

When a process is created by fork, and then added to the red and black tree of the CFS run queue, then we need to select its run, we directly look at the schedule function. In order to facilitate reading the backbone, the code is simplified

static void __sched notrace __schedule(bool preempt)
{

	cpu = smp_processor_id();             //获取当前CPU
	rq = cpu_rq(cpu);                     //获取当前的struct rq, PER_CPU变量
	prev = rq->curr;                      //通过curr指针获取当前运行进程


       next = pick_next_task(rq, prev, &rf);   //通过pick_next回调选择进程

	if (likely(prev != next)) {
		
		rq = context_switch(rq, prev, next, &rf);  //如果当前进程和下一个进程不同,则发生切换
       }
  • Get the next running process through pick_next
  • Context switch through context_switch
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;

	if (likely((prev->sched_class == &idle_sched_class ||
		    prev->sched_class == &fair_sched_class) &&
		   rq->nr_running == rq->cfs.h_nr_running)) {

		p = fair_sched_class.pick_next_task(rq, prev, rf);
		if (unlikely(p == RETRY_TASK))
			goto again;

		/* Assumes fair_sched_class->next == idle_sched_class */
		if (unlikely(!p))
			p = idle_sched_class.pick_next_task(rq, prev, rf);

		return p;
	}

again:
	for_each_class(class) {
		p = class->pick_next_task(rq, prev, rf);
		if (p) {
			if (unlikely(p == RETRY_TASK))
				goto again;
			return p;
		}
	}

}

The two main steps of pick_next

  • Because there are so many common processes in the system, here we make an optimization by judging whether the scheduling class of the current process and the number of runnable processes in the run queue are equal to the number of runnable processes in the CFS run queue. If the same, it means that the remaining processes are ordinary processes, and then directly call the pick_next callback in fair_sched_class
  • Otherwise, jump to again, honestly traverse the call to the pick_next_task function pointer in order from the priority of the scheduling class from high to low
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;

again:
	if (!cfs_rq->nr_running)
		goto idle;

	put_prev_task(rq, prev);

	do {
		se = pick_next_entity(cfs_rq, NULL);
		set_next_entity(cfs_rq, se);
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);
  • If there are no more processes in the CFS run queue, the idle process will be returned
  • pick_next_entry will obtain a scheduling entity from the leftmost node of the CFS red-black tree
  • set_next_entry sets the next scheduling entity to the curr pointer of the CFS run queue
  • Later, context_switch will switch, the content of the switch is introduced in the following chapters

 

Published 187 original articles · won 108 · 370,000 views

Guess you like

Origin blog.csdn.net/longwang155069/article/details/104578189