[Reading notes] Linux kernel design and implementation-process scheduling

Process scheduling: The kernel decides which process is running, when, and in what order in this subtle and interesting way.

Processes appear to the operating system as a running form of the program.

A process scheduler (scheduler for short) can be seen as a kernel subsystem that allocates limited processor time resources between runnable processes .

The principle of the scheduler's maximum use of processor time is: as long as there are processes that can be executed, there will always be processes that are executing.

Selecting one of a set of processes in a running state to execute is the basic work required by the scheduler.

1. Multitasking-preemption, time slices and concessions

A multitasking operating system is an operating system that can simultaneously execute multiple processes concurrently and interactively.
Whether on a single-processor or multi-processor machine, a multitasking operating system can keep multiple processes in a blocked or sleeping state, that is, they are not actually put into execution, knowing that the work is indeed ready.

Multitasking systems can be divided into: non-preemptive multitasking (cooperative multitasking) and preemptive multitasking (preemptive multitasking).

Q: What is preemption?
A: The scheduler decides when to stop the running of a process so that other processes can get the opportunity to execute. This forced suspending action is called preemption.

Q: What is a time slice (timeslice)?
A: The time that the process can run before it is preempted is set in advance. This time is the time slice of the process.

Q: What is yielding?
A: The process actively suspends its own operating system is called yielding.

The advantages and disadvantages of preemptive and non-preemptive types are self-evident. In a non-preemptive multitasking mode, unless the process actively gives in, then other processes can have the opportunity to run. The suspension process can crash the system.

2. Linux process scheduling-O (1) scheduler

3. Strategy-Decide when and what process the scheduler will run

3.1 I / O-consuming and processor-consuming processes

Processes can be divided into I / O consumption and processor consumption.
I / O consumption means that most of the process is used to submit I / O requests or wait for I / O requests.
Processor consumption means that the process spends most of its time executing code. Unless they are preempted, they usually run nonstop.

ps: The process can show both of them, that is, both I / O consumption type and processor consumption type.

Scheduling strategies usually seek to balance between two contradictory goals: the process is correspondingly fast (short response time) and maximum system utilization (high throughput) .

3.2 Process priority

The most basic type of scheduling algorithm is priority-based scheduling.
The scheduler always selects the process with the highest priority and the time slice is not exhausted.

Linux uses two different priority ranges. Nice value and real-time priority

Types of Explanation
nice value 1: The range is from -20 to +19, the default value is 0;
2: The larger nice value means lower priority;
3: The related command ps -el, the NI column is the nice value for the process
Real-time priority 1: The value can be configured;
2: The default range varies from 0 to 99 (inclusive);
3: Higher real-time priority value means higher process priority;
4: Any real-time process priority is higher than ordinary process (Real-time priority and nice priority are in two disjoint categories)
5: Related commands: ps-eo state, uid, pid, ppid, rtprio, time, comm () Details as shown below
ps -eo state,uid,pid,ppid,rtprio,time,comm

Insert picture description here

3.3 Time slice

A time slice is a numeric value that indicates how long a process can continue to run before it is preempted.

Experience shows that any long-term movie will result in poor system interaction performance.

Linux's CFS (Completely Fair Scheduling) scheduler does not directly allocate time slices to processes, but divides the processor usage ratio to processes.
Therefore, the processor time gained by the process is actually closely related to the system load. And this ratio will also be affected by the nice value of the process. The nice value as a weight will adjust the processor time usage ratio used by the process.

Linux is preemptive. When a process enters a runnable state, it is allowed to run in Linux's CFS scheduler. The timing of preemption depends on how much processor usage the new runnable program consumes. If the consumed usage is less than the current process, the new process is immediately put into operation, preempting the current thread, otherwise, it will be delayed.

3.4 Scheduling strategy activities-it is recommended to read this section of the original book, which clearly describes the "strategy"

4.Linux scheduling algorithm

ps: jump to chapter 11 temporarily

4.1 Scheduler classes (scheduler classes)-modular structure

The Linux scheduler is provided as a module. The purpose of this is to allow different types of processes to select scheduling algorithms in a targeted manner.

The scheduler class allows multiple different scheduling algorithms that can be added dynamically to coexist and schedule processes belonging to its own category.

Each scheduler has a priority, and the basic scheduler code is defined in the kernel / sched.c file.

Completely fair scheduling (CFS) is a scheduling class for ordinary processes, called SCHED_NORMAL in Linux (called SCHED_OTHER in POSIX), and the CFS algorithm implementation is defined in the file kernel / sched_fair.c.

4.2 Process scheduling in Unix systems

4.3 Fair scheduling-CFS

The starting point of CFS is based on a simple concept: the effect of process scheduling should be as if the system has an ideal perfect multitasking processor.

In this system, each process will be able to get 1 / n processor time- n refers to the number of running processes.

The above ideal model is not realistic, because it is impossible to run multiple processes on a processor at the same time.

The practice of CFS is to allow each process to run for a period of time, rotate and rotate, and select the process that runs the least as the next running process, rather than adopting the method of allocating time slots for each process.

CFS calculates how long a process should run based on the total number of all processes that can be run, instead of relying on the nice value to calculate the time slice.

The nice value is used as the weight of the processor running ratio obtained by the process in CFS: the higher the nice value (lower priority) the process gets the lower processor usage weight, which is relative to the default nice value process ; Conversely, processes with lower nice values ​​(higher priority) receive higher processor weights.

Each process runs according to the "time slice" of its weight in all runnable processes. In order to calculate accurate time slices, CFS sets a goal for the approximate value of an infinitely small scheduling period in perfect multitasking-- " Target delay ".

CFS introduces the bottom line of the time slice obtained by each process, which is called the minimum granularity .

Only the relative value will affect the distribution ratio of processor time.

5. Implementation of Linux scheduling-kernel / sched_fair.c

5.1 Time accounting

All schedulers must keep track of the running time of the process.
When the time slice of a process is reduced to 0, it will be preempted by another runnable process that has not been reduced to 0.

CFS no longer has the concept of time slices, but it must also maintain time accounting for each process running to ensure that each process only runs within the processor time allocated to it fairly.
CFS uses the scheduler entity structure defined in the struct sched_entity in the file <linux / sched.h> to track the process of running accounting.

ps: The scheduler entity structure is embedded as a member variable named se in the process descriptor struct task_struct.

The vruntime variable (defined in struct sched_entity) stores the virtual running time of the process. The calculation of the running time (the time spent on the operator) is standardized after the total number of all runnable processes.

CFS uses the vruntime variable to record how long a program has been running and how long it should run.

The update_curr () function defined in the kernel / sched_fair.c file implements this accounting function.

5.2 Process selection-CFS scheduling algorithm: select the process with the smallest vruntime value

CFS uses a red-black tree (rbtree, self-balancing binary search tree) to organize a queue of executable processes and uses it to quickly find the process with the smallest vruntime value.

  1. Pick the next task;
  2. Add processes to the tree;
  3. Delete the process from the tree.

5.3 Scheduler entrance

The main entry point for process scheduling is the function schedule (), which is defined in kernel / sched.c.
It is the entry point used by other parts of the kernel to call the process scheduler: choose which process can run and when to put it into operation.

schedule () usually needs to be associated with a specific scheduling class, that is, it will find a scheduling class with the highest priority-the latter needs to have its own runnable queue, and then ask the latter who is the next to run Process.

The schedule implementation is quite simple, it will call pick_next_task () (also defined in the file kernel / sched.c).

pick_next_task () will check each scheduling class in order of priority from high to low, and select the process with the highest priority from the scheduling class with the highest priority .

The pick_next_task () implementation in CFS will call pick_next_entity (), and this function will call the __pick_next_entity () function in 5.2 above.

ps: CFS is a scheduling class for ordinary processes. For details, please refer to this article Linux System Process Scheduling-Detailed Analysis of Scheduling Architecture

5.4 Sleep and wake

The sleeping (blocked) process is in a special non-executable state .

There are many reasons for the process to sleep, but it is definitely all about waiting for some events.

The kernel's sleep wakeup operation is the same :
sleep : the process marks itself as dormant, moves out of the executable red-black tree, puts it in the waiting queue , and then calls schedule () to select and execute another process.
Waking up : The process of waking up is just the opposite of sleeping. The process is set to executable state, and then moved from the waiting queue to the executable red and black tree.

Waiting queue (sleep) :
Sleep is processed by waiting queue.
The waiting queue is a simple linked list composed of processes waiting for certain events to occur .
The kernel uses wake_queue_head_t to represent the waiting queue.
The waiting queue can be created statically through DECLARE_WAITQUEUE () or dynamically created by init_waitqueue_head ().

The process adds itself to a waiting queue by performing the following steps:

  1. Call the macro DEFINE_WAIT () to create a waiting queue item;
  2. Call add_wait_queue () to add yourself to the queue. The queue will wake up the process when its waiting conditions are met. Of course, we must write relevant code in other places, and perform wake_up () operation on the waiting queue when the event occurs.
  3. Call the prepare_to_wait () method to change the state of the process to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. And this function will add the process back to the waiting queue if necessary, which is needed in the next loop traversal;
  4. If the status is set to TASK_INTERRUPTIBLE, the signal wakes up the process. This is called a false wake-up (wake-up is not due to an event). So check and process the signal;
  5. When the process is woken up, it will again check if the condition is true. If it is, it exits the loop, if not, it calls schedule () again and repeats this step all the time;
  6. When the conditions are met, the process will change its snake to TASK_RUNNING and call the finish_wait () method to remove itself from the waiting queue.

eg:

/* 'q' 是希望休眠的等待队列 */
DEFINE_WAIT(wait);

add_wait_queue(q,&wait);
while(!condition){ /* 'condition' 是在等待的事件 */
	prepare_to_wait(&q,&wait,TASK_INTERRUPTIBLE);
	if(signal_pending(current))
	{
		/* 处理信号 如:*/
		/* do_something; */
	}
	schedule();
};
finish_wait(&q,&wait);

ps: If you still have locks when you plan to sleep, remember to release the locks before calling schedule (), and reacquire them afterwards, or respond to other events.

The function inotify_read () is located in the file fs / notify / inotify / inotify_user.c and copies the information read from the notification file descriptor. Its implementation is a typical usage of the waiting queue (the same tty device serial communication n_tty.c The n_tty_write () function is also a typical wait queue usage).

Wake up : The
wake up operation is performed by the function wake_up (), which will wake up all processes on the specified waiting queue.
The wake_up function calls try_to_wake_up (), which copies the process to set the process to TASK_RUNNING state, and calls enqueue_task () to put the process into the red-black tree. Need to set the need_resched flag.

ps: Which code usually causes the wait condition to complete, it will copy and then call wake_up () function.
ps: One thing to note about hibernation, there is false wake-up. That is, sometimes the process is awakened not because it is waiting for the condition to be met. It is necessary to use a circular process to ensure that the condition it is waiting for is actually met.
Insert picture description here

6. Preemption and context switching

Context switching, that is, switching from one executable process to another, is handled by the context_switch () function defined in kernel / sched.c.
Whenever a new process is selected and ready to run, schedule () will call this function.
The context_switch () function completes the following two basic tasks:

  1. Call switch_mm () declared in <asm / mmu_context.h>, this function is responsible for switching virtual memory from the previous process map to the new process;
  2. Call switch_to () declared in <asm / system.h>. This function is responsible for switching from the processor state of the previous process to the processor state of the new process. Including saving, replying to stack information and register information, as well as any other state information related to the architecture, each process must be managed and saved.

The kernel must know when to call schedule (). A need_resched flag is provided to indicate whether the scheduling needs to be performed again.

The functions used to access and operate need_resched are as follows:

function purpose
set_tsk_need_resched() Set the need_resched flag in the specified process
clear_tsk_need_resched() Clear the need_resched flag in the specified process
need_resched() Check the value of the need_resched flag and return true if set, otherwise false

Q: When is the need_resched flag set?
A: When a process should be preempted, scheduler_tick () will set this flag; when a process with high priority enters the executable state, try_to_wake_up will set this flag, the kernel checks the flag to confirm that it is set , Call schedule () to switch to a new process. When returning to user space and returning from an interrupt, the kernel will also check the need_resched flag. If it is set, the kernel will call the scheduler before continuing execution.

Each process contains a need_resched flag, because accessing the value in the process descriptor is faster than accessing a global variable (because the current macro is fast and the descriptor is usually in the cache). Need_resched after version 2.6 was moved to the thread_info structure and represented by a bit in a special flag variable.

6.1 User preemption

When the kernel is about to return to user space, if the need_resched flag is set, schedule () will be called, and user preemption will occur.

User preemption occurs when:

  1. When returning to user space from a system call;
  2. When returning to user space from an interrupt handler.

6.2 Kernel preemption – as long as the lock is not held, the kernel can preempt

Linux fully supports kernel preemption. That is, the scheduler can reschedule while a kernel-level task is being executed.

As long as the scheduling is safe, the kernel can preempt the task being executed at any time.

Q: When is rescheduling safe?
A: As long as the lock is not held, the kernel can preempt. The lock is a sign of a non-preempted area.

Kernel preemption occurs when:

  1. The interrupt handler is executing and before it returns to kernel space;
  2. When the kernel code is once again preemptible;
  3. If the task in the kernel explicitly calls schedule ();
  4. If the task in the kernel is blocked (this will also cause schedule () to be called).

ps: The first change to support kernel preemption is to introduce a preempt_count counter for thread_info of each process. The initial value of the counter is 0, and the value increases by 1 each time the lock is used, and decreases by 1 when the lock is released. When the value is 0, the kernel can perform preemption.

7. Real-time scheduling strategy-SCHED_FIFO and SCHED_RR

Linux provides two real-time scheduling strategies: SCHED_FIFO and SCHED_RR.
The common, non-real-time scheduling strategy is SCHED_NORMAL.

The real-time strategy is not managed by the late fair scheduler, but by a special real-time scheduler . The implementation is defined in the file kernel / sched_rt.c.

Real-time scheduling strategy Detailed description with
SCHED_FIFO SCHED_FIFO implements a simple, first-in-first-out scheduling algorithm: it does not use time slices. SCHED_FIFO-level processes that are in a running state are scheduled before any SCHED_NORMAL-level processes. Once a SCHED_FIFO-level process is in a runnable state, it will continue to execute until it is blocked or explicitly releases the processor; it is not based on time slices and can continue to execute, only higher priority SCHED_FIFO or SCHED_RR tasks In order to preempt the SCHED_FIFO task. If there are two or more priority-level SCHED_FIFO-level processes, they will execute in turn, but they have been let out only if they are willing to give up the processor. As long as there is a SCHED_FIFO-level process executing, other lower-level processes can only wait for it to become inoperable before they have a chance to execute What is achieved is static priority . The kernel does not calculate dynamic priorities for real-time processes. This ensures that real-time processes of a given priority level can always preempt processes with lower priority than it
SCHED_RR SCHED_RR is roughly the same as SCHED_RR, except that the SCHED_RR-level process cannot continue execution after exhausting the time allocated to it in advance. In other words, SCHED_RR is SCHED_FIFO with time slices-this is a real-time rotation scheduling algorithm . When the SCHED_RR task runs out of its time slices, other real-time processes at the same priority are scheduled in turn. Time slices are only used to reschedule processes of the same priority. For the SCHED_FIFO process, the high priority always immediately preempts the low priority, but the low priority process must not preempt the SCHED_RR task, even if its time slice is exhausted. What is achieved is static priority. The kernel does not calculate dynamic priorities for real-time processes. This ensures that real-time processes of a given priority level can always preempt processes with lower priority than it

Linux's real-time scheduling algorithm provides a soft real-time working method. The meaning of soft real time is: the kernel schedules the process, trying to make the process run before its limited time, but the kernel does not guarantee that it can always meet the requirements of these processes.

Real-time priority ranges from 0 to MAX_RT_PRIO minus 1 (0-99). By default, MAX_RT_PRIO is 100.
The nice value of the SCHED_NORMAL level process shares this value space; its value range is from MAX_RT_PRIO to (MAX_RT_PRIO + 40), that is, by default, the nice value from -20 to +19 directly corresponds to from 100 to 139 Real-time priority range.

8. System calls related to scheduling

Linux provides a family of system calls for managing parameters related to the scheduler.
The system calls related to scheduling are shown in the following table

System call description
nice() Set the nice value of the process
sched_setscheduler() Set process scheduling strategy
sched_getscheduler() Get the scheduling strategy of the process
sched_setparam() Set the real-time priority of the process
sched_getparam() Get the real-time priority of a process
sched_get_priority_max() Get the maximum value of real-time priority
sched_get_priority_min() Get the minimum value of real-time priority
sched_rr_get_interval() Get the time slice value of the process
sched_setaffinity () Set the affinity of the process's processor
sched_getaffinity() Get the affinity of the process's processor
sched_yield() Temporarily give up the processor
Published 91 original articles · praised 17 · 50,000+ views

Guess you like

Origin blog.csdn.net/qq_23327993/article/details/105072421