In-depth understanding of Linux scheduling subsystem - Part 1

CPU

As a computing resource, CPU has always been the core competitiveness of cloud computing vendors. Our goal is to arrange computing tasks reasonably, fully improve CPU utilization, reserve more space for fault tolerance, enhance system stability, and allow tasks to be executed faster , reduce ineffective power consumption, save costs, thereby improving market competitiveness.

Abstract logic diagram of CPU implementation

  1. First, we have an automatic counter. This automatic counter will continuously increase with the main frequency of the clock as our PC register;

  2. Behind this automatic counter, we connect a decoder. The decoder is also connected to the memory that we have formed through a large number of D flip-flops.

  3. The automatic counter will increase continuously with the main frequency of the clock. From the decoder, find the memory address represented by the corresponding counter, and then read out the CPU instructions inside.

  4. The read CPU instruction will be controlled by the CPU clock and written into a register composed of D flip-flops, that is, the instruction register.

  5. After the instruction register, we can follow another decoder. The function of this decoder is no longer used for addressing, but to parse the obtained instructions into opcode and corresponding operands.

  6. When we get the corresponding opcode and operand, the corresponding output line will be connected to the ALU, and various arithmetic and logic operations will start. The corresponding calculation results will be written back to the register or memory composed of D flip-flops.

The whole process here is probably the execution process of an instruction of the CPU. In order to speed up the execution speed of CPU instructions, CPU has done a lot of optimization in the development process, such as pipeline, branch prediction, superscalar, Hyper-threading, SIMD, multi-level cache, NUMA architecture, etc. Here we mainly focus on the Linux scheduling system.

from the planet of geeks

CPU context

Linux is a multitasking operating system, which supports the simultaneous operation of tasks far greater than the number of CPUs. Of course, these tasks are not actually running at the same time, but because the system allocates the CPU to them in turn in a short period of time, creating the illusion that multiple tasks are running at the same time.

Before each task runs, the CPU needs to know where the task is loaded and started to run, that is to say, the system needs to set the CPU registers and program counter (Program Counter, PC) for it in advance.

The CPU register is a small-capacity but extremely fast memory built into the CPU. The program counter is used to store the position of the instruction being executed by the CPU, or the position of the next instruction to be executed. They are all CPU dependent environments before running any tasks, so they are also called CPU context (execution environment):

                             

These saved contexts will be stored in the system kernel (stack) and loaded again when the task is rescheduled for execution. In this way, the original state of the task will not be affected, and the task will appear to be running continuously.

In Linux, kernel space and user space are two working modes. The operating system runs in kernel space, while user-mode applications run in user space. They represent different levels and have different access rights to system resources.

In this way, there are different CPU contexts for code (instruction) execution, and when scheduling, the corresponding CPU context switching is required. There are different stacks in the Linux system to save the CPU context. Each process in the system will have its own kernel stack, and Each CPU in the system will prepare two independent interrupt stacks for interrupt processing, namely hardirq stack and softirq stack:

Linux system call CPU context switching stack structure:

        

  • Interrupt context : The interrupt code runs in the kernel space. The interrupt context is the CPU context environment required to run the interrupt code. These parameters need to be passed by the hardware, and some other environments that the kernel needs to save (mainly the currently interrupted process or other interrupt environment), these are generally stored in the interrupt stack (x86 is independent, others may be shared with the kernel stack, which is closely related to the specific processing architecture), after the end of the interrupt, the process can still resume from the original state.

  • Process context : Processes are managed and scheduled by the kernel. Process switching occurs in the kernel state. The context of a process includes not only user space resources such as virtual memory, stacks, and global variables, but also kernel space such as kernel stacks and registers. status.

  • System call context : A process can run in kernel space and user space, which are called the user state of the process and the kernel state of the process, respectively. The transition from user state to kernel state needs to be completed through a system call, which requires CPU context switching. When the system calls, it is necessary to save the CPU context of the user mode (user mode stack) to the kernel stack, and then load the CPU context of the kernel mode.

  • The CPU processor is always in one of the following states:
    1. Kernel state, running in the process context, the kernel represents the process running in the kernel space;
    2, Kernel state, running in the interrupt context, the kernel represents the hardware running in the kernel space;
    3. User mode, running in user space.

to interrupt

Interrupts are generated by hardware devices, and they are physically electrical signals. After that, they are sent to the CPU through the interrupt controller, and then the CPU determines which hardware device the received interrupt comes from (this is defined in the kernel), and finally , sent by the CPU to the kernel, and the kernel handles the interrupt.

Simple processing flow of hard interrupt:

Hard interrupt implementation: interrupt controller + interrupt service routine

Interrupt Framework Design (x86) :

The CPU of the X86 computer provides only two external pins for interrupts: NMI and INTR. Among them, NMI is a non-maskable interrupt, which is usually used for power failure and physical memory parity; INTR is a maskable interrupt, which can be masked by setting the interrupt mask bit. It is mainly used to accept interrupt signals from external hardware. The signal is passed to the CPU by the interrupt controller. The mainstream of the current x86 SMP architecture is to use the multi-level I/O APIC (Advanced Programmable Interrupt Controller) interrupt system.

  • Local APIC: mainly responsible for delivering interrupt signals to the specified processor;

  • I/O APIC: mainly collects Interrupt signals from I/O devices and sends signals to the local APIC when those devices need to be interrupted;

 

Interrupt classification :

Interrupts can be divided into synchronous ( synchronous ) interrupts and asynchronous ( asynchronous ) interrupts:

  • A synchronous interrupt is actively generated by the CPU control unit when an instruction is executed. It is called synchronous because the CPU issues an interrupt only after an instruction is executed, rather than during the execution of a code instruction, such as a system call. According to Intel According to official information, synchronous interruption is called exception , and exceptions can be divided into three categories: fault , trap , and abort .

  • Asynchronous interrupts are randomly generated by other hardware devices according to the CPU clock signal, which means that interrupts can occur between instructions, such as keyboard interrupts. Asynchronous interrupts are called interrupts, and interrupts can be divided into maskable interrupts . ) and non-maskable interrupt ( Nomaskable interrupt ).

  1. Non-maskable interrupts (NMI) : Just like the literal meaning of this type of interrupt, this kind of interrupt cannot be ignored or canceled by the CPU. NMI is sent on a separate interrupt line, which is usually used for critical hardware errors, such as memory errors, fan failures, temperature sensor failures, etc.

  2. Maskable interrupts (Maskable interrupts) : These interrupts can be ignored or delayed by the CPU. This type of interrupt will be generated when the external pin of the cache controller is triggered, and the interrupt mask register will mask such an interrupt. We can set a bit to 0 to disable interrupts triggered on this pin.

Processing flow:

the difference:

Same point:

1. In the end, it is sent to the kernel by the CPU and processed by the kernel;

2. The flow design of the processing program is similar.

difference:

1. The sources of generation are different. Traps and exceptions are generated by the CPU, while interrupts are generated by hardware devices;

2. The kernel needs to call different handlers according to whether it is an exception, a trap, or an interrupt;

3. Interrupts are not clock-synchronous, which means that interrupts may come at any time; traps and exceptions are generated by the CPU, so they are clock-synchronous;

4. When processing interrupts, it is in the interrupt context; when processing traps and exceptions, it is in the process context.

Interrupt affinity:

  • In the SMP architecture, we can set CPU affinity (CPU affinity) through system calls and a set of related macros, and bind one or more processes to run on one or more processors. Interrupts show no weakness in this regard either, and share the same properties. Interrupt affinity refers to binding one or more interrupt sources to run on a specific CPU;

  • In the /proc/irq directory, for hardware devices that have registered interrupt handlers, there will be a directory IRQ# named after the interrupt number under this directory, and there will be a smp_affinity file under the IRQ# directory (only for SMP architecture) file), it is a CPU bitmask, which can be used to set the affinity of the interrupt, and the default value is 0xffffffff, indicating that the interrupt is sent to all CPUs for processing. If the interrupt controller does not support IRQ affinity, this default value cannot be changed, and all CPU bit masks cannot be turned off, that is, it cannot be set to 0x0;

  • The advantage of interrupt affinity is that in a large number of hardware interrupt scenarios, for applications such as file servers and high-traffic Web servers, binding different network card IRQs to different CPUs in a balanced manner will reduce the burden on a certain CPU and improve multiple The ability of each CPU to handle interrupts as a whole; for applications such as database servers, binding the disk controller to one CPU and the network card to another CPU will improve the response time of the database and optimize performance. Reasonably balancing IRQ interrupts according to the characteristics of your own production environment and applications can help improve the overall throughput and performance of the system;

  Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

Classification of Common Interrupts in Linux System

Clock interrupt :

The clock chip is generated, and its main job is to process all information related to time, decide whether to execute the scheduler, and process the second half. All information related to time includes system time, process time slice, delay, CPU time, and various timers. The updated time slice of the process provides the basis for process scheduling, and then decides whether to execute when the clock interrupt returns scheduler. A second-half handler is a mechanism provided by Linux to defer execution of a portion of work. The clock interruption must absolutely ensure the accuracy of the system time. "Clock interruption" is the pulse of the entire operating system.

NMI interrupt :

The external hardware triggers through the NMI Pin of the CPU (hardware trigger), or the software delivers an NMI type interrupt to the CPU system bus (software trigger). There are two main uses of the NMI interrupt:

  • It is used to inform the operating system of hardware failure (Hardware Failure), such as memory error, fan failure, temperature sensor failure, etc.;

  • Used as a watchdog timer to detect CPU deadlocks, etc.;

Hardware IO interrupt :

Most hardware peripheral IO interrupts, such as network cards, keyboards, hard disks, mice, USB, serial ports, etc.;

Virtual interrupt :

Some interrupt exit and interrupt injection in KVM, software simulation interrupt;

View method: cat /proc/interrupts

Linux system interrupt handling

Since the interrupt will interrupt the normal scheduling operation of the process in the kernel, the interrupt service program is required to be as short and concise as possible; but in the actual system, when the interrupt arrives, it often requires a lot of time-consuming processing to complete the work. Therefore, it is expected to make the interrupt handler run faster and to make it complete more work. These two goals restrict each other, and the top/bottom half mechanism is born.

Interrupt the top half:

The interrupt handler is the top half - accept the interrupt, it starts executing immediately, but only does time-critical work. Work that can be allowed to be done later is deferred to the bottom half, after which, at the appropriate time, the bottom half is opened for execution. The top half is simple and fast, and some or all interrupts are disabled during execution.

Interrupt bottom half:

The bottom half is executed later, and all interrupts can be serviced during execution. This design can make the time that the system is in the interrupt shielding state as short as possible, so as to improve the responsiveness of the system. The top half only has an interrupt handler mechanism, while the bottom half implements soft interrupts, tasklets, and work queues;

soft interrupt

Softirq, as the representative of the lower half mechanism, came into being with the emergence of SMP (share memory processor), and it is also the basis of tasklet implementation (tasklet actually just adds a certain mechanism on the basis of softirq). Soft interrupts are generally the general term for "delayable functions", and sometimes tasklets are also included (readers are asked to infer whether tasklets are included according to the context when encountering them). It appears because it needs to meet the difference between the upper half and the lower half proposed above, so that the execution of time-insensitive tasks is delayed, and it can be executed in parallel on multiple CPUs, so that the overall system efficiency can be higher . Its characteristics include: it cannot be executed immediately after it is generated, and it must wait for the scheduling of the kernel to execute.

Soft interrupts cannot be interrupted by themselves (that is, soft interrupts on a single CPU cannot be nested), and can only be interrupted by hardware interrupts (the upper part), and can run concurrently on multiple CPUs (even the same type). Therefore, the soft interrupt must be designed as a reentrant function (allowing multiple CPUs to operate at the same time), so it is also necessary to use a spin lock to protect its data structure.


Scheduling timing of soft interrupts:

  1. irq_exit is called when do_irq completes the I/O interrupt.

  2. The system uses the I/O APIC when processing the local clock interrupt.

  3. local_bh_enable, that is, when the local soft interrupt is enabled.

  4. In the SMP system, when the cpu finishes processing the function triggered by the CALL_FUNCTION_VECTOR interprocessor interrupt.

  5. When the ksoftirqd/n thread is woken up.

softirq kernel thread

In Linux, interrupts have the highest priority. No matter at any time, as long as an interrupt event occurs, the kernel will immediately execute the corresponding interrupt handler, and wait until all pending interrupts and soft interrupts are processed before performing normal tasks, so real-time tasks may not be processed in time . After the interrupt is threaded, the interrupt will run as a kernel thread and be given a different real-time priority. The real-time task can have a higher priority than the interrupt thread. In this way, real-time tasks with the highest priority can be processed first, and real-time performance is guaranteed even under severe load. However, not all interrupts can be threaded, such as clock interrupts, which are mainly used to maintain system time and timers, etc. The timer is the pulse of the operating system. Once threaded, it may be suspended, and the consequences would be unthinkable, so should not be threaded.

The soft interrupt is first executed in irq_exit(), and if the condition exceeds the time, it will be executed in the softirqd thread. Softirqs are executed in the softirqd thread if any of the following conditions are met:

  • Running in irq_exit()->__do_softirq() takes more than 2ms.

  • Runs in irq_exit()->__do_softirq(), polling softirq more than 10 times.

  • Running in irq_exit()->__do_softirq(), this thread needs to be scheduled.

Note: When calling raise_softirq() to wake up the soft interrupt, it is not in the interrupt environment.

TASKLET

Because soft interrupts must use reentrant functions, this leads to higher design complexity, which increases the burden on developers of device drivers. And if an application does not need to be executed in parallel on multiple CPUs, then soft interrupts are actually unnecessary. Therefore, a tasklet was born to make up for the above two requirements. It has the following properties:

a) A specific type of tasklet can only run on one CPU, not in parallel, but in serial.

b) Multiple tasklets of different types can run in parallel on multiple CPUs.

c) Soft interrupts are allocated statically and cannot be changed after the kernel is compiled. But tasklets are much more flexible and can be changed at runtime (such as when adding modules).

Tasklet is implemented on the basis of two softirq types, so if the parallel feature of softirq is not required, tasklet is the best choice. That is to say, tasklet is a special usage of soft interrupt, that is, serial execution under delay.

There are two types of tasklets, tasklet and hi-tasklet:

  • tasklet   corresponds to softirq_vec[TASKLET_SOFTIRQ];

  • hi-tasklet  corresponds to softirq_vec[HI_SOFTIRQ]. It's just that the latter is ranked first in softirq_vec[], so it is executed earlier;

/proc/softirqs provides the running status of softirqs

# cat /proc/softirqs
CPU0      
HI:    1   //高优先级TASKLET软中断
TIMER:   12571001  //定时器软中断
NET_TX:     826165  //网卡发送软中断
NET_RX:    6263015  //网卡接收软中断
BLOCK:    1403226  //块设备处理软中断
BLOCK_IOPOLL:   0  //块设备处理软中断
TASKLET:   3752   //普通TASKLET软中断
SCHED:     0  //调度软中断
HRTIMER:   0  //当前已经没有使用
RCU:    9729155  //RCU处理软中断,主要是callback函数处理

work queue

The work queue (work queue) is a mechanism in the Linux kernel to postpone the execution of work. Soft interrupts run in the interrupt context, so they cannot block and sleep, and tasklets are implemented using soft interrupts, and of course they cannot block and sleep. The work queue can postpone the work and hand it over to a kernel thread for execution—this lower part is always Will be executed in the context of the process, so the advantage of the work queue is that it allows rescheduling and even sleep.

Several role relationships in workqueue:

  • work  : work/task.

  • workqueue  : A collection of jobs. workqueue and work is a one-to-many relationship.

  • worker  : worker. In the code, worker corresponds to a work_thread() kernel thread.

  • worker_pool : A collection of workers. The relationship between worker_pool and worker is one-to-many.

  • pwq (pool_workqueue): middleman/intermediary, responsible for establishing the relationship between workqueue and worker_pool. The relationship between workqueue and pwq is one-to-many, and that between pwq and worker_pool is one-to-one.

In general, to choose between work queues and softirqs/tasklets, the following rules can be used:

  • If the postponed task needs to sleep, then only the work queue can be selected.

  • If the post-execution task needs to be delayed for a specified time before being triggered, then use the work queue because it can use the timer delay (kernel timer implementation).

  • If the post-execution task needs to be processed within one tick, use soft interrupt or tasklet, because it can preempt ordinary processes and kernel threads, and cannot sleep at the same time.

  • If the postponed task does not have any requirements for the delay time, the work queue is used, which is usually an insignificant task at this time.

In fact, the essence of the work queue is to hand over the work to the kernel thread, so it can be replaced by the kernel thread. However, the creation and destruction of kernel threads has high requirements for programmers, and the work queue realizes the encapsulation of kernel threads, which is not easy to make mistakes. It is recommended to use work queues.

interrupt context

The interrupt code runs in the kernel space. The interrupt context is the CPU context environment required to run the interrupt code. These parameters need to be passed by the hardware, and some other environments that the kernel needs to save (mainly the currently interrupted process or other interrupt environments), These are generally stored in the interrupt stack (x86 is independent, and others may be shared with the kernel stack, which is closely related to the specific processing architecture). After the interrupt ends, the process can still resume from the original state.

Whether it is interrupted or not is judged by preempt_count in Linux, as follows:

#define in_irq() (hardirq_count()) //in processing hard interrupts

#define in_softirq() (softirq_count()) //in processing softirq

#define in_interrupt() (irq_count()) //in processing hard or soft interrupts

#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) //Include all the above cases

  

Summary and points to note:

1. The designer of the Linux kernel made the rules:

  • Interrupt context is not a scheduling entity , task is [process (main thread) or thread];

  • Priority order : hard interrupt context > soft interrupt context > process context;

The interrupt context (hardirq and softirq context) does not participate in scheduling (interrupt threading is not considered for the time being), they are a processing mechanism for asynchronous events, and the goal is to complete the processing as soon as possible and return to the scene. Therefore, the priority of all interrupt contexts is higher than that of process contexts. That is to say, for user processes (whether in kernel mode or user mode) or kernel threads, unless the CPU's local interrupt is disabled, once an interrupt occurs, they have no ability to prevent the interrupt context from preempting the execution of the current process context.

2.Linux divides the interrupt processing process into two stages, namely the upper half and the lower half :

  • The upper part is used to quickly process interrupts. It runs in interrupt-disabled mode and mainly deals with hardware-related or time-sensitive tasks that need to be executed quickly;

  • The lower part is used to delay processing the unfinished work of the upper part, and usually runs in soft interrupt mode, which can delay execution.

3. Hard interrupts and soft interrupts (as long as they are interrupt contexts) are not allowed to be preempted by the kernel (the kernel preemption will be discussed in the subsequent chapters of this article). Because in the interrupt context, the only one that can interrupt the current interrupt handler is a higher priority interrupt, which will not be interrupted by the process (this is the same for softirq and tasklet, so these bottom half cannot sleep); if in Sleeping in the interrupt context, there is no way to wake it up, because all wake_up_xxx are for a certain process, but in the interrupt context, there is no concept of process, no corresponding task_struct (this is the same for softirq and tasklet), so Really sleep, such as calling a routine that will cause blocking, the kernel will almost hang.

4. A hard interrupt can be "interrupted" by another hard interrupt with a higher priority than itself, but cannot be "interrupted" by a hard interrupt of the same level (the same type of hard interrupt) or a low-level hard interrupt, let alone a soft interrupt. A soft interrupt can be "interrupted" by a hard interrupt, but not "interrupted" by another soft interrupt. On a CPU, softirqs are always executed serially. Therefore, on a uniprocessor, accessing the data structure of the soft interrupt does not require any synchronization primitives.

5. Turning off interrupts will not lose interrupts, but multiple identical interrupts arriving during the period will be merged into one, that is, only processed once; the jieffis count value needs to be updated in the clock interrupt. If multiple interrupts are combined into one, in order to reduce the impact on the jieffis value Accuracy, need other hardware clock to correct.

 

Guess you like

Origin blog.csdn.net/youzhangjing_/article/details/131577331