Start with the hardware to deeply understand the essence of epoll

To engage in server-side development, you must contact network programming. Epoll is an essential technology for high-performance network servers under Linux. Nginx, Redis, Skynet and most game servers all use this multiplexing technology.

Epoll is very important, but what is the difference between epoll and select? What is the reason for epoll's high efficiency?

Insert picture description here

Although there are many articles explaining epoll on the Internet, they are either too simple or trapped in source code analysis, and few are easy to understand. The author therefore decided to write this article so that readers who lack professional background knowledge can also understand the principle of epoll.

The core idea of ​​the article is: let readers clearly understand why epoll performs well.

This article will start with the process of receiving data from the network card, connect the knowledge of CPU interruption, operating system process scheduling, etc.; then analyze the evolution process of blocking receiving data, select to epoll step by step; finally, explore the implementation details of epoll.

[Article benefits] The editor recommends my own linuxC/C++ language exchange group: 832218493. I have compiled some learning books and video materials that I think are better for sharing in it. You can add them if you need them! ~!
Insert picture description here
Insert picture description here

1. Speaking of receiving data from the network card

Below is a typical computer structure diagram. The computer is composed of CPU, memory (memory) and network interface. The first step to understand the essence of epoll is to see how the computer receives network data from the perspective of hardware.

Insert picture description here

Computer structure diagram (picture source: the microcomputer composition structure fully annotated by the Linux kernel)

The following figure shows the process of the network card receiving data.

In stage ①, the network card receives the data from the network cable;
after the transmission of the hardware circuit in stage ②;
finally stage ③ writes the data to an address in the memory.
This process involves hardware-related knowledge such as DMA transmission and IO channel selection, but we only need to know: the network card will write the received data into the memory.
Insert picture description here

The process of network card receiving data

Through hardware transmission, the data received by the network card is stored in the memory, and the operating system can read them.

2. How do I know that the data is received?

The second step to understand the essence of epoll is to look at data reception from the perspective of the CPU. To understand this problem, we must first understand a concept-interrupt.

When a computer executes a program, there is a priority requirement. For example, when the computer receives a power-off signal, it should immediately save the data. The program that saves the data has a higher priority (the capacitor can save a small amount of power for the CPU to run for a short period of time).

Generally speaking, the signal generated by the hardware requires the CPU to respond immediately, otherwise the data may be lost, so it has a high priority. The CPU should interrupt the program being executed to make a response; when the CPU completes the response to the hardware, it executes the user program again. The interrupt process is as shown in the figure below. It is similar to a function call, except that the function call is located in advance, and the interrupt location is determined by the "signal".

Insert picture description here

Interrupt program call

Take the keyboard as an example. When the user presses a key on the keyboard, the keyboard will send a high level to the interrupt pin of the CPU. The CPU can capture this signal and then execute the keyboard interrupt program. The following figure shows the process of various hardware interacting with the CPU through interrupts.
Insert picture description here

CPU interrupt (picture source: net.pku.edu.cn)

Now you can answer the question "How do I know that the data is received?": When the network card writes data into the memory, the network card sends an interrupt signal to the CPU, and the operating system can know that new data is coming, and then interrupt the program through the network card. Data processing.

3. Why does process block not occupy CPU resources?

The third step to understand the essence of epoll is to look at data reception from the perspective of operating system process scheduling. Blocking is a key part of process scheduling. It refers to the waiting state of a process before an event (such as receiving network data) occurs. Recv, select, and epoll are all blocking methods. Let's analyze why the process blocking does not occupy CPU resources?

For the sake of simplicity, let's start the analysis from the ordinary recv reception, first look at the following code:

//创建socket
int s = socket(AF_INET, SOCK_STREAM, 0);   
//绑定
bind(s, ...)
//监听
listen(s, ...)
//接受客户端连接
int c = accept(s, ...)
//接收客户端数据
recv(c, ...);
//将数据打印出来
printf(...)

This is the most basic network programming code. First create a new socket object, call bind, listen and accept in turn, and finally call recv to receive data. Recv is a blocking method. When the program runs to recv, it will wait until it receives data before executing it.

So what is the principle of blocking?

Work queue

In order to support multitasking, the operating system realizes the function of process scheduling, and divides the process into several states such as "running" and "waiting". The running state is the state in which the process obtains the right to use the CPU and the code is executing; the waiting state is the blocking state. For example, when the above program runs to recv, the program will change from the running state to the waiting state, and then change back to the running state after receiving the data. The operating system will execute the processes in each running state in a time-sharing manner. Because of its fast speed, it looks like it is performing multiple tasks at the same time.

The computer in the figure below is running three processes, A, B, and C. Process A is executing the above-mentioned basic network program. At the beginning, these three processes are all referenced by the operating system's work queue. They are running and will be time-sharing. carried out.

Insert picture description here

There are three processes A, B and C in the work queue

Waiting queue

When process A executes the statement to create a socket, the operating system will create a socket object managed by the file system (as shown below). This socket object contains members such as sending buffer, receiving buffer and waiting queue. The waiting queue is a very important structure, which points to all processes that need to wait for the socket event.

Insert picture description here

Create socket

When the program reaches recv, the operating system will move process A from the work queue to the waiting queue of the socket (as shown in the figure below). Since there are only processes B and C left in the work queue, according to process scheduling, the CPU will execute the programs of these two processes in turn, and will not execute the programs of process A. Therefore, process A is blocked and will not execute code down and will not occupy CPU resources.

Insert picture description here

Socket waiting queue

Note: The waiting queue added by the operating system just adds a reference to this "waiting" process so that it can get the process object and wake it up when data is received, instead of directly incorporating process management into itself. For the convenience of illustration, the above figure directly hangs the process under the waiting queue.

Wake up process

When the socket receives the data, the operating system puts the process on the socket waiting queue back to the work queue, the process becomes running and continues to execute code. At the same time, because the receiving buffer of the socket already has data, recv can return the received data.

Fourth, the whole process of receiving network data by the kernel

In this step, through the knowledge of network cards, interrupts and process scheduling, it describes the whole process of receiving data by the kernel under the blocking recv.

As shown in the figure below, during the recv blocking period, the computer receives the data transmitted by the opposite end (step ①), and the data is transmitted to the memory via the network card (step ②), and then the network card informs the CPU of the arrival of data through an interrupt signal, and the CPU executes the interrupt program (Step ③).

The interrupt program here has two main functions. First, write network data into the receiving buffer of the corresponding socket (step ④), then wake up process A (step ⑤), and put process A into the work queue again.

Insert picture description here

The whole process of kernel receiving data

The process of waking up the process is shown in the following figure:

Insert picture description here

Wake up process

The above is the whole process of receiving data by the kernel. Here we may think about two issues:

First, how does the operating system know which socket the network data corresponds to?
Second, how to monitor the data of multiple sockets at the same time?
The first question: Because a socket corresponds to a port number, and the network data packet contains ip and port information, the kernel can find the corresponding socket through the port number. Of course, in order to improve the processing speed, the operating system will maintain the index structure of the port number to the socket for fast reading.

The second issue is the top priority of multiplexing, which is the focus of the second half of this article.

Five, a simple way to monitor multiple sockets at the same time

The server needs to manage multiple client connections, while recv can only monitor a single socket. Under this contradiction, people began to look for ways to monitor multiple sockets. The essence of epoll is to efficiently monitor multiple sockets.

From the perspective of historical development, an inefficient method must first appear, and people will improve it, just like select is to epoll.

To understand the less efficient select first, you can better understand the nature of epoll.

If it is possible to pass in a socket list in advance, if there is no data in the sockets in the list, suspend the process until a socket receives data and wake up the process. This method is very straightforward and is also the design idea of ​​select.

To facilitate understanding, let's review the usage of select first. In the code below, first prepare an array fds so that fds stores all the sockets that need to be monitored. Then call select. If all sockets in fds have no data, select will block until a socket receives data, select returns and wakes up the process. The user can traverse fds, determine which socket received the data through FD_ISSET, and then make processing.

int s = socket(AF_INET, SOCK_STREAM, 0);  
bind(s, ...);
listen(s, ...);
int fds[] =  存放需要监听的socket;
while(1){
    
    
    int n = select(..., fds, ...)
    for(int i=0; i < fds.count; i++){
    
    
        if(FD_ISSET(fds[i], ...)){
    
    
            //fds[i]的数据处理
        }
    }}

select process

The implementation idea of ​​select is very straightforward. If the program monitors three sockets, sock1, sock2, and sock3 as shown in the figure below, after calling select, the operating system adds process A to the waiting queues of these three sockets.

Insert picture description here

The operating system adds process A to the waiting queue of these three sockets

When any socket receives data, the interrupt program will arouse the process. The following figure shows the processing flow of data received by sock2:

Note: The interrupt callback of recv and select can be set to different content.

Insert picture description here

sock2 receives the data, the interrupt program evokes process A

The so-called arousing process is to remove the process from all waiting queues and add it to the work queue, as shown in the following figure:

Insert picture description here

Remove process A from all waiting queues and add it to the work queue

After these steps, when process A is awakened, it knows that at least one socket has received data. The program only needs to traverse the socket list once to get the ready socket.

This simple method works well and has corresponding implementations in almost all operating systems.

But simple methods often have disadvantages, mainly:

First, each call to select needs to add the process to the waiting queue of all monitoring sockets, and each wakeup needs to be removed from each queue. Two traversals are involved here, and each time the entire fds list is passed to the kernel, there is a certain overhead. It is precisely because of the high cost of the traversal operation that for efficiency considerations, the maximum monitoring number of select is specified, and only 1024 sockets can be monitored by default.

Second, after the process is awakened, the program does not know which sockets received data, and it needs to traverse once.

So, is there a way to reduce traversal? Is there a way to save the ready socket? These two problems are the epoll technology to solve.

Supplementary note: This section only explains one situation of select. When the program calls select, the kernel will first traverse the socket. If there is more than one socket receiving buffer with data, then select will return directly without blocking. This is one of the reasons why the return value of select may be greater than 1. If no socket has data, the process will block.

Six, epoll design ideas

Epoll was invented many years after select appeared N. It is an enhanced version of select and poll (poll and select are basically the same, with a few improvements). Epoll uses the following measures to improve efficiency:

Measure 1: Function separation

One of the reasons for the inefficiency of select is the two steps of "maintaining the waiting queue" and "blocking the process" into one. As shown in the figure below, each call to select requires two steps. However, in most application scenarios, the socket to be monitored is relatively fixed and does not need to be modified every time. Epoll separates these two operations, first maintains the waiting queue with epoll_ctl, and then calls epoll_wait to block the process. Obviously, efficiency can be improved.

Insert picture description here

Compared to select, epoll splits the function

In order to facilitate the understanding of the subsequent content, let's first understand the usage of epoll. In the following code, first use epoll_create to create an epoll object epfd, then add the socket to be monitored to epfd through epoll_ctl, and finally call epoll_wait to wait for data:

int s = socket(AF_INET, SOCK_STREAM, 0);   
bind(s, ...)
listen(s, ...)

int epfd = epoll_create(...);
epoll_ctl(epfd, ...); //将所有需要监听的socket添加到epfd中

while(1){
    
    
    int n = epoll_wait(...)
    for(接收到数据的socket){
    
    
        //处理
    }
}

Separation of functions makes it possible for epoll to be optimized.

Measure 2: Ready list

Another reason for the inefficiency of select is that the program does not know which sockets receive data and can only traverse one by one. If the kernel maintains a "ready list" that refers to the socket that received the data, it can avoid traversal. As shown in the figure below, the computer has three sockets, and sock2 and sock3 that received data are referenced by the ready list rdlist. When the process is awakened, as long as you get the content of rdlist, you can know which sockets received data.
Insert picture description here

Schematic diagram of ready list

Seven, epoll principle and work process

This section will explain the principle and workflow of epoll with examples and diagrams.

Create epoll object

As shown in the figure below, when a process calls the epoll_create method, the kernel creates an eventpoll object (that is, the object represented by epfd in the program). The eventpoll object is also a member of the file system. Like sockets, it also has a waiting queue.

Insert picture description here

The kernel creates eventpoll objects

It is necessary to create an eventpoll object representing the epoll, because the kernel needs to maintain data such as the "ready list", which can be a member of eventpoll.

Maintain watch list

After creating an epoll object, you can use epoll_ctl to add or delete the socket to be monitored. Take adding socket as an example. As shown in the figure below, if you add sock1, sock2 and sock3 monitoring through epoll_ctl, the kernel will add eventpoll to the waiting queue of these three sockets.

Insert picture description here

Add the socket to be monitored

When the socket receives data, the interrupt program will operate the eventpoll object instead of directly operating the process.

Receive data

When the socket receives the data, the interrupt program will add a socket reference to the "ready list" of eventpoll. The following figure shows that after sock2 and sock3 receive the data, the interrupt program makes rdlist refer to these two sockets.

Insert picture description here

Add a reference to the ready list

The eventpoll object is equivalent to the intermediary between the socket and the process. The data reception of the socket does not directly affect the process, but changes the state of the process by changing the ready list of eventpoll.

When the program executes to epoll_wait, if rdlist has already referenced the socket, then epoll_wait returns directly, if rdlist is empty, the process is blocked.

Block and wake up the process

Assume that process A and process B are running in the computer, and at some point, process A runs to the epoll_wait statement. As shown in the figure below, the kernel will put process A into the waiting queue of eventpoll, blocking the process.

Insert picture description here

epoll_wait blocking process

When the socket receives data, the interrupt program modifies the rdlist on the one hand, and wakes up eventpoll on the other hand to wait for the process in the queue, and the process A enters the running state again (as shown below). Also because of the existence of rdlist, process A can know which sockets have changed.

Insert picture description here

epoll wake up the process

8. Implementation details of epoll

At this point, I believe that readers have a certain understanding of the nature of epoll. But we also need to know what the data structure of eventpoll looks like?

Also, what data structure should the ready queue use? What data structure should eventpoll use to manage sockets added or deleted via epoll_ctl?

As shown in the figure below, eventpoll contains lock, mtx, wq (waiting queue) and rdlist, among which rdlist and rbr are what we care about.
Insert picture description here

Schematic diagram of epoll principle, picture source: "Deep Understanding of Nginx: Module Development and Architecture Analysis (Second Edition)", Tao Hui

Data structure of the ready list

The ready list refers to the ready socket, so it should be able to insert data quickly.

The program may call epoll_ctl to add a monitoring socket at any time, or it may delete it at any time. When deleting, if the socket is already stored in the ready list, it should also be removed. So the ready list should be a data structure that can be inserted and deleted quickly.

A doubly linked list is such a data structure, and epoll uses a doubly linked list to implement the ready queue (corresponding to the rdllist in the figure above).

Index structure

Since epoll separates the "maintenance monitoring queue" and "process blocking", it also means that a data structure is needed to store the monitored sockets, at least for easy addition and removal, and easy search to avoid repeated additions. The red-black tree is a self-balancing binary search tree. The time complexity of search, insertion and deletion are all O(log(N)), and the efficiency is better. epoll uses the red-black tree as the index structure (corresponding to the rbr in the figure above) ).

Note: Because the operating system has to take into account multiple functions and more data that needs to be saved, rdlist does not directly refer to the socket, but indirectly through epitem. The nodes of the red-black tree are also epitem objects. Similarly, the file system does not directly refer to the socket. To facilitate understanding, some indirect structures are omitted in this article.

Nine, summary

Epoll introduces eventpoll as an intermediate layer on the basis of select and poll, uses advanced data structures, and is an efficient multiplexing technology. Here is also a simple comparison of select, poll and epoll in the form of a table to end this article. Hope readers can gain something.

Insert picture description here

Guess you like

Origin blog.csdn.net/lingshengxueyuan/article/details/111747197