[Linux] Advanced IO --- multiplexing, select, poll, epoll

All pleasure obtained through shortcuts, whether money, sex or fame, will eventually bring pain to oneself
Insert image description here


1. Five IO models

1. What is efficient IO? (Reduce the proportion of waiting time)

1.
The most commonly used network IO design pattern for back-end servers is actually Reactor, also known as reactor mode. Reactor is a single process and single thread, but it can handle network IO requests initiated by multiple clients to the server because it is There is a single execution flow, so its cost is not high, and the occupancy rate of resources such as CPU and memory will be low, which reduces the cost of server performance and improves server performance.
Compared with Reactor, the shortcomings of multi-process and multi-threaded servers are obvious. In high-concurrency scenarios, the server will face a large number of connection requests. Each thread requires its own memory space, stack, and its own kernel. Data structure, so the resource consumption caused by a large number of threads will reduce the performance of the server. Multi-threading will also perform thread context switching, that is, execution flow level switching. Each switch needs to save and restore the context information of the thread. This It will consume CPU time, and frequent context switching will also reduce server performance. The previous problems are all for servers. For programmers, the most annoying thing about a server with multiple execution flows is debugging and finding bugs. Therefore, the ecology of servers with multiple execution flows is relatively poor, and it is more difficult to troubleshoot problems. The server is difficult to maintain. At the same time, because multiple execution flows may access critical resources at the same time, the security of the server is also relatively low, and problems such as resource competition and data corruption may occur.

2.
After talking about the benefits of the Reactor model, let’s understand what efficient IO is. Only by truly understanding IO can we understand the Reactor model.
We are not unfamiliar with IO. In fact, IO is being performed all the time inside our own computers, because the von Neumann system has determined that computers must perform IO all the time, from storage devices to Get the data into the memory and return the processing results to the memory. From the perspective of network IO, our computer needs to get the data from hardware such as network cards to the computer memory. After the data is processed, it may need to be processed again. Data is sent out from the network card, which is actually an IO process, so IO is a very common thing for computers.

Insert image description here
3.
But the above understanding of IO is still not deep enough. When we were learning TCP network socket programming, we talked about it before. IO actually means copying. For example, when sending, it actually copies the data in its own buffer. Copied to the kernel's sk_buff, during recv, the data in the kernel's receiving buffer is actually copied to the buffer defined by the application layer, so we thought at that time that IO was actually a data copy.
But we can think about it in detail. As long as you call recv, will the data be copied to the application layer buffer? Will there be no data in sk_buff? Because there may be no client sending data to my server. In the same way, as long as you call send, will the data be copied to the kernel sk_buff? Could it be that the sk_buff is already full of data and there is no remaining space?
If these situations exist, what will interfaces such as recv and send do? The answer is, these interfaces will definitely wait! It will wait until the conditions are ready before copying the data. recv will wait until there is data in sk_buff, and send will wait until there is remaining space in sk_buff. When the conditions are ready, these IO interfaces will copy the data. So we are going to redefine IO today. IO is not just data copying, but also requires waiting. When the waiting conditions are ready, these interfaces will copy data, so IO = waiting + data copying.

4.
In fact, we have encountered situations like waiting. When talking about inter-process communication and pipe communication, we have encountered situations of waiting. For example, the writing end does not write data to the pipe. At this time, the reading end will block. The essence of blocking is actually waiting, waiting for the write end to write data to the pipe. Therefore, it is common for the IO interface to read data to wait, but it is not common to write data, because most In most cases, write events are directly ready because there is often remaining space in the kernel send buffer. TCP has its own sliding window strategy for sending data. Basically, it is rare that write events are not ready, but read events Not ready is still very common, especially when reading data in a network environment, because after the data packet is called send, the data packet will be sent to the kernel's sk_buff. When to send and how to send these strategies are provided by TCP Yes, when the data packet is sent, it will experience a delayed response, query the routing table to determine the next hop path, and forward the data packet in the LAN. These processes all take time. At this time, when the peer calls recv to receive data, in these Within the time window, doesn’t recv have to wait?

5.
The so-called efficient IO actually means reducing the proportion of waiting time, because the proportion of time for data copying is basically determined. It is determined by hardware structure, operating system optimization, compiler optimization and other conditions, or the bandwidth can be increased. Copying more data at one time, but these methods are actually fixed, and the efficiency of data copying is not greatly improved. The biggest factor affecting IO efficiency is actually waiting. As long as the proportion of waiting time in the IO model is very low, then We call this IO efficient.

2. What IO models are there? Which models are efficient?

1.
There are five IO models, namely blocking IO, non-blocking IO, signal-driven IO, multiplexed IO, and asynchronous IO. Let's give an example to briefly talk about the IO practices of these five models.
Once upon a time, there was a small river with many fish in the river. A young man named Zhang San liked fishing very much. He took his fishing rod and went fishing. But Zhang San was very stubborn. As long as the fish did not take the bait, Zhang San would Just keep waiting, doing nothing, just staring at the fish float. Only when the fish float moves, Zhang San will move, and then catch the fish. After catching the fish, Zhang San will repeat the previous actions, motionless. Wait for the fish to bite. At this time, a boy named Li Si came over. Li Si also liked fishing, but Li Si was different from Zhang San. Li Si had "Linux High-Performance Server Programming" in his left pocket and "Introduction to Algorithms" in his right pocket. , holding a mobile phone in his left hand and a fishing rod in his right hand. After Li Si took the fishing stool and sat down, Li Si started fishing, but Li Si was not like Zhang San who stubbornly stared at the fish float. Li Si waited for a while He read the book in his left pocket for a while, played with his mobile phone for a while, read Introduction to Algorithms for a while, and looked at the fish bobber for a while, so Li Si kept looping the previous actions until he cycled to look at the fish bobber and found that the fish bobber had already moved. After a long time, Li Si will catch the fish at this time, and then continue to repeat the previous actions in the cycle. At this time, another young boy came, and Wang Wu took his own iphone14pro Max and a fishing rod plus a bell, and then went fishing. Wang Wu hung the bell on the fishing rod. When the fish took the bait, the bell would ring. Wang Wu didn't even look at the fishing rod and just kept playing with himself. iPhone, when the fish takes the bait, the bell will ring automatically. Wang Wu can just catch the fish at this time. Then Wang Wu will continue to repeat the previous actions. As long as the bell does not ring, Wang Wu will keep playing with the phone. Only When the bell rang, Wang Wu would catch the fish. At this time, another person from Zhao Liu came. Zhao Liu was different from the previous three people. Zhao Liu was a small local tyrant. By the river, first of all, hundreds of fishing rods were inserted every few meters, and a total of hundreds of meters of fishing rods were inserted, and then Zhao Liu traversed these fishing rods one by one. After catching the fish on this fishing rod, Zhao Liu continued to repeat the previous action of traversing the fishing rod for fishing. Then came another Qian Qi, Qian Qi is richer than Zhao Liu, Qian Qi is the CEO of a listed company, Qian Qi has his own driver, Qian Qi does not like fishing, but Qian Qi likes to eat fish, so Qian Qi took My driver stayed on the shore, and gave the driver a phone and a bucket, and told the driver, when you catch a bucket full of fish, call me, and then I will drive from the company to pick you up, so Qian Qi drove directly back to the company to hold a shareholders' meeting, while his driver was left here to continue fishing.

2.
In the above example, whose fishing method do you think is more efficient? First of all, we believe that if a person keeps fishing, collects the fishing rod from time to time, catches the fish, and waits for the fish to take the bait, but the proportion of time is very low, then in my opinion, this person’s fishing method is efficient. . And if a person spends most of the time waiting and only closes the pole to catch the fish a few times, then this person's fishing method is inefficient.
In the above example, the fish is actually the data, the fishing rod is actually the file descriptor fd, and everyone is a process, except for Qian Qi’s driver, which is the operating system, the river is the kernel buffer, and the fish float is the ready event. , indicating that the fishing is ready and the process can copy the data.
In fact, Zhao Liu's method is the most efficient, that is, the IO model of multi-channel switching is the most efficient, because Zhao Liu has more fishing rods, and the chance of catching fish is greater, while others only have one fishing rod. , can only care about the data on this fishing rod, naturally not as efficient as Zhao Liu. The same reason why scumbags have so many girlfriends, because they cast a wide net, and the probability of finding a girlfriend is higher than that of ordinary honest people. It's great, because people can care about so many WeChat accounts at one time, and they can chat with whomever the girl sends a message, which is definitely more efficient than if you only have one girl on WeChat.
Therefore, this article mainly introduces the IO model of multiplexing, and also explains blocking and non-blocking IO. It should be noted that in actual projects, blocking IO is most commonly used, and most FDs are blocking by default. , because this kind of IO is too simple. The simpler things are, the more reliable they are, the simpler the code is written, and the less difficult it is to debug and find bugs. Such code is highly maintainable, so it is more commonly used.

3. Differences in the characteristics of the five IO models

1.
There is no difference in IO efficiency between blocking, non-blocking and signal-driven, because all three of them have only one fishing rod, and the probability of waiting for the fish to take the bait is the same, which is equivalent to the same probability of them waiting for the event to be ready. , so from the perspective of IO efficiency, there is no difference between these three models. It's just that the waiting methods of the three are different. Blocking is always waiting, while non-blocking may use polling to wait. During the waiting period, non-blocking may also do some other things. The signal driver is the same as the non-blocking one. During the waiting period, the signal driver will do some other things, such as monitoring whether other connections are ready and so on. So from the perspective of IO efficiency, there is no difference between these three types of IO, because the IO process is divided into waiting and data copying. The efficiency of the three in this work is the same, except for non-blocking and signal-driven waiting. The method is different from blocking IO. The signal driver is just passive waiting. Both blocking and non-blocking are active waiting. When the signal arrives, the signal driver IO will handle the ready event through callback.

2.
Multiplexing is more efficient than the first three IO models because it can wait for multiple file descriptors at one time, but these four IOs have a common feature, which is that they directly participate in the IO process. We call this kind of communication synchronous communication, while asynchronous IO is a typical asynchronous communication. It leaves the waiting for the data to be ready to the kernel. When the data is ready, the operating system will respond in the form of a signal or callback function. The notification process can process the data because the data is ready. This is a typical asynchronous communication.

3.
So when you see the word synchronization in the future, you must determine what its background is, whether it is synchronous communication or asynchronous communication. This refers to the different message communication mechanisms of the process. The former is a synchronous IO method, that is, Actively wait for the result returned by the call. The latter is an asynchronous IO method. The call returns directly. The operating system that subsequently executes the call will notify the process in some way.
Or synchronization may also be thread synchronization and mutual exclusion. Thread synchronization refers to multiple threads working together with each other through condition variables to complete a certain task.

2. Blocking and non-blocking IO

1.
The file descriptor fd corresponding to each open file is blocked by default. Whether it is a system file fd or a network socket sock, it is a blocked IO mode by default.
There are roughly three ways to set fd to be non-blocking. When opening a file, you can carry the option O_NONBLOCK, or when using a network IO interface, such as send, recv, etc., you can carry an additional option MSG_DONTWAIT. The other is the most commonly used The method is the fcntl system call. Methods like open are only suitable for opening system files, but are not suitable for network sockets. Moreover, when opening a file, you have to remember so many additional options, which is still not very suitable for programmers. Friendly, methods like fcntl are completely applicable to both system files and network sockets.

Insert image description here

2.
The fcntl function has 5 functions. We only use the third function here, which is to set the status flag of the file descriptor. First, obtain the flag bit of the original file descriptor through the F_GETFL option, and then set the original file descriptor flag through the F_SETFL option. The flag bit is bitwise ORed with O_NONBLOCK, and then reset back to the file. This will set the file descriptor to non-blocking.

Insert image description here

3.
The following running result is a typical blocking IO. When the program runs, the execution flow will block at read, because read reads file descriptor No. 0 today, which is the data on the keyboard file. As long as I If you do not enter data from the keyboard, read will always be blocked. At this time, the process will be suspended by the operating system. The process will not be re-entered into the CPU's running queue until there is data on the hardware device keyboard. After we enter the data, we can I immediately saw that the process displayed the echo response result, and at the same time, the process immediately fell into blocking, waiting for me to input data next time. This IO method is a typical blocking method, and it is also the most commonly used and simplest IO method.
I would like to add here that the shortcut key to indicate the end of input in the Linux command line is ctrl+d. When this hotkey is pressed by the user, it means that the writing end of file descriptor No. 0 is closed. At this time, the reading end will read 0. read will return a value of 0. At this time, in addition to outputting the prompt message "read file end", the process should also add a break to jump out of the loop and end the process.

Insert image description here

4.
The following are the experimental results of non-blocking IO. When non-blocking IO does not read data, it will not be stuck at the read system call, waiting for the arrival of data from fd No. 0, but read will return immediately, and read will return at the same time. The value is -1, so you can see that the while loop is constantly printing >>> output information, because read will return immediately when the data is not read, and will not block.

Insert image description here
Below is the interface SetNonBlock() for setting fd to non-blocking IO. The setting method is also very simple. You only need to obtain the original flag fl of fd first, and then bitwise OR between fl and O_NONBLOCK to reset it to the file descriptor fd. That’s it, isn’t it very simple?
Insert image description here
5.
During non-blocking IO, the return result of read is -1. Is this reasonable? There is no data at the bottom layer, is this an error? In fact, this is not an error, but when there is no data in the bottom layer, read returns in an incorrect way, but how do we distinguish whether the read interface really fails to call (for example, read reads a non-existent fd), or whether It’s just that there is no data on the bottom layer. Of course, we can’t tell the difference through the return value of read, because read returns -1 in both cases, but it can be distinguished by the error code. When non-blocking IO returns, if there is no data on the bottom layer data, the error code will be EWOULDBLOCK or EAGAIN. If read is called with an error, there will be a corresponding error code.
At the same time, during the non-blocking waiting period, the process can also do some other tasks, such as printing some logs, downloading some files, executing some SQL statements, etc. The execution method is also very simple. First load these functions into a vector. , and then execute a macro function inside the non-blocking IO loop. The inside of the macro function actually traverses the vector container and executes the function pointer methods in the container in sequence.
An error code that requires additional explanation is EINTR, which stands for interrupted, which means that the system call is interrupted. It is possible that during the execution of the read system call, the kernel will check the process's three tables about signals: block table, pending table, and handler table. Before the check, the process happened to receive a signal from the operating system, so it is possible that the process running level will change to user mode again at this time, and instead execute the user-defined handler method, which is the processing function corresponding to the signal. After executing the handler When the method returns, the read system call will be interrupted, read returns -1, and the error code is set to EINTR.

Insert image description here
Insert image description here

3. select_server

1. Detailed explanation of select system call

1.
Select is the first multi-channel IO interface we learned. Select is only responsible for the waiting step in the IO process. In other words, users may be concerned about some read events on the sock and want to read data from the sock. , read directly, the recv call may be blocked, waiting for the data to arrive, and at this time the server process will be blocked and suspended, but the server will be finished if it hangs, and the server will not be able to provide services to customers, which may cause many unpredictable problems. It has a bad impact. What if the customer is transferring money and the server suddenly hangs up? The customer’s money is gone, but the merchant has not received the money. Who can the customer ask for explanation? Therefore, server hangup is a problem that we must avoid. Such a problem arises.
The function of select is to help the user care about the read event on the sock. When there is data in the sock, select will return at this time to inform the user that the read event on the sock that you care about is ready. Users can call recv to read the sock. The data is in! Therefore, multiplexing actually separates the IO process and uses the multiplexing interface to monitor whether the event on the fd is ready. Once ready, the upper layer will be notified immediately, allowing the upper layer to call the corresponding interface for data processing. Waiting and data copying are performed separately. This kind of IO efficiency must be high, because a multiplexing interface like select can wait for multiple fds at a time. When returning, it can put all ready files in multiple fds. All fds are returned and notified to the upper layer.

2.
When the program is running, the program will actually wait here in select, traverse multiple underlying fds once, see which fd is ready, and then return the ready fd to the upper layer. The first parameter of select, nfds, represents monitoring. The largest fd value + 1 among the fds actually means that select needs to traverse all monitored fds at the bottom layer, and this nfds parameter actually tells select what the range of the bottom layer traversal is. The last four parameters are all input and output parameters, both The user tells the kernel and the kernel tells the user the role of the message, such as the timeout parameter. When input, it means the user tells the kernel how to select to monitor and wait for fd. nullptr means that select blocks and waits for fd to be ready. When there is fd ready, select will return. Pass 0 means non-blocking waiting for fd to be ready, that is, select will only traverse and detect the underlying fd once. Regardless of whether there is fd ready, select will return. Passing a value greater than 0 means that select will block and wait within this time range. If it exceeds this time, select will directly Non-blocking return. Assume that the timeout parameter value you entered is 5s. If select detects that an fd is ready and returns at the third time, the kernel will modify the timeout value to 2s inside the select call. This is the role of the output parameter. The kernel informs the user , the timeout value is 2s, and the waiting time for select is 3s.
The return value of select has three meanings. If it is greater than 0, it means the number of ready file descriptors. If it is equal to 0, it means that select times out and returns. If it is less than 0, it means that select really returns with an error. For example, if you ask select to monitor an FD that does not exist at all, then At this time, select will error and return -1

Insert image description here

3.
The three parameters in the middle of select are also input and output parameters. As a bridge between users and the kernel, the following is the source code of the fd_set structure type. In fact, the fd_set structure is easy to understand. It is actually a bitmap structure, and the so-called The bitmap structure is implemented by wrapping the array with a structure. __fd_mask is actually an 8-byte long int type. The size of __FD_SETSIZE/__NFDBITS is 16, so we can directly regard this bitmap as a long int type with a size of 16. The array has a total of 16×8×8 bits, so select can support monitoring a maximum of 1024 fds. This is actually one of the shortcomings of select. We will summarize the shortcomings of select later.
fd_set is indeed a bitmap, but we should not manually add, delete or modify fd to this bitmap structure ourselves, but should use the corresponding bit operation interface provided by the kernel to operate. The user can inform the kernel through the interface of these bit operations, which events on the fd the user wants to care about. The three parameters in the middle of select respectively represent the information the user wants to inform the kernel. readfds is the user telling the kernel to care about the read events on the fd. , writefds is the user telling the kernel to care about write events on fd, exceptfds is the user telling the kernel to care about exception events on fd, and when the select call returns, fd_set will be used as an output parameter, and the kernel informs the user about the fd you care about. , I put the ready fd in the fd_set structure for you, so after the select call is completed, the contents of the user-defined fd_set structure will be modified by the kernel, from the user-defined fd that needs the kernel to care about to ready The fd event. This is the role of input and output parameters, as a bridge to achieve the purpose of mutual communication between the user and the kernel.

Insert image description here

Let's meet with history. When we were learning linux signals, the blocked table in the three tables we mentioned was actually a bitmap structure. At that time, we also used the interface provided by the operating system to operate the blocked table, so here The fd_set bitmap is exactly similar to the signal set we learned at that time.
Insert image description here

2.Select server code writing

1.
The interfaces exposed by the server to the outside are initServer() and start(), which are used to initialize the server and start the server. In start, the server will begin to accept the full connection that completes the three-way handshake, but the server can directly accept the connection. Get the connection? How does the server determine that there must be a ready connection in the current kernel listening queue? Only select can monitor whether the read event on listensock is ready, so we must first add the read event on listensock to the fd_set bitmap structure, and let the kernel help us care about the read event on listensock. After the read event on listensock is ready, The server then calls accept to get the connection. At this time, accept will not block and wait, but will directly get the ready connection.
We can perform the above work based on the return value of select. When the return value is greater than 0, directly call the HandlerReadEvent interface to handle the ready read event in the rfds bitmap.

Insert image description here
2.
In HandlerReadEvent, there may be two types of read events to be processed, one is the read event on the listensock, and the other is the read event on the ordinary sock, so it must be judged by the effective bit in the rfds bitmap What kind of read event is it, but in fact, there will be a problem here. Since there may be many fds in fd_set, there will be many events to be processed by the server. After accepting the connection, a communication sock is obtained. This Can the communication sock directly perform recv to read data? Of course not, this sock must also be handed over to select for monitoring. When the sock is ready, recv can read the data, but how does HandlerReadEvent inform select? You should not only help me care about listensock, but also help me care about the sock used for communication Woolen cloth?
You can’t add the rfds bitmap through output parameters. The most disgusting part of select is that the kernel will modify the rfds bitmap. If you want to add the fd you care about to rfds through output parameters, then every time After the select call returns, you need to record all the fds you care about, and then put all the fds you care about into rfds before the next select call. The point is that you need to record all the fds, like listensock as It is not troublesome to record the private member variables of the server, but what if more than 100 connections are accepted later? Should we define more than 100 sock, record all the sock, and then add them to fd_set one by one? This is too troublesome!
So here in select, we need to use a third-party array fd_array to store the fd that the user needs to care about. Before each call to select, add all the legal fd in this array to fd_set, and then let select help us care. We add a third-party array to store fd, mainly because the bitmap parameter rfds in the select interface is an input-output parameter. Every time the select interface returns, the kernel will modify the value in rfds, so this will cause the next call to select. We need to reset the fd concerned in the rfds bitmap, which is also one of the shortcomings of the select interface.

Insert image description here

3.
The following is the implementation of the Accepter interface when the server handles read events on listensock. Today our server will not process write events. When we write the Reactor network library later, we will process all the events. Today We just handle read events.
After accept gets the communication sock, we need to add the communication sock to fd_array, and let the kernel help us care about the read event on the sock when we call select next time. When adding, there are actually pitfalls here. Since the fd_set bitmap can only support a maximum of 1024 bits, there is an upper limit to the fd that select can monitor at the same time. Inside the loop of adding sock to fd_array, we need to find the free bits. bit, and then add sock to this bit. At this time, there are two situations when jumping out of the loop. One is that fd_set is really full, and the other is that fd_set has free space.
After the addition is completed, when the execution flow returns to start, all legal fds in fd_array will be added to the rfds bitmap before executing select.

Insert image description here
4.
The following is the interface Recver when processing communication sock. After entering Recver, the sock is indeed ready, but there is still a problem when recv reads the data directly. The most typical problem is the sticky packet problem. How can you ensure that you can get it right in one go? Can a complete message be read? If you read in a loop, how to ensure that the recv called later will not block? In fact, there is no need to guarantee this. This issue is more worrying, because as long as the Recver interface is entered, it can ensure that there is data at the bottom of the sock. If a complete message cannot be read at one time, it can be read a second time. , the third time... until the read data can parse a complete message. After reading a complete message, we need to deserialize the message and convert the byte stream data into Parsing to get structured data, of course, these are the processing tasks of the application layer, we will not talk about them in detail.
When recv reads 0, it means that the write end has closed the sock. The server should also close the socket and invalidate the socket in fd_array, that is, set it to -1. Today, our server's application layer processing work It's very simple. In fact, it just needs to return the message sent by the client. func is a callback method passed to the select_server class in main. It is used for business logic to process the information sent by the client. After processing, directly call send to The response is sent back to the client, and there is also a problem here. How do you ensure that send will be able to send data? You don't know that the write event of sock is ready, but we don't care about it today, because there is a high probability that the write event is ready directly, and because the server has not sent anything before, there is a high probability that there is space in the send buffer.

Insert image description here

5.
The following is the complete code of select_server. In fact, it is very simple to implement this server. What needs to be noted is that select must use the third-party array fd_array to save the fd that the user cares about. Each time you call select, you need to re- The fd that the user cares about in fd_array is set to the parameter of select.
Some other points that need attention are that when processing read events, the processing logic between listensock and communication sock is different, and their logic should be completed by different modules. For example, Accepter and Recver in the code combine the two The sock event is handled separately.

Insert image description here

Below are some encapsulated socket programming interfaces. After encapsulation, they are simpler to use and the code is more readable.

Insert image description here

Insert image description here

3. Disadvantages of select server

1.
Select is not a good solution in multi-channel transfer. Of course, this does not mean that it is problematic, but it costs more to use and there are more points to pay attention to, so we say it is not a Good plan.

2.
At the same time, select also has shortcomings. For example, there is an upper limit on the FDs monitored by select. The maximum limit under my cloud server kernel version is 1024 FDs. This is mainly because fd_set is a fixed-size bitmap structure. The array in the bitmap It will not change after development. This is the data structure of the kernel. Unless you modify the kernel parameters, it will not change. Therefore, once the number of fd monitored by select exceeds 1024, select will report an error.
In addition, most of the parameters of select are input and output parameters. Users and the kernel will constantly modify the values ​​of these parameters. As a result, the contents of the fd_set bitmap need to be reset before each call to select. This is a problem for the user. This will bring a lot of unnecessary traversal + copy costs.
At the same time, select also needs to use a third-party array to maintain the fd that users need to care about, which is also a manifestation of the inconvenience of using select. The above problems are exactly the meaning of other multi-channel switching interfaces. Poll solves many problems of the select interface.

Insert image description here

4. poll_server

1. Detailed explanation of poll system call

1.
The poll interface mainly solves two problems of the select interface. One is that the fd monitored by select has an upper limit, and the other is that select needs to use a third-party array to reset the fd of interest in fd_set before each call.
The first parameter of poll is the address of the structure array. Each element in the array is the struct pollfd structure. The structure contains three fields. fd indicates that the user tells the kernel that the fd needs to be concerned about. events indicates that the user tells the kernel that the kernel needs to care about it. What events on fd do you need to care about? revents means that the kernel informs the user that the revents on fd you are concerned about are ready. The second parameter of poll is nfds, which indicates the size of the structure array. nfds_t is actually an unsigned long int type. The redefinition is 8 bytes in size under a 64-bit system, so the maximum value of nfds is 2^64, which is 4.2 billion × 4.2 billion, and the maximum is 18446744073709600000. Such a number cannot exist in a computer There are many file descriptors, so although there is an upper limit to the structure array fds in a mathematical sense, in a computer, this upper limit is meaningless, because there cannot be so many file descriptors, so we think poll Solved the problem that the fd monitored by select has an upper limit.
At the same time, the fd and events fields in the structure are input parameters, and revents are output parameters. The user tells the kernel about the fd that it needs to care about, and the kernel informs the user about the fd that is ready. They are decoupled in poll, unlike in select. Both of these things are done through input and output parameters and are coupled together, so poll does not need to use a third-party array. It can just add the structure directly to the structure array. There is no need to call poll every time. Reset because poll decouples input and output.

Insert image description here

2.
The following are the values ​​of events and revents. The most commonly used ones are POLLIN and POLLOUT, which represent care about read events and write events on fd respectively. POLLRDBAND means care about out-of-band data on fd. These uppercase values ​​​​are actually macros. These macros will be assigned to events and revents of the 2-byte short type.

Insert image description here

3.
The meaning of the return value of poll is the same as that of select. If it is greater than 0, it means the number of ready fds. If it is equal to 0, it means a timeout return. If it is less than 0, it means an error return. Timeout represents the strategy of poll monitoring fd. If it is greater than 0, it means blocking within the value range. Monitoring, if it exceeds this value, it will return directly in a non-blocking manner. If it is less than 0, it means blocking monitoring until some fd is ready. If it is equal to 0, it means non-blocking monitoring. Poll will traverse the fd that the user cares about. No matter whether there is an fd ready, poll will directly return. The timeout parameter of the poll interface is different from the select interface. The timeout parameter of the select interface is an input-output parameter, while the timeout parameter of the poll interface is a purely input parameter. Only the user will modify the timeout, not the kernel.

2.Writing poll server code

1.
Next, rewrite and implement the select server code just now with the poll interface.
The poll server is very similar to the select server just now, except that the select member variable just now has an fd_array to store the fd that the user cares about, while the current poll server uses a _rfds variable to store the start of the structure array. Address, this structure pointer is the first parameter passed to the poll interface.
The main interface of pollServer.hpp is the same as select. It just replaces the select part of the interface with the use of poll interface. When initializing the server, you need to open a structure array. This array is opened on the heap. This array is actually written in a relatively standard way. It was made into an expanded version, that is, vector, but today, for simplicity, we made it into a fixed-size array. After opening the array, first reset the contents of each structure in the array, that is, initialize it. Set fd to -1, events and revents to 0, which means no events.
After initializing the array, we can directly add the read event of listensock to the first position of the structure array, because the first thing the server needs to care about must be the read event of listensock, so we can do this. At this point, the initialization work of the server is complete. Done successfully.

Insert image description here
2.
The following is the running logic of the server. This running logic is much simpler to write than select. Because the poll interface does not need to reset fd into the structure before each call, it is very simple to write. When the return value of poll When it is greater than 0, it means that an event is ready, then we execute the HandlerReadEvent() interface.
In HandlerReadEvent(), the entire structure array needs to be traversed, and the size of the structure array is determined during initialization, which is the size of num. num is a global static constant size. I initially set it to 2048, so HandlerReadEvent needs to traverse The entire structure array depends on which structure's revents value has been set by the kernel. If it is set, it means that the structure is ready, and HandlerReadEvent should handle the ready events on this structure. Today pollServer, like selectServer, only processes read events, so we can first determine whether the fd is legal inside the for loop, and then determine whether the events are set to POLLIN. If it is not set, it means that the fd does not need to be processed, and then Going to the branch statement below, it is actually a judgment of two situations. One is the read event processing on the listensock, and the other is the read event processing on the communication sock. These two events are the same as selectServer and are handed over to the Accepter respectively. and Recver to handle it.

Insert image description here

3.
The accept system call can be directly called in the Accepter to obtain the sock used for IO, because the read event on the listensock must be ready at this time. After obtaining the sock, the next step is to hand the sock to the poll interface Monitor and monitor the read events on the sock. The essence of hosting poll monitoring is to put the sock in the structure array _rfds, so you only need to traverse _rfds to find the free structure location, and then put the sock and The POLLIN event can be set into the structure. There are also two situations when jumping out of the loop. One is that the structure array has no free space, but in fact this situation generally does not exist. As long as the array is set to a flexible array, it can be expanded. However, the other situation is very simple. Just fill in the three field values ​​​​of struct pollfd, thus completing the IO code module of the server.

Insert image description here
4.
The calling logic of pollserver is exactly the same as that of select. There is nothing to say. The code is easy to understand. It uses smart pointers to bind the life cycle of the server object and the life cycle of the pointer, and uses RAII to prevent possible memory leaks. , as long as the smart pointer is destroyed, the server object resource will also be destroyed.

Insert image description here

Below is the complete code
Insert image description here

3.Disadvantages of poll

1.
In fact, the advantage of poll is that it solves the two problems of the upper limit of fd supported by select and the coupling of user input information and kernel output information.
But the shortcomings of poll are actually reflected in the above code. When the kernel detects whether the fd is ready, it needs to traverse the entire structure array to detect the value of events. Similarly, when the user processes the ready fd event, it also needs to traverse the entire structure. The array detects the value of revents. When the rfds structure array becomes larger and larger, each traversal of the array will actually reduce the efficiency of the server. For this reason, the kernel provides the epoll interface to solve this problem.
The same as select, poll also requires the user to maintain a third-party array to store the fd and events that the user needs to care about, but poll does not need to reset the fd of concern before each call, because the user's input and the kernel's The output is separated, and the input and output are separated in the events and revents fields of the structure.

5. epoll_server

1. Detailed explanation of epoll system call

1.
epoll is recognized as the most efficient multiplexing interface. The man manual describes that epoll is an improved poll in order to handle a large number of handles, that is, extendPoll's scalable poll. As we mentioned above, when a large number of handles When the handle arrives, poll will reduce efficiency due to frequent traversal of all handles. The emergence of epoll solves this problem.
Although epoll is an improved version of poll, epoll and poll are very different in terms of interface usage and underlying implementation. The epoll interface was introduced in the Linux kernel version 2.5.44, and now the mainstream Linux kernel version It's almost 3 o'clock.

Insert image description here
2.
epoll_create will help us create an epoll model in the kernel. This epoll model is very important and can help us understand why epoll is efficient and how it works. The so-called epoll model is actually a struct file structure, so epoll_create creates After the epoll model is successful, a file descriptor will be returned, and the size parameter of epoll_create has been ignored as early as kernel version 2.6. In early Linux kernel versions, this parameter specifies the kernel data structure when the epoll model is created. The initial size, but now this parameter is useless, because the kernel will automatically adjust the size of the epoll model according to the needs of the user. The epoll model is actually mainly a red-black tree + ready queue + underlying callback mechanism. These are kernel data structure.
Although size is useless, when passing parameters to epoll_create, the parameter must be greater than 0. We can just pass 128 or 256.

Insert image description here

3.
The first parameter of epoll_ctl is the return value of epoll_create, which is the file descriptor of the epoll model. The second parameter represents what function of epoll_ctl you want to use, such as adding events that fd cares about, modifying events that fd cares about, and deleting For events that fd cares about, you can pass macros EPOLL_CTL_ADD, EPOLL_CTL_MOD, and EPOLL_CTL_DEL to indicate which functions of epoll_ctl are used. The third parameter is the fd that the user wants to care about. The fourth parameter is a structure, which contains two fields. One is 32-bit events indicate that the user tells the kernel what events on fd to care about. The macros corresponding to these events are as shown in the source code on the right. Only one of the 32 bits of each macro is 1, and the others are all 0. The final Commonly used macros are still EPOLLIN and EPOLLOUT. The other field is a union data. An important field in this union is fd, which also tells the kernel what fd it should care about.
epoll_ctl returns 0 to indicate that the interface call is successful, and returns -1 to indicate that the interface call fails.

Insert image description here

4.
The third interface is epoll_wait. The first parameter is also the return value of epoll_create, the file descriptor of the epoll model. The second parameter is a pure output parameter. The kernel will put the ready struct epoll_event structure into this in turn. In the array, the third parameter represents the size of the events structure array passed in by the user, and timeout represents the monitoring strategy when epoll_wait monitors fd. It is the same as the interface of poll, so I won’t go into details here.
The return value of epoll_wait indicates the number of ready struct epoll_event structures.

Insert image description here

2.The underlying principle of the epoll model

2.1 The whole process of data flow when software and hardware interact

1.
The data is copied from the software memory to the hardware peripheral. This process is actually relatively easy to understand, because the data can pass through the protocol stack, encapsulating the header layer by layer, and finally the corresponding driver of the hardware delivers the data packet to the specific device. Hardware, the bottom layer of the protocol stack is the physical layer.
But when the data arrives, how does the operating system know that there is data arriving in the network? We have never learned this before, because it belongs to the knowledge of the planning team. We who are engaged in software study it are actually just to understand the entire process of data flowing in IO.
When the data reaches the network card, the network card has a corresponding 8259 interrupt device. This device is used to send an interrupt signal to a certain pin of the CPU. The CPU has many pins, and some of the pins correspond to a hardware interrupt device. When the CPU pin receives When an interrupt signal comes from the network card interrupt device, the pin will be lit and trigger a high-level signal. The register corresponding to the pin (CPU workbench) will interpret the lit pin as a binary sequence. This binary The sequence is the serial number corresponding to the pin.
Next, the CPU processor will query a data structure called the interrupt vector table based on this serial number. The interrupt vector table has been loaded into a specific location in the memory when the CPU starts. The interrupt vector table can be understood as an array structure that stores The entry address of the handler corresponding to each interrupt sequence number is actually a function pointer, and the function will call back the driver method of the network card to copy the data from the hardware network card to the operating system code in the memory. (The above set of logical processes are all implemented by the operating system)
At this point, the process of data flow from hardware to software memory is completed. After the data reaches the inside of the operating system, everyone is also clear about the next work, which is to penetrate the protocol upwards. Stack, splitting the header and payload, until finally handed over to the application layer, of course we are very familiar with the data flow in the software.

2.
In the computer hardware, not only the network card has terminal equipment, but also the more common hardware keyboard has its own interrupt device. Every time we press a key on the keyboard, it will actually trigger a hardware interrupt. There is also the timer module, which also has its own interrupt device, which can manage and schedule the kernel process at the overall computer level.

Insert image description here

2.2 epoll model kernel structure diagram

1.
When you call epoll_create, the kernel will create an epoll model at the bottom layer. The epoll model mainly consists of three parts, red-black tree + ready queue + underlying callback mechanism.
Each node in the red-black tree is actually a struct epoll_event structure. When the upper layer calls epoll_ctl to add events that fd cares about, it actually inserts nodes into the red-black tree. Therefore, epoll_ctl is essentially responsible for the addition, deletion and modification of events that fd cares about. In fact, it is to add, delete and modify nodes in the red-black tree created in the kernel, so the user tells the kernel, what fd do you want to care about for me? The bottom layer is to manage the red-black tree.
The ready queue stores the ready struct epoll_event structure. When the kernel informs the user which events on fd are ready, it actually copies each node in the ready queue to the pure output parameters passed in when the user calls epoll_wait. In the structure array events. The ready queue is a doubly linked list Doubly Linked List.
So the essence of the so-called event readiness is actually to link the nodes in the red-black tree into the ready queue. The linking process is actually very simple. Just add an additional pointer of the linked list node type inside the red-black tree node. , this pointer can be initialized to nullptr first, and when the event that fd cares about in the node is ready, then point this pointer to the tail node in the ready queue.
A node can be in multiple data structures at the same time. The method is very simple. Just add the pointer of the element type in the data structure. By modifying the pointer of the pointer, the node can be linked into the new data structure. Logically We have separated the ready queue and the red-black tree, but in terms of code implementation, we only need to add a pointer inside the struct epoll_event structure, so that a structure can be in the ready queue and the red-black tree at the same time.

2.
We already know the general principle of the epoll model, but there is still a question. How does the operating system know which nodes on the red-black tree are ready? Does the operating system also need to traverse the entire red-black tree to detect the readiness of each node? The operating system does not actually do this. If it does, what efficiency can epoll talk about? Don’t you have to traverse all fd with epoll? What is the difference from my poll traversal? The red-black tree is more efficient in searching, not in traversing. If all nodes are traversed, the efficiency of the red-black tree is actually about the same as that of linked list traversal, which is not efficient at all!
So how does the operating system know which node in the red-black tree is ready? In fact, it is realized through the underlying callback mechanism, which is also an important implementation link that is recognized as very efficient by the epoll interface!
When the data reaches the network card, we know that the data will go through hardware interrupts, and the CPU executes the interrupt vector table and other steps to allow the data to reach the operating system in the memory. When running through the network protocol stack in the OS, the data at the transport layer will be copied to the struct In the receive_queue receiving buffer in the file structure, the file descriptor corresponding to this struct file structure is actually the sockfd used for communication from accept. There is a very important field private_data inside this structure, which the pointer will point to. A callback function, this callback function will link the struct epoll_event structure corresponding to the sock into the ready queue, because at this time the data has been copied to the kernel's socket receiving buffer, and the event is ready, so when the kernel is copying At the same time as the data is being sent, the private_data callback method will also be called to link the red-black tree node corresponding to the sock into the ready queue. Therefore, the operating system does not need to traverse the red-black tree to detect whether the nodes are ready. When the data arrives, the underlying The callback mechanism will automatically link the ready red-black tree nodes into the ready queue.

3.
Summarize the underlying workflow when the fd event is ready.
When the data reaches the network card of the network device, the hardware interrupt will be used as the initiation point, and the interrupt signal will be sent to the pin of the CPU through the interrupt device. Next, the CPU will query the interrupt vector table to find the driver callback method corresponding to the interrupt serial number. In the callback method Internally, data is copied from the hardware device network card to the software OS. The data packet will pass through the protocol stack upward in the OS. When it reaches the transport layer, the data will be copied to the kernel buffer of the struct file. At the same time, the OS will execute a callback function pointer field called private_data, which will be modified inside the callback function. The content of the ready queue pointer in the red-black tree node links the node to the ready queue. When the kernel informs the user which fds are ready, it only needs to copy the node content in the ready queue to the output parameter events of epoll_wait. This is It is the underlying callback mechanism of the epoll model!

Insert image description here

4.
Below I slightly simulate the private_data pointer callback method. You can use this pointer to store a function pointer. During the callback, you only need to convert the pointer type from void type to function pointer type first, and then call it.
The so-called epoll model is actually a red-black tree + ready queue + underlying callback mechanism.

Insert image description here

2.3 Problems caused by the epoll model

1. Why is the epoll model efficient?

Because the operating system has done most of the work for us, such as adding nodes to the red-black tree, we only need to call epoll_ctrl and return the ready fd, which is directly equivalent to returning the node in the ready queue. The upper layer can directly After getting the ready fd, there is no need to traverse the work of checking whether it is ready. Instead, when the underlying data is ready, a callback mechanism will automatically link the nodes of the red-black tree into the ready queue, and the operating system does not need to traverse the red-black tree. Readiness detection, after the upper layer gets the ready fd, it can determine the range of traversing the output parameter struct epoll_event array, instead of blindly traversing all elements of the entire array.

2. Why choose red-black tree as the underlying data structure of epoll model?
Because the search efficiency of the red-black tree is very high and can reach the time complexity of logN, so whether it is the insertion, deletion or modification of epoll_ctl, the first prerequisite for these tasks is to first find the target node or target location, and then proceed with the specific operation, and the efficiency of finding the red-black tree in this step is very high.
Some people may say that the red-black tree needs to be rotated to adjust the balance. Although logically we feel that the rotation and balance of the red-black tree is time-consuming and may cause the efficiency of the red-black tree to decrease, but this is not the case. The so-called rotation Adjusting the balance is only logically complicated. In actual operation, it only modifies the pointer in the node, and has little impact on the efficiency of the red-black tree.
At the same time, the requirements for balance of red-black trees are not as high as AVL. Therefore, the number of rotations to adjust the balance is much less than that of AVL trees. The overall efficiency is higher than that of AVL trees. This is also the reason for using red-black trees. , the reason for not using an AVL tree.

3.What are the details of epoll_wait?
(1) epoll_wait will place all ready fds in order in the output parameter events. When the user traverses the array to process ready events, he does not need to traverse any extra fd. He only needs to traverse the return value from 0 to epoll_wait. Just one fd.
(2) If there are a large number of nodes in the ready queue, don’t worry if the output parameter array of epoll_wait cannot be taken out at one time, because the queue is first in, first out. The next time you call epoll_wait, you can also get the ready events.
(3) When select poll is used, the programmer needs to maintain a third-party array to store the fd and events that the user cares about, but epoll does not need it because the kernel maintains a red-black tree at the bottom for epoll, and the user directly You can add, delete, or modify nodes in the red-black tree through epoll_ctl. There is no need to maintain third-party arrays at the application layer.

3. Writing epoll server code

1.
The initialization module is also very simple to implement. First, create the socket of the server normally, bind the binding, listen to monitor, and then create the epoll model. After the creation is successful, add the listensock read event to the red-black tree of the epoll model , the way to add is also very simple, just define a struct epoll_event structure, fill in the events and data.fd fields in it, and call epoll_ctl to add listensock and its events to the red-black tree. The next step is to apply for the space for the output parameter of epoll_wait, which only requires a new one.

Insert image description here

2.
The logic of server start is also very simple. First call epoll_wait to monitor fd. When the return value is greater than 0, call the interface HandlerEvent to process the time. Since only read events are processed today, only two branch statements are needed to implement it. , the parameter of HandlerEvent is readyNum, which indicates the number of ready events. When traversing the _revs array, you only need to traverse 0-readyNum structures, without traversing any redundant fd.
Then in the accept branch statement, you only need to bring up the ready connection, then set the sock to the red-black tree, and wait for the data on the recv sock when it is ready next time.
In the recv branch statement, if it is read only once, there will still be a sticky problem like the previous two server codes. The next article, Reactor, will solve all problems. It should be reminded that it is recommended to remove the node from the red-black tree first, and then close the sock. If you close the sock first, the fd will become invalid. If you call epoll_ctl to remove the node at this time, the parameters passed in sock is invalid, then epoll_ctl will report an error!
(Isn’t epoll_server very simple to write? Because the more efficient the interface, the fewer things the programmer will need to do, the more things the kernel will do, and the cost of code writing will be lower.)

Insert image description here

The following is the complete epoll_server code
Insert image description here

The following is the calling logic of the server. It is no different from the previous select poll and is still very simple.

Insert image description here

4. Summarize the advantages and disadvantages of select poll epoll

select Disadvantages:

(1) There is an upper limit for supported file descriptors. The maximum in my kernel version is 1024.
(2) The programmer needs to maintain a third-party array to store the fd and events that the user cares about.
(3) Due to input and output coupling, each Before calling select every time, you need to reset the fd and events you care about to select
(4) The user needs to traverse the entire fd_set bitmap each time to determine which ready fd needs to be processed. If you have a very large file descriptor set, even if Only one file descriptor is ready and you need to check all the bits. The kernel also needs to traverse the fd_set bitmap each time to determine which fd is ready. A large number of users and kernel traversing the fd_set collection will reduce efficiency.
select advantages:
(1) Able to monitor multiple file descriptors at the same time, allowing one process or thread to manage multiple IO operations at the same time, improving IO efficiency
(2) select is a cross-platform system call, available on almost all mainstream operating systems Support, including Linux, Unix, Windows, etc.


Disadvantages of poll:

(1) The programmer needs to maintain a third-party structure array to store the fd and events that the user cares about.
(2) The same as select, the user still needs to traverse the entire array to find the ready file descriptor, even if there is only one structure The revents of the body are ready to determine which fd is ready for processing. The kernel also needs to traverse the structure array each time to determine which fd is ready. A large number of user and kernel traversal collections will reduce efficiency.

Advantages of poll:
(1) The maximum number of fds that a process can open, poll can monitor at most the number of fds at the same time (the mathematical upper limit is 2^64)
(2) There is no need to reset the fds and events of concern each time before calling poll.
(3) Poor cross-platform portability of poll


Disadvantages of epoll:
(1) epoll is not suitable for small-scale connections, because epoll needs to maintain a lot of kernel data structures, and is more suitable for high-concurrency large-scale IO operations. Small-scale connections will maintain complex data structures and callback mechanisms due to epoll. This brings unnecessary overhead to the system
(2) epoll has poor cross-platform portability

Advantages of epoll:
(1) The maximum number of fds that a process can open, epoll can monitor the maximum number of fds at the same time. (The mathematical upper limit is 2^32)
(2) Programmers do not need to maintain third-party arrays to store the fds and events that users care about. Because the kernel will create a red-black tree for epoll, just add, delete, and modify nodes directly to the red-black tree.
(3) Users also need to traverse the structure array, because epoll_wait will place the ready fd in order in the structure array events passed in by the user, so the user can traverse on demand. But the kernel does not need to traverse the entire red-black tree to detect which nodes have fd ready, because the epoll model has its own underlying callback mechanism, which greatly reduces the performance overhead caused by the kernel traversing the collection, thereby improving efficiency.
(4) There is no need to reset the fd and events of concern every time before calling epoll.

In fact, I personally think that the kernel can allow programmers to traverse the ready fd as needed when select and poll are used, instead of traversing all the arrays or bitmaps storing fd every time. The kernel must be able to do it, but maybe the people who designed the kernel at the time were unwilling to do it, or the kernel couldn't do it yet, but they must have considered this issue of on-demand traversal.
The underlying approach of select and poll kernels is the same as that of the upper layer, which directly traverses the entire array, no matter how many fds are set by the upper layer to the collection.

Insert image description here

Guess you like

Origin blog.csdn.net/erridjsis/article/details/132548582