Advanced IO (select poll epoll)

Table of contents

Five IO models

blocking IO

non-blocking IO

Signal driven IO 

IO multiplexing

​Asynchronous I/O 

summary 

Synchronous communication vs asynchronous communication (synchronous communication/ asynchronous communication)

Synchronous and asynchronous focus on the message communication mechanism

blocking vs non-blocking

Other advanced I/O

non-blocking IO

fcntl

Implement the function SetNoBlock

I/O multiplexing select

Understand the select execution process

socket ready condition

read ready

write ready

 select features

select disadvantages

Select usage example: check standard input and output

select usage example

I/O multiplexing poll

poll function interface

return result

poll example: use poll to monitor standard input

epoll for I/O multiplexing

Introduction to epoll

epoll related system calls

epoll_create

epoll_ctl

epoll_wait

underlying mechanism

Why is epoll efficient?

How epoll works

The advantages of epoll (corresponding to the disadvantages of select)

How epoll works

Level trigger Level Triggered working mode

Edge Triggered Edge Triggered working mode

Comparing LT and ET

Understanding ET mode and non-blocking file descriptors

Usage scenarios of epoll

Shocking group problem in epoll

epoll example: epoll server (LT mode)

tcp_epoll_server.hpp

epoll example: epoll server (ET mode)

tcp_socket.hpp

tcp_epoll_server.hpp 


Advanced IO: The system (file descriptor) corresponding to the previous basic io belongs to local communication. And advanced io is some interfaces, that is, how to write efficient network server code in network communication.

Multi-way transfer: Linux has three options (select, poll, epoll)
and one more design mode: reactor mode, based on multi-way transfer, write better code.

Five IO models 

When we are in the io scene, if the corresponding process has participated in the specific process of io, then it is called a synchronous process. If a process is in the io scene and does not participate in the details of io, we call it asynchronous io . (So ​​the difference between synchronous io and asynchronous io is whether it participates in io details)

blocking IO

non-blocking IO

If the kernel has not prepared the data yet, the system call will still return directly, and return the EWOULDBLOCK error code 

Non-blocking IO often requires programmers to repeatedly try to read and write file descriptors in a cyclic manner . This process is called polling . This is a big waste of CPU and is generally only used in specific scenarios.

Signal driven IO 

When the kernel prepares the data, it uses the SIGIO signal to notify the application to perform IO operations

IO multiplexing

Although it looks similar to blocking IO from the flowchart, in fact the core is that IO multi-transfer can wait for multiple files at the same time
Descriptor ready state

Asynchronous I/O 

The kernel notifies the application program when the data copy is completed (the signal driver tells the application program when it can start copying data) 

summary 

In any IO process, there are two steps. The first is waiting , and the second is copying . And in actual application scenarios, the time spent waiting is often much higher than the time spent copying. To make IO more efficient, the core The best way is to minimize the waiting time
The difference between asynchronous io and signal methods:
asynchronous io: he doesn't care about the process of waiting for data and copying data. After you finish copying, tell the process, and then the process will call back to process the data.
Signal: It doesn’t matter when waiting, but the process needs to get the data when it’s ready

Synchronous communication vs asynchronous communication (synchronous communication/ asynchronous communication)

Synchronous and asynchronous focus on the message communication mechanism

  • The so-called synchronization means that when a call is made, the call does not return until the result is obtained . But once the call returns, the return value is obtained; in other words, the caller actively waits for the result of the call .
  • Asynchronous is the opposite. After the call is issued, the call returns directly, so there is no return result ; in other words, when an asynchronous procedure call is issued, the caller will not get the result immediately ; The caller notifies the caller through status, notification, or handles the call through a callback function.

In addition , we recall that when we talked about multi-process and multi-thread , we also mentioned synchronization and mutual exclusion . Here, synchronous communication and synchronization between processes are concepts that we don’t want to do at all.
  • Process / thread synchronization is also a direct constraint relationship between processes / threads.
  • It is two or more threads established to complete a certain task. This thread needs to coordinate their work order in some positions and wait for the constraint relationship generated by passing information. Especially when accessing critical resources .

blocking vs non-blocking

Blocking and non-blocking focus on the state of the program while waiting for the call result (message, return value)
  • Blocking call means that the current thread will be suspended before the result of the call is returned . The calling thread will not return until the result is obtained .
  • A non-blocking call means that the call will not block the current thread until the result cannot be obtained immediately

Other advanced I/O

Non-blocking IO , record lock, system V flow mechanism, I/O multiplexing (also called I/O multiplexing) , readv and writev functions, and storage mapping IO ( mmap ), these are collectively referred to as advanced IO
What we focus on here is I/O multiplexing

non-blocking IO

fcntl

A file descriptor , the default is blocking IO
The function prototype is as follows
#include <unistd.h>
#include <fcntl.h>
int fcntl(int fd, int cmd, ... /* arg */ );
The value of the incoming cmd is different , and the parameters added later are also different
The fcntl function has 5 functions :
  • Duplicate an existing descriptor ( cmd=F_DUPFD ) .
  • Get / set file descriptor flags (cmd=F_GETFD or F_SETFD).
  • Get / set file status flags (cmd=F_GETFL or F_SETFL).
  • Get / set asynchronous I/O ownership (cmd=F_GETOWN or F_SETOWN).
  • Get / set record lock (cmd=F_GETLK, F_SETLK or F_SETLKW)
We just use the third function here , get / set file status flag , you can set a file descriptor as non-blocking

Implement the function SetNoBlock

For fcntl, we implement a SetNoBlock function to set the file descriptor to non-blocking
void SetNoBlock(int fd) { 
 int fl = fcntl(fd, F_GETFL); 
 if (fl < 0) { 
 perror("fcntl");
 return; 
 }
 fcntl(fd, F_SETFL, fl | O_NONBLOCK); 
}
  • Use F_GETFL to get the attributes of the current file descriptor ( this is a bitmap ).
  • Then use F_SETFL to set the file descriptor back . When setting it back , add an O_NONBLOCK parameter

I/O multiplexing select

The system provides the select function to implement the multiplexed input / output model
The select system call is used to allow our program to monitor the status changes of multiple file descriptors ;
The program will stop at select and wait until one or more of the monitored file descriptors changes state
The function prototype of select is as follows : #include <sys/select.h>
int select(int nfds, fd_set *readfds, fd_set *writefds,  
                fd_set *exceptfds, struct timeval *timeout);
Note that he uses the same bitmap, this is his problem, and it is also the PS that we should pay attention to when writing code
: why is max_fd+1 because the bottom layer is traversed in an array way, and the left is closed and the right is opened.
select: time (the last parameter), setting the time is timeout every once in a while (if not ready). Time is set to 0, that is, non-blocking waiting. Setting it to nullptr is to block forever (if nothing is ready).
PS: Multi-channel transfer is to replace multi-process and multi-thread.
Then there is a bunch of code
selection of SelectServer: batch setting, batch monitoring, batch processing.
1 The multi-pass server is designed to replace the multi-process multi-threaded version of the server.
2 select is the same array used for input and output
Parameter explanation :
  • The parameter nfds is the largest file descriptor value +1 that needs to be monitored ;
  • rdset, wrset, and exset respectively correspond to the set of readable file descriptors, the set of writable file descriptors and the set of abnormal file descriptors that need to be detected;
  • The parameter timeout is the structure timeval , which is used to set the waiting time of select() .

Parameter timeout value :

  • NULL : means select () has no timeout , select will be blocked until an event occurs on a file descriptor ;
  • 0 : Only check the state of the descriptor set, and then return immediately without waiting for the occurrence of external events.
  • Specific time value: If no event occurs within the specified time period, select will timeout and return.
About the fd_set structure

In fact, this structure is an array of integers , more strictly speaking , a " bitmap ". Use the corresponding bit in the bitmap to represent the file descriptor to be monitored

A set of interfaces for operating fd_set is provided to facilitate the operation of bitmaps

void FD_CLR(int fd, fd_set *set); // Used to clear the relevant fd bits in the description phrase set
int FD_ISSET(int fd, fd_set *set); // Used to test whether the bit describing the relevant fd in the phrase set is true
void FD_SET(int fd, fd_set *set); // used to set the relevant fd in the description phrase set
void FD_ZERO(fd_set *set); // Used to clear all bits describing the phrase set

About the timeval structure

The timeval structure is used to describe the length of a period of time. If no event occurs in the descriptor that needs to be monitored within this time, the function returns, and the return value is 0.

Function return value: 

  • If the execution is successful, it will return the number of file descriptors whose status has changed
  • If 0 is returned, it means that the timeout time has passed before the descriptor state changes , and no return
  • When an error occurs, it returns -1 , and the cause of the error is stored in errno . At this time, the values ​​of the parameters readfds , writefds, exceptfds and timeout become unpredictable.

Error values ​​may be: 

The EBADF file descriptor is invalid or the file is closed
EINTR This call was interrupted by a signal
EINVAL Argument n is negative.
ENOMEM core out of memory
Common program fragments are as follows :
fs_set readset
FD_SET(fd,&readset);
select(fd+1,&readset,NULL,NULL,NULL);
if(FD_ISSET(fd,readset)){……}

Understand the select execution process

The key to understanding the select model is to understand fd_set. For the convenience of explanation, the length of fd_set is taken as 1 byte, and each bit in fd_set can correspond to a file descriptor fd . Then a 1- byte long fd_set can correspond to a maximum of 8 fds
* (1) Execute fd_set set; FD_ZERO(&set); then set is 0000,0000 in bits . * (2) If fd = 5, execute FD_SET(fd,&set); after that, set becomes 0001,0000 ( the fifth position is 1) * (3) If fd = 2 , fd=1, then set becomes 0001,0011 * (4) Execute select(6,&set,0,0,0) to block and wait * (5) If readable events occur on both fd=1 and fd=2 , select returns, and set becomes 0000,0011. Note: fd=5 with no event is cleared

socket ready condition

read ready

  • In the socket kernel , the number of bytes in the receiving buffer is greater than or equal to the low water mark SO_RCVLOWAT. At this time, the file descriptor , and the return value is greater than 0;
  • In socket TCP communication , the peer end closes the connection , and if the socket is read at this time , it returns 0;
  • There is a new connection request on the listening socket ;
  • There was an unhandled error on the socket ;

write ready

  • In the socket kernel , the number of available bytes in the send buffer ( the free position size of the send buffer ) is greater than or equal to the low water mark SO_SNDLOWAT, and it can be written without blocking at this time , and the return value is greater than 0;
  • The write operation of the socket is closed (close or shutdown). Writing to a socket whose write operation is closed will trigger the SIGPIPE signal ;
  • After the socket connects successfully or fails using non-blocking connect ;
  • There was an unread error on the socket ;

 select features

  • The number of file descriptors that can be monitored depends on the value of sizeof(fd_set) . Sizeof(fd_set) = 512 on my server , and each bit represents a file descriptor, so the maximum file descriptor supported on my server is 512* 8=4096.
  • When fd is added to the select monitoring set, a data structure array is also used to save it in the select monitoring set
  • The fd one is used to select and return, and the array is used as the source data and fd_set for FD_ISSET judgment. The second is select
  • After returning, the fd that was added before but no event occurred will be cleared, and each time before starting to select , fd must be obtained from the array and added one by one (FD_ZERO first ) , and the maximum value of fd maxfd will be obtained while scanning the array , which is used for select The first parameter of .

select disadvantages

  • Every time you call select, you need to manually set the fd collection , which is also very inconvenient from the perspective of interface usage .
  • Every time select is called , the fd collection needs to be copied from the user state to the kernel state. This overhead will be very large when there are many fds
  • At the same time, every time select is called, it is necessary to traverse all the fds passed in by the kernel . This overhead is also very large when there are many fds
  • The number of file descriptors supported by select is too small

Select usage example : detect standard input and output

#include <stdio.h>
#include <unistd.h>
#include <sys/select.h>
int main() {
 fd_set read_fds;
 FD_ZERO(&read_fds);
 FD_SET(0, &read_fds);
 for (;;) {
 printf("> ");
 fflush(stdout);
 int ret = select(1, &read_fds, NULL, NULL, NULL);
 if (ret < 0) {
 perror("select");
 continue;
 }
 if (FD_ISSET(0, &read_fds)) {
 char buf[1024] = {0};
 read(0, buf, sizeof(buf) - 1);
 printf("input: %s", buf);
 } else {
 printf("error! invaild fd\n");
 continue;
 }
 FD_ZERO(&read_fds);
 FD_SET(0, &read_fds);
 }
 return 0;
}
illustrate:
When only detecting file descriptor 0 (standard input), because the input condition is only true when you have input information, if you do not input all the time, a timeout message will be generated

select usage example

Use select to implement dictionary server
tcp_select_server.hpp
#pragma once
#include <vector>
#include <unordered_map>
#include <functional>
#include <sys/select.h>
#include "tcp_socket.hpp"
// 必要的调试函数
inline void PrintFdSet(fd_set* fds, int max_fd) {
 printf("select fds: ");
 for (int i = 0; i < max_fd + 1; ++i) {
 if (!FD_ISSET(i, fds)) {
 continue;
 }
 printf("%d ", i);
 }
 printf("\n");
}
typedef std::function<void (const std::string& req, std::string* resp)> Handler;
// 把 Select 封装成一个类. 这个类虽然保存很多 TcpSocket 对象指针, 但是不管理内存
class Selector {
public:
 Selector() {
 // [注意!] 初始化千万别忘了!!
 max_fd_ = 0;
 FD_ZERO(&read_fds_);
 }
 bool Add(const TcpSocket& sock) {
 int fd = sock.GetFd();
 printf("[Selector::Add] %d\n", fd);
 if (fd_map_.find(fd) != fd_map_.end()) {
 printf("Add failed! fd has in Selector!\n");
 return false;
 }
 fd_map_[fd] = sock;
 FD_SET(fd, &read_fds_);
 if (fd > max_fd_) {
 max_fd_ = fd;
 }
 return true;
 }
 bool Del(const TcpSocket& sock) {
 int fd = sock.GetFd();
 printf("[Selector::Del] %d\n", fd);
 if (fd_map_.find(fd) == fd_map_.end()) {
 printf("Del failed! fd has not in Selector!\n");
 return false;
 }
 fd_map_.erase(fd);
 FD_CLR(fd, &read_fds_);
 // 重新找到最大的文件描述符, 从右往左找比较快
 for (int i = max_fd_; i >= 0; --i) {
 if (!FD_ISSET(i, &read_fds_)) {
 continue;
 }
 max_fd_ = i;
 break;
 }
 return true;
 }
 // 返回读就绪的文件描述符集
 bool Wait(std::vector<TcpSocket>* output) {
 output->clear();
 // [注意] 此处必须要创建一个临时变量, 否则原来的结果会被覆盖掉
 fd_set tmp = read_fds_;
 // DEBUG
 PrintFdSet(&tmp, max_fd_);
 int nfds = select(max_fd_ + 1, &tmp, NULL, NULL, NULL);
 if (nfds < 0) {
 perror("select");
 return false;
 }
 // [注意!] 此处的循环条件必须是 i < max_fd_ + 1
 for (int i = 0; i < max_fd_ + 1; ++i) {
 if (!FD_ISSET(i, &tmp)) {
 continue;
 }
 output->push_back(fd_map_[i]);
 }
 return true;
 }
private:
 fd_set read_fds_;
 int max_fd_;
 // 文件描述符和 socket 对象的映射关系
 std::unordered_map<int, TcpSocket> fd_map_;
};
class TcpSelectServer {
public:
 TcpSelectServer(const std::string& ip, uint16_t port) : ip_(ip), port_(port) {
 
 }
 bool Start(Handler handler) const {
 // 1. 创建 socket
 TcpSocket listen_sock;
 bool ret = listen_sock.Socket();
 if (!ret) {
 return false;
 }
 // 2. 绑定端口号
 ret = listen_sock.Bind(ip_, port_);
 if (!ret) {
 return false;
 }
 // 3. 进行监听
 ret = listen_sock.Listen(5);
 if (!ret) {
 return false;
 }
 // 4. 创建 Selector 对象
 Selector selector;
 selector.Add(listen_sock);
 // 5. 进入事件循环
 for (;;) {
 std::vector<TcpSocket> output;
 bool ret = selector.Wait(&output);
 if (!ret) {
 continue;
 }
 // 6. 根据就绪的文件描述符的差别, 决定后续的处理逻辑
 for (size_t i = 0; i < output.size(); ++i) {
 if (output[i].GetFd() == listen_sock.GetFd()) {
 // 如果就绪的文件描述符是 listen_sock, 就执行 accept, 并加入到 select 中
 TcpSocket new_sock;
 listen_sock.Accept(&new_sock, NULL, NULL);
 selector.Add(new_sock);
 } else {
 // 如果就绪的文件描述符是 new_sock, 就进行一次请求的处理
 std::string req, resp;
 bool ret = output[i].Recv(&req);
 if (!ret) {
 selector.Del(output[i]);
 // [注意!] 需要关闭 socket
 output[i].Close();
 continue;
 }
 // 调用业务函数计算响应
 handler(req, &resp);
 // 将结果写回到客户端
 output[i].Send(resp);
 }
 } // end for
 } // end for (;;)
 return true;
 }
private:
 std::string ip_;
 uint16_t port_;
};

I/O multiplexing poll

poll function interface

#include <poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
// pollfd structure
struct pollfd {
        int fd; /* file descriptor */
        short events; /* requested events */
        short revents; /* returned events */
};

 

Parameter Description
  • fds is a list of structures monitored by the poll function . Each element contains three parts : a file descriptor , a collection of monitored events , and a collection of returned events.
  • nfds indicates the length of the fds array .
  • timeout indicates the timeout period of the poll function , in milliseconds (ms)

The values ​​of events and revents :

return result

  • If the return value is less than 0, it means an error ;
  • The return value is equal to 0, indicating that the poll function waits for a timeout ;
  • The return value is greater than 0, indicating that poll returns because the monitored file descriptor is ready .

socket ready condition

Same select

Advantages of polling

Unlike select , which uses three bitmaps to represent three fdsets , poll uses a pollfd pointer .

  • The pollfd structure contains the event to be monitored and the event that occurs , and the select " parameter - value " transfer method is no longer used . The interface is more convenient to use than select
  • There is no maximum number of polls ( but the performance will decrease if the number is too large ) 

Disadvantages of polling

When the number of file descriptors monitored in poll increases
  • Like the select function, after poll returns, pollfd needs to be polled to get ready descriptors .
  • Every time poll is called , a large number of pollfd structures need to be copied from user mode to the kernel .
  • A large number of clients connected at the same time may have only a few ready at a time , so as the number of monitored descriptors grows , its efficiency will decrease linearly.

poll example : use poll to monitor standard input

#include <poll.h>
#include <unistd.h>
#include <stdio.h>
int main() {
 struct pollfd poll_fd;
 poll_fd.fd = 0;
 poll_fd.events = POLLIN;
 
 for (;;) {
 int ret = poll(&poll_fd, 1, 1000);
 if (ret < 0) {
 perror("poll");
 continue;
 }
 if (ret == 0) {
 printf("poll timeout\n");
 continue;
 }
 if (poll_fd.revents == POLLIN) {
 char buf[1024] = {0};
 read(0, buf, sizeof(buf) - 1);
 printf("stdin:%s", buf);
 }
 }
}

epoll for I/O multiplexing

Introduction to epoll

According to the man manual : it is an improved poll for handling large batches of handles .
It was introduced in the 2.5.44 kernel (epoll(4) is a new API introduced in Linux kernel 2.5.44)
It has almost all the advantages mentioned before, and is recognized as the best multi-channel I/O readiness notification method under Linux2.6 .

epoll related system calls


epoll_create: Successfully returns a file descriptor -- an epoll model is created

epoll_wait: The core work of multi-way transfer is to wait, so how to wait is this function.

epoll_create

int epoll_create(int size);
Create a handle to epoll .
Since linux2.6.8 , the size parameter is ignored .
After use , you must call close() to close .

epoll_ctl

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
Event registration function of epoll
  • It is different from select() , which tells the kernel what type of event to listen to when listening to the event , but first registers the type of event to be listened to here .
  • The first parameter is the return value of epoll_create() ( the handle of epoll ).
  • The second parameter represents the action, represented by three macros .
  • The third parameter is the fd that needs to be monitored.
  • The fourth parameter is to tell the kernel what to monitor
The value of the second parameter :
  • EPOLL_CTL_ADD : Register a new fd to epfd ;
  • EPOLL_CTL_MOD : Modify the listening event of the registered fd ;
  • EPOLL_CTL_DEL : delete a fd from epfd ;

The structure of struct epoll_event is as follows:

events can be a collection of the following macros:

  • EPOLLIN : Indicates that the corresponding file descriptor can be read ( including the normal closing of the peer SOCKET );
  • EPOLLOUT : Indicates that the corresponding file descriptor can be written ;
  • EPOLLPRI: Indicates that the corresponding file descriptor has urgent data readable ( here it should indicate that there is out-of-band data coming );
  • EPOLLERR: Indicates that the corresponding file descriptor has an error ;
  • EPOLLHUP: Indicates that the corresponding file descriptor is hung up ;
  • EPOLLET: Set EPOLL to Edge Triggered mode , which is relative to Level Triggered .
  • EPOLLONESHOT : Only listen to one event . After listening to this event , if you need to continue to monitor this socket , you need to add this socket to the EPOLL queue again

epoll_wait

You can extract ready events from a specific epoll model (the latter two parameters are output parameters, which can return ready events) (the third is to set the waiting mode of epoll, or those three) (the solution is The problem that the kernel tells the user)
poll is to separate the data, and epoll is to separate the interface from the data (to be waited for and ready) (when these two functions are called, the events in their eopll_event are different !)

int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

Collect the events that have been sent in the events monitored by epoll

  • The parameter events is an array of allocated epoll_event structures .
  • epoll will assign the events that occurred to the events array (events cannot be a null pointer, the kernel is only responsible for copying data to the events array, and will not help us allocate memory in user mode ).
  • maxevents tells the kernel how big the events are, and the value of this maxevents cannot be greater than the size when creating epoll_create() . The parameter timeout is the timeout time ( milliseconds, 0 will return immediately, -1 is permanent blocking ).
  • If the function call is successful, it returns the number of file descriptors that have been prepared on the corresponding I/O . If it returns 0 , it means it has timed out . If it returns less than 0 , it means the function failed .

underlying mechanism

When the red-black tree creates a node, the bottom layer automatically creates a callback function. When the data is ready, a callback will occur to tell the OS that it can return, and it will load the data to be returned to the ready queue.
We call the red-black tree, the ready queue and the underlying callback mechanism the epoll model.
(**epoll_create** is to create an epoll model and return a file descriptor. That is, to establish a callback mechanism at the bottom layer, create an empty red-black tree, and create a ready queue. These can be packaged into a data structure and connected to In our file system (struct files, then set other parameters, such as the file type, the red-black tree model pointed to), and then use this file descriptor to find this file model) (**epoll_ctl** is to
add Red-black tree (key: fd) node/find node/modify node/delete node. This red-black tree is the user telling the kernel which file descriptors and which events the kernel needs to care about) (**epoll_wait**
Once the epoll model is established, In essence, which events on the fd are ready, the user does not have to worry about the whole process (because there are callbacks at the bottom (about the callback function: 1 Get the ready fd 2 What is the ready event 3 Build the queue_node node 4 Link the node into the ready queue )). So the function of this function is to get the data of the underlying ready queue, that is, it is a function of waiting (whether the ready queue is empty) + copy)

Why is epoll efficient?

1 After creating the epoll model, use epll_ctl to manage file descriptors and their events. The data structure targeted by epoll_ctl is a red-black tree, and the scenario is the problem that the user tells the kernel. Because the red-black tree node has no upper limit, the file description There is no upper limit on the number of character nodes
. 2 Completely liberate the OS, instead of traversing each file descriptor, the OS automatically sets a callback mechanism when creating the epoll model when processing the copy function at the bottom layer. When there is data in the bottom layer, it passes The callback mechanism is to obtain the ready file descriptor, obtain the ready event, build the ready node, and connect to the ready queue, which is automatically completed by the callback function. Because it is a callback strategy, it does not require active polling by the OS, which greatly improves the detection efficiency. That is to say, the OS now only provides services for ready nodes, and there is no need to traverse in an O(N) manner.
3 epoll_wait does not need to traverse to get the node, just take the node directly from the ready queue (the kernel tells the user)

How epoll works

When a process calls the epoll_create method, the Linux kernel will create an eventpoll structure, which has two members closely related to the use of epoll
struct eventpoll{
....
/* The root node of the red-black tree, which stores all the events that need to be monitored that are added to epoll */
struct rb_root rbr;
/* The double-linked list stores the events that meet the conditions that will be returned to the user through epoll_wait */
struct list_head rdlist;
....
};
  • Each epoll object has an independent eventpoll structure, which is used to store events added to the epoll object through the epoll_ctl method .
  • These events will be mounted in the red-black tree, so that repeatedly added events can be efficiently identified through the red-black tree ( the insertion time efficiency of the red-black tree is lgn , where n is the height of the tree ).
  • And all events added to epoll will establish a callback relationship with the device ( network card ) driver, that is, this callback method will be called when the corresponding event occurs.
  • This callback method is called ep_poll_callback in the kernel , and it will add the event that occurred to the rdlist double-linked list .
  • In epoll , for each event, an epitem structure is created
struct epitem{
struct rb_node rbn;// red-black tree node
struct list_head rdllink;// doubly linked list node
struct epoll_filefd ffd; // event handle information
struct eventpoll *ep; // point to the eventpoll object it belongs to
struct epoll_event event; // The expected event type
}
  • When calling epoll_wait to check whether an event has occurred, you only need to check whether there is an epitem element in the rdlist double-linked list in the eventpoll object.
  • If rdlist is not empty, copy the events that occurred to the user mode, and return the number of events to the user at the same time . The time complexity of this operation is O(1)

To sum up , the process of using epoll is a trilogy :

  • Call epoll_create to create an epoll handle ;
  • Call epoll_ctl to register the file descriptor to be monitored ;
  • Call epoll_wait to wait for the file descriptor to be ready ;

The advantages of epoll ( corresponding to the disadvantages of select )

  • The interface is easy to use : although it is split into three functions , it is more convenient and efficient to use . It is not necessary to set the concerned file descriptor every cycle, and the input and output parameters are also separated.
  • Lightweight data copy : only call EPOLL_CTL_ADD to copy the file descriptor structure to the kernel when appropriate , this operation is infrequent ( while select/poll is copied every cycle )
  • Event callback mechanism : avoid using traversal , but use the callback function to add the ready file descriptor structure to the ready queue , and epoll_wait returns to directly access the ready queue to know which file descriptors are ready . The time complexity of this operation is O( 1). Even if the number of file descriptors is large, the efficiency will not be affected .
  • Unlimited number : unlimited number of file descriptors

How epoll works

epoll has 2 working modes - level trigger (LT) and edge trigger (ET)

Level trigger Level Triggered working mode

If there is such an example :

We have added a tcp socket to the epoll descriptor
At this time, the other end of the socket is written with 2KB of data
Call epoll_wait , and it will return . It means that it is ready for read operation
Then call read, only read 1KB of data
Continue to call epoll_wait...
By default, epoll is in LT working mode .
  • When epoll detects that the event on the socket is ready , it can not process it immediately . Or only process part of it .
  • As in the above example , since only 1K data is read , there is still 1K data left in the buffer . When epoll_wait is called for the second time , epoll_wait will still return immediately and notify the socket that the read event is ready .
  • Until all the data on the buffer has been processed , epoll_wait will not return immediately .
  • Support blocking read and write and non-blocking read and write

Edge Triggered Edge Triggered working mode

If we use the EPOLLET flag when we add the socket to the epoll descriptor in step 1 , epoll enters the ET working mode

  • When epoll detects that the event on the socket is ready , it must be processed immediately .
  • As in the above example , although only 1K of data is read , there is still 1K of data in the buffer . When epoll_wait is called for the second time , epoll_wait will not return .
  • That is to say , in ET mode , after the event on the file descriptor is ready , there is only one chance to process it .
  • The performance of ET is higher than that of LT ( epoll_wait returns a lot less times ). Nginx uses epoll in ET mode by default .
  • Only supports non-blocking read and write
select and poll actually work in LT mode . epoll can support both LT and ET

Comparing LT and ET

LT is the default behavior of epoll . Using ET can reduce the number of epoll triggers . But the price is to force the programmer to process all the data in one response.
It is equivalent to that after a file descriptor is ready , it will not be repeatedly prompted to be ready , which seems to be more efficient than LT . But in the case of LT , if it is also possible to process every ready file descriptor immediately, do not let this ready If you are prompted repeatedly , the performance is actually the same
On the other hand , the code complexity of ET is higher

Understanding ET mode and non-blocking file descriptors

To use epoll in ET mode , you need to set the file description to non-blocking . This is not a requirement on the interface , but a requirement on " engineering practice " .
Suppose such a scenario: the server receives a 10k request , and will return a response data to the client . If the client does not receive the response , it will not send the second 10k request
If the code written by the server is a blocking read, and only reads 1k data at a time (read cannot guarantee to read all the data at once , refer to the man manual , it may be interrupted by signals ), and the remaining 9k the data will stay in the buffer

At this time, since epoll is in ET mode , it does not think that the file descriptor is ready to read . epoll_wait will not return again . The remaining 9k data will always be in the buffer until the next time the client writes data to the server . epoll_wait can return

but here comes the problem

  • The server only reads 1k pieces of data , and will return response data to the client only after 10k pieces of data are read .
  • The client will not send the next request until it reads the server's response
  • After the client sends the next request , epoll_wait will return to read the remaining data in the buffer
Therefore , in order to solve the above problems ( blocking read may not be able to read the complete request at once ), you can use the non-blocking round-robin method to read the buffer to ensure that the complete request can be read out

And if it is LT, there is no such problem . As long as the data in the buffer has not been read , epoll_wait can return the file descriptor to read ready 

Usage scenarios of epoll

The high performance of epoll has certain specific scenarios . If the scenario is not suitable , the performance of epoll may be counterproductive

For multiple connections , and only a part of the multiple connections are active , it is more suitable to use epoll.

For example , a typical server that needs to handle tens of thousands of clients , such as the entry server of various Internet APPs , such a server is very suitable for epoll

If it is only within the system , the server communicates with the server , and there are only a few connections , it is not appropriate to use epoll in this case . The specific IO model to be used depends on the needs and the characteristics of the scene.

Shocking group problem in epoll

Surprising Questions Some interviewers may ask
Reference http://blog.csdn.net/fsmiy/article/details/36873357

epoll example : epoll server (LT mode )

tcp_epoll_server.hpp

///
// 封装一个 Epoll 服务器, 只考虑读就绪的情况
///
#pragma once
#include <vector>
#include <functional>
#include <sys/epoll.h>
#include "tcp_socket.hpp"
typedef std::function<void (const std::string&, std::string* resp)> Handler;
class Epoll {
public:
 Epoll() {
 epoll_fd_ = epoll_create(10);
 }
 ~Epoll() {
 close(epoll_fd_);
 }
 bool Add(const TcpSocket& sock) const {
 int fd = sock.GetFd();
 printf("[Epoll Add] fd = %d\n", fd);
 epoll_event ev;
 ev.data.fd = fd;
 ev.events = EPOLLIN;
 int ret = epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, fd, &ev);
 if (ret < 0) {
 perror("epoll_ctl ADD");
 return false;
 }
 return true;
 }
 bool Del(const TcpSocket& sock) const {
 int fd = sock.GetFd();
 printf("[Epoll Del] fd = %d\n", fd);
 int ret = epoll_ctl(epoll_fd_, EPOLL_CTL_DEL, fd, NULL);
 if (ret < 0) {
 perror("epoll_ctl DEL");
 return false;
 }
 return true;
 }
 bool Wait(std::vector<TcpSocket>* output) const {
 output->clear();
 epoll_event events[1000];
 int nfds = epoll_wait(epoll_fd_, events, sizeof(events) / sizeof(events[0]), -1);
 if (nfds < 0) {
 perror("epoll_wait");
 return false;
 }
 // [注意!] 此处必须是循环到 nfds, 不能多循环
 for (int i = 0; i < nfds; ++i) {
 TcpSocket sock(events[i].data.fd);
 output->push_back(sock);
 }
 return true;
 }
private:
 int epoll_fd_;
};
class TcpEpollServer {
public:
 TcpEpollServer(const std::string& ip, uint16_t port) : ip_(ip), port_(port) {
 }
 bool Start(Handler handler) {
 // 1. 创建 socket
 TcpSocket listen_sock;
 CHECK_RET(listen_sock.Socket());
 // 2. 绑定
 CHECK_RET(listen_sock.Bind(ip_, port_));
 // 3. 监听
 CHECK_RET(listen_sock.Listen(5));
 // 4. 创建 Epoll 对象, 并将 listen_sock 加入进去
 Epoll epoll;
 epoll.Add(listen_sock);
 // 5. 进入事件循环
 for (;;) {
 // 6. 进行 epoll_wait
 std::vector<TcpSocket> output;
 if (!epoll.Wait(&output)) {
 continue;
 }
 // 7. 根据就绪的文件描述符的种类决定如何处理
 for (size_t i = 0; i < output.size(); ++i) {
 if (output[i].GetFd() == listen_sock.GetFd()) {
 // 如果是 listen_sock, 就调用 accept
 TcpSocket new_sock;
 listen_sock.Accept(&new_sock);
 epoll.Add(new_sock);
 } else {
 // 如果是 new_sock, 就进行一次读写
 std::string req, resp;
 bool ret = output[i].Recv(&req);
 if (!ret) {
 // [注意!!] 需要把不用的 socket 关闭
 // 先后顺序别搞反. 不过在 epoll 删除的时候其实就已经关闭 socket 了
 epoll.Del(output[i]);
 output[i].Close();
 continue;
 }
 handler(req, &resp);
 output[i].Send(resp);
 } // end for
 } // end for (;;)
 }
 return true;
 }
 
private:
 std::string ip_;
 uint16_t port_;
};

epoll example : epoll server (ET mode )

Based on the LT version, it can be slightly modified
1. Modify tcp_socket.hpp, add non-blocking read and non-blocking write interfaces
2. For the new_sock returned by accept , add options such as EPOLLET
Note : This code does not consider the case of listen_sock ET for the time being . If you set listen_sock to ET, you need to accept in a non-blocking polling method . Otherwise, when a large number of clients connect at the same time, you can only accept once.

tcp_socket.hpp

 // 以下代码添加在 TcpSocket 类中
 // 非阻塞 IO 接口
 bool SetNoBlock() {
 int fl = fcntl(fd_, F_GETFL);
 if (fl < 0) {
 perror("fcntl F_GETFL");
 return false;
 }
 int ret = fcntl(fd_, F_SETFL, fl | O_NONBLOCK);
 if (ret < 0) {
 perror("fcntl F_SETFL");
 return false;
 }
 return true;
 }
 bool RecvNoBlock(std::string* buf) const {
 // 对于非阻塞 IO 读数据, 如果 TCP 接受缓冲区为空, 就会返回错误
 // 错误码为 EAGAIN 或者 EWOULDBLOCK, 这种情况也是意料之中, 需要重试
 // 如果当前读到的数据长度小于尝试读的缓冲区的长度, 就退出循环
 // 这种写法其实不算特别严谨(没有考虑粘包问题)
 buf->clear();
 char tmp[1024 * 10] = {0};
 for (;;) {
 ssize_t read_size = recv(fd_, tmp, sizeof(tmp) - 1, 0);
 if (read_size < 0) {
 if (errno == EWOULDBLOCK || errno == EAGAIN) {
 continue;
 }
 perror("recv");
 return false;
 }
 if (read_size == 0) {
 // 对端关闭, 返回 false
 return false;
 }
 tmp[read_size] = '\0';
 *buf += tmp;
 if (read_size < (ssize_t)sizeof(tmp) - 1) {
 break;
 }
 }
 return true;
 }
 bool SendNoBlock(const std::string& buf) const {
 // 对于非阻塞 IO 的写入, 如果 TCP 的发送缓冲区已经满了, 就会出现出错的情况
 // 此时的错误号是 EAGAIN 或者 EWOULDBLOCK. 这种情况下不应放弃治疗
 // 而要进行重试
 ssize_t cur_pos = 0; // 记录当前写到的位置
 ssize_t left_size = buf.size();
 for (;;) {
 ssize_t write_size = send(fd_, buf.data() + cur_pos, left_size, 0);
 if (write_size < 0) {
 if (errno == EAGAIN || errno == EWOULDBLOCK) {
 // 重试写入
 continue;
 }
 return false;
 }
 cur_pos += write_size;
 left_size -= write_size;
 // 这个条件说明写完需要的数据了
 if (left_size <= 0) {
 break;
 }
 }
 return true;
 }

tcp_epoll_server.hpp 

///
// 封装一个 Epoll ET 服务器
// 修改点:
// 1. 对于 new sock, 加上 EPOLLET 标记
// 2. 修改 TcpSocket 支持非阻塞读写
// [注意!] listen_sock 如果设置成 ET, 就需要非阻塞调用 accept 了
// 稍微麻烦一点, 此处暂时不实现
///
#pragma once
#include <vector>
#include <functional>
#include <sys/epoll.h>
#include "tcp_socket.hpp"
typedef std::function<void (const std::string&, std::string* resp)> Handler;
class Epoll {
public:
 Epoll() {
 epoll_fd_ = epoll_create(10);
 }
 ~Epoll() {
 close(epoll_fd_);
 }
 bool Add(const TcpSocket& sock, bool epoll_et = false) const {
 int fd = sock.GetFd();
 printf("[Epoll Add] fd = %d\n", fd);
 epoll_event ev;
 ev.data.fd = fd;
 if (epoll_et) {
 ev.events = EPOLLIN | EPOLLET;
 } else {
 ev.events = EPOLLIN;
 }
 int ret = epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, fd, &ev);
 if (ret < 0) {
 perror("epoll_ctl ADD");
 return false;
 }
 return true;
 }
 bool Del(const TcpSocket& sock) const {
 int fd = sock.GetFd();
 printf("[Epoll Del] fd = %d\n", fd);
 int ret = epoll_ctl(epoll_fd_, EPOLL_CTL_DEL, fd, NULL);
 if (ret < 0) {
 perror("epoll_ctl DEL");
 return false;
 }
 return true;
 }
 bool Wait(std::vector<TcpSocket>* output) const {
 output->clear();
 epoll_event events[1000];
 int nfds = epoll_wait(epoll_fd_, events, sizeof(events) / sizeof(events[0]), -1);
 if (nfds < 0) {
 perror("epoll_wait");
 return false;
 }
 // [注意!] 此处必须是循环到 nfds, 不能多循环
 for (int i = 0; i < nfds; ++i) {
 TcpSocket sock(events[i].data.fd);
 output->push_back(sock);
 }
 return true;
 }
private:
 int epoll_fd_;
};
class TcpEpollServer {
public:
 TcpEpollServer(const std::string& ip, uint16_t port) : ip_(ip), port_(port) {
 }
 bool Start(Handler handler) {
 // 1. 创建 socket
 TcpSocket listen_sock;
 CHECK_RET(listen_sock.Socket());
 // 2. 绑定
 CHECK_RET(listen_sock.Bind(ip_, port_));
 // 3. 监听
 CHECK_RET(listen_sock.Listen(5));
 // 4. 创建 Epoll 对象, 并将 listen_sock 加入进去
 Epoll epoll;
 epoll.Add(listen_sock);
 // 5. 进入事件循环
 for (;;) {
 // 6. 进行 epoll_wait
 std::vector<TcpSocket> output;
 if (!epoll.Wait(&output)) {
 continue;
 }
 // 7. 根据就绪的文件描述符的种类决定如何处理
 for (size_t i = 0; i < output.size(); ++i) {
 if (output[i].GetFd() == listen_sock.GetFd()) {
 // 如果是 listen_sock, 就调用 accept
 TcpSocket new_sock;
 listen_sock.Accept(&new_sock);
 epoll.Add(new_sock, true);
 } else {
 // 如果是 new_sock, 就进行一次读写
 std::string req, resp;
 bool ret = output[i].RecvNoBlock(&req);
 if (!ret) {
 // [注意!!] 需要把不用的 socket 关闭
 // 先后顺序别搞反. 不过在 epoll 删除的时候其实就已经关闭 socket 了
 epoll.Del(output[i]);
 output[i].Close();
 continue;
 }
 handler(req, &resp);
 output[i].SendNoBlock(resp);
 printf("[client %d] req: %s, resp: %s\n", output[i].GetFd(), 
 req.c_str(), resp.c_str());
 } // end for
 } // end for (;;)
 }
 return true;
 }
 
private:
 std::string ip_;
 uint16_t port_;
};

Guess you like

Origin blog.csdn.net/weixin_62700590/article/details/131183435