High concurrency cornerstone | Deep understanding of epoll of IO reuse technology

1. Write in front

Let's learn about the important foundation of high concurrency realization today: I/O multiplexing technology & epoll principle.
Through this article, you will learn the following:

The concept
of IO reuse before the emergence of epoll The IO reuse tool
epoll three-stage rocket
epoll realizes the
ET mode & LT mode at the bottom layer A
Tencent
interview question epoll surprise group question

Insert picture description here

[Article benefits] The editor recommends my own linuxC/C++ language exchange group: 832218493. I have compiled some learning books and video materials that I think are better for sharing in it. You can add them if you need them! ~ !

Insert picture description here

2. First understanding of multiplexing technology and IO multiplexing

Before understanding epoll, let's take a look at the concept of multiplexing technology and what IO multiplexing is talking about?

2.1 Reuse concept and resource characteristics
2.1.1 Reuse concept

Multiplexing technology multiplexing is not a new technology but a design idea. There are frequency division multiplexing, time division multiplexing, wavelength division multiplexing, code division multiplexing, etc. in communication and hardware design, which are multiplexed in daily life. There are also many, so don't be fooled by technical terms.

In essence, reuse is to solve the imbalance of limited resources and too many users, so as to achieve maximum utilization and handle more problems.

2.1.2 Releasable resources

For example:
Unreleasable scenario: The ventilator in the ICU ward is a limited resource. Once occupied by the patient and before being out of danger, the patient cannot give up the occupation. Therefore, it is impossible for several patients with the same situation to take turns.

Releasable scenario: For some other resources such as medical staff, multiple patients can be monitored at the same time. In theory, there is no scenario where a patient occupies medical staff resources and is not released.

So we can think about whether the resources (processing threads) shared by multiple IOs are releasable?

2.1.3 Understanding IO multiplexing

The meaning of I/O: The IO often referred to in the computer field includes disk IO and network IO. The IO reuse we say mainly refers to network IO. In Linux, everything is a file, so network IO often uses file descriptor FD To represent.

The meaning of reuse: So what should these file descriptors FD be reused? In the network scenario, the task processing thread is reused, so the simple understanding is that multiple IOs share one processing thread.

The feasibility of IO multiplexing: The basic operations of IO requests include read and write. Due to the nature of network interaction, there must be waiting. In other words, the read and write of FD in the entire network connection occurs alternately, sometimes readable and writable, and sometimes Idle, so IO multiplexing is available.

To sum up, the IO multiplexing technology is to coordinate multiple resource-releasing FD alternately shared task processing threads to complete communication tasks, and realize multiple FDs corresponding to one task processing thread.

In real life IO reuse is like managing hundreds of sheep by a border animal husbandry:

Insert picture description here

The picture comes from the Internet: Multiple sheep share the management of a border animal husbandry

2.1.4 Background of the emergence of IO reuse
Business needs are the driving force behind the evolution of technology.

In the primitive period when the network concurrency was very small, even processing network requests per req per process could meet the requirements, but with the increase in network concurrency, the original method will inevitably hinder progress, so it stimulates the realization of the IO multiplexing mechanism And promotion.

Efficient IO multiplexing mechanism should meet: the coordinator consumes the least system resources, minimizes the waiting time of FD, maximizes the number of FDs, the least idle task processing threads, and how fast and easy it is to complete tasks.

Voiceover: The above paragraph may be a little confusing to read. The plain statement is to let the task processing thread coordinate more network request connections with less resource consumption. The IO reuse tool is also gradually evolving, and you can find it after comparing before and after. This principle runs through it all the time.

After understanding the basic concepts of IO multiplexing technology, we then look at the various IO multiplexing tools that have appeared in the Linux system and their respective characteristics to deepen our understanding.

3. Overview of Linux IO reuse tools

In Linux, select, poll, epoll, etc. have appeared successively. FreeBSD's kqueue is also a very good IO reuse tool. The principle of kqueue is very similar to epoll. This article takes the Linux environment as an example, and does not discuss the implementation of too much select and poll. Mechanism and details.

3.1 Pioneer selec t

select appeared around 2000, the external interface definition:

/* According to POSIX.1-2001 */
#include <sys/select.h>

/* According to earlier standards */
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);
void FD_CLR(int fd, fd_set *set);
int  FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);

3.1.1 Official tips

As the first IO multiplexing system call, select uses a macro definition function to fill fd according to the bitmap principle. The default size is 1024. Therefore, if the value of fd is greater than 1024, there may be problems. See the official warning:

Macro: int FD_SETSIZE The value of this macro is the maximum number of
file descriptors that a fd_set object can hold information about. On
systems with a fixed maximum number, FD_SETSIZE is at least that
number. On some systems, including GNU, there is no absolute limit on
the number of descriptors open, but this macro still has a constant
value which controls the number of bits in an fd_set; if you get a
file descriptor with a value as high as FD_SETSIZE, you cannot put
that descriptor into an fd_set.

Briefly explain the meaning of this passage:
when the absolute value of fd is greater than 1024, it will be uncontrollable. Because of the underlying bit array, the official recommendation is not to exceed 1024, but we cannot control the absolute value of fd. This problem was previously addressed. I have done some research, and the conclusion is that the system has its own strategy for the allocation of fd, which will most likely be allocated within 1024. I don't fully understand this, but just mention this pit.

3.1.2 Existing problems and objective evaluation

Due to the limitations of the underlying implementation, select has some problems, including:

The number and value of the coordinated fd does not exceed 1024. High concurrency cannot be achieved.
Use O(n) complexity to traverse the fd array to view the read and write efficiency of fd. Low efficiency.
Involves a large number of kernels and user mode copies
. Incoming and sub-event incoming operation redundant
select realizes IO multiplexing in a simple way, increasing the concurrency by the maximum K level, but the cost and flexibility of completing this task need to be improved.

In any case, select as a pioneer has a huge push for IO reuse, and indicates the direction of subsequent optimization, do not ignorantly accuse select.

3.2 Successor epoll

Before the emergence of epoll, poll made improvements to select, but the essence has not changed much, so we skip poll and look at epoll directly.

epoll first appeared in the 2.5.44 kernel version, and subsequently optimized the code in the 2.6.x version to make it more concise. Faced with doubts from the outside world, some settings were subsequently added to solve the hidden problem, so epoll also has More than ten years of history.

In the third edition of "Unix Network Programming" (2003), epoll has not been introduced, because epoll has not yet appeared in that era. The book only introduces select and poll. Epoll solves the problems in select one by one. The advantages of epoll include :

There is no limit to the number of fd (of course, this is also solved in poll)
. The bitmap array is abandoned and a new structure is implemented to store multiple event types. There is
no need to copy fd repeatedly, add and discard and delete as you
use, and use event-driven to avoid polling and viewing. Read and write events

The application status of epoll:

After the emergence of epoll, the amount of concurrency has been greatly increased. It is easy to deal with the C10K problem. Even if there is a real asynchronous IO in the follow-up, it has not (for the time being) shaken the status of epoll.

Because epoll can solve tens of thousands of hundreds of thousands of concurrency, it can already solve most of the current scenarios. Asynchronous IO is excellent, but programming is more difficult than epoll. Under the trade-off, epoll is still full of vitality.

4. Getting to know epoll

Epoll inherits the style of select, leaving the user's interface very concise, it can be said to be simple but not simple, let's feel it together.

4.1 Basic API and data structure of epoll

Epoll mainly includes epoll_data, epoll_event, and three apis, as shown below:

//用户数据载体
typedef union epoll_data {
    
    
   void    *ptr;
   int      fd;
   uint32_t u32;
   uint64_t u64;
} epoll_data_t;

//fd装载入内核的载体
 struct epoll_event {
    
    
     uint32_t     events;    /* Epoll events */
     epoll_data_t data;      /* User data variable */
 };
 
 //三板斧api
int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events,
                 int maxevents, int timeout);

4.2 Class understanding of epoll three-stage rocket

epoll_create

This interface is to create a series of epoll-related structures in the kernel area and return a handle fd to the user mode. The subsequent operations are based on this fd. The parameter size tells the kernel the size of the elements of this structure, similar to stl If the size of the vector dynamic array is not suitable, it will involve replication and expansion, but it seems that the size is not very useful after the 4.1.2 kernel;

epoll_ctl

This interface is to add/delete fd to the epfd returned by epoll_create, where epoll_event is the structure of interaction between user mode and kernel mode, which defines the type of event that the user mode cares about and the data carrier epoll_data when triggered;

epoll_wait

This interface is to block the read and write events waiting for the kernel to return, epfd is the return value of epoll_create, events is a structure array pointer to store epoll_event, that is, to store all the pending epoll_event structures returned by the kernel, and maxevents tells the kernel to return this time The maximum number of fd, this is related to the array pointed to by events;

Three of the APIs run through a data structure: epoll_event, which can be said to be the spokesperson of the user mode that needs to monitor fd, and subsequent user programs' operations on fd are based on this structure;

4.3 Popular explanation of epoll three-stage rocket

Maybe the above description is a bit abstract. Give a real example to help everyone understand:

epoll_create scene

In the first week of the university, you, as the class leader, need to help all the classmates collect related items. You tell the staff at the student office that I am the class leader of the xx class of xx college. At this time, the staff confirms your identity and gives you the certificate , You need to use the following things (that is, call epoll_create to apply for the epfd structure to the kernel, and the kernel returns the epfd handle for you to use);

epoll_ctl scene

You take the voucher and start working in the working hall. The sorting office staff said that the class leader, please record all the student booklets and the things that need to be handled, so the class leader starts to write the corresponding needs in each student handbook separately What to do: Li Ming needs to open the laboratory permission, Sun Daxiong needs to apply for a swimming card... In this way, the squad leader wrote it all in one brain and gave it to the staff (that is, to tell the kernel which fd needs to do what operations);

epoll_wait scene

You take the voucher and wait in front of the collection office. At this time, the broadcast is calling xx class leader and your class Sun Daxiong's swimming card is ready to get it quickly, and the Li Ming laboratory permission card is ready to get it quickly...and classmates It was not done properly, so the monitor can only continue (that is, call epoll_wait to wait for the readable and writable event feedback from the kernel to occur and process it);

4.4 epoll official demo

You can see the official demo through man epoll, although there are only 50 lines, but the dry goods are full, as follows:

#define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;

/* Set up listening socket, 'listen_sock' (socket(),
  bind(), listen()) */

epollfd = epoll_create(10);
if(epollfd == -1) {
    
    
   perror("epoll_create");
   exit(EXIT_FAILURE);
}

Site is undergoing maintenance = EPOLLIN;
ev.data.fd = listen_sock;
if(epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
    
    
   perror("epoll_ctl: listen_sock");
   exit(EXIT_FAILURE);
}

for(;;) {
    
    
   nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
   if (nfds == -1) {
    
    
       perror("epoll_pwait");
       exit(EXIT_FAILURE);
   }

   for (n = 0; n < nfds; ++n) {
    
    
       if (events[n].data.fd == listen_sock) {
    
    
           //主监听socket有新连接
           conn_sock = accept(listen_sock,
                           (struct sockaddr *) &local, &addrlen);
           if (conn_sock == -1) {
    
    
               perror("accept");
               exit(EXIT_FAILURE);
           }
           setnonblocking(conn_sock);
           Site is undergoing maintenance = EPOLLIN | EPOLLET;
           ev.data.fd = conn_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                       &ev) == -1) {
    
    
               perror("epoll_ctl: conn_sock");
               exit(EXIT_FAILURE);
           }
       } else {
    
    
           //已建立连接的可读写句柄
           do_use_fd(events[n].data.fd);
       }
   }
}

Special attention: In epoll_wait, it is necessary to distinguish whether it is the new connection event of the main listener thread fd or the read and write request of the connected event, and then handle it separately.

5. Low-level details of epoll

The bottom layer of epoll implements the two most important data structures: epitem and eventpoll.

You can simply think that epitem corresponds to the fd of each user mode monitoring IO. Eventpoll is the structure created by the user mode to manage all monitored fd. Let's look at the data structure of epoll from the local to the whole, from the inside to the outside.

5.1 The underlying data structure

Red-black tree node definition:

#ifndef  _LINUX_RBTREE_H
#define  _LINUX_RBTREE_H
#include <linux/kernel.h>
#include <linux/stddef.h>
#include <linux/rcupdate.h>

struct rb_node {
    
    
  unsigned long  __rb_parent_color;
  struct rb_node *rb_right;
  struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
/* The alignment might seem pointless, but allegedly CRIS needs it */
struct rb_root {
    
    
  struct rb_node *rb_node;
};

Epitem definition:

struct epitem {
    
    
  struct rb_node  rbn;
  struct list_head  rdllink;
  struct epitem  *next;
  struct epoll_filefd  ffd;
  int  nwait;
  struct list_head  pwqlist;
  struct eventpoll  *ep;
  struct list_head  fllink;
  struct epoll_event  event;
}

Eventpoll definition:

struct eventpoll {
    
    
  spin_lock_t       lock;
  struct mutex      mtx;
  wait_queue_head_t     wq;
  wait_queue_head_t   poll_wait;
  struct list_head    rdllist;   //就绪链表
  struct rb_root      rbr;      //红黑树根节点
  struct epitem      *ovflist;
}

5.2 The underlying calling process

epoll_create will create an object of type struct eventpoll and return a file descriptor corresponding to it. After that, the application will rely on this file descriptor when using epoll in user mode, and the file descriptor is further used inside epoll. Obtain the eventpoll type object, and then perform the corresponding operation to complete the penetration of user mode and kernel mode.

The bottom layer of epoll_ctl calls epoll_insert to achieve:

Create and initialize a strut epitem type object, complete the association between the object and the monitored event and the epoll object eventpoll;
add the struct epitem type object to the red-black tree of the epoll object eventpoll for management;
add the struct epitem type object Go to the waiting list of the target file corresponding to the monitored event, and register the callback function that will be called when the event is ready. The callback function in epoll is ep_poll_callback();
ovflist is mainly for transient processing, when the ep_poll_callback() callback function is called found ovflist members eventpoll not equal EP_UNACTIVE_PTR, described rdllist list being scanned, then the event will be ready to join the corresponding epitem ovflist temporarily stores the list, and so the list rdllist then scanned in a linked list element is moved to rdllist ovflist chain;
FIG. Shows the relationship between red-black trees, double-linked lists, and epitem:
Insert picture description here

5.3 Confusing data copy

A widely circulated false view:

When epoll_wait returns, for the ready event, epoll uses the shared memory method, that is, both the user mode and the kernel mode point to the ready linked list, so the memory copy consumption is avoided

Shared memory? nonexistent!

Regarding epoll_wait's use of shared memory to accelerate the data interaction between user mode and kernel mode, and avoid memory copying, the view has not been confirmed by the 2.6 kernel version code, and the implementation of this copy is like this:

revents = ep_item_poll(epi, &pt);//获取就绪事件
if (revents) {
    
    
  if (__put_user(revents, &uevent->events) ||
  __put_user(epi->event.data, &uevent->data)) {
    
    
    list_add(&epi->rdllink, head);//处理失败则重新加入链表
    ep_pm_stay_awake(epi);
    return eventcnt ? eventcnt : -EFAULT;
  }
  eventcnt++;
  uevent++;
  if (epi->event.events & EPOLLONESHOT)
    epi->event.events &= EP_PRIVATE_BITS;//EPOLLONESHOT标记的处理
  else if (!(epi->event.events & EPOLLET)) {
    
    
    list_add_tail(&epi->rdllink, &ep->rdllist);//LT模式处理
    ep_pm_stay_awake(epi);
  }
}

6.LT mode and ET mode

The two modes of epoll leave room for users to play and are also a key issue.

6.1 Simple understanding of LT/ET

The LT mode is adopted by default. LT supports blocking and non-blocking sockets. The ET mode only supports non-blocking sockets. Its efficiency is higher than that of the LT mode, and the LT mode is more secure.

In both LT and ET modes, the event can be obtained through the epoll_wait method. After the event is copied to the user program in LT mode, if it is not processed or not processed, it will be fed back to the user program at the next call. It can be considered that the data is not Will be reminded repeatedly;

In the ET mode, if it has not been processed or has not been processed, the user program will not be notified next time, thus avoiding repeated reminders, but strengthening the requirements for user program reading and writing;

6.2 In-depth understanding of LT/ET

The above explanation will be mentioned in any article on the Internet, but it is still difficult to actually use LT and ET.

6.2.1 LT read and write operations

LT is relatively simple for read operations. There is a read event to read. It is no problem to read more or less, but write is not so easy. Generally speaking, the sending buffer of the socket must be dissatisfied when the socket is in the idle state. If fd has been monitored , Then it will always be notified of writing events, which is annoying.

Therefore, it must be ensured that when there is no data to be sent, the write event monitoring of fd should be deleted from the epoll list, and added back when needed, and so on.

There is no free lunch in the world, and it is impossible to always remind at no cost. The excessive reminder corresponding to write requires users to add it as they use it, otherwise they will always be reminded of writable events.

6.2.2 ET read and write operations

If fd is readable, it returns a readable event. If the developer has not read all the data, epoll will not notify the read event again, which means that if all the data is not read, epoll will not notify the socket's read again. Event, in fact, it's easy to read it all the time.

If the sending buffer is not full, epoll notifies the write event. Until the developer fills up the sending buffer, epoll will notify the write event when the sending buffer changes from full to not full next time.

In ET mode, it will only notify when the status of the socket changes, that is, when the read buffer changes from no data to data, the read event is notified, and the send buffer changes from full to not full to notify the write event.

6.2.3 A Tencent interview question

It seems to be a bit trapped, let’s take a look at
an interview question: using the LT horizontal trigger mode of the Linux epoll model, when the socket is writable, it will trigger the events that the socket can write continuously. How to deal with it?
It is indeed a very good question! Let's analyze and appreciate the deep meaning.
This topic has a more in-depth investigation of LT and ET, verifying the LT mode write problem mentioned above.

Common practice:
When you need to write data to a socket, add the socket to epoll and wait for a writable event. After receiving the socket writable event, call write or send to send data. When all the data is written, the socket descriptor is removed from the epoll list. This practice requires repeated addition and deletion.

Improved method:
When writing data to the socket, directly call send to send. When send returns the error code EAGAIN, the socket is added to epoll, and the data is sent after waiting for the writable event. After all the data is sent, the epoll model is removed. The improved method is equivalent Yu believes that sockets are writable most of the time, and if they cannot be written, let epoll help monitor them.

The above two methods are to fix the frequent notifications of write events in LT mode. In essence, the ET mode can be done directly, and no user-level program patch operation is required.

6.2.4 Thread starvation in ET mode

If a socket continuously receives a lot of data, in the process of trying to read all the data, it may cause other sockets to not be processed, thereby causing starvation problems.

Solution:
Maintain a queue for each prepared descriptor, so that the program can know which descriptors have been prepared but have not been read, and then the program will read it regularly or quantitatively, and move it if it is read. Divide until the queue is empty, which ensures that each fd is read and no data is lost.

The process is as follows:

Insert picture description here

6.2.5 EPOLLONESHOT settings

After the A thread reads the data on a socket, it starts to process the data. At this time, there is new data on the socket to read, and the B thread is awakened to read the new data. This results in a situation where two threads operate on a socket at the same time. EPOLLONESHOT guarantees a socket. The connection is handled by only one thread at any one time.

6.2.6 Selection of LT and ET

Through the previous comparison, we can see that the LT mode is safer and the code writing is clearer, but the ET mode is a high-speed mode. It is better to use it properly in handling large and high concurrency scenarios. The specific choice depends on your actual needs and team code capabilities. .

There is a comparison between ET and LT on Zhihu. There are many big cows who have expressed their opinions in it. If you are interested, you can check it out.

7. Epoll's shocking group problem

If you don’t know what the flock effect is, imagine:
you feed the pigeons in the square, you only feed one piece of food, but you attract a group of pigeons to fight, and in the end only one pigeon grabs the food, for other pigeons Said to be futile.

This kind of imagination also exists in network programming.

In the 2.6.18 kernel, the shock group problem of accept has been solved, but the shock group problem still exists in epoll. It appears that when multiple processes/threads call epoll_wait, they will block and wait. When the kernel triggers a read-write event, All processes/threads will respond, but in fact only one process/thread actually handles these events.

Before epoll officially fixed this problem, Nginx, as a well-known user, used a global lock to limit the number of processes that can monitor fd at a time, and only one process can be monitored at a time. Later, the SO_REUSEPORT option was added to the Linux 3.9 kernel. With kernel-level load balancing, Nginx 1.9.1 version supports the new feature of reuseport to solve the problem of shocking groups.

EPOLLEXCLUSIVE is a newly added epoll logo in the Linux 4.5 kernel in 2016. Ngnix added the NGX_EXCLUSIVE_EVENT option after 1.11.3 to support this feature. The EPOLLEXCLUSIVE flag will ensure that only one thread will be awakened when an event occurs, so as to avoid the cluster problem caused by multiple listening.

Guess you like

Origin blog.csdn.net/lingshengxueyuan/article/details/111314869
Recommended