Meituan Ermian: epoll performance is so high, why?

say up front

In the reader exchange group (50+) of the 40-year-old veteran architect Nien , some friends recently got interview qualifications for first-tier Internet companies such as Meituan, Pinduoduo, Jitu, Youzan, and Xiyin. A very important interview question:

  • Talk about the data structure of epoll
  • Talk about the realization principle of epoll
  • How does the protocol stack communicate with epoll?
  • How to lock epoll thread safety?
  • Talk about the realization of ET and LT
  • ……

Here, Nien will give you a systematic and systematic sorting out, so that you can fully demonstrate your strong "technical muscles", and make the interviewer love "can't help yourself, drooling" .

Also include this question and the reference answer in our " Nin Java Interview Collection " V88 version, for the reference of later friends, to improve everyone's 3-high architecture, design, and development level.

For the latest PDF files of "Nin's Architecture Notes", "Nin's High Concurrency Trilogy" and "Nin's Java Interview Collection", please go to the official account [Technical Freedom Circle] to obtain

The data structure of epoll

Multiple data structures for decision making

epoll requires at least two collections

  • total set of all fd
  • collection of ready fds

So what data structure is used to store this total set?

We know that for an fd, its underlying layer corresponds to a TCB. That is to say, key=fd, val=TCB is a typical kv-type data structure. For the kv-type data structure, we can use the following three types for storage.

  1. hash
  2. red black tree
  3. b/b+tree

If hash is used for storage, the advantage is that the query speed is very fast, O(1).

But when we call epoll_create(), how big is the hash underlying array creation?

If we have millions of fds, the larger the array, the better. If we only need to manage a dozen fds, too much space will be wasted when creating the array. And we can't know how many fds there are in advance, so hash is not suitable.

b/b+tree is a multi-fork tree. One node can store multiple keys. It is mainly used to reduce the layer height and used for disk indexing, so it is not suitable for our memory scenario.

In the memory index scenario, we generally use the red-black tree as the preferred data structure. First, the search speed of the red-black tree is very fast, O(log(N)). Secondly, when calling epoll_create(), you only need to create a red-black tree root without wasting additional space.

So what data structure does the ready set use? First of all, the ready set is not based on search. The function of the ready set is to copy the elements inside to the user for processing, so the elements in the set have no priority, so linear data can be used Structure, using queues for storage, first in first out, first ready first processing.

The total set of all fd -----> red-black tree

Collection of ready fd -----> queue

The relationship between red-black tree and ready queue

The node of the red-black tree is the same node as the node of the ready queue. The so-called joining the ready queue is to link the front and rear pointers of the node together. So when it is ready, it is not to delete the red-black tree node and then add it to the queue. They are the same node, no need to delete.

struct epitem{
    
    
    RB_ENTRY(epitem) rbn;
    LIST_ENTRY(epitem) rdlink;
    int rdy; //exist in list
    
    int sockfd;
    struct epoll_event event;
};
struct eventpoll {
    
    
    ep_rb_tree rbr;
    int rbcnt;
    LIST_HEAD( ,epitem) rdlist;
    int rdnum;

    int waiting;
    pthread_mutex_t mtx; //rbtree update
    pthread_spinlock_t lock; //rdlist update
    pthread_cond_t cond; //block for event
    pthread_mutex_t cdmtx; //mutex for cond
};

How does the protocol stack communicate with the epoll module

How does the protocol stack communicate with the epoll module

Working environment of epoll

The application can only operate epoll through three api interfaces. When an io is ready, how does epoll know that the io is ready? It is the protocol stack that parses the data and triggers the callback to notify epoll. That is to say, the working environment of epoll can be seen in three parts, the api of the application program on the left, epoll in the middle, and the callback of the protocol stack on the right (of course, the protocol stack cannot directly operate epoll, and the vfs in the middle is not the focus here, so it is omitted directly vfs layer).

When the protocol stack triggers a callback to notify epoll

There are two types of sockets, one is listening to listenfd, and the other is clientfd. For sockfd, we generally pay more attention to the two events EPOLLIN and EPOLLOUT, so if it is listenfd, our usual approach is to accept. For clientfd, we recv if it is readable, and send if it is writable.

The protocol stack parses the data and triggers a callback to notify epoll. How does epoll know which io is ready? We can analyze the source ip, destination ip and protocol from the ip header, and the source port and destination port from the tcp header. At this time, the five-tuple is assembled. socket fd — <source IP address, source port, destination IP address, destination port, protocol> An fd is a quintuple. Knowing the fd, we can find the corresponding node from the red-black tree.

So what does this callback function do? We pass in the two parameters of fd and specific events, and then do the following two operations

  1. Find the corresponding node through fd
  2. Add the node to the ready queue

1. In the protocol stack, after the three-way handshake is completed, a TCB node will be added to the full connection queue, and then a callback function will be triggered to notify that there is an EPOLLIN event in epoll

2. The client sends a data packet, and the protocol stack replies with ACK after receiving it, and then triggers a callback function to notify epoll that there is an EPOLLIN event

3. There is a sendbuf in the TCB of each connection. After the peer receives the data and returns ACK, sendbuf can clear this part of the data that has been confirmed to be received. At this time, there is remaining space in the sendbuf, and a callback is triggered at this time Function to notify epoll that there is an EPOLLOUT event

4. When the peer sends close and replies ACK after receiving fin, the callback function will be called to notify epoll of an EPOLLIN event

5. When the rst flag is received, the callback function will also be triggered after replying to ack, notifying epoll that there is an EPOLLERR event

Timing of Notification Summary

One place with 5 notifications

  1. After the three-way handshake is complete,
  2. After receiving data and replying ACK,
  3. After sending data and receiving ACK,
  4. After receiving FIN reply ACK,
  5. After receiving RST reply ACK

See the difference between epoll and select/poll from the callback mechanism

Since there is no essential difference between select and poll, they are collectively referred to as poll below.

//  poll跟select类似, 其实poll就是把select三个文件描述符集合变成一个集合了。

int select(int nfds, fd_set * readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

We see that every time poll is called, the total fds needs to be copied to the kernel mode. After the detection, there is a user mode copied from the kernel mode, which is poll. This is not the case for epoll. As long as epoll has new io, it will call epoll_ctl() to join the red-black tree. Once there is a trigger, use epoll_wait() to bring out the nodes with events. You can see their first difference: poll always copies the total set, if there are 1 million fds, only two or three are ready? This will cause a lot of waste of resources; and epoll always copies the things that need to be copied, and there is no waste.

The second difference: We know from the above that epoll events are called back by the protocol stack and added to the ready queue, but what about poll? How does the kernel detect whether poll's io is ready? It can only be judged by traversal, so the method of poll detection io by traversal is also relatively slow.

So the difference between the two:

select/poll needs to copy the total set to the kernel, but epoll does not

The implementation principle above, select/poll needs to loop through whether the total set is ready, and epoll is to add the node to the ready queue when it is ready.

Note: poll is not necessarily slower than epoll. When the amount of io is small, poll is faster than epoll. When the amount of io is large, epoll is definitely dominant. As for how many ios are too many, it is actually hard to say. It is generally considered that 500 or 1024 is the cut-off point (I am talking nonsense).

How to lock epoll thread safety

What do the 3 APIs do

epoll_create() === "Create the root node of the red-black tree

epoll_ctl() === "add, del, mod add, delete, modify nodes

epoll_wait() === "Copy the node of the ready queue to the user state and put it in events, which is very similar to the recv function

Analysis lock

If there are 3 threads operating epoll at the same time, what places need to be locked? There are only 3 APIs available at the user level:

If you call epoll_create() at the same time, it will create three red-black trees, and there is no resource competition involved, so it doesn't matter.

If epoll_ctl() is called at the same time to add, delete, and modify the same red-black tree, this involves resource competition and needs to be locked. At this time, we lock the entire tree.

If epoll_wait() is called at the same time, it operates on the ready queue, so the ready queue needs to be locked.

We need to hold back the working environment of epoll. When the application calls epoll_ctl(), will the protocol stack have a callback to operate the red-black tree node? When the epoll_wait() copy is called, will the protocol stack operate the red-black tree node to join the ready queue? In summary:

epoll_ctl() 对红黑树加锁
epoll_wait()对就绪队列加锁
回调函数()   对红黑树加锁,对就绪队列加锁

So what locks are added to the red-black tree, and what locks are added to the ready queue?

When there are many nodes such as red-black trees, mutexes are used to lock them.

The ready queue is the same as the producer consumer. The node is produced from the callback function of the protocol stack, and the consumption is consumed by epoll_wait(). Then for the queue, use a spin lock (for the queue, inserting and deleting is relatively simple, and the cost of cpu spin waiting is lower than giving up, so use a spin lock).

How ET and LT are implemented

ET edge trigger, trigger only once

LT level trigger, if it is not finished reading, it will always trigger

How does the code achieve the effects of ET and LT? Horizontal trigger and edge trigger are not intentionally designed, it is a natural and automatic function. Horizontal trigger and edge trigger code only need to change a little bit to achieve. When the data is received from the protocol stack, a callback is called, which is ET. When the data is received, a callback is called. The LT level triggers, and calls the callback when it detects that there is data in the recvbuf. So ET and LT are the difference in the number of callbacks used.

So how to achieve it? Triggering callbacks in the protocol stack process is naturally in line with ET triggering only once. Then if it is LT, after recv, if there is data in the buffer, it will be added to the ready queue. Then if it is LT, after send, if there is still space in the buffer, it will be added to the ready queue. Then LT can be realized in this way.

(Refer to the real implementation code: ep_send_events function in the linux-2.6.24/fs/eventpoll.c file )

say at the end

Linux-related interview questions are very common interview questions.

If you can answer the above content fluently and familiarly, basically the interviewer will be shocked and attracted by you.

In the end, the interviewer fell in love so much that he "couldn't help himself, drooling" . The offer is coming.

During the learning process, if you have any questions, you can come to the 40-year-old architect Nien to communicate.

Recommended related reading

" Didi is too ruthless: Distributed ID, how to achieve 1000Wqps?" "

" Tencent is too ruthless: 4 billion QQ accounts, given 1G memory, how to deduplicate? "

" Starting from 0, Handwriting Redis "

" Starting from 0, Handwriting MySQL Transaction ManagerTM "

" Starting from 0, Handwriting MySQL Data Manager DM "

"Nin's Architecture Notes", "Nin's High Concurrency Trilogy", "Nin's Java Interview Collection" PDF, please go to the following official account [Technical Freedom Circle] to take it↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/131887110