As a C++ back-end development programmer, you should thoroughly understand the principles of Epoll implementation

It doesn't matter if you don't understand this article, you can bookmark it first.

Of course, these core ideas will be slowly explained in detail in subsequent articles, and you are welcome to pay attention.

If you really don’t understand the article, you can look back at this video explanation: epoll principle analysis

Epoll is a unique multiplexing IO implementation method under the Linux platform. Compared with the traditional select, epoll has a great improvement in performance. This article mainly explains the implementation principle of epoll, and for the use of epoll, you can refer to related books or articles.

The creation of epoll

To use epoll, you first need to call the epoll_create() function to create an epoll handle. The epoll_create() function is defined as follows:

int epoll_create(int size);

The parameter size is left over due to historical reasons and does not work now. When the user calls the epoll_create() function, he will enter the kernel space and call the sys_epoll_create() kernel function to create the epoll handle. The sys_epoll_create() function code is as follows:

asmlinkage long sys_epoll_create(int size)
{
    int error, fd = -1;
    struct eventpoll *ep;

    error = -EINVAL;
    if (size <= 0 || (error = ep_alloc(&ep)) < 0) {
        fd = error;
        goto error_return;
    }

    fd = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep);
    if (fd < 0)
        ep_free(ep);

error_return:
    return fd;
}

sys_epoll_create() mainly does two things:

  1. Call the ep_alloc() function to create and initialize an eventpoll object.
  2. Call the anon_inode_getfd() function to map the eventpoll object to a file handle and return the file handle.

Let's take a look at the eventpoll object first. The eventpoll object is used to manage the list of files monitored by epoll. Its definition is as follows:

struct eventpoll {
    ...
    wait_queue_head_t wq;
    ...
    struct list_head rdllist;
    struct rb_root rbr;
    ...
};

Let me first explain the role of each member of the eventpoll object:

  1. wq: waiting queue. When calling epoll_wait(fd), the process will be added to the wq waiting queue of the eventpoll object.
  2. rdllist: Save the list of files that are ready.
  3. rbr: Use red-black trees to manage all monitored files.

The following figure shows the relationship between the eventpoll object and the monitored file:

 

If you have any questions about the article understanding, welcome to add me qun: Jumping in to discuss

Share more about C/C++ Linux server development advanced architecture architecture  learning materials , content knowledge points include Linux, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP , Coroutine, DPDK and so on. ​​​​​​​

Click on the learning video link: C/C++Linux server development/Linux background development architect-learning video

Since the monitored files are managed by epitem objects, the nodes in the above figure all exist in the form of epitem objects. Why use red-black trees to manage monitored files? This is to be able to quickly find the corresponding epitem object through the file handle. The red-black tree is a balanced binary tree. If you don't know about it, you can refer to related documents.

Add file handle to epoll

I introduced how to create epoll before, and then I will introduce how to add files to be monitored to epoll.

The file to be monitored can be added to epoll by calling the epoll_ctl() function. The prototype is as follows:

long epoll_ctl(int epfd, int op, int fd,struct epoll_event *event);

The following explains the role of each parameter:

  1. epfd: The file handle returned by calling the epoll_create() function.
  2. op: The operation to be performed, there are 3 options:
  3. EPOLL_CTL_ADD: Indicates that the add operation is to be performed.
  4. EPOLL_CTL_DEL: Indicates that the delete operation is to be performed.
  5. EPOLL_CTL_MOD: Indicates that the modification operation is to be carried out.
  6. fd: The file handle to be monitored.
  7. event: Tell the kernel what to monitor. Its definition is as follows:
struct epoll_event {
    __uint32_t events;  /* Epoll events */
    epoll_data_t data;  /* User data variable */
};

events can be a collection of the following macros:

  • EPOLLIN: indicates that the corresponding file handle can be read (including the normal closing of the peer SOCKET);
  • EPOLLOUT: indicates that the corresponding file handle can be written;
  • EPOLLPRI: indicates that the corresponding file handle has urgent data to read;
  • EPOLLERR: indicates that the corresponding file handle has an error;
  • EPOLLHUP: indicates that the corresponding file handle is hung up;
  • EPOLLET: Set EPOLL to Edge Triggered mode, which is relative to Level Triggered.
  • EPOLLONESHOT: Only listen to the event once. After listening to this event, if you need to continue to listen to the socket, you need to add the socket to the EPOLL queue again.

data is used to save user-defined data.

The epoll_ctl() function will call the sys_epoll_ctl() kernel function. The implementation of the sys_epoll_ctl() kernel function is as follows:

asmlinkage long sys_epoll_ctl(int epfd, int op,
    int fd, struct epoll_event __user *event)
{
    ...
    file = fget(epfd);
    tfile = fget(fd);
    ...
    ep = file->private_data;

    mutex_lock(&ep->mtx);

    epi = ep_find(ep, tfile, fd);

    error = -EINVAL;
    switch (op) {
    case EPOLL_CTL_ADD:
        if (!epi) {
            epds.events |= POLLERR | POLLHUP;

            error = ep_insert(ep, &epds, tfile, fd);
        } else
            error = -EEXIST;
        break;
    ...
    }
    mutex_unlock(&ep->mtx);

    ...
    return error;
}

The sys_epoll_ctl() function will perform different operations based on the value of different ops passed in. For example, passing in EPOLL_CTL_ADD indicates that you want to add, then call the ep_insert() function to add.

Let's continue to analyze the implementation of the add operation ep_insert() function:

static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
             struct file *tfile, int fd)
{
    ...
    error = -ENOMEM;
    // 申请一个 epitem 对象
    if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        goto error_return;

    // 初始化 epitem 对象
    INIT_LIST_HEAD(&epi->rdllink);
    INIT_LIST_HEAD(&epi->fllink);
    INIT_LIST_HEAD(&epi->pwqlist);
    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd);
    epi->event = *event;
    epi->nwait = 0;
    epi->next = EP_UNACTIVE_PTR;

    epq.epi = epi;
    // 等价于: epq.pt->qproc = ep_ptable_queue_proc
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

    // 调用被监听文件的 poll 接口.
    // 这个接口又各自文件系统实现, 如socket的话, 那么这个接口就是 tcp_poll().
    revents = tfile->f_op->poll(tfile, &epq.pt);
    ...
    ep_rbtree_insert(ep, epi); // 把 epitem 对象添加到epoll的红黑树中进行管理

    spin_lock_irqsave(&ep->lock, flags);

    // 如果被监听的文件已经可以进行对应的读写操作
    // 那么就把文件添加到epoll的就绪队列 rdllink 中, 并且唤醒调用 epoll_wait() 的进程.
    if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);

        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }

    spin_unlock_irqrestore(&ep->lock, flags);
    ...
    return 0;
    ...
}

The monitored file is managed by the epitem object, which means that the monitored file will be encapsulated into an epitem object, and then will be added to the red-black tree of the eventpoll object for management (such as ep_rbtree_insert(ep, epi)).

tfile->f_op->poll(tfile, &epq.pt) The function of this line of code is to call the poll() interface of the monitored file. If the monitored file is a socket handle, then tcp_poll() will be called. Let’s Take a look at what tcp_poll() does:

unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
    struct sock *sk = sock->sk;
    ...
    poll_wait(file, sk->sk_sleep, wait);
    ...
    return mask;
}

Each socket object has a waiting queue (waitqueue, about the waiting queue, please refer to the article: Principle and Implementation of Waiting Queue), which is used to store the processes waiting for the socket state to change.

From the above code, we can know that tcp_poll() calls the poll_wait() function, and poll_wait() will eventually call the ep_ptable_queue_proc() function. The ep_ptable_queue_proc() function is implemented as follows:

static void ep_ptable_queue_proc(struct file *file,
    wait_queue_head_t *whead, poll_table *pt)
{
    struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq;

    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        add_wait_queue(whead, &pwq->wait);
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        epi->nwait = -1;
    }
}

The main work of the ep_ptable_queue_proc() function is to add the current epitem object to the waiting queue of the socket object, and set the wake-up function to ep_poll_callback(), that is, when the socket status changes, it will trigger the call to the ep_poll_callback() function. The ep_poll_callback() function is implemented as follows:

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
    ...
    // 把就绪的文件添加到就绪队列中
    list_add_tail(&epi->rdllink, &ep->rdllist);

is_linked:
    // 唤醒调用 epoll_wait() 而被阻塞的进程
    if (waitqueue_active(&ep->wq))
        wake_up_locked(&ep->wq);
    ...
    return 1;
}

If you have any questions about the article understanding, welcome to add me qun: Jumping in, come in and discuss

Share more about C/C++ Linux server development advanced architecture architecture  learning materials , content knowledge points include Linux, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP , Coroutine, DPDK and so on. ​​​​​​​

Click on the learning video link: C/C++Linux server development/Linux background development architect-learning video

The main job of the ep_poll_callback() function is to add the ready file to the ready queue of the eventepoll object, and then wake up the process that is blocked by calling epoll_wait().

Waiting for the status of the monitored file to change

After adding the file handle to be monitored to epoll, you can wait for the status of the file being monitored to change by calling epoll_wait(). The epoll_wait() call will block the current process. When the status of the monitored file changes, the epoll_wait() call will return.

The prototype of the epoll_wait() system call is as follows:

long epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

The meaning of each parameter:

  1. epfd: The epoll handle created by calling the epoll_create() function.
  2. events: used to store a list of ready files.
  3. maxevents: The size of the events array.
  4. timeout: Set the waiting timeout time.

The epoll_wait() function will call the sys_epoll_wait() kernel function, and the sys_epoll_wait() function will eventually call the ep_poll() function. Let’s take a look at the implementation of the ep_poll() function:

static int ep_poll(struct eventpoll *ep,
    struct epoll_event __user *events, int maxevents, long timeout)
{
    ...
    // 如果就绪文件列表为空
    if (list_empty(&ep->rdllist)) {
        // 把当前进程添加到epoll的等待队列中
        init_waitqueue_entry(&wait, current);
        wait.flags |= WQ_FLAG_EXCLUSIVE;
        __add_wait_queue(&ep->wq, &wait);

        for (;;) {
            set_current_state(TASK_INTERRUPTIBLE); // 把当前进程设置为睡眠状态
            if (!list_empty(&ep->rdllist) || !jtimeout) // 如果有就绪文件或者超时, 退出循环
                break;
            if (signal_pending(current)) { // 接收到信号也要退出
                res = -EINTR;
                break;
            }

            spin_unlock_irqrestore(&ep->lock, flags);
            jtimeout = schedule_timeout(jtimeout); // 让出CPU, 切换到其他进程进行执行
            spin_lock_irqsave(&ep->lock, flags);
        }
        // 有3种情况会执行到这里:
        // 1. 被监听的文件集合中有就绪的文件
        // 2. 设置了超时时间并且超时了
        // 3. 接收到信号
        __remove_wait_queue(&ep->wq, &wait);

        set_current_state(TASK_RUNNING);
    }
    /* 是否有就绪的文件? */
    eavail = !list_empty(&ep->rdllist);

    spin_unlock_irqrestore(&ep->lock, flags);

    if (!res && eavail
        && !(res = ep_send_events(ep, events, maxevents)) && jtimeout)
        goto retry;

    return res;
}

The ep_poll() function mainly does the following things:

  1. Determine whether there is a ready file in the monitored file collection, and return if there is.
  2. If not, add the current process to the waiting queue of epoll and go to sleep.
  3. The process will sleep until the following situations occur:
    1. There are ready files in the monitored file collection
    2. The timeout period is set and timed out
    3. Received signal
  4. If there is a ready file, then call the ep_send_events() function to copy the ready file to the events parameter.
  5. Returns the number of ready files.

Finally, we summarize the principle of epoll through a picture:

 

The following text describes this process:

  1. Create and initialize an eventpoll object by calling the epoll_create() function.
  2. Encapsulate the monitored file handle (such as the socket handle) into an epitem object by calling the epoll_ctl() function and add it to the red-black tree of the eventpoll object for management.
  3. Wait for the status of the monitored file to change by calling the epoll_wait() function.
  4. When the status of the monitored file changes (for example, the socket receives data), the epitem object corresponding to the file handle is added to the ready queue rdllist of the eventpoll object. And copy the file list of the ready queue to the events parameter of the epoll_wait() function.
  5. Wake up the process that is blocked (sleeping) by calling the epoll_wait() function.

Guess you like

Origin blog.csdn.net/Linuxhus/article/details/114135530