epoll source code analysis and implementation in Redis

1 Overview

This article analyzes the implementation principle of epoll in Linux, mainly to enhance your understanding of network calls. There are many frameworks that use epoll in the industry, and many can be listed casually. For example, the implementation of jdk's nio under linux, as well as netty, redis and other places involving long-link network requests, we can directly use epoll. At the end of the article, I will briefly look at how to use epoll to do IO multiplexing to achieve high concurrency from the redis source code.

2. Concrete realization

Refer to the official document description:

The central concept of the epoll API is the epoll instance, an inkernel data structure which, from a user-space perspective, can be considered as a container for two lists

So in fact, epoll is a data structure of the kernel. From the perspective of user space, there are actually two linked lists. So basically it is enough to maintain two linked lists. After understanding this passage, we can also understand that Epoll provides three methods:

  • create is to initialize the data structure of this kernel. Returns an fd. As we all know, Unix is ​​a file. Therefore, what is created here is a file fd. We only need to pass in fd for each operation, and the kernel can get the data structure corresponding to epoll.
  • epoll_ctl is the operation on one of the linked lists. This linked list stores the io events that the user is interested in. Of course, when registering for an event, there will be some other operations. Explained in detail later
  • epoll_wait is to return to the ready event (event of interest). Then let the application layer handle it.
int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

struct epoll_event {
    __uint32_t events;
    epoll_data_t data;
};

epoll_create method

SYSCALL_DEFINE1(epoll_create1, int, flags)
{
    int error, fd;
    struct eventpoll *ep = NULL;
    struct file *file;
    // 创建内部数据结构eventpoll 
    error = ep_alloc(&ep);
    //查询未使用的fd
    fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
    file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
                 O_RDWR | (flags & O_CLOEXEC));

    ep->file = file;
    fd_install(fd, file);  //建立fd和file的关联关系
    return fd;
out_free_fd:
    put_unused_fd(fd);
out_free_ep:
    ep_free(ep);
    return error;
}

Briefly talk about this method. First, this method returns a file descriptor. We can find the corresponding structure using this file descriptor. That is, a memory area. All data of epoll will be saved here. This memory area is represented by the eventpoll structure. So the logic of this method is as follows:

1. Create the eventpoll structure. Initialize the corresponding data of the structure

2. Query an unused fd, and then create a file for epoll. Point file->private_data to ep. Not much to say about the process of creating files

3. Point ep->file to file. In fact, it's just binding.

4. Associate fd and file. So we can find the corresponding file through fd. And find the structure (memory area) corresponding to ep. Here again, the private_data of file is actually very important in the device driver. It can point to a custom data structure. This is why it is guaranteed that one device driver can adapt to multiple devices. Because different devices may have different attributes. There is no problem with epoll using private_data to point to its own data structure.

The content of the eventpoll structure is as follows. Encountered later in detail.

struct eventpoll {
    spinlock_t lock;
    struct mutex mtx;
    wait_queue_head_t wq; //sys_epoll_wait()使用的等待队列
    wait_queue_head_t poll_wait; //file->poll()使用的等待队列
    struct list_head rdllist; //所有准备就绪的文件描述符列表
    struct rb_root rbr; //用于储存已监控fd的红黑树根节点
    
    struct epitem *ovflist; //用于监听文件的结构。如果rdllist被锁定,临时事件会被连接到这里
    struct wakeup_source *ws; // 当ep_scan_ready_list运行时使用wakeup_source
    struct user_struct *user; //创建eventpoll描述符的用户
    struct file *file;
    int visited;           //用于优化循环检测检查
    struct list_head visited_list_link;
};

epoll_ctl method

This method is mainly to add, delete and modify the monitoring event.

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    int error;
    int full_check = 0;
    struct fd f, tf;
    struct eventpoll *ep;    
    struct epitem *epi;     
    struct epoll_event epds; 
    struct eventpoll *tep = NULL;
    error = -EFAULT;
    //如果不是删除操作,将用户空间的epoll_event 拷贝到内核
    if (ep_op_has_event(op) &&
        copy_from_user(&epds, event, sizeof(struct epoll_event)))
    f = fdget(epfd); //epfd对应的文件
    tf = fdget(fd); //fd对应的文件.
    ...
    ep = f.file->private_data; // 取出epoll_create过程创建的ep
    ...
    epi = ep_find(ep, tf.file, fd); //ep红黑树中查看该fd
    switch (op) {
    case EPOLL_CTL_ADD:
        if (!epi) {
            epds.events |= POLLERR | POLLHUP;
            error = ep_insert(ep, &epds, tf.file, fd, full_check); 
        }
        if (full_check)
            clear_tfile_check_list();
        break;
    case EPOLL_CTL_DEL:
        if (epi)
            error = ep_remove(ep, epi); 
        break;
    case EPOLL_CTL_MOD:
        if (epi) {
            epds.events |= POLLERR | POLLHUP;
            error = ep_modify(ep, epi, &epds); 
        }
        break;
    }
    mutex_unlock(&ep->mtx);
    fdput(tf);
    fdput(f);
    ...
    return error;
}

Share More on C / C ++ Linux back-end network infrastructure development to enhance the knowledge of the principles of learning Click on learning materials acquisition, improve technology stack, content knowledge, including Linux, Nginx, ZeroMQ, MySQL, Redis, thread pool, MongoDB, ZK, Linux kernel, CDN, P2P, epoll, Docker, TCP/IP, coroutine, DPDK, etc.

 

Some judgment codes are omitted above. The main core is to execute different functions according to different event types.

epoll calls ep_find to get the corresponding epi from the red-black tree. If it already exists, there is no need to add. If it does not exist, remove and modify operations cannot be performed. The whole process will be locked. Because it is a red-black tree, the search and insertion performance are both at the logn level. Therefore, for high-concurrency scenarios, fast registration and monitoring can also be achieved. Let's take a look at the logic of these three operations respectively.

ep_insert operation

As the name suggests, it is to add monitoring events.

static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
             struct file *tfile, int fd, int full_check)
{
    int error, revents, pwake = 0;
    unsigned long flags;
    long user_watches;
    struct epitem *epi;
    struct ep_pqueue epq; //[小节2.4.5]


    user_watches = atomic_long_read(&ep->user->epoll_watches);
    if (unlikely(user_watches >= max_user_watches))
        return -ENOSPC;
    if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        return -ENOMEM;

    //构造并填充epi结构体
    INIT_LIST_HEAD(&epi->rdllink);
    INIT_LIST_HEAD(&epi->fllink);
    INIT_LIST_HEAD(&epi->pwqlist);
    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd); // 将tfile和fd都赋值给ffd
    epi->event = *event;
    epi->nwait = 0;
    epi->next = EP_UNACTIVE_PTR;
    if (epi->event.events & EPOLLWAKEUP) {
        error = ep_create_wakeup_source(epi);
    } else {
        RCU_INIT_POINTER(epi->ws, NULL);
    }
    epq.epi = epi;
    //设置轮询回调函数
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
    //执行poll方法
    revents = ep_item_poll(epi, &epq.pt);
    spin_lock(&tfile->f_lock);
    list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
    spin_unlock(&tfile->f_lock);
    ep_rbtree_insert(ep, epi); //将将当前epi添加到RB树
    spin_lock_irqsave(&ep->lock, flags);
    //事件就绪 并且 epi的就绪队列有数据
    if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake(epi);

        //唤醒正在等待文件就绪,即调用epoll_wait的进程
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }
    spin_unlock_irqrestore(&ep->lock, flags);
    atomic_long_inc(&ep->user->epoll_watches);


    if (pwake)
        ep_poll_safewake(&ep->poll_wait); //唤醒等待eventpoll文件就绪的进程
    return 0;
...
}

1. First, an epi will be initialized, and the monitoring of the target file needs to be maintained through the epi. The monitoring of a file corresponds to an epi. And saved in the red-black tree of ep.

struct epitem {
    union {
        struct rb_node rbn; //RB树节点将此结构链接到eventpoll RB树
        struct rcu_head rcu; //用于释放结构体epitem
    };


    struct list_head rdllink; //时间的就绪队列,主要就是链接到eventpoll的rdllist
    struct epitem *next; //配合eventpoll中的ovflist一起使用来保持单向链的条目
    struct epoll_filefd ffd; //该结构 监听的文件描述符信息,每一个socket fd都会对应一个epitem 。就是通过这个结构关联
    int nwait; //附加到poll轮询中的活跃等待队列数

    struct list_head pwqlist; //用于保存被监听文件的等待队列
    struct eventpoll *ep;  //epi所属的ep
    struct list_head fllink; //主要是为了实现一个文件被多个epoll监听。将该结构链接到文件的f_ep_link。
    struct wakeup_source __rcu *ws; //设置EPOLLWAKEUP时使用的wakeup_source
    struct epoll_event event; //监控的事件和文件描述符
};

2. After initialization, the fd of the file and the corresponding file pointer will be bound to the ffd of epi. The main function is to bind fd and change epi.

struct epoll_filefd {
    struct file *file;
    int fd;
} __packed;

3. Register the corresponding function ep_ptable_queue_proc for the pt of epq (actually a poll_table).

struct ep_pqueue {
    poll_table pt;
    struct epitem *epi;
};

typedef struct poll_table_struct {
    poll_queue_proc _qproc;
    unsigned long _key;
} poll_table;

Here epq is a structure, which binds epi and poll_table. The poll_table mainly registers the ep_ptable_queue_proc function. _key is used to record events. So epq saves the epi and the corresponding ep_ptable_queue_proc . When subsequent execution of the callback function, we can poll_table get the address of the corresponding epq, finally got the corresponding epi, which is the purpose of defining the structure.

4. Call the ep_item_poll method. Let me briefly talk about this method. He will call the poll method of the file system.

static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
{
    pt->_key = epi->event.events;、
    return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
}

Different drivers have their own poll method. If it is a TCP socket, the poll method is tcp_poll. In TCP, this method is called periodically, and the calling frequency depends on the setting of the protocol stack interrupt frequency. Once an event arrives, the corresponding tcp_poll method is called, and the tcp_poll method will call back sock_poll_wait(), which will call the ep_ptable_queue_proc method registered here. Epoll actually adds its own callback function to the waitqueue of the file through this mechanism. This is also the purpose of ep_ptable_queue_proc.

5. Revents will be returned after calling ep_item_poll. That is, the event triggered by the fd. If there is an event we are interested in, it will be inserted into the rdllist of ep. If a process is waiting for the ready state of the file, it is the process that calls epoll_wait to sleep. Then the waiting process will be awakened.

 if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake(epi);

        //唤醒正在等待文件就绪,即调用epoll_wait的进程
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }

ep_ptable_queue_proc method

The whole process is actually to bind the ep_ptable_queue_proc function through the poll method of the file. When an event arrives in the file corresponding to the file descriptor, call this function back

Ps: The file is the structure of the corresponding file. Of course, multiple fd can point to the file structure. Multiple files can point to the same innode node at the same time. In linux, the content of a file is defined and described by innode. File is only created when we operate on the file. Everyone needs to be clear about this concept.

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
                 poll_table *pt)
{
    struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq;  


    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        //初始化回调方法
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        //将ep_poll_callback放入等待队列whead
        add_wait_queue(whead, &pwq->wait);
        //将llink 放入epi->pwqlist的尾部
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        epi->nwait = -1; //标记错误发生
    }
}

static inline void
init_waitqueue_func_entry(wait_queue_t *q, wait_queue_func_t func)
{
    q->flags    = 0;
    q->private    = NULL;
    q->func        = func;
}

ep_ptable_queue_proc has three parameters. file is the pointer of the file to be monitored, and whead is the waiting queue of the device corresponding to the fd. pt is what we passed in by calling the poll of the file at that time.

In ep_ptable_queue_proc, eppoll_entry was introduced. That is pwq. pwq mainly completes the correlation between epi and callback functions when epi events occur.

It can be seen from the above code. First get the corresponding epi according to pt. Then associate the three through pwq.

Finally, through the add_wait_queue method, the eppoll_entry is hung on the device waiting queue of fd. That is, register the callback function of epoll.

So the main goal of this method is to hang eppoll_entry on the device waiting queue of fd. When the device has hardware data arriving, the hardware interrupt processing function will wake up the waiting process on the queue, and the wake-up function ep_poll_callback will be called.

Share More on C / C ++ Linux back-end network infrastructure development to enhance the knowledge of the principles of learning Click on learning materials acquisition, improve technology stack, content knowledge, including Linux, Nginx, ZeroMQ, MySQL, Redis, thread pool, MongoDB, ZK, Linux kernel, CDN, P2P, epoll, Docker, TCP/IP, coroutine, DPDK, etc.

 

ep_poll_callback method

The main function of this function is to add the epi corresponding to the file to the ready queue when the event of the monitored file is ready. When the application layer calls epoll_wait(), the kernel will copy the events of the ready queue to the user space. Report to the application.

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
    int pwake = 0;
    unsigned long flags;
    struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;
    spin_lock_irqsave(&ep->lock, flags);
    if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
        if (epi->next == EP_UNACTIVE_PTR) {
            epi->next = ep->ovflist;
            ep->ovflist = epi;
            if (epi->ws) {
                __pm_stay_awake(ep->ws);
            }
        }
        goto out_unlock;
    }
    //如果此文件已在就绪列表中,很快就会退出
    if (!ep_is_linked(&epi->rdllink)) {
        //将epi就绪事件 插入到ep就绪队列
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake_rcu(epi);
    }
    // 如果活跃,唤醒eventpoll等待队列和 ->poll()等待队列
    if (waitqueue_active(&ep->wq))
        wake_up_locked(&ep->wq);  //当队列不为空,则唤醒进程
    if (waitqueue_active(&ep->poll_wait))
        pwake++;
out_unlock:
    spin_unlock_irqrestore(&ep->lock, flags);
    if (pwake)
        ep_poll_safewake(&ep->poll_wait);


    if ((unsigned long)key & POLLFREE) {
        list_del_init(&wait->task_list); //删除相应的wait
        smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
    }
    return 1;
}

//判断等待队列是否为空
static inline int waitqueue_active(wait_queue_head_t *q)
{
    return !list_empty(&q->task_list);
}

3. epoll implementation summary

Looking at the essence through phenomena, in fact, the soul of epoll is ep_item_poll and ep_poll_callback.

epoll relies on ep_item_poll of the virtual file system. Register ep_poll_callback to the waitqueue of the corresponding file. When the corresponding file has data coming. The registered function will be called. The epoll callback will add the epi of the corresponding file to the ready queue.

When the user calls epoll_wait(), epoll will lock and transfer the queue data to the user space, and the event at this time will be hung in the ovflist.

4. Redis uses Epoll

The specific implementation is in ae_epoll.c

typedef struct aeApiState {

    // epoll_event 实例描述符
    int epfd;
    // 事件槽
    struct epoll_event *events;

} aeApiState;

aeApiCreate method

Redis calls the aeCreateEventLoop method when it initializes the server. aeCreateEventLoop calls aeApiCreate back to create an epoll instance.

static int aeApiCreate(aeEventLoop *eventLoop) {
    aeApiState *state = zmalloc(sizeof(aeApiState));


    if (!state) return -1;
    state->events = zmalloc(sizeof(struct kevent)*eventLoop->setsize);
    if (!state->events) {
        zfree(state);
        return -1;
    }
    state->kqfd = kqueue();
    if (state->kqfd == -1) {
        zfree(state->events);
        zfree(state);
        return -1;
    }
    eventLoop->apidata = state;
    return 0;    
}

aeApiAddEvent method

This method is to associate events to epoll, so epoll's ctl method will be called

static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) {
    aeApiState *state = eventLoop->apidata;
    struct epoll_event ee;

    /* If the fd was already monitored for some event, we need a MOD
     * operation. Otherwise we need an ADD operation. 
     *
     * 如果 fd 没有关联任何事件,那么这是一个 ADD 操作。
     * 如果已经关联了某个/某些事件,那么这是一个 MOD 操作。
     */
    int op = eventLoop->events[fd].mask == AE_NONE ?
            EPOLL_CTL_ADD : EPOLL_CTL_MOD;

    // 注册事件到 epoll
    ee.events = 0;
    mask |= eventLoop->events[fd].mask; /* Merge old events */
    if (mask & AE_READABLE) ee.events |= EPOLLIN;
    if (mask & AE_WRITABLE) ee.events |= EPOLLOUT;
    ee.data.u64 = 0; /* avoid valgrind warning */
    ee.data.fd = fd;

    if (epoll_ctl(state->epfd,op,fd,&ee) == -1) return -1;
    return 0;
}

This method is called when the redis service creates a new client. The read event of this client will be registered.

When redis needs to write data to the client, it will call the prepareClientToWrite method. This method is mainly to register the write event corresponding to fd.

If the registration fails, redis will not write the data to the buffer.

If the corresponding package word is writable, the redis event loop will write the new data in the buffer to the socket.

aeMain method

The main loop of the Redis event handler.

void aeMain(aeEventLoop *eventLoop) {

    eventLoop->stop = 0;

    while (!eventLoop->stop) {

        // 如果有需要在事件处理前执行的函数,那么运行它
        if (eventLoop->beforesleep != NULL)
            eventLoop->beforesleep(eventLoop);

        // 开始处理事件
        aeProcessEvents(eventLoop, AE_ALL_EVENTS);
    }
}

This method will eventually call epoll_wait() to get the corresponding event and execute it.

 

Guess you like

Origin blog.csdn.net/Linuxhus/article/details/114985548