Linux network programming-analysis of the underlying implementation of epoll

Basic data structure

Let's take a look at the data structure used in epoll, which are eventpoll, epitem and eppoll_entry.

The eventpoll data structure, this data structure is a handle created on the kernel side after calling epoll_create , which represents an epoll instance. Later, if we call epoll_ctl and epoll_wait, etc., we will operate on this eventpoll data . This part of the data will be saved in the private_data field of the anonymous file file created by epoll_create .

/*
 * This structure is stored inside the "private_data" member of the file
 * structure and represents the main data structure for the eventpoll
 * interface.
 */
struct eventpoll {
    /* Protect the access to this structure */
    spinlock_t lock;

    /*
     * This mutex is used to ensure that files are not removed
     * while epoll is using them. This is held during the event
     * collection loop, the file cleanup path, the epoll file exit
     * code and the ctl operations.
     */
    struct mutex mtx;

    /* Wait queue used by sys_epoll_wait() */
    //这个队列里存放的是执行epoll_wait从而等待的进程队列
    wait_queue_head_t wq;

    /* Wait queue used by file->poll() */
    //这个队列里存放的是该eventloop作为poll对象的一个实例，加入到等待的队列
    //这是因为eventpoll本身也是一个file, 所以也会有poll操作
    wait_queue_head_t poll_wait;

    /* List of ready file descriptors */
    //这里存放的是事件就绪的fd列表，链表的每个元素是下面的epitem
    struct list_head rdllist;

    /* RB tree root used to store monitored fd structs */
    //这是用来快速查找fd的红黑树
    struct rb_root_cached rbr;

    /*
     * This is a single linked list that chains all the "struct epitem" that
     * happened while transferring ready events to userspace w/out
     * holding ->lock.
     */
    struct epitem *ovflist;

    /* wakeup_source used when ep_scan_ready_list is running */
    struct wakeup_source *ws;

    /* The user that created the eventpoll descriptor */
    struct user_struct *user;

    //这是eventloop对应的匿名文件，充分体现了Linux下一切皆文件的思想
    struct file *file;

    /* used to optimize loop detection check */
    int visited;
    struct list_head visited_list_link;

#ifdef CONFIG_NET_RX_BUSY_POLL
    /* used to track busy poll napi_id */
    unsigned int napi_id;
#endif
};

I mentioned epitem in the code. What is this epitem structure for?

Whenever we call epoll_ctl to add a fd, the kernel will create an epitem instance for us, and add this instance as a child node of the red-black tree to the red-black tree in the eventpoll structure. The corresponding field is rbr. After that, to find whether an event occurred on each fd is operated through the epitem on the red-black tree .

/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the "rbr" RB tree.
 * Avoid increasing the size of this struct, there can be many thousands
 * of these on a server and we do not want this to take another cache line.
 */
struct epitem {
    union {
        /* RB tree node links this structure to the eventpoll RB tree */
        struct rb_node rbn;
        /* Used to free the struct epitem */
        struct rcu_head rcu;
    };

    /* List header used to link this structure to the eventpoll ready list */
    //将这个epitem连接到eventpoll 里面的rdllist的list指针
    struct list_head rdllink;

    /*
     * Works together "struct eventpoll"->ovflist in keeping the
     * single linked chain of items.
     */
    struct epitem *next;

    /* The file descriptor information this item refers to */
    //epoll监听的fd
    struct epoll_filefd ffd;

    /* Number of active wait queue attached to poll operations */
    //一个文件可以被多个epoll实例所监听，这里就记录了当前文件被监听的次数
    int nwait;

    /* List containing poll wait queues */
    struct list_head pwqlist;

    /* The "container" of this item */
    //当前epollitem所属的eventpoll
    struct eventpoll *ep;

    /* List header used to link this item to the "struct file" items list */
    struct list_head fllink;

    /* wakeup_source used when EPOLLWAKEUP is set */
    struct wakeup_source __rcu *ws;

    /* The structure that describe the interested events and the source fd */
    struct epoll_event event;
};

Every time a fd is associated with an epoll instance, an eppoll_entry will be generated. The structure of eppoll_entry is as follows:

/* Wait structure used by the poll hooks */
struct eppoll_entry {
    /* List header used to link this structure to the "struct epitem" */
    struct list_head llink;

    /* The "base" pointer is set to the container "struct epitem" */
    struct epitem *base;

    /*
     * Wait queue item that will be linked to the target file wait
     * queue head.
     */
    wait_queue_entry_t wait;

    /* The wait queue head that linked the "wait" wait queue item */
    wait_queue_head_t *whead;
};

epoll_create

When we use epoll, we first call epoll_create to create an epoll instance.

First of all, epoll_create will simply verify the passed flags parameter.

/* Check the EPOLL_* constant for consistency.  */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

if (flags & ~EPOLL_CLOEXEC)
    return -EINVAL;
/*

The kernel applies to allocate the memory space required by eventpoll.

/* Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
if (error < 0)
  return error;

In the next step, epoll_create assigns an anonymous file and file description word to the epoll instance, where fd is the file description word and file is an anonymous file. This fully embodies the idea that everything is a file under UNIX. Note that the instance of eventpoll saves a reference to an anonymous file, and binds the anonymous file with the file description by calling the fd_install function.

There is another point that needs special attention. When calling anon_inode_getfile, epoll_create saves eventpoll as the private_data of the anonymous file file, so that you can quickly locate it when you find it later through the file description word of the epoll instance eventpoll object.

Finally, this file description word is used as the file handle of epoll and is returned to the caller of epoll_create.

/*
 * Creates all the items needed to setup an eventpoll file. That is,
 * a file structure and a free file descriptor.
 */
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
if (fd < 0) {
    error = fd;
    goto out_free_ep;
}
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
             O_RDWR | (flags & O_CLOEXEC));
if (IS_ERR(file)) {
    error = PTR_ERR(file);
    goto out_free_fd;
}
ep->file = file;
fd_install(fd, file);
return fd;

epoll_ctl

Next, look at how the socket is added to the epoll instance.

Find epoll instance:

First of all, the epoll_ctl function obtains the corresponding anonymous file through the epoll instance handle. This is easy to understand. Everything under UNIX is a file, and an instance of epoll is also an anonymous file.

//获得epoll实例对应的匿名文件
f = fdget(epfd);
if (!f.file)
    goto error_return;

Next, get the file corresponding to the added socket, where tf represents the target file, which is the target file to be processed.

/* Get the "struct file *" for the target file */
//获得真正的文件，如监听套接字、读写套接字
tf = fdget(fd);
if (!tf.file)
    goto error_fput;

Next, a series of data verifications are performed to ensure that the parameters passed in by the user are legal. For example, epfd is really an epoll instance handle, not a normal file descriptor.

/* The target file descriptor must support poll */
//如果不支持poll，那么该文件描述字是无效的
error = -EPERM;
if (!tf.file->f_op->poll)
    goto error_tgt_fput;
...

If you get a real epoll instance handle, you can get the eventpoll instance created before through private_data.

/*
 * At this point it is safe to assume that the "private_data" contains
 * our own data structure.
 */
ep = f.file->private_data;

Red-black tree search:

epoll_ctl finds whether the socket exists in the red-black tree through the target file and the corresponding description, which is why epoll is efficient. The red-black tree (RB-tree) is a common data structure, where eventpoll tracks all the file descriptions currently monitored through the red-black tree, and the root of this tree is stored in the eventpoll data structure.

/* RB tree root used to store monitored fd structs */
struct rb_root_cached rbr;

For each file description word being monitored, there is a corresponding epitem corresponding to it, and epitem is stored in the red-black tree as a node in the red-black tree.

/*
 * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
 * above, we can be sure to be able to use the item looked up by
 * ep_find() till we release the mutex.
 */
epi = ep_find(ep, tf.file, fd);

The red-black tree is a binary tree. As a node on the binary tree, epitem must provide comparison capabilities so that an ordered binary tree can be constructed in order of size. Its sorting ability is done by relying on the epoll_filefd structure. epoll_filefd can be simply understood as the file description word that needs to be monitored, which corresponds to the node on the binary tree.

It can be seen that this is relatively easy to understand, sorted by file address size. If the two are the same, they are sorted according to the file description.

struct epoll_filefd {
  struct file *file; // pointer to the target file struct corresponding to the fd
  int fd; // target file descriptor number
} __packed;

/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
                            struct epoll_filefd *p2)
{
  return (p1->file > p2->file ? +1:
       (p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

After performing the red-black tree search, if an ADD operation is found, and the corresponding binary tree node is not found in the tree, ep_insert will be called to increase the binary tree node.

case EPOLL_CTL_ADD:
    if (!epi) {
        epds.events |= POLLERR | POLLHUP;
        error = ep_insert(ep, &epds, tf.file, fd, full_check);
    } else
        error = -EEXIST;
    if (full_check)
        clear_tfile_check_list();
    break;

ep_insert

First determine whether the currently monitored file value exceeds the preset maximum value of /proc/sys/fs/epoll/max_user_watches, if it exceeds, an error will be returned directly.

user_watches = atomic_long_read(&ep->user->epoll_watches);
if (unlikely(user_watches >= max_user_watches))
    return -ENOSPC;

The next step is to allocate resources and initialize actions.

if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        return -ENOMEM;

    /* Item initialization follow here ... */
    INIT_LIST_HEAD(&epi->rdllink);
    INIT_LIST_HEAD(&epi->fllink);
    INIT_LIST_HEAD(&epi->pwqlist);
    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd);
    epi->event = *event;
    epi->nwait = 0;
    epi->next = EP_UNACTIVE_PTR;

The next thing is very important, ep_insert will set a callback function for each file description word added. This callback function is set through the function ep_ptable_queue_proc. What does this callback function do? In fact, if an event occurs on the corresponding file description, this function will be called. For example, if the socket buffer has data, this function will be called back. This function is ep_poll_callback. Here you will find that the original kernel design is also full of event callback principles.

/*
 * This is the callback that is used to add our wait queue to the
 * target file wakeup lists.
 */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,poll_table *pt)
{
    struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq;

    if (epi>nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        if (epi->event.events & EPOLLEXCLUSIVE)
            add_wait_queue_exclusive(whead, &pwq->wait);
        else
            add_wait_queue(whead, &pwq->wait);
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        /* We have to signal that an error occurred */
        epi->nwait = -1;
    }
}

ep_poll_callback

The ep_poll_callback function is very important. It really connects the kernel events with the epoll object. How is it achieved?

First, find the corresponding epitem object through the wait_queue_entry_t object of this file , because the wait_quue_entry_t is stored in the eppoll_entry object, and the address of the eppoll_entry object can be simply calculated based on the address of the wait_quue_entry_t object, so that the address of the epitem object can be obtained. This part of the work is completed in the ep_item_from_wait function . Once the epitem object is obtained, the eventpoll instance can be traced.

/*
 * This is the callback that is passed to the wait queue wakeup
 * mechanism. It is called by the stored file descriptors when they
 * have events to report.
 */
static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
{
    int pwake = 0;
    unsigned long flags;
    struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;

Next, perform a lock operation.

spin_lock_irqsave(&ep->lock, flags);

The following is to filter the events that occur, why do we need to filter? For performance considerations, ep_insert registers all events with the corresponding monitoring file, but the actual events subscribed by the user side may not correspond to the kernel events. For example, a user subscribes to the kernel for a socket's readable event. When a socket's writable event occurs at a certain time, it does not need to be passed to the user space.

/*
 * Check the events coming with the callback. At this stage, not
 * every device reports the events in the "key" parameter of the
 * callback. We need to be able to handle both cases here, hence the
 * test for "key" != NULL before the event match test.
 */
if (key && !((unsigned long) key & epi->event.events))
    goto out_unlock;

Next, determine whether the event needs to be passed to the user space.

if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
  if (epi->next == EP_UNACTIVE_PTR) {
      epi->next = ep->ovflist;
      ep->ovflist = epi;
      if (epi->ws) {
          /*
           * Activate ep->ws since epi->ws may get
           * deactivated at any time.
           */
          __pm_stay_awake(ep->ws);
      }
  }
  goto out_unlock;
}

If necessary, and the event_item corresponding to the event is not in the completed queue corresponding to eventpoll, put it in the queue so that the event can be delivered to the user space.

/* If this file is already in the ready list we exit soon */
if (!ep_is_linked(&epi->rdllink)) {
    list_add_tail(&epi->rdllink, &ep->rdllist);
    ep_pm_stay_awake_rcu(epi);
}

We know that when we call epoll_wait, the calling process is suspended. From the kernel's point of view, the calling process goes to sleep. If an event occurs in the corresponding descriptor of the epoll instance, the dormant process should be awakened to process the event in time. The following code is for this purpose. The wake_up_locked function wakes up the waiting process on the current eventpoll.

/*
 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 * wait list.
 */
if (waitqueue_active(&ep->wq)) {
    if ((epi->event.events & EPOLLEXCLUSIVE) &&
                !((unsigned long)key & POLLFREE)) {
        switch ((unsigned long)key & EPOLLINOUT_BITS) {
        case POLLIN:
            if (epi->event.events & POLLIN)
                ewake = 1;
            break;
        case POLLOUT:
            if (epi->event.events & POLLOUT)
                ewake = 1;
            break;
        case 0:
            ewake = 1;
            break;
        }
    }
    wake_up_locked(&ep->wq);
}

Find epoll instance

The epoll_wait function first performs a series of checks, for example, the incoming maxevents should be greater than 0.

/* The maximum number of event must be greater than zero */
if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
    return -EINVAL;
/* Verify that the area passed by the user is writeable */
if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
    return -EFAULT;

Like the epoll_ctl described earlier, find the corresponding anonymous file and description through the epoll instance, and check and verify it.

/* Get the "struct file *" for the eventpoll file */
f = fdget(epfd);
if (!f.file)
    return -EBADF;
/*
 * We have to check that the file structure underneath the fd
 * the user passed to us _is_ an eventpoll file.
 */
error = -EINVAL;
if (!is_file_epoll(f.file))
    goto error_fput;

Still get the eventpoll instance by reading the private_data of the anonymous file corresponding to the epoll instance.

/*
 * At this point it is safe to assume that the "private_data" contains
 * our own data structure.
 */
ep = f.file->private_data;

Next, call ep_poll to complete the collection of the corresponding events and deliver them to the user space.

/* Time to fish for events ... */
error = ep_poll(ep, events, maxevents, timeout);

ep_poll

When the epoll function was introduced earlier, can the corresponding timeout value be greater than 0, equal to 0 and less than 0? Here ep_poll deals with scenarios with different values of timeout. If it is greater than 0, a timeout period is generated. If it is equal to 0, it immediately checks whether an event occurs.

static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,int maxevents, long timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
u64 slack = 0;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;

if (timeout > 0) {
    struct timespec64 end_time = ep_set_mstimeout(timeout);
    slack = select_estimate_accuracy(&end_time);
    to = &expires;
    *to = timespec64_to_ktime(end_time);
} else if (timeout == 0) {
    /*
     * Avoid the unnecessary trip to the wait queue loop, if the
     * caller specified a non blocking operation.
     */
    timed_out = 1;
    spin_lock_irqsave(&ep->lock, flags);
    goto check_events;
}

Next try to obtain the lock on eventpoll:

spin_lock_irqsave(&ep->lock, flags);

After obtaining the lock, check whether there is an event currently occurring. If not, add the current process to the eventpoll waiting queue wq. The purpose of this is that when an event occurs, the ep_poll_callback function can wake up the waiting process.

if (!ep_events_available(ep)) {
    /*
     * Busy poll timed out.  Drop NAPI ID for now, we can add
     * it back in when we have moved a socket with a valid NAPI
     * ID onto the ready list.
     */
    ep_reset_busy_poll_napi_id(ep);

    /*
     * We don't have any available event to return to the caller.
     * We need to sleep here, and we will be wake up by
     * ep_poll_callback() when events will become available.
     */
    init_waitqueue_entry(&wait, current);
    __add_wait_queue_exclusive(&ep->wq, &wait);

Then there is an infinite loop. In this loop, the current process is put to sleep by calling schedule_hrtimeout_range, and the CPU time is scheduled by the scheduler for other processes. Of course, the current process may be awakened. The awakening conditions include the following four:

The current process has timed out;
The current process receives a signal signal;
An event occurred on a certain description;
The current process is rescheduled by the CPU and enters the for loop to re-judgment. If the first three conditions are not met, it goes to sleep again.

//这个循环里，当前进程可能会被唤醒，唤醒的途径包括
//1.当前进程超时
//2.当前进行收到一个signal信号
//3.某个描述字上有事件发生
//对应的1.2.3都会通过break跳出循环
//第4个可能是当前进程被CPU重新调度，进入for循环的判断，如果没有满足1.2.3的条件，就又重新进入休眠
for (;;) {
    /*
     * We don't want to sleep if the ep_poll_callback() sends us
     * a wakeup in between. That's why we set the task state
     * to TASK_INTERRUPTIBLE before doing the checks.
     */
    set_current_state(TASK_INTERRUPTIBLE);
    /*
     * Always short-circuit for fatal signals to allow
     * threads to make a timely exit without the chance of
     * finding more events available and fetching
     * repeatedly.
     */
    if (fatal_signal_pending(current)) {
        res = -EINTR;
        break;
    }
    if (ep_events_available(ep) || timed_out)
        break;
    if (signal_pending(current)) {
        res = -EINTR;
        break;
    }

    spin_unlock_irqrestore(&ep->lock, flags);

    //通过调用schedule_hrtimeout_range，当前进程进入休眠，CPU时间被调度器调度给其他进程使用
    if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
        timed_out = 1;

    spin_lock_irqsave(&ep->lock, flags);
}

If the process returns from sleep, delete the current process from the waiting queue of eventpoll, and set the current process to TASK_RUNNING state.

//从休眠中结束，将当前进程从wait队列中删除，设置状态为TASK_RUNNING，接下来进入check_events，来判断是否是有事件发生
    __remove_wait_queue(&ep->wq, &wait);
    __set_current_state(TASK_RUNNING);

Finally, call ep_send_events to copy the event to user space.

//ep_send_events将事件拷贝到用户空间
/*
 * Try to transfer events to user space. In case we get 0 events and
 * there's still timeout left over, we go trying again in search of
 * more luck.
 */
if (!res && eavail &&
    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
    goto fetch_events;

return res;

ep_send_event

ep_send_events This function will use ep_send_events_proc as a callback function and call the ep_scan_ready_list function. The ep_scan_ready_list function calls ep_send_events_proc to process each event loop that is ready.

When the ep_send_events_proc loop processes the ready event, it will call the poll method of each file descriptor again to make sure that an event has indeed occurred . Why do this? This is to determine that the registered event is still valid at this moment.

It can be seen that although ep_send_events_proc has been considered as well as possible to make the event notifications obtained by user space are real and effective, there is still a certain probability. When ep_send_events_proc calls the poll function on the file again, the event notifications obtained by user space It is no longer valid, this may be that the user space has been processed, or some other situation. In this case, if the socket is not non-blocking, the entire process will be blocked, which is why the use of non-blocking sockets with epoll is the best practice .

After a simple event mask check, ep_send_events_proc copies the event structure to the data structure required by the user space. This is done through the __put_user method.

Level-triggered VS Edge-triggered

Before, we have been emphasizing the difference between level-triggered and edge-triggered.

From an implementation point of view, it is actually very simple. At the end of the ep_send_events_proc function, for the level-triggered situation, the current epoll_item object is re-added to the ready list of eventpoll, so that these epoll_item objects will be processed again when the next epoll_wait is called .

As we mentioned earlier, before finally copying to the user space valid event list, the poll method of the corresponding file will be called to determine whether the event is still valid. Therefore, if the user space program has processed the event, it will not be notified again; if it is not processed, it means that the event is still valid and will be notified again.

//这里是Level-triggered的处理，可以看到，在Level-triggered的情况下，这个事件被重新加回到ready list里面
//这样，下一轮epoll_wait的时候，这个事件会被重新check
else if (!(epi->event.events & EPOLLET)) {
    /*
     * If this file has been added with Level
     * Trigger mode, we need to insert back inside
     * the ready list, so that the next call to
     * epoll_wait() will check again the events
     * availability. At this point, no one can insert
     * into ep->rdllist besides us. The epoll_ctl()
     * callers are locked out by
     * ep_scan_ready_list() holding "mtx" and the
     * poll callback will queue them in ep->ovflist.
     */
    list_add_tail(&epi->rdllink, &ep->rdllist);
    ep_pm_stay_awake(epi);
}

epoll VS poll/select

Let's explain from an implementation perspective why epoll is much more efficient than poll/select.

First, poll/select first copies the fd to be monitored from user space to kernel space, and then processes it in the kernel space before copying it to user space. This involves the process of kernel space application for memory, memory release, etc., which is very time-consuming in the case of a large number of fd . The epoll maintains a red-black tree, by operating the black mangrove tree, you can avoid a lot of memory operation and application release, and look very fast .

The following code shows how poll/select applies for memory in the kernel space. You can see that select first tries to apply for resources on the stack. If there are more fd that need to be monitored, it will apply for resources of the heap space.

int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
               fd_set __user *exp, struct timespec64 *end_time)
{
    fd_set_bits fds;
    void *bits;
    int ret, max_fds;
    size_t size, alloc_size;
    struct fdtable *fdt;
    /* Allocate small arguments on the stack to save memory and be faster */
    long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];

    ret = -EINVAL;
    if (n < 0)
        goto out_nofds;

    /* max_fds can increase, so grab it once to avoid race */
    rcu_read_lock();
    fdt = files_fdtable(current->files);
    max_fds = fdt->max_fds;
    rcu_read_unlock();
    if (n > max_fds)
        n = max_fds;

    /*
     * We need 6 bitmaps (in/out/ex for both incoming and outgoing),
     * since we used fdset we need to allocate memory in units of
     * long-words. 
     */
    size = FDS_BYTES(n);
    bits = stack_fds;
    if (size > sizeof(stack_fds) / 6) {
        /* Not enough space in on-stack array; must use kmalloc */
        ret = -ENOMEM;
        if (size > (SIZE_MAX / 6))
            goto out_nofds;


        alloc_size = 6 * size;
        bits = kvmalloc(alloc_size, GFP_KERNEL);
        if (!bits)
            goto out_nofds;
    }
    fds.in      = bits;
    fds.out     = bits +   size;
    fds.ex      = bits + 2*size;
    fds.res_in  = bits + 3*size;
    fds.res_out = bits + 4*size;
    fds.res_ex  = bits + 5*size;
    ...

Second, when select/poll is awakened from sleep, if you monitor multiple fd, as long as one of the fd has an event, the kernel will traverse the internal list to check which event arrived. It is not like epoll. Directly associate eventpoll objects through fd, and quickly add fd directly to the ready list of eventpoll.

static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
{
    ...
    retval = 0;
    for (;;) {
        unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
        bool can_busy_loop = false;

        inp = fds->in; outp = fds->out; exp = fds->ex;
        rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;

        for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
            unsigned long in, out, ex, all_bits, bit = 1, mask, j;
            unsigned long res_in = 0, res_out = 0, res_ex = 0;

            in = *inp++; out = *outp++; ex = *exp++;
            all_bits = in | out | ex;
            if (all_bits == 0) {
                i += BITS_PER_LONG;
                continue;
            }
        
        if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
                   to, slack))
        timed_out = 1;
...

In short

Epoll maintains a red-black tree to track all file descriptions to be detected. The use of black-red tree reduces the large amount of data copy and memory allocation in the kernel and user space, and greatly improves performance. At the same time, epoll maintains a linked list to record ready events. The kernel registers itself in the ready event list when an event occurs in each file. Through the callback and wake-up mechanism between the kernel's own file file-eventpoll, it reduces The traversal of the kernel description word greatly accelerates the efficiency of event notification and detection, which also brings convenience to the realization of level-triggered and edge-triggered.

By comparing the implementation of poll/select, we find that epoll has indeed overcome the drawbacks of poll/select, and it is indeed the crown of high-performance network programming under Linux. We should thank the great gods of the Linux community for designing such a powerful event distribution mechanism, so that we can enjoy the various technical dividends brought by high-performance network servers under Linux.

Learn the new by reviewing the past!

Linux network programming-analysis of the underlying implementation of epoll

Guess you like