Linux IO 多路复用理解

1、复用的意思时不用每个进程/线程来操控单独的一个IO，只需一个进程/线程来操控多个IO.

2、内核空间不能直接解引用用户态的指针。

select 与 poll

select 传递 fd_set* 的指针，仍然需要将fd_set从用户态拷贝到内核态。poll 传递的 pollfd* 指针一样需要从用户态拷贝所有 pollfd 到内核态。（ copy_from_user 方法）

fd_set 只是一个包装成 struct 的数组，就是一个 1024bit 的bitmap 而已。由于传入时需要用来标记监控的文件描述符，返回时也要用其标记是否有事件发生，所以每次调用前需要初始化。fd_set 是一个静态的数组，所以 select 支持的文件描述符数量有限，而 poll 传入的相当于一个动态数组（指针 + 元素个数），所以支持的文件描述符数量没有限制。

pollfd 将文件描述符和事件用不同的字段来分离表示，绑定到一个结构体当中。传入时用 events 表示监控的事件，传出时用 revents 表示返回的事件，所以不用 select 一样每次调用初始化一下。

struct pollfd
  {
    int fd;			/* File descriptor to poll.  */
    short int events;		/* Types of events poller cares about.  */
    short int revents;		/* Types of events that actually occurred.  */
  };

select 和 polled 的问题在于感兴趣的文件描述符一直由用户态记录，而 epoll 则交个内核来管理了。select 和 poll 传入到内核的结构，内核需要遍历所有传入的文件描述符，依次检查每个文件描述符是否有监控的事件发生。在检查之前，会将当前进程加入到文件描述符fd 的 wait queue 当中。

1、在 select 和 poll 调用之前，如果有事件发生，网卡通过中断和中断联合（interrupt coalescing）来通知内核读写，此时由于检测到事件，内核将 fd_set/ pollfd 结果拷贝到用户态，同时用户态调用返回。用户态遍历所有监控的文件描述符，检查返回结果，作对应处理

2、如果在 select 和 poll 调用之前，如果没有事件发生。select/poll将阻塞，进程休眠，知道超时或者被中断。新的事件来临，内核按文件描述符 fd 为依托来处理。此时该 fd 上的 wait queue 会被依次唤醒。通常的实现下，select/poll 并不知道是自己是被哪个 fd 唤醒，所以又需要再去遍历一遍所有传入的fd，然后同 1 一样在用户态返回和处理。

epoll

epoll 将所有需要监控的文件描述符同一交给内核来管理，所以不需要在每次调用时拷贝。步骤细化，涉及到 3 个调用。

typedef union epoll_data
{
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event
{
  uint32_t events;	/* Epoll events */
  epoll_data_t data;	/* User data variable */
} __EPOLL_PACKED;


/* Creates an epoll instance.  Returns an fd for the new instance.
   The "size" parameter is a hint specifying the number of file
   descriptors to be associated with the new instance.  The fd
   returned by epoll_create() should be closed with close().  */
extern int epoll_create (int __size) __THROW;

/* Manipulate an epoll instance "epfd". Returns 0 in case of success,
   -1 in case of error ( the "errno" variable will contain the
   specific error code ) The "op" parameter is one of the EPOLL_CTL_*
   constants defined above. The "fd" parameter is the target of the
   operation. The "event" parameter describes which events the caller
   is interested in and any associated user data.  */
extern int epoll_ctl (int __epfd, int __op, int __fd,
		      struct epoll_event *__event) __THROW;


/* Wait for events on an epoll instance "epfd". Returns the number of
   triggered events returned in "events" buffer. Or -1 in case of
   error with the "errno" variable set to the specific error code. The
   "events" parameter is a buffer that will contain triggered
   events. The "maxevents" is the maximum number of events to be
   returned ( usually size of "events" ). The "timeout" parameter
   specifies the maximum wait time in milliseconds (-1 == infinite).

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern int epoll_wait (int __epfd, struct epoll_event *__events,
		       int __maxevents, int __timeout);

epoll 传入的和 poll 相似，也是一个动态数组，所以数量也没有限制。内核使用红黑树来快速的添加删除需要监控的文件描述符，同时基于事件驱动，文件描述符 fd 有事件发生时，内核的回调函数会将该 fd 加入到内核维护的 ready list 内。所以调用 epoll_ctl 时，内核只需要去检查 ready list 并拷贝结果到用户态即可。及时 epoll_ctl 调用时，ready list 为空，进程休眠。在进程挂在 fd 上的 wait queue 被唤醒之前，内核已经将事件添加到 ready list 了，所以这个时候仍然只要简单的将 ready list 的结果返回给用户态而已。就是说，在用户态，返回的结果只包含产生了事件的文件描述符。

最后，select 和 poll 实际上是水平触发模式，而 epoll 不仅支持水平触发，而且可以设置为边沿触发。