Detailed explanation of Linux multi-channel transfer select, poll, epoll

Linux network programming multiplexing

Five IO models, blocking IO, non-blocking IO, multiplexed IO, signal-driven IO, and asynchronous IO.


foreword

In simple terms, IO is read and write. It is generally divided into two steps: 1. Wait for the data to be ready; 2. Copy data from the buffer in the kernel to the user area, or copy data from the user area to the kernel area. The essence of efficient IO is actually to reduce the time waiting for data to be ready . The IO multi-channel IO transfer server is also called a multi-tasking IO server. The overall design idea is that the application program itself no longer monitors the client connection, but instead the kernel monitors the event for the application program.
There are three main uses: select, poll, epoll



Five IO models

blocking IO

All sockets are blocking IO by default, waiting for data to be ready.

insert image description here

non-blocking IO

If it is not ready, the system call will still return directly, returning the EWOULDBLOCK error code

insert image description here

Signal driven IO

When the kernel prepares the data, the application receives the SIGIO signal. This IO method is rarely used.

insert image description here

Multiplexing IO

The most essential difference from blocking IO is that it can wait for multiple fds.

insert image description here

Asynchronous I/O

The kernel notifies the application when the data copy is complete (both waiting and copying are complete)

insert image description here


fcntl set non-blocking

insert image description here

  • Duplicate an existing descriptor (cmd=F_DUPFD) .
  • Get/set file descriptor flags (cmd=F_GETFD or F_SETFD).
  • Get/set file status flags (cmd=F_GETFL or F_SETFL)
  • Get/set asynchronous I/O ownership (cmd=F_GETOWN or F_SETOWN)
  • Get/set record lock (cmd=F_GETLK,F_SETLK or F_SETLKW)

The default file descriptors are all blocking


SetNonBlock

void SetNonBlock(int fd) {
    
    
	int fl = fcntl(fd,F_GETFL); //取出当前fd的属性
	check(fl);
	fcntl(fd,F_SETFL,fl | O_NONBLOCK); //在原先基础上设置非阻塞
}



select

Select monitors the status changes of multiple fds and waits for the data to be ready.

The bottom layer of select is polling to check whether the data is ready, so if there are a large number of connections, the server's response time will be greatly reduced.

function prototype

int select(int nfds, fd_set *readfds, fd_set *writefds,
			fd_set *exceptfds, struct timeval *timeout);

	nfds: 		监控的文件描述符集里最大文件描述符加1
	readfds:	监控有读数据到达文件描述符集合,传入传出参数
	writefds:	监控写数据到达文件描述符集合,传入传出参数
	exceptfds:	监控异常发生达文件描述符集合,如带外数据到达异常,传入传出参数
	timeout:	定时阻塞监控时间,3种情况
				1.NULL,阻塞等待
				2.设置timeval,等待固定时间
				3.设置timeval里时间均为0,立即返回,用来轮询
	struct timeval {
    
    
		long tv_sec; /* seconds */
		long tv_usec; /* microseconds */
	};

	fd_set : 这个结构就是整数数组,是一个位图,使用位图中对应的位来表示要监视的文件描述符.
	
	void FD_CLR(int fd, fd_set *set); 	//把文件描述符集合里fd清0
	int FD_ISSET(int fd, fd_set *set); 	//测试文件描述符集合里fd是否置1
	void FD_SET(int fd, fd_set *set); 	//把文件描述符集合里fd位置1
	void FD_ZERO(fd_set *set); 			//把文件描述符集合里所有位清0


Return value :
If the execution is successful, it will return the number of file descriptors whose status has changed.
If it returns 0, it means that the timeout time has passed before the descriptor status changes. If there is no return,
it returns -1 when an error occurs. The cause of the error is stored in errno. The values ​​of parameters readfds, writefds, exceptfds and timeout are unpredictable.
Possible error values ​​are:
EBADF The file descriptor is invalid or the file is closed
EINTR The call was interrupted by a signal
EINVAL Parameter n is negative
ENOMEM Out of core memory



socket ready condition


read ready

  • The number of bytes in the socket receiving buffer in the kernel is greater than or equal to the low water mark SO_RCVLOWAT. At this time, the file descriptor can be read without blocking, and the return value is greater than 0
  • In socket TCP communication, the peer end closes the connection, and if the socket is read at this time, it returns 0
  • There is a new connection request on the listening socket;
  • There was an unhandled error on the socket;

write ready

  • The number of bytes available in the socket sending buffer in the kernel (the size of the free location of the sending buffer), which is greater than or equal to the low water mark so_SNDLOWAT, can write without blocking at this time, and the return value is greater than 0
  • The write operation of the socket is closed (close or shutdown). Writing to a socket whose write operation is closed will trigger the SIGPIPE signal
  • After the socket connects successfully or fails using non-blocking connect;
  • unread error on socket

abnormally ready

  • The socket receives out-of-band data.


select Disadvantages:

  • Every call needs to reset fd_set
  • Frequent switching from user mode to kernel mode for copying is not efficient.
  • The bottom layer adopts a polling mechanism, and the efficiency is very low under a large number of connections.
  • The fd that select supports monitoring is limited.


poll

poll solves the two problems of select:
1. There is no limit to the file descriptors that poll can wait for.
2. Poll does not need to reset the monitored collection every time it is called.


function prototype

#include <sys/poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);

	nfds 			监控数组中有多少文件描述符需要被监听

	timeout 		毫秒级等待
		-1:阻塞等待
		0:立即返回,不阻塞进程
		>0:等待指定毫秒数,如当前系统时间精度不够毫秒,向上取值


	struct pollfd {
    
    
		int fd; /* 文件描述符 */
		short events; /* 监控的事件 */
		short revents; /* 监控事件中满足条件返回的事件 */
	};
	
	POLLIN			普通或带外优先数据可读,即POLLRDNORM | POLLRDBAND
	POLLRDNORM		数据可读
	POLLRDBAND		优先级带数据可读
	POLLPRI 		高优先级可读数据
	POLLOUT		普通或带外数据可写
	POLLWRNORM		数据可写
	POLLWRBAND		优先级带数据可写
	POLLERR 		发生错误
	POLLHUP 		发生挂起
	POLLNVAL 		描述字不是一个打开的文件

Return value
If the return value is less than 0, it means that there is an error;
if the return value is equal to 0, it means that the poll function waits for a timeout;
if the return value is greater than 0, it means that poll is ready due to the file description being monitored.



epoll

Epoll can be considered as an improved version of poll. In order to handle a large number of connections, it can significantly improve the CPU utilization of the system when there are only a few active programs in a large number of concurrent connections. It is recognized as the best multi-channel IO readiness notification method under Linux2.6.

In addition to providing the level trigger (Level Triggered) of IO events such as select/poll, epoll also provides edge trigger (Edge Triggered), which allows the user program to cache the IO status, reduce the calls of epoll_wait/epoll_pwait, and improve program efficiency.



function prototype


epoll_create

int epoll_create(int size);

Create a handle to epoll

Since linux2.6.8, the size parameter is ignored.
After use, you must call close() to close the handle


epoll_ctl

int epoll_ctl(int epfd,int op,int fd,struct epoll_event *event) ;

Add fd or corresponding event to the model

Event registration function of epoll

The first parameter is the return value of epoll_create() (the handle of epoll). The
second parameter represents the action, using three macros to represent
EPOLL_CTL_ADD: register a new fd to epfd
EPOLL_CTL_MOD: modify the listening event of the registered fd ;
EPOLL_CTL_DEL: delete a fd from epfd;

The third parameter is the fd that needs to be monitored.
The fourth parameter: inform the kernel of the event that needs to be monitored

	struct epoll_event {
    
    
			__uint32_t events; /* Epoll events */
			epoll_data_t data; /* User data variable */
		};
		typedef union epoll_data {
    
    
			void *ptr;
			int fd;
			uint32_t u32;
			uint64_t u64;
		} epoll_data_t;

		EPOLLIN :	表示对应的文件描述符可以读(包括对端SOCKET正常关闭)
		EPOLLOUT:	表示对应的文件描述符可以写
		EPOLLPRI:	表示对应的文件描述符有紧急的数据可读(这里应该表示有带外数据到来)
		EPOLLERR:	表示对应的文件描述符发生错误
		EPOLLHUP:	表示对应的文件描述符被挂断;
		EPOLLET: 	将EPOLL设为边缘触发(Edge Triggered)模式,这是相对于水平触发(Level Triggered)而言的
		EPOLLONESHOT:只监听一次事件,当监听完这次事件之后,如果还需要继续监听这个socket的话,需要再次把这个socket加入到EPOLL队列里

epoll_wait

#include <sys/epoll.h>
	int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout)
		events:		用来存内核得到事件的集合,
		maxevents:	告之内核这个events有多大,这个maxevents的值不能大于创建epoll_create()时的size,
		timeout:	是超时时间
			-1:	阻塞
			0:	立即返回,非阻塞
			>0:	指定毫秒
		返回值:	成功返回有多少文件描述符就绪,时间到时返回0,出错返回-1


epoll principle

When a process calls the epol_create method, the Linux kernel will create an eventpoll structure. There are two members in this structure that are closely related to the use of epoll.

struct eventpo11{
    
    
....
//红黑树的根节点,这颗树中存储着所有添加到epo11中的需要监控的事件
struct rb_rootrbr;
//双链表中则存放着将要通过epo11_wait返回给用户的满足条件的事件
struct list_head rdlist;
....
};

Each epoll object has an independent eventpoll structure, which is used to store events added to the epoll object through the epoll_ctl method.
These events will be mounted in the red-black tree, so that repeatedly added events can be efficiently identified through the red-black tree (the insertion time efficiency of the red-black tree is lgn, where n is the height of the tree). And all
added Events to epoll will establish a callback relationship with the device (network card) driver, that is, this callback method will be called when the corresponding event occurs.
This callback method is called ep_poll_callback in the kernel, which will add the events that occurred to the rdlist double-linked list. In epoll, for each event, an epitem structure will be created.

struct epitem{
    
    
struct rb_noderbn;//红黑树节点
struct list_headrd11ink;//双向链表节点
struct epo11_filefd ffd;//事件句柄信息
struct eventpo11 *ep;1/指向其所属的eventpoll对象
struct epo11_event event;//期待发生的事件类型
}

When calling epoll_wait to check whether an event occurs, you only need to check whether there is an epitem element in the rdlist double-linked list in the eventpoll object.
If rdlist is not empty, copy the events that occurred to the user mode, and return the event data to the user at the same time. The time complexity of this operation is O(1).



Advantages of epoll

  • Compared with select, in the case of a large number of connections and only a small number of connection responses, epoll is more efficient, directly reading from the ready queue, and the time complexity is O(1), while select requires polling to detect O(N).

  • The interface is easy to use. Although it is split into three national numbers, it is more convenient and efficient to use. It is not necessary to set the concerned file descriptor every cycle, and the input and output parameters are also separated.

  • The data copy is lightweight, and only calls EPoLl_CTL_ADD when appropriate to copy the file descriptor structure to the kernel. This operation is not frequent (while select/poll is copied every cycle)

  • Event callback mechanism: avoid using traversal, but use the callback function to add the ready file descriptor structure to the ready queue, and epoll_wait returns to directly access the ready queue to know which file descriptors are ready. The time complexity of this operation is O( 1). Even if the number of file descriptors is large, the efficiency will not be affected.

  • There is no limit to the number of listening sockets.



epoll event model


There are two models of EPOLL events:

  • Edge Triggered (ET) : Triggered only when data arrives, regardless of whether there is still data in the buffer.

  • Level Triggered (LT) level trigger : as long as there is data will be triggered.


LT:

epoll is the LT working mode by default

  • When epoll detects that the event on the socket is ready, it can not process it immediately. Or only process part of it
  • If only part of the data is read at one time, and there is still data in the buffer, epoll_wait will still return immediately and notify the socket that the read event is ready when calling epoll_wait for the second time
  • epoll_wait will not return immediately until all the data on the buffer has been processed
  • Support blocking read and write and non-blocking read and write

ET

When epoll works in ET mode, it must use a non-blocking socket to avoid blocking the IO operation of a file descriptor and causing other fds to starve to death.

  • non-blocking read and write
  • Only when read or write returns EAGAIN (non-blocking read, no data temporarily), you need to hang up and wait. But this does not mean that you need to read in a loop every time you read. It is not considered that the event processing is complete until an EAGAIN is generated. When the length of the read data returned by read is less than the requested data length, you can determine the buffering at this time. There is no more data in it, so it can be considered that the read event has been processed.

The efficiency of ET is higher than that of LE. The user space can cache the IO state and reduce the number of calls to epoll_wait, but it will be difficult to program.


Guess you like

Origin blog.csdn.net/juggte/article/details/123267895