Server: Linux IO mode and detailed explanation of select, poll, epoll (simple and clear)

What are synchronous IO and asynchronous IO, blocking IO and non-blocking IO, and what is the difference? Different people give different answers in different contexts. So let’s limit the context of this article first.

The background discussed in this article is network IO in the Linux environment. 

A concept note

Before explaining, let's first explain a few concepts:
- User space and kernel space
- Process switching
- Process blocking
- File descriptors
- Cached I/O

user space and kernel space

Now operating systems use virtual memory, so for a 32-bit operating system, its addressing space (virtual storage space) is 4G (2 to the 32nd power). The core of the operating system is the kernel, which is independent of ordinary applications and can access the protected memory space and also have all the permissions to access the underlying hardware devices. In order to ensure that the user process cannot directly operate the kernel and ensure the security of the kernel, the worry system divides the virtual space into two parts, one is the kernel space and the other is the user space. For the linux operating system, the highest 1G bytes (from virtual address 0xC0000000 to 0xFFFFFFFF) are used by the kernel, called kernel space, and the lower 3G bytes (from virtual address 0x00000000 to 0xBFFFFFFF), It is used by each process and is called user space.

process switch

In order to control the execution of a process, the kernel must have the ability to suspend a process running on the CPU and resume execution of a previously suspended process. This behavior is called process switching. Therefore, it can be said that any process runs with the support of the operating system kernel and is closely related to the kernel.

From the running of one process to running on another process, the following changes are made in this process:

1. Save the processor context, including the program counter and other registers.
2. Update PCB information.
3. Move the PCB of the process into the corresponding queue, such as ready, blocked in a certain event queue, etc.
4. Select another process to execute and update its PCB.
5. Update memory management data structures.
6. Restore the handler context.

Note: All in all, it is very resource-intensive

process blocking

The process being executed, because some expected events have not occurred, such as failure to request system resources, waiting for the completion of an operation, new data has not yet arrived or no new work to do, etc., the system automatically executes the blocking primitive (Block), Change itself from running state to blocking state. It can be seen that the blocking of the process is an active behavior of the process itself, so only the process in the running state (getting the CPU) can turn it into the blocking state. 当进程进入阻塞状态,是不占用CPU资源的.

file descriptor fd

A file descriptor is a term in computer science, an abstraction used to express a reference to a file.

A file descriptor is formally a non-negative integer. In fact, it is an index value that points to the record table of the process's open files maintained by the kernel for each process. When a program opens an existing file or creates a new file, the kernel returns a file descriptor to the process. In programming, some low-level programming often revolves around file descriptors. However, the concept of file descriptors is often only applicable to operating systems such as UNIX and Linux.

Cached I/O

Cached I/O is also known as standard I/O, and the default I/O operation for most file systems is cached I/O. In the cached I/O mechanism of Linux, the operating system caches the I/O data in the page cache (page cache) of the file system, that is to say, the data will be copied to the buffer of the operating system kernel first. Then it will be copied from the operating system kernel's buffer to the application's address space.

Disadvantages of cached I/O:

During the data transmission process, multiple data copy operations are required in the application address space and the kernel. The CPU and memory overhead caused by these data copy operations is very large.

Two IO mode

As I just said, for an IO access (take read as an example), the data will be copied to the buffer of the operating system kernel first, and then copied from the buffer of the operating system kernel to the address space of the application. So, when a read operation happens, it goes through two phases:

1. Waiting for the data to be ready
2. Copying the data from the kernel to the process

Formally because of these two stages, the Linux system has produced the following five network modes.

- Blocking I/O (blocking IO)
- Nonblocking I/O (nonblocking IO)
- I/O multiplexing (IO multiplexing)
- Signal driven I/O (signal driven IO)
- Asynchronous I/O (asynchronous IO ) )

Note: Since signal driven IO is not commonly used in practice, I only mention the remaining four IO Models.

Blocking I/O (blocking IO)

In Linux, all sockets are blocked by default. A typical read operation flow is like this:

When the user process calls the recvfrom system call, the kernel starts the first stage of IO: preparing data (for network IO, many times the data has not arrived at the beginning. For example, a complete UDP has not been received yet. package. At this time, the kernel will wait for enough data to arrive). This process needs to wait, that is to say, a process is required for data to be copied to the buffer of the operating system kernel. On the user process side, the entire process will be blocked (of course, the process itself chooses to block). When the kernel waits until the data is ready, it will copy the data from the kernel to the user memory, and then the kernel returns the result, and the user process releases the block state and starts running again.

Therefore, the characteristic of blocking IO is that it is blocked in both stages of IO execution.

Nonblocking I/O (nonblocking IO)

Under linux, you can make it non-blocking by setting the socket. When performing a read operation on a non-blocking socket, the flow looks like this:

When the user process issues a read operation, if the data in the kernel is not ready, it will not block the user process, but will return an error immediately. From the perspective of the user process, after it initiates a read operation, it does not need to wait, but gets a result immediately. When the user process judges that the result is an error, it knows that the data is not ready, so it can send the read operation again. Once the data in the kernel is ready and the system call of the user process is received again, it immediately copies the data to the user memory and returns.

Therefore, the feature of nonblocking IO is that the user process needs to constantly ask the kernel data actively.

I/O multiplexing (IO multiplexing)

IO multiplexing is what we call select, poll, and epoll. In some places, this IO method is also called event driven IO. The advantage of select/epoll is that a single process can handle the IO of multiple network connections at the same time. Its basic principle is that the function select, poll, and epoll will continuously poll all the sockets it is responsible for. When a certain socket has data arriving, it will notify the user process.

当用户进程调用了select,那么整个进程会被block, and at the same time, the kernel will "monitor" all sockets responsible for select, and when the data in any socket is ready, select will return. At this time, the user process calls the read operation again to copy the data from the kernel to the user process.

Therefore, the feature of I/O multiplexing is that a process can wait for multiple file descriptors at the same time through a mechanism, and any one of these file descriptors (socket descriptors) enters the read-ready state, select( ) function to return.

This graph is not very different from the blocking IO graph. In fact, it is even worse. Because two system calls (select and recvfrom) need to be used here, and blocking IO only calls one system call (recvfrom). However, the advantage of using select is that it can handle multiple connections at the same time.

Therefore, if the number of connections processed is not very high, the web server using select/epoll is not necessarily better than the web server using multi-threading + blocking IO, and the delay may be greater. The advantage of select/epoll is not that it can handle a single connection faster, but that it can handle more connections. )

In the IO multiplexing Model, in practice, each socket is generally set to be non-blocking. However, as shown in the figure above, the entire user process is actually blocked all the time. It's just that the process is blocked by the select function, not by socket IO.

Asynchronous I/O (asynchronous IO)

Asynchronous IO under linux is actually rarely used. Let's take a look at its process:

After the user process initiates the read operation, it can start doing other things immediately. On the other hand, from the perspective of the kernel, when it receives an asynchronous read, it will return immediately, so it will not generate any block for the user process. Then, the kernel will wait for the data preparation to complete, and then copy the data to user memory. When all this is done, the kernel will send a signal to the user process, telling it that the read operation is complete.

Summarize

The difference between blocking and non-blocking

Calling blocking IO will block the corresponding process until the operation is completed, while non-blocking IO will return immediately when the kernel is still preparing data.

The difference between synchronous IO and asynchronous IO

Before explaining the difference between synchronous IO and asynchronous IO, we need to give the definitions of the two. The POSIX definition is like this:
- A synchronous I/O operation causes the requesting process to be blocked until that I/O operation completes;
- An asynchronous I/O operation does not cause the requesting process to be blocked;

The difference between the two is that synchronous IO blocks the process when doing "IO operation". According to this definition, the aforementioned blocking IO, non-blocking IO, and IO multiplexing all belong to synchronous IO.

Some people will say that non-blocking IO is not blocked. There is a very "cunning" place here. The "IO operation" referred to in the definition refers to the real IO operation, which is the system call of recvfrom in the example. When non-blocking IO executes the recvfrom system call, if the kernel data is not ready, the process will not be blocked at this time. However, when the data in the kernel is ready, recvfrom will copy the data from the kernel to the user memory. At this time, the process is blocked. During this time, the process is blocked.

Asynchronous IO is different. When a process initiates an IO operation, it returns directly and ignores it until the kernel sends a signal to tell the process that IO is complete. During this whole process, the process is not blocked at all.

The comparison of each IO Model is shown in the figure:

From the above picture, it can be found that the difference between non-blocking IO and asynchronous IO is still obvious. In non-blocking IO, although the process will not be blocked most of the time, it still requires the process to actively check, and when the data preparation is completed, the process also needs to actively call recvfrom again to copy the data to the user memory . And asynchronous IO is completely different. It is like the user process handing over the entire IO operation to someone else (kernel) to complete, and then signaling when the other person is done. During this period, the user process does not need to check the status of IO operations, nor does it need to actively copy data.

Detailed explanation of select, poll, and epoll of three I/O multiplexing

select, poll, and epoll are all IO multiplexing mechanisms. I/O multiplexing is a mechanism through which a process can monitor multiple descriptors. Once a descriptor is ready (usually read ready or write ready), it can notify the program to perform corresponding read and write operations. But select, poll, and epoll are essentially synchronous I/O, because they all need to be responsible for reading and writing after the read and write events are ready, that is to say, the read and write process is blocked, and asynchronous I/O does not require their own Responsible for reading and writing, the implementation of asynchronous I/O will be responsible for copying data from the kernel to user space. (long-winded here)

select

int select (int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

The file descriptors monitored by the select function are divided into three categories: writefds, readfds, and exceptfds. After the call, the select function will block until the description is ready (data is readable, writable, or except), or it times out (timeout specifies the waiting time, if it returns immediately, it can be set to null), and the function returns. When the select function returns, the ready descriptor can be found by traversing the fdset.

select is currently supported on almost all platforms, and its good cross-platform support is also one of its advantages. One disadvantage of select is that there is a maximum limit on the number of file descriptors that a single process can monitor, which is generally 1024 on Linux. This limit can be increased by modifying macro definitions or even recompiling the kernel, but this will also reduce efficiency. .

poll

int poll (struct pollfd *fds, unsigned int nfds, int timeout);

Different from the way select uses three bitmaps to represent three fdsets, poll uses a pointer to pollfd.

struct pollfd {
    int fd; /* file descriptor */
    short events; /* requested events to watch */
    short revents; /* returned events witnessed */
};

The pollfd structure contains the event to be monitored and the event that occurs, and no longer uses the "parameter-value" method of select. At the same time, pollfd does not have a maximum number limit (but the performance will also decrease if the number is too large). Like the select function, after poll returns, pollfd needs to be polled to obtain a ready descriptor.

From the above, both select and poll need to be returned after the return 通过遍历文件描述符来获取已经就绪的socket. In fact, a large number of concurrently connected clients may have few ready states at a time, so the efficiency decreases linearly as the number of monitored descriptors grows.

epoll

epoll was proposed in the 2.6 kernel and is an enhanced version of the previous select and poll. Compared with select and poll, epoll is more flexible and has no descriptor restrictions. epoll uses one file descriptor to manage multiple descriptors, and stores the events of the user-related file descriptors in an event table of the kernel, so that the copy in user space and kernel space only needs to be done once.

An epoll operation process

The epoll operation process requires three interfaces, which are as follows:

int epoll_create(int size);//Create an epoll handle, size is used to tell the kernel how big the number of monitors is
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

1. int epoll_create(int size);
Create an epoll handle, size is used to tell the kernel how large the number of monitors is, this parameter is different from the first parameter in select(), giving the maximum monitor fd+1 value of , 参数size并不是限制了epoll所能监听的描述符最大个数,只是对内核初始分配内部数据结构的一个建议.
When the epoll handle is created, it will occupy an fd value. If you look at /proc/process id/fd/ under linux, you can see the fd, so after using epoll, you must call close() to close it , otherwise fd may be exhausted.

2. int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
the function is to perform the op operation on the specified descriptor fd.
- epfd: is the return value of epoll_create().
- op: represents the op operation, represented by three macros: add EPOLL_CTL_ADD, delete EPOLL_CTL_DEL, and modify EPOLL_CTL_MOD. Add, delete and modify the listening events for fd respectively.
- fd: is the fd (file descriptor) that needs to be monitored
- epoll_event: is to tell the kernel what to monitor. The structure of struct epoll_event is as follows:

struct epoll_event {
  __uint32_t events;  /* Epoll events */
  epoll_data_t data;  /* User data variable */
};

//events can be a collection of the following macros:
EPOLLIN: Indicates that the corresponding file descriptor can be read (including the normal closing of the peer SOCKET);
EPOLLOUT: Indicates that the corresponding file descriptor can be written;
EPOLLPRI: Indicates that the corresponding file descriptor has urgent data to read (this should indicate that out-of-band data is coming);
EPOLLERR: Indicates that the corresponding file descriptor has an error;
EPOLLHUP: Indicates that the corresponding file descriptor is hung up;
EPOLLET: Set EPOLL to Edge Triggered mode, which is relative to Level Triggered.
EPOLLONESHOT: Only listen to the event once. After listening to this event, if you need to continue monitoring the socket, you need to add the socket to the EPOLL queue again.

3. int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
Wait for io events on epfd, and return maxevents events at most.
The parameter events is used to get the set of events from the kernel, maxevents tells the kernel how big the events are, the value of this maxevents cannot be greater than the size when creating epoll_create(), the parameter timeout is the timeout time (milliseconds, 0 will return immediately, -1 will Not sure, there is also a saying that it is permanently blocked). This function returns the number of events that need to be processed. If it returns 0, it means that it has timed out.

Two working modes

There are two modes for epoll to operate on file descriptors: LT (level trigger) and ET (edge ​​trigger). LT mode is the default mode. The differences between LT mode and ET mode are as follows:
LT mode: When epoll_wait detects the occurrence of a descriptor event and notifies the application of this event, 应用程序可以不立即处理该事件. The next time epoll_wait is called, the application will respond again and be notified of this event.

ET mode: When epoll_wait detects the occurrence of a descriptor event and notifies the application of this event, 应用程序必须立即处理该事件. If it is not handled, the next time epoll_wait is called, the application will not respond again and be notified of this event.

1. LT mode

LT (level triggered) is the default way of working, and supports both block and no-block sockets. In this approach, the kernel tells you whether a file descriptor is ready, and then you can perform IO operations on the ready fd . If you do nothing, the kernel will continue to notify you.

2. ET mode

ET (edge-triggered) is a high-speed working method and only supports no-block sockets. In this mode, the kernel tells you via epoll when a descriptor becomes ready from never ready. Then it will assume that you know the file descriptor is ready, and will not send more ready notifications for that file descriptor until you do something that causes that file descriptor to be no longer ready (e.g. you An EWOULDBLOCK error was caused when sending, receiving, or receiving a request, or sending or receiving less than a certain amount of data). Note, however, that if you don't do IO operations on this fd (thus causing it to become not ready again), the kernel will not send more notifications (only once)

The ET mode greatly reduces the number of times the epoll event is repeatedly triggered, so the efficiency is higher than that of the LT mode. When epoll works in ET mode, non-blocking sockets must be used to avoid starving the task of processing multiple file descriptors due to blocking read/blocking write operations on a file handle.

3. Summary

If there is such an example:

1. We have added a file handle (RFD) for reading data from the pipe to the epoll descriptor
2. At this time 2KB of data is written from the other end of the pipe
3. Call epoll_wait(2), and It will return RFD that it is ready for read operation
4. Then we read 1KB of data
5. Call epoll_wait(2)...

LT mode:
If it is LT mode, you can still be notified after calling epoll_wait(2) in step 5.

ET mode:
If we used the EPOLLET flag when adding the RFD to the epoll descriptor in step 1, then it would likely hang after calling epoll_wait(2) in step 5 because the remaining data still exists in the file's In the input buffer, and the data sender is still waiting for a feedback message for the data that has been sent. The ET work mode will report events only when an event occurs on the monitored file handle. So at step 5, the caller may give up waiting for the remaining data still in the file input buffer.

When using epoll's ET model to work, when an EPOLLIN event is generated, what
needs to be considered when reading data is that when the size returned by recv() is equal to the requested size, it is very likely that there is still data in the buffer. After reading, it also means that the event has not been processed, so it needs to be read again:

while(rs){
  buflen = recv(activeevents[i].data.fd, buf, sizeof(buf), 0);
  if(buflen < 0){
    // Since it is a non-blocking mode, when errno is EAGAIN, it means that there is no data to read in the current buffer
    // Here, it is regarded as the place where the event has been handled.
    if(errno == EAGAIN){
        break;
    }
    else{
        return;
    }
  }
  else if(buflen == 0){
     // This indicates that the peer's socket has been closed normally.
  }

 if(buflen == sizeof(buf){
      rs = 1; // need to read again
 }
 else{
      rs = 0;
 }
}

EAGAIN meaning in Linux

Development in the Linux environment often encounters many errors (set errno), of which EAGAIN is one of the more common errors (such as used in non-blocking operations).
Literally, a prompt to try again. This error often occurs when the application performs some non-blocking operations (on files or sockets).

For example, opening a file/socket/FIFO with the O_NONBLOCK flag, if you do continuous read operations and no data to read. At this time, the program will not block and wait for the data to be ready to return. The read function will return an error EAGAIN, indicating that your application has no data to read, please try again later.
For another example, when a system call (such as fork) fails to execute because there are not enough resources (such as virtual memory), EAGAIN is returned to prompt it to call it again (maybe it will succeed next time).

Three code demo

The following is an incomplete code and the format is wrong, which is intended to express the above process, and some template code has been removed.

#define IPADDRESS   "127.0.0.1"
#define PORT        8787
#define MAXSIZE     1024
#define LISTENQ     5
#define FDSIZE      1000
#define EPOLLEVENTS 100

listenfd = socket_bind(IPADDRESS,PORT);

struct epoll_event events[EPOLLEVENTS];

//create a descriptor
epollfd = epoll_create(FDSIZE);

//Add listener descriptor event
add_event(epollfd,listenfd,EPOLLIN);

//loop waiting
for ( ; ; ){
    //This function returns the number of descriptor events that have been prepared
    ret = epoll_wait(epollfd,events,EPOLLEVENTS,-1);
    // handle the received connection
    handle_events(epollfd,events,ret,listenfd,buf);
}

//event handler
static void handle_events(int epollfd,struct epoll_event *events,int num,int listenfd,char *buf)
{
     int i;
     int fd;
     //Perform the traversal; here, just traverse the io events that have been prepared. num is not the FDSIZE of the original epoll_create.
     for (i = 0;i < num;i++)
     {
         fd = events[i].data.fd;
        // Process according to the type of descriptor and event type
         if ((fd == listenfd) &&(events[i].events & EPOLLIN))
            handle_accpet(epollfd,listenfd);
         else if (events[i].events & EPOLLIN)
            do_read(epollfd,fd,buf);
         else if (events[i].events & EPOLLOUT)
            do_write(epollfd,fd,buf);
     }
}

//add event
static void add_event(int epollfd,int fd,int state){
    struct epoll_event ev;
    ev.events = state;
    ev.data.fd = fd;
    epoll_ctl(epollfd,EPOLL_CTL_ADD,fd,&ev);
}

// handle the received connection
static void handle_accpet(int epollfd,int listenfd){
     int clifd;     
     struct sockaddr_in cliaddr;     
     socklen_t cliaddrlen;     
     clifd = accept(listenfd,(struct sockaddr*)&cliaddr,&cliaddrlen);     
     if (clifd == -1)         
     perror("accpet error:");     
     else {         
         printf("accept a new client: %s:%d\n",inet_ntoa(cliaddr.sin_addr),cliaddr.sin_port); //Add a client descriptor and event         
         add_event(epollfd,clifd,EPOLLIN);     
     }
}

//read processing
static void do_read(int epollfd,int fd,char *buf){
    int nread;
    nread = read(fd,buf,MAXSIZE);
    if (nread == -1)     {         
        perror("read error:");         
        close(fd); //Remember close fd        
        delete_event(epollfd,fd,EPOLLIN); //Delete listener
    }
    else if (nread == 0)     {         
        fprintf(stderr,"client close.\n");
        close(fd); //Remember close fd       
        delete_event(epollfd,fd,EPOLLIN); //Delete listener
    }     
    else {         
        printf("read message is : %s",buf);        
        //Modify the event corresponding to the descriptor, from read to write         
        modify_event(epollfd,fd,EPOLLOUT);     
    }
}

//write processing
static void do_write(int epollfd,int fd,char *buf) {     
    int nwrite;     
    nwrite = write(fd,buf,strlen(buf));     
    if (nwrite == -1){         
        perror("write error:");        
        close(fd); //Remember close fd       
        delete_event(epollfd,fd,EPOLLOUT); //delete listener    

    }else{
        modify_event(epollfd,fd,EPOLLIN);
    }    
    memset(buf,0,MAXSIZE);
}

// delete event
static void delete_event(int epollfd,int fd,int state) {
    struct epoll_event ev;
    ev.events = state;
    ev.data.fd = fd;
    epoll_ctl(epollfd,EPOLL_CTL_DEL,fd,&ev);
}

//modify event
static void modify_event(int epollfd,int fd,int state){     
    struct epoll_event ev;
    ev.events = state;
    ev.data.fd = fd;
    epoll_ctl(epollfd,EPOLL_CTL_MOD,fd,&ev);
}

//Note: I saved the other end

Four epoll summary

In select/poll, the kernel scans all monitored file descriptors only after the process calls a certain method, and epoll registers a file descriptor through epoll_ctl() in advance, once it is ready based on a certain file descriptor , the kernel will use a callback-like callback mechanism to activate the file descriptor quickly, and be notified when the process calls epoll_wait(). ( 此处去掉了遍历文件描述符,而是通过监听回调的的机制This is exactly the charm of epoll.)

The advantages of epoll are mainly in the following aspects:

1. The number of monitored descriptors is not limited. The upper limit of the FD it supports is the maximum number of open files. This number is generally much larger than 2048. For example, it is about 100,000 on a machine with 1GB memory. The specific number You can view cat /proc/sys/fs/file-max. Generally speaking, this number has a lot to do with system memory. The biggest disadvantage of select is that the number of fds opened by a process is limited. This is simply not enough for a server with a large number of connections. Although you can also choose a multi-process solution (this is how Apache is implemented), although the cost of creating a process on Linux is relatively small, it is still not negligible, and the data synchronization between processes is far less efficient than the synchronization between threads , so it's not a perfect solution.

2. The efficiency of IO will not decrease as the number of monitored fds grows. Unlike select and poll polling, epoll is implemented through the callback function defined by each fd. Only the ready fd will execute the callback function.

If there is no large number of idle-connections or dead-connections, the efficiency of epoll will not be much higher than that of select/poll, but when encountering a large number of idle-connections, you will find that the efficiency of epoll is much higher than that of select/poll.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325874833&siteId=291194637