Promise me this time, I/O multiplexing in one fell swoop

This time, we transition to I/O multiplexing step by step in the simplest way socket network model.

But I won't talk about the parameters of each system call in detail, the book must be more detailed than me in this regard.


The most basic socket model

If the client and the server can communicate in the network, it must use Socket programming, which is a special way of inter-process communication, the special thing is that it can communicate across hosts.

The Chinese name of Socket is called Socket, which is quite confusing at first glance. In fact, before both parties want to communicate on the network, they have to create a Socket, which is equivalent to opening a "hole" for both the client and the server. When both parties read and send data, they all go through this "hole". Looking at it this way, does it feel like a network cable is made, one end is plugged into the client, the other end is plugged into the server, and then the communication is carried out.

When creating a socket, you can specify whether the network layer uses IPv4 or IPv6, and the transport layer uses TCP or UDP.

Socket programming of UDP is relatively simple, here we only introduce Socket programming based on TCP.

The program of the server should run first, and then wait for the connection and data of the client. Let's take a look at the Socket programming process of the server first.

The server first calls the socket() function to create a Socket whose network protocol is IPv4 and the transmission protocol is TCP, and then calls the bind() function to bind an  IP address and port to the Socket . What is the purpose of binding these two? ?

  • The purpose of binding the port: When the kernel receives a TCP message, it finds our application through the port number in the TCP header, and then passes the data to us.
  • The purpose of binding the IP address: a machine can have multiple network cards, and each network card has a corresponding IP address. When binding a network card, the kernel will send us the packets on the network card. ;

After binding the IP address and port, you can call the listen() function to monitor, which corresponds to the listen in the TCP state diagram. If we want to determine whether a network program in the server is started, we can use the netstate command to view the corresponding port. Whether the number is being monitored.

After the server enters the listening state, it calls the accept() function to obtain the client's connection from the kernel. If there is no client connection, it will block and wait for the arrival of the client connection.

How does the client initiate the connection? After the client has created the Socket, it calls the connect() function to initiate a connection. The parameters of this function should specify the IP address and port number of the server, and then the much-anticipated TCP three-way handshake begins.

During the TCP connection process, the server's kernel actually maintains two queues for each Socket:

  • One is the queue for which the connection has not been fully established, which is called the  TCP semi-connection queue . This queue is the connection that has not completed the three-way handshake. At this time, the server is in the state of syn_rcvd;
  • One is a queue for establishing a connection, which is called a  TCP full connection queue . This queue is a connection that has completed the three-way handshake. At this time, the server is in the established state;

When the TCP full connection queue is not empty, the accept() function of the server will take out a socket that has completed the connection from the TCP full connection queue in the kernel and return it to the application, which will be used for subsequent data transmission.

Note that the listening Socket and the Socket that is actually used to transmit data are two:

  • One is called listening Socket ;
  • One is called a connected Socket ;

After the connection is established, the client and server begin to transfer data to each other, and both parties can read and write data through the read() and write() functions.

At this point, the calling process of the Socket program of the TCP protocol is over. The whole process is as follows:

Seeing this, I don't know if you feel that the way of reading and writing Socket is like reading and writing files.

Yes, based on the concept that everything is a file in Linux, Sockets also exist in the form of "files" in the kernel, and they also have corresponding file descriptors.

PS: The data structure in the kernel will be mentioned below. If you are not interested, you can skip this part, and it will not affect the subsequent content.

What is the role of file descriptors? Each process has a data structure task_struct, which has a member pointer to the "file descriptor array". This array lists the file descriptors of all files opened by this process. The subscript of the array is the file descriptor, which is an integer, and the content of the array is a pointer, pointing to the list of all open files in the kernel, that is to say, the kernel can find the corresponding open file through the file descriptor.

Then each file has an inode. The inode of the Socket file points to the Socket structure in the kernel. There are two queues in this structure, namely the sending queue and the receiving queue . These two queues store structs one by one. sk_buff, strung together in the form of a linked list.

sk_buff can represent the data packets of each layer. In the application layer, the data packet is called data, in the TCP layer we call it segment, in the IP layer we call it packet, and in the data link layer it is called frame.

You may be wondering why all the data packets are described by only one structure? The protocol stack adopts a layered structure. When the upper layer transmits data to the lower layer, it needs to add the packet header, and when the lower layer transmits data to the upper layer, it needs to remove the packet header. If each layer uses a structure, when transmitting data between layers, it is necessary to There are multiple copies to occur, which will greatly reduce CPU efficiency.

Therefore, in order to transfer data between layers, no copying occurs, and only a structure of sk_buff is used to describe all network packets. How does it do it? By adjusting the data pointer in sk_buff, for example:

  • When receiving a message, starting from the network card driver, the datagram is transmitted up through the protocol stack layer by layer, and the protocol header is gradually stripped by increasing the value of skb->data.
  • When a message is to be sent, a sk_buff structure is created, and the header of the data buffer reserves enough space to fill the headers of each layer. When passing through each lower-layer protocol, increase the protocol header by reducing the value of skb->data .

You can see the movement of the data pointer when sending a message from the picture below.


How to serve more users?

The TCP Socket call process mentioned above is the simplest and most basic. It can basically only communicate one-to-one because it uses a synchronous blocking method. When the server has not finished processing the network I/O of a client , or when the read and write operations are blocked, other clients cannot connect to the server.

But if our server can only serve one client, it would be a waste of resources, so we have to improve this network I/O model to support more clients.

Before improving the network I/O model, let me ask a question. Do you know the maximum number of clients that can be connected to a single server in theory?

I believe you know that a TCP connection is uniquely confirmed by a quadruple, which is: local IP, local port, peer IP, peer port .

As a server, the server usually listens to a fixed port locally and waits for the connection of the client. Therefore, the local IP and port of the server are fixed, so for the four-tuple of server TCP connections, only the peer IP and port will change, so the maximum number of TCP connections = the number of client IPs × the number of client ports .

For IPv4, the maximum number of client IPs is 2 to the 32nd power, and the maximum number of client ports is 2 to the 16th power, that is, the maximum number of single server TCP connections is about 2 to the 48th power .

This theoretical value is quite "full", but the server certainly cannot carry such a large number of connections, which is mainly limited by two aspects:

  • File descriptor , Socket is actually a file, and it will correspond to a file descriptor. Under Linux, the number of file descriptors opened by a single process is limited, and the unmodified value is generally 1024, but we can increase the number of file descriptors through ulimit;
  • System memory , each TCP connection has a corresponding data structure in the kernel, which means that each connection will occupy a certain amount of memory;

If the server's memory is only 2 GB and the network card is gigabit, can it support 10,000 concurrent requests?

Concurrent 10,000 requests is the classic C10K problem, C is the acronym for Client, and C10K is the problem of a single machine processing 10,000 requests at the same time.

From the perspective of hardware resources, for a server with a 2GB memory Gigabit NIC, if each request processing occupies less than 200KB of memory and 100Kbit of network bandwidth, it can satisfy 10,000 concurrent requests.

However, in order to truly realize the C10K server, the point to consider is the server's network I/O model. The low-efficiency model will increase the system overhead, so it will be farther and farther away from the goal of C10K.


multi-process model

Based on the most primitive blocking network I/O, if the server needs to support multiple clients, the more traditional way is to use the multi-process model , that is, assign a process to each client to process the request.

The main process of the server is responsible for monitoring the connection of the client. Once the connection with the client is completed, the accept() function will return a "connected Socket". At this time, a child process is created through the fork() function, which actually takes the parent process. Make a copy of everything relevant , including file descriptors, memory address space, program counters, executed code, etc.

When the two processes were just copied, they were almost identical. However, the parent process or the child process will be distinguished according to the return value . If the return value is 0, it is the child process; if the return value is another integer, it is the parent process.

Because the child process will copy the file descriptor of the parent process , it can directly use the "connected Socket" to communicate with the client,

It can be found that the child process does not need to care about the "listening Socket", only the "connected Socket"; the parent process, on the contrary, hands the customer service to the child process for processing, so the parent process does not need to care about the "connected Socket", only Need to care about "listening Socket".

The following diagram depicts from connection request to connection establishment, the parent process creates the child process to serve the client.

In addition, when the "child process" exits, some information of the process will actually be retained in the kernel, which will also occupy memory. If the "recycling" work is not done well, it will become a zombie process . If it is too much, it will slowly exhaust our system resources.

Therefore, the parent process wants to "clean up" its own children, how to clean up the aftermath? Then there are two ways to recycle resources after the child process exits, namely calling wait() and waitpid() functions.

This method of using multiple processes to deal with multiple clients is still feasible to deal with 100 clients, but when the number of clients reaches 10,000, it will definitely be unbearable, because each process generated will occupy a certain amount of space. system resources, and the "burden" of context switching between processes is very heavy, and the performance will be greatly reduced.

The context switching of a process includes not only user space resources such as virtual memory, stacks, and global variables, but also kernel space resources such as kernel stacks and registers.


multithreading model

Since the "burden" of context switching between processes is very heavy, we will create a relatively lightweight model to deal with multi-user requests -  multi-threading model .

A thread is a "logical flow" running in a process. Multiple threads can run in a single process. Threads in the same process can share some resources of the process, such as file descriptor list, process space, code, global data, heap , shared libraries, etc. These shared resources do not need to be switched during context switching, but only need to switch private data, registers and other unshared data of threads, so the overhead of thread context switching under the same process is much smaller than that of the process. many.

When the connection between the server and the client TCP is completed, a thread is created through the pthread_create() function, and then the file descriptor of the "connected Socket" is passed to the thread function, and then communicates with the client in the thread, so as to achieve the purpose of concurrent processing .

If a thread is created every time a connection is made, the operating system has to destroy the thread after the thread runs. Although the overhead of thread switching is not large, if the thread is frequently created and destroyed, the system overhead is also not small.

Then, we can use the thread pool method to avoid the frequent creation and destruction of threads. The so-called thread pool is to create several threads in advance, so that when a new connection is established, the connected Socket is put into a queue. , and then the thread in the thread pool is responsible for taking out the connected Socket process from the queue for processing.

It should be noted that this queue is global, and each thread will operate. In order to avoid multi-thread competition, threads must lock before operating this queue.

The above based on the process or thread model, there are still problems. When a new TCP connection arrives, you need to allocate a process or thread. If you want to reach C10K, it means that a machine needs to maintain 10,000 connections, which is equivalent to maintaining 10,000 processes/threads. Even if the operating system is dead, it will be carried. Can't stop.


I/O multiplexing

Since it is not appropriate to allocate a process/thread for each request, is it possible to use only one process to maintain multiple Sockets? The answer is yes, that is  I/O multiplexing technology.

Although a process can only process one request at any time, the time required to process the events of each request is controlled within 1 millisecond, so that thousands of requests can be processed within 1 second. The request multiplexes a process, which is multiplexing. This idea is very similar to the concurrent multiple processes of a CPU, so it is also called time division multiplexing.

The familiar select/poll/epoll kernel provides multiplexing system calls to user mode, and a process can obtain multiple events from the kernel through a system call function .

How does select/poll/epoll get network events? When acquiring an event, first pass all connections (file descriptors) to the kernel, and then the kernel returns the connection that generated the event, and then processes the requests corresponding to these connections in user mode.

select/poll/epoll These are three multiplexing interfaces, can they all implement C10K? Next, let's talk about them separately.


select/poll

The way select implements multiplexing is to put the connected sockets into a file descriptor set , and then call the select function to copy the file descriptor set to the kernel, and let the kernel check whether there is a network event. The method is very crude, that is, by traversing the file descriptor set, when an event is detected, the Socket is marked as readable or writable, and then the entire file descriptor set is copied back to the user mode, and then the user mode It is also necessary to find a readable or writable Socket by traversing , and then process it.

Therefore, for the select method, it is necessary to  "traverse" the file descriptor set twice , one in the kernel mode and one in the user mode, and  two "copy" file descriptor sets will occur . It is passed into the kernel space from the user space, modified by the kernel, and then sent out to the user space.

select uses a fixed-length BitsMap to represent a set of file descriptors, and the number of supported file descriptors is limited. In Linux systems, it is limited by FD_SETSIZE in the kernel. The default maximum value is 1024, and only 0 can be monitored. A file descriptor of ~1023.

Poll no longer uses BitsMap to store the file descriptors concerned, but instead uses a dynamic array, which is organized in the form of a linked list, breaking through the limit on the number of file descriptors in select, and of course, it is limited by system file descriptors.

However, poll and select are not fundamentally different. Both use a "linear structure" to store the Socket collection that the process pays attention to. Therefore, both need to traverse the file descriptor collection to find a readable or writable Socket. The time complexity is O( n), and it is also necessary to copy the file descriptor set between user mode and kernel mode . In this way, as the number of concurrency increases, the performance loss will increase exponentially.


epoll

epoll solves the problem of select/poll in two ways.

The first point is that epoll uses the red-black tree in the kernel to track all the file descriptors of the process to be detected , and adds the socket to be monitored to the red-black tree in the kernel through the epoll_ctl() function. The red-black tree is an efficient data structure. , the general time complexity of adding, deleting and checking is O(logn). By operating this black red tree, there is no need to pass in the entire socket set for each operation like select/poll, but only one to be detected. socket, reduces a large amount of data copying and memory allocation in the kernel and user space.

The second point is that epoll uses an event-driven mechanism. The kernel maintains a linked list to record ready events . When an event occurs in a socket, the kernel will add it to the ready event list through the callback function. When the user calls epoll_wait () function, it will only return the number of file descriptors with events, instead of polling and scanning the entire socket set like select/poll, which greatly improves the detection efficiency.

From the figure below, you can see the role of epoll-related interfaces:

In the epoll method, even if the number of monitored sockets increases, the efficiency will not be greatly reduced, and the number of sockets that can be monitored at the same time is also very large. The upper limit is the maximum number of file descriptors opened by the process defined by the system. Therefore, epoll is known as a sharp tool to solve the C10K problem .

As a digression, many online articles say that when epoll_wait returns, for ready events, epoll uses the shared memory method, that is, both user mode and kernel mode point to the ready list, so memory copy consumption is avoided.

This is wrong! Anyone who has read the epoll kernel source code knows that shared memory is not used at all . As you can see from the code below, the __put_user function is called in the kernel code implemented by epoll_wait, which copies data from the kernel to the user space.

Okay, that's all for this digression, let's move on!

epoll supports two event trigger modes, edge-triggered ( ET ) and level -triggered (LT ) .

These two terms are quite abstract, but the difference between them is well understood.

  • When using edge-triggered mode, when a readable event occurs on the monitored Socket descriptor, the server will only wake up from epoll_wait once . Even if the process does not call the read function to read data from the kernel, it will only wake up once, so Our program must ensure that the data in the kernel buffer is read at one time;
  • When using the horizontal trigger mode, when a readable event occurs on the monitored Socket, the server continuously wakes up from epoll_wait until the kernel buffer data is read by the read function . The purpose is to tell us that there is data to read ;

For example, your courier has been put in a courier box. If the courier box will only notify you once by text message, even if you never pick it up, it will not send a second text message to remind you. This method is the edge. Triggered; if the courier box finds that your courier has not been taken out, it will keep sending text messages to notify you, and it will not stop until you take out the courier. This is the way of horizontal triggering.

This is the difference between the two. Horizontal triggering means that as long as the conditions of the event are met, such as data in the kernel that needs to be read, the event will be continuously delivered to the user; while edge triggering means that only the first time the conditions are met It fires only at the time, and the same event will not be delivered after that.

If the horizontal trigger mode is used, when the kernel informs the file descriptor that it is readable and writable, it can continue to check its state to see if it is still readable or writable. So after receiving the notification, there is no need to perform as many read and write operations at once.

If the edge trigger mode is used, the I/O event will only be notified once, and we do not know how much data can be read and written, so we should read and write data as much as possible after receiving the notification, so as not to miss the opportunity to read and write. Therefore, we will read and write data from the file descriptor in a loop , then if the file descriptor is blocked, when there is no data to read or write, the process will be blocked in the read and write function, and the program cannot continue to execute. Therefore, edge-triggered mode is generally used with non-blocking I/O , and the program will continue to perform I/O operations until system calls (such as read and write) return an error, and the error type is EAGAIN or EWOULDBLOCK.

Generally speaking, the efficiency of edge triggering is higher than that of horizontal triggering, because edge triggering can reduce the number of system calls of epoll_wait, and system calls also have certain overhead, after all, there is also context switching.

select/poll only has horizontal trigger mode. The default trigger mode of epoll is horizontal trigger, but it can be set to edge trigger mode according to the application scenario.

In addition, when using I/O multiplexing, it is best to use it with non-blocking I/O. The Linux manual has the following instructions on select:

Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.

My google translate result:

Under Linux, select() may report a socket file descriptor as "ready to read", while subsequent read blocks do not. This can happen, for example, when data has arrived but is discarded after checking for a bad checksum. It is also possible that there are other situations where the file descriptor is incorrectly reported as ready. Therefore, it may be safer to use O_NONBLOCK on sockets that should not block.

A simple understanding is that the events returned by the multiplexing API are not necessarily readable and writable . If blocking I/O is used, program blocking will occur when calling read/write, so it is best to use non-blocking I/O , in order to deal with rare special cases.


Summarize

The most basic TCP socket programming, which is a blocking I/O model, basically can only communicate one-to-one. In order to serve more clients, we need to improve the network I/O model.

The more traditional method is to use a multi-process/thread model. Each time a client connects, a process/thread is allocated, and then subsequent reads and writes are performed in the corresponding process/thread. This method is no problem to handle 100 clients. But when the number of clients increases to 10,000, the scheduling of 10,000 processes/threads, context switching, and the memory they occupy will become bottlenecks.

In order to solve the above problem, I/O multiplexing appears, which can handle I/O of multiple files in only one process. There are three APIs that provide I/O multiplexing under Linux, namely : select, poll, epoll.

There is no essential difference between select and poll. They both use a "linear structure" to store the Socket collection that the process pays attention to.

When using, first need to copy the concerned Socket collection from user mode to kernel mode through the select/poll system call, and then the kernel detects the event. When a network event occurs, the kernel needs to traverse the process to pay attention to the Socket collection and find the corresponding Socket, and set its status as readable/writable, and then copy the entire Socket collection from the kernel state to the user state, and the user state will continue to traverse the entire Socket collection to find the readable/writable Socket, and then process it.

Obviously, the defect of select and poll is that when there are more clients, that is, the larger the socket set is, the traversal and copying of the socket set will bring a lot of overhead, so it is difficult to deal with C10K.

epoll is a powerful tool to solve the C10K problem. It solves the problem of select/poll in two ways.

  • epoll uses the "red-black tree" in the kernel to pay attention to all sockets to be detected in the process. The red-black tree is an efficient data structure. The general time complexity of adding, deleting and checking is O(logn). Through the management of this black-red tree, There is no need to pass in the entire Socket collection for each operation like select/poll, which reduces a large amount of data copying and memory allocation in the kernel and user space.
  • epoll uses an event-driven mechanism. A "linked list" is maintained in the kernel to record ready events. Only the set of Sockets that have events occurred is passed to the application. There is no need to poll and scan the entire set (including yes and no) like select/poll. event Socket), greatly improving the efficiency of detection.

Moreover, epoll supports edge triggering and horizontal triggering, while select/poll only supports horizontal triggering. Generally speaking, edge triggering is more efficient than horizontal triggering.

Guess you like

Origin blog.csdn.net/m0_63437643/article/details/123793896