11 - In-depth understanding of NIO optimization implementation principles

One tuning that is often mentioned in Tomcat is modifying the thread's I/O model. Before Tomcat version 8.5, the BIO thread model was used by default. In high-load and high-concurrency scenarios, you can set the NIO thread model to improve the network communication performance of the system.

We can use a performance comparison test to see the communication performance of BIO and NIO under high load or high concurrency (here page requests are used to simulate requests for multiple I/O read and write operations):

 Test results: Tomcat has obvious advantages in using the NIO thread model when there are many I/O read and write operations.

Tomcat seems to be a simple configuration, but it contains a lot of optimization and upgrading knowledge points. Next, we will start from the optimization of the underlying network I/O model, then go to the optimization of memory copy and thread model, and deeply analyze how communication frameworks such as Tomcat and Netty can improve system performance by optimizing I/O.

1. Network I/O model optimization

In network communication, the bottom layer is the network I/O model in the kernel. With the development of technology, the network model of the operating system kernel has derived five I/O models. The book "UNIX Network Programming" divides these five I/O models into blocking I/O and non-blocking I/O. O, I/O multiplexing, signal-driven I/O, and asynchronous I/O. The emergence of each I/O model is an optimization and upgrade based on the previous I/O model.

The original blocking I/O requires a user thread to process each connection, and when the I/O operation is not ready or ends, the thread will be suspended and enter the blocking waiting state. I/O becomes the root cause of performance bottlenecks.

So where does the blocking happen in the socket (socket) communication?

In "Unix Network Programming", socket communication can be divided into stream socket (TCP) and datagram socket (UDP). Among them, the TCP connection is the most commonly used one. Let's understand the workflow of the TCP server (because the data transmission of TCP is more complicated, there is the possibility of unpacking and packing, here I only assume the simplest TCP data transmission):

  • First, the application creates a socket through the system call socket, which is a file descriptor assigned to the application by the system;
  • Secondly, the application program will call bind through the system, bind the address and port number, and give the socket a name;
  • Then, the system will call listen to create a queue for storing incoming connections from clients;
  • Finally, the application service will listen to the client's connection request through the system call accept.

When a client connects to the server, the server will call fork to create a child process, listen to the message sent by the client through the system call read, and return information to the client through write.

1.1, blocking I/O

Throughout the socket communication workflow, the default state of the socket is blocking. That is to say, when a socket call that cannot be completed immediately is issued, its process will be blocked, suspended by the system, and enter a sleep state, waiting for the corresponding operation response. From the figure above, we can find that there are mainly three types of possible blockages.

connect blocking : When the client initiates a TCP connection request, the system calls the connect function. The establishment of the TCP connection needs to complete the three-way handshake process. The client needs to wait for the ACK and SYN signals sent back by the server. Similarly, the server also needs to block and wait for the client Confirm the ACK signal of the connection, which means that each connection of TCP will block and wait until the connection is confirmed.

accept blocking : A blocked socket communication server will call the accept function when it receives an external connection. If no new connection arrives, the calling process will be suspended and enter the blocked state.

Read and write blocking : When a socket connection is successfully established, the server uses the fork function to create a child process, and calls the read function to wait for the client's data to be written. If no data is written, the calling child process will be suspended and enter the blocking state .

1.2, non-blocking I/O

Use fcntl to set the above three operations as non-blocking operations. If no data is returned, an EWOULDBLOCK or EAGAIN error will be returned directly, and the process will not be blocked all the time.

When we set the above operation to non-blocking state, we need to set a thread to poll and check the operation, which is also the most traditional non-blocking I/O model.

1.3, I/O multiplexing

If you use user threads to poll to see the status of an I/O operation, this is undoubtedly a disaster for CPU usage in the case of a large number of requests. So besides this way, are there other ways to implement non-blocking I/O sockets?

Linux provides the I/O multiplexing function select/poll/epoll, and the process will block one or more read operations through the system call function on the function operation. In this way, the system kernel can help us detect whether multiple read operations are in the ready state.

select() function : Its purpose is to monitor the occurrence of readable, writable and abnormal events on the file descriptor that the user is interested in within the timeout period. The kernel of the Linux operating system regards all external devices as a file to operate, and the read and write operations on a file will call the system command provided by the kernel and return a file descriptor (fd).

int select(int maxfdp1,fd_set *readset,fd_set *writeset,fd_set *exceptset,const struct timeval *timeout)

 Looking at the above code, the file descriptors monitored by the select() function are divided into three categories, namely writefds (write file descriptor), readfds (read file descriptor) and exceptfds (exception event file descriptor).

After calling, the select() function will block until a descriptor is ready or timeout, and the function returns. After the select function returns, the fdset can be traversed through the function FD_ISSET to find ready descriptors. fd_set can be understood as a set, which stores file descriptors, which can be set by the following four macros:

void FD_ZERO(fd_set *fdset);           // 清空集合
void FD_SET(int fd, fd_set *fdset);   // 将一个给定的文件描述符加入集合之中
void FD_CLR(int fd, fd_set *fdset);   // 将一个给定的文件描述符从集合中删除
int FD_ISSET(int fd, fd_set *fdset);   // 检查集合中指定的文件描述符是否可以读写 

poll() function : Before calling the select() function each time, the system needs to copy an fd from the user mode to the kernel mode, which brings a certain performance overhead to the system. In addition, the number of fds monitored by a single process is 1024 by default, and we can break this limitation by modifying the macro definition or even recompiling the kernel. However, since fd_set is implemented based on an array, when adding and deleting fd, the efficiency will decrease if the number is too large.

The mechanism of poll() is similar to that of select(), and there is not much difference between the two in essence. poll() manages multiple descriptors through polling and processes according to the status of the descriptors, but poll() does not have a limit on the maximum number of file descriptors.

poll() and select() have the same disadvantage, that is, the array containing a large number of file descriptors is copied between the user mode and the kernel address space as a whole, and whether these file descriptors are ready or not, their overhead will vary It increases linearly with the number of file descriptors.

epoll() function : select/poll sequentially scans whether the fd is ready, and the number of supported fd should not be too large, so its use is subject to some restrictions.

Linux provides an epoll call in the 2.6 kernel version, and epoll uses an event-driven method instead of polling and scanning fd. epoll registers a file descriptor through epoll_ctl() in advance, and stores the file descriptor in an event table of the kernel. This event table is implemented based on a red-black tree, so in the case of a large number of I/O requests, insert and The performance of deletion is better than the array fd_set of select/poll, so the performance of epoll is better, and it will not be limited by the number of fd.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event event)

Through the above code, we can see: epfd in the epoll_ctl() function is an epoll-specific file descriptor generated by the epoll_create() function. op represents the operation event type, fd represents the associated file descriptor, and event represents the event type to be monitored.

Once a file descriptor is ready, the kernel will use a callback mechanism similar to callback to quickly activate the file descriptor. When the process calls epoll_wait(), it will be notified, and then the process will complete the relevant I/O operations.

int epoll_wait(int epfd, struct epoll_event events,int maxevents,int timeout)

1.4. Signal-driven I/O

Signal-driven I/O is similar to the observer mode, the kernel is an observer, and the signal callback is a notification. When the user process initiates an I/O request operation, it will call the sigaction function through the system to register a signal callback to the corresponding socket. At this time, the user process will not be blocked, and the process will continue to work. When the kernel data is ready, the kernel generates a SIGIO signal for the process, and notifies the process to perform related I/O operations through the signal callback.

 Compared with the previous three I/O modes, signal-driven I/O realizes that when waiting for data to be ready, the process is not blocked, and the main loop can continue to work, so the performance is better.

For TCP, signal-driven I/O is hardly used. This is because the SIGIO signal is a Unix signal, and the signal has no additional information. If a signal source has multiple reasons for generating the signal, the signal receiver cannot Determine exactly what happened. However, there are as many as seven signal events produced by the TCP socket, so that the application program receives SIGIO, and there is no way to distinguish and process it.

But signal-driven I/O is now used in UDP communication. We can find from the UDP communication flow chart in Lecture 10 that UDP has only one data request event, which means that under normal circumstances the UDP process only needs to capture SIGIO signal, call recvfrom to read the arriving datagram. If an exception occurs, an exception error is returned. For example, the NTP server applies this model.

1.5, extra step I/O

Although the signal-driven I/O does not block the process while waiting for the data to be ready, the I/O operation performed after being notified is still blocked, and the process will wait for the data to be copied from the kernel space to the user space. Asynchronous I/O implements true non-blocking I/O.

When a user process initiates an I/O request operation, the system will tell the kernel to start an operation, and let the kernel notify the process after the entire operation is completed. This operation involves waiting for data to be ready and copying data from kernel to user space. Due to the high code complexity of the program, the difficulty of debugging, and the fact that operating systems that support asynchronous I/O are relatively rare (currently, Linux does not support asynchronous I/O, and Windows has already implemented asynchronous I/O), so it is rarely used in the actual production environment. to the asynchronous I/O model.

In Lecture 08, I mentioned that NIO uses the I/O multiplexer Selector to implement non-blocking I/O, and the Selector uses the I/O multiplexing model of these five types. Selector in Java is actually an outsourced class of select/poll/epoll.

We mentioned in the above TCP communication process that conect, accept, read and write in Socket communication are blocking operations, which correspond to the four monitoring events OP_ACCEPT, OP_CONNECT, OP_READ and OP_WRITE of SelectionKey in the Selector.

In NIO server-side communication programming, a Channel is first created to monitor client connections; then, a multiplexer Selector is created, and the Channel is registered to the Selector, and the program will poll the registered on it through the Selector Channel, when one or more Channels are found to be in the ready state, the ready listening event is returned, and finally the program matches the listening event and performs related I/O operations.

When creating a Selector, the program will choose which I/O multiplexing function to use according to the operating system version. In the JDK1.5 version, if the program runs on the Linux operating system and the kernel version is above 2.6, NIO will choose epoll to replace the traditional select/poll, which also greatly improves the performance of NIO communication.

Since the signal-driven I/O does not support TCP communication, and the application of asynchronous I/O in the kernel of the Linux operating system is not yet mature, most frameworks are still network communication based on the I/O multiplexing model.

2. Zero copy

In the I/O multiplexing model, the execution of read and write I/O operations is still blocked. When performing read and write I/O operations, there are multiple memory copies and context switches, which increase the performance overhead of the system.

Zero copy is a technique to avoid multiple memory copies to optimize read and write I/O operations.

In network programming, an I/O read and write operation is usually completed by read and write. Each I/O read and write operation needs to complete four memory copies, the path is I/O device -> kernel space -> user space -> kernel space -> other I/O devices.

The mmap function in the Linux kernel can replace the I/O read and write operations of read and write, so that the user space and the kernel space can share a cache data. mmap maps an address in user space and an address in kernel space to the same physical memory address at the same time. Both user space and kernel space are virtual addresses, which are finally mapped to physical memory addresses through address mapping. This method avoids data exchange between kernel space and user space. In the epoll function in I/O multiplexing, mmap is used to reduce memory copying.

In Java's NIO programming, Direct Buffer is used to realize zero copy of memory. Java opens up a physical memory space directly outside the JVM memory space, so that both the kernel and user processes can share a cached data. This is the content that has been explained in detail in Lecture 08, you can review it again.

3. Thread model optimization

In addition to the kernel's optimization of the network I/O model, NIO has also been optimized and upgraded in the user layer. NIO is an I/O operation based on an event-driven model. The Reactor model is a common model for synchronous I/O event processing. Its core idea is to register I/O events to the multiplexer. Once an I/O event is triggered, the multiplexer will send the event Dispatched to event handlers to perform ready I/O event operations. The model has the following three main components:

  • Event receiver Acceptor: mainly responsible for receiving request connections;
  • Event separator Reactor: After receiving the request, the established connection will be registered in the separator, relying on the loop monitoring multiplexer Selector, once the event is monitored, the event will be dispatched to the event processor;
  • Event handler Handlers: The event handler mainly completes related event processing, such as reading and writing I/O operations.

3.1. Single-threaded Reactor threading model

Initially, NIO was implemented based on a single thread, and all I/O operations were performed on one NIO thread. Since NIO is non-blocking I/O, theoretically one thread can complete all I/O operations.

However, NIO does not really implement non-blocking I/O operations, because the user process is still blocked when reading and writing I/O operations. This method will have performance bottlenecks in high-load and high-concurrency scenarios. A If the NIO thread handles I/O operations of tens of thousands of connections at the same time, the system cannot support requests of this magnitude.

3.2. Multi-threaded Reactor threading model

In order to solve the performance bottleneck of this single-threaded NIO in high-load and high-concurrency scenarios, a thread pool was used later.

Both Tomcat and Netty use an Acceptor thread to monitor the connection request event. When the connection is successful, the established connection will be registered in the multiplexer. Once the event is monitored, it will be handed over to the Worker thread pool for processing. . In most cases, this threading model can meet the performance requirements, but if the connected clients increase by an order of magnitude, an Acceptor thread may have a performance bottleneck.

3.3. Master-slave Reactor threading model

The NIO communication frameworks in the current mainstream communication frameworks are all implemented based on the master-slave Reactor thread model. In this model, Acceptor is no longer a single NIO thread, but a thread pool. The Acceptor receives the client's TCP connection request, and after the connection is established, subsequent I/O operations will be handed over to the Worker I/O thread.

4. Tomcat parameter tuning based on thread model

In Tomcat, BIO and NIO are implemented based on the master-slave Reactor thread model.

In BIO, the Acceptor in Tomcat is only responsible for monitoring new connections. Once the connection is established and monitors I/O operations, it will be handed over to the Worker thread, and the Worker thread is responsible for I/O read and write operations.

In NIO, Tomcat has added a Poller thread pool. After the Acceptor monitors the connection, it does not directly use the thread in the Worker to process the request, but first sends the request to the Poller buffer queue. In Poller, a Selector object is maintained. When Poller takes the connection out of the queue, it registers in the Selector; then, by traversing the Selector, find out the ready I/O operations, and use the thread in the Worker to process the corresponding request .

You can set the configuration items of Acceptor thread pool and Worker thread pool through the following parameters.

acceptorThreadCount: This parameter represents the number of threads of the Acceptor. When the amount of data requested by the client is very large, the number of threads can be appropriately increased to improve the ability to process request connections. The default value is 1.

maxThreads: The number of Worker threads dedicated to processing I/O operations. The default is 200. This parameter can be adjusted according to the actual environment, but the bigger the better.

acceptCount: Tomcat's Acceptor thread is responsible for taking out the connection from the accept queue, and then handing it over to the worker thread to perform related operations, where acceptCount refers to the size of the accept queue.

When Http closes keep alive, this value can be appropriately increased when the amount of concurrency is relatively large. When Http enables keep alive, because the number of Worker threads is limited, Worker threads may be occupied for a long time, and the connection waits in the accept queue for timeout. If the accept queue is too large, it is easy to waste connections.

maxConnections: Indicates how many sockets are connected to Tomcat. In BIO mode, a thread can only handle one connection, generally the value of maxConnections and maxThreads is the same; in NIO mode, a thread handles multiple connections at the same time, maxConnections should be set much larger than maxThreads, the default is 10000.

Guess you like

Origin blog.csdn.net/qq_34272760/article/details/132635126