[Switch] java nio's handling of OP_WRITE to solve slow connections

How java nio handles OP_WRITE to solve slow connections

For enterprise-class server software, high performance and scalability are basic requirements. In addition, there should be the ability to deal with a variety of different environments. For example, a good server software should not assume that all clients have fast processing power and a good network environment. If a client is running slowly, or the network speed is slow, this means that the entire request takes longer. For the server, this means that the client's request will take longer. This time delay is not caused by the server, so the CPU usage will not increase, but the network connection time will increase, and the processing thread occupancy time will also increase. This results in that the current processing thread and other resources cannot be released quickly and cannot be reused by requests from other clients. For example, Tomcat, when there are a large number of slow connection clients, thread resources are consumed by these slow connections, making the server unable to respond to other requests.
 
The asynchronous, non-blocking form of NIO enables a small number of threads to serve a large number of requests. Through the Selector's registration function, it is possible to selectively return ready channels, so that there is no need to allocate a separate thread to serve each request.
 
In some popular NIO frameworks, the processing of OP_ACCEPT and OP_READ can be seen. There is very little handling of OP_WRITE. The code we often see is to return the result to the client directly through the following code after the request is processed:
 
[Example 17.7] An example of not processing OP_WRITE:
 
while (bb.hasRemaining()) {
    int len = socketChannel.write(bb);
    if (len < 0) {
        throw new EOFException();
    }
}
 
This is fine in most cases. But when the client's network environment is very bad, the server will be hit hard.
 
Because if there is a problem with the client's network or the intermediate switch, the efficiency of network transmission is very low. At this time, the return result that the server has prepared cannot be transmitted to the client through the TCP/IP layer. At this time, the following situation will occur when the above program is executed.
 
(1) bb.hasRemaining() is always "true" because the server's return result is ready.
 
(2) The result of socketChannel.write(bb) is always 0, because the data cannot be transmitted due to network reasons.
 
(3) Because it is asynchronous and non-blocking, socketChannel.write(bb) will not be blocked and will be returned immediately.
 
(4) For a period of time, this code will be executed endlessly and quickly, consuming a lot of CPU resources. In fact, no specific task is done until the network allows the current data to be sent out.
 
This result is obviously not what we want. Therefore, we should also handle OP_WRITE. The most commonly used methods in NIO are as follows.
 
[Example 17.8] The processing of OP_WRITE in the general NIO framework:
 
while (bb.hasRemaining()) {
    int len = socketChannel.write(bb);
    if (len < 0){
        throw new EOFException();
    }
    if (len == 0) {
        selectionKey.interestOps(
                        selectionKey.interestOps() | SelectionKey.OP_WRITE);
        mainSelector.wakeup();
        break;
    }
}
 
The above program registers the OP_WRITE operation of this channel to the Selector when the network is not good, so that when the network recovers and the channel can continue to return the result data to the client, the Selector will notify the application through the SelectionKey, and then execute it write operation. In this way, a lot of CPU resources can be saved, so that the server can adapt to various harsh network environments.
 
However, the handling of OP_WRITE in Grizzly is not like this. Let's take a look at the source code of Grizzly first. In Grizzly, the return of the request result is processed in ProcessTask, through the SocketChannelOutputBuffer class, and finally through the OutputWriter class to complete the action of returning the result. The code to handle OP_WRITE in OutputWriter is as follows:
 
[Example 17.9] Handling of OP_WRITE in Grizzly:
 
public static long flushChannel(SocketChannel socketChannel,
        ByteBuffer bb, long writeTimeout) throws IOException
{
    SelectionKey key = null;
    Selector writeSelector = null;
    int attempts = 0;
    int bytesProduced = 0;
    try {
        while (bb.hasRemaining()) {
            int len = socketChannel.write(bb);
            attempts++;
            if (len < 0){
                throw new EOFException();
            }
            bytesProduced + = only;
            if (len == 0) {
                if (writeSelector == null){
                    writeSelector = SelectorFactory.getSelector();
                    if (writeSelector == null){
                        // Continue using the main one
                        continue;
                    }
                }
                key = socketChannel.register(writeSelector, key.OP_WRITE);
                if (writeSelector.select(writeTimeout) == 0) {
                    if (attempts > 2)
                        throw new IOException("Client disconnected");
                } else {
                    attempts--;
                }
            } else {
                attempts = 0;
            }
        }
    } finally {
        if (key != null) {
            key.cancel();
            key = null;
        }
        if (writeSelector != null) {
            // Cancel the key.
            writeSelector.selectNow();
            SelectorFactory.returnSelector(writeSelector);
        }
    }
    return bytesProduced;
The difference between the above program example 17.9 and example 17.8 is that when it is found that the transmission data is blocked (len==0) due to network conditions, the processing of example 17.8 is to register the current channel into the current Selector; and In Example 17.9, the program obtains a temporary Selector from the SelectorFactory. After obtaining this temporary Selector, the program does a blocking operation: writeSelector.select(writeTimeout). This blocking operation will wait for the send status of this channel for a certain period of time (writeTimeout). If the waiting time is too long, it is considered that the connection of the current client is abnormally interrupted.
 
This implementation is quite controversial. Many developers have questioned why the authors of Grizzly did not use the pattern in Example 17.8. In addition, in actual processing, Grizzly's processing method actually gives up the non-blocking advantage in NIO, and uses writeSelector.select(writeTimeout) to do a blocking operation. Although the CPU resources are not wasted, the thread resources are occupied by this request during the blocking time and cannot be released for use by other requests.
 
The author of Grizzly responded below.
 
(1) The purpose of using a temporary Selector is to reduce switching between threads. The current Selector is generally used to handle OP_ACCEPT and OP_READ operations. Using a temporary Selector can reduce the burden of the main Selector; while registering requires thread switching, which will cause unnecessary system calls. In this way, frequent switching between threads is avoided, which is beneficial to the performance improvement of the system.
 
(2) Although writeSelector.select(writeTimeout) does a blocking operation, this situation only occurs in a few extreme environments. Most clients do not have this phenomenon frequently, so there will not be many threads blocked at the same time.
 
(3) Use this blocking operation to determine the abnormally interrupted client connection.
 
(4) The performance of this realization is proved to be very good through pressure experiments.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326802607&siteId=291194637