I. Overview

1.1 Amazing performance data

Recently, a friend in the circle told me via private message that by using Netty4 + Thrift compressed binary codec technology, they realized a 10W TPS (1K complex POJO object) cross-node remote service call. Compared with the traditional communication framework based on Java serialization + BIO (synchronous blocking IO), the performance is improved by more than 8 times.

In fact, I am not surprised by this data. Based on my more than 5 years of NIO programming experience, I have carefully designed the Reactor thread model to achieve the above performance by selecting a suitable NIO framework, coupled with high-performance compressed binary codec technology, Indicators are entirely possible.

Let's take a look at how Netty supports 10W TPS cross-node remote service calls. Before we start to explain it, let's briefly introduce Netty.

1.2 Getting started with Netty basics

Netty is a high-performance, asynchronous event-driven NIO framework. It provides support for TCP, UDP and file transfer. As an asynchronous NIO framework, all IO operations of Netty are asynchronous and non-blocking. Through the Future-Listener mechanism, Users can easily obtain the IO operation results actively or through the notification mechanism.

As the most popular NIO framework, Netty has been widely used in the Internet field, big data distributed computing field, game industry, communication industry, etc. Some well-known open source components in the industry are also built on Netty's NIO framework.

Two, RPC call performance model analysis

2.1 Three sins of poor performance of traditional RPC calls

1 Network transmission mode problem

The traditional RPC framework or remote service (procedure) calls based on RMI and other methods use synchronous blocking IO. When the client's concurrency pressure or network delay increases, the synchronous blocking IO will often block the IO thread due to frequent waits. , As threads cannot work efficiently, the IO processing capacity naturally declines.

Below, we look at the drawbacks of BIO communication through the BIO communication model diagram:
Detailed NIO framework: Netty's high-performance way_1.png

The server that adopts the BIO communication model is usually responsible for monitoring the client connection by an independent Acceptor thread. After receiving the client connection, it creates a new connection for the client connection. The thread processes the request message, and after the processing is completed, it returns a response message to the client, and the thread is destroyed. This is a typical one-request-response model. The biggest problem with this architecture is that it does not have elastic scalability. When the number of concurrent accesses increases, the number of threads on the server side is linearly proportional to the number of concurrent accesses. Because threads are a very precious system resource of the JAVA virtual machine, when the number of threads expands, The performance of the system drops sharply. As the amount of concurrency continues to increase, problems such as handle overflow and thread stack overflow may occur, and eventually cause the server to go down.

2 serialization problem

Java serialization has the following typical problems:

The Java serialization mechanism is an object encoding and decoding technology within Java, which cannot be used across languages; for example, for the docking between heterogeneous systems, the code stream after Java serialization needs to be deserialized into the original object (copy) by other languages , It is difficult to support at present;
Compared with other open source serialization frameworks, the code stream after Java serialization is too large, whether it is network transmission or persistence to disk, it will cause additional resource occupation;
Poor serialization performance (high CPU resource usage).

3 thread model issues

Due to the use of synchronous blocking IO, this will cause each TCP connection to occupy 1 thread. Because thread resources are very precious resources of the JVM virtual machine, when the IO read and write blocking causes the thread to be released in time, it will cause a sharp drop in system performance. In severe cases, the virtual machine cannot create new threads.

2.2 Three themes of high performance

1) Transmission:
What kind of channel is used to send data to the other party, BIO, NIO or AIO, and the IO model largely determines the performance of the framework.

2) Protocol:
What kind of communication protocol is used, HTTP or internal private protocol. The choice of protocol is different, the performance model is also different. Compared with public agreements, the performance of internal private agreements can usually be designed to be better.

3) Thread:
How to read the datagram? In which thread the codec is performed after reading, how the message after codec is dispatched, and the different Reactor thread model has a great impact on performance.

Three elements of RPC call performance:
Detailed NIO framework: Netty's high-performance way_2.png

3. Detailed explanation of Netty's high performance

3.1 Asynchronous non-blocking communication

In the IO programming process, when multiple client access requests need to be processed at the same time, multithreading or IO multiplexing technology can be used for processing. The IO multiplexing technology multiplexes multiple IO blocks onto the same select block, so that the system can process multiple client requests at the same time in a single thread. Compared with the traditional multi-threaded/multi-process model, the biggest advantage of I/O multiplexing is the low system overhead. The system does not need to create new additional processes or threads, nor does it need to maintain the operation of these processes and threads, which reduces The maintenance workload of the system saves system resources.

JDK1.4 provides support for non-blocking IO (NIO). The JDK1.5_update10 version uses epoll to replace the traditional select/poll, which greatly improves the performance of NIO communication.

The JDK NIO communication model is as follows: Corresponding
Detailed NIO framework: Netty's high-performance way_3.png

to the Socket class and the ServerSocket class, NIO also provides two different socket channel implementations: SocketChannel and ServerSocketChannel. Both of these new channels support blocking and non-blocking modes. The blocking mode is very simple to use, but the performance and reliability are not good. The non-blocking mode is just the opposite. Developers can generally choose the appropriate mode according to their needs. Generally speaking, low-load, low-concurrency applications can choose to block IO synchronously to reduce programming complexity. However, for high-load, high-concurrency network applications, it is necessary to use NIO's non-blocking mode for development.

The Netty architecture is designed and implemented in accordance with the Reactor model. Its server-side communication sequence diagram is as follows: The
Detailed NIO framework: Netty's high-performance way_4.png

client-side communication sequence diagram is as follows:

Netty's IO thread NioEventLoop, due to the aggregation of multiplexer selectors, can concurrently process hundreds of client Channels at the same time. Since the read and write operations are non-blocking, this can fully improve the efficiency of the IO thread and avoid Thread hangs caused by frequent IO blocking. In addition, because Netty adopts the asynchronous communication mode, one IO thread can concurrently process N client connections and read and write operations, which fundamentally solves the traditional synchronous blocking IO-connection-thread model, the performance of the architecture, the elastic scalability and Reliability has been greatly improved.

3.2 Zero copy

Many users have heard that Netty has a "zero copy" function, but it is not clear where it is specifically reflected. This section will explain in detail the "zero copy" function of Netty.

Netty's "zero copy" is mainly reflected in the following three aspects:

Netty uses DIRECT BUFFERS for receiving and sending ByteBuffer, and uses direct memory outside the heap for Socket read and write, without the need for a second copy of the byte buffer. If the traditional heap memory (HEAP BUFFERS) is used for Socket reading and writing, the JVM will copy the heap memory Buffer to the direct memory, and then write it to the Socket. Compared with the off-heap direct memory, the message has one more buffer memory copy during the sending process.
Netty provides a combined Buffer object, which can aggregate multiple ByteBuffer objects. Users can operate the combined Buffer as easily as a Buffer, avoiding the traditional memory copy method to merge several small Buffers into one large Buffer.
Netty's file transfer uses the transferTo method, which can directly send the data in the file buffer to the target Channel, avoiding the memory copy problem caused by the traditional circular write method.

Below, we explain the above three kinds of "zero copy", first look at the creation of Netty receiving buffer (asynchronous message reading "zero copy"):
Detailed NIO framework: Netty's high-performance way_6.png

every time a message is read in a loop, the ByteBuf object is obtained through the ioBuffer method of ByteBufAllocator, as follows Continue to look at its interface definition.

ByteBufAllocator allocates off-heap memory through ioBuffer:
Detailed NIO framework: Netty's high-performance way_7.png

When Socket IO reads and writes, in order to avoid copying a copy from heap memory to direct memory, Netty's ByteBuf allocator directly creates non-heap memory to avoid secondary copies of the buffer. Zero copy" to improve read and write performance.

Let's continue to look at the second implementation of "zero copy" CompositeByteBuf, which encapsulates multiple ByteBufs into a ByteBuf externally, and provides a unified encapsulated ByteBuf interface. Its class definition is as follows (CompositeByteBuf class inheritance relationship):
Detailed NIO framework: Netty's high-performance way_8.png

through inheritance relationship We can see that CompositeByteBuf is actually a wrapper for ByteBuf. It combines multiple ByteBuf into a set, and then provides a unified ByteBuf interface. The related definition is as follows ( CompositeByteBuf class definition ):
Detailed NIO framework: Netty's high-performance way_9.png

Add ByteBuf, no need to do memory copy, related The code is as follows (the "zero copy" of ByteBuf is added):

Finally, let's look at the "zero copy" of

file transfer : Netty file transfer DefaultFileRegion sends the file to the target Channel through the transferTo method. The focus is on the transferTo method of FileChannel. The API DOC description is as follows:
Detailed NIO framework: Netty's high-performance way_12.png

For many operating systems, it directly sends the contents of the file buffer to the target Channel without copying. This is a more efficient transmission method, which realizes the "zero" of file transmission. copy".

3.3 Memory Pool

With the development of JVM virtual machine and JIT just-in-time compilation technology, the allocation and recycling of objects is a very lightweight task. But for the buffer Buffer, the situation is slightly different, especially for the allocation and recovery of direct memory outside the heap, which is a time-consuming operation. In order to reuse the buffer as much as possible, Netty provides a buffer reuse mechanism based on the memory pool. Let's take a look at the implementation of Netty ByteBuf together:
Detailed NIO framework: Netty's high-performance way_1.png

Netty provides a variety of memory management strategies. Differentiated customization can be achieved by configuring related parameters in the startup auxiliary class. Below through the performance test, we look at the performance difference between ByteBuf and ordinary ByteBuf based on memory pool recycling.

Use case one, use the memory pool allocator to create a direct memory buffer:
Detailed NIO framework: Netty's high-performance way_2.png

Use case two, use the non-heap memory allocator to create a direct memory buffer:

each executed 3 million times, the performance comparison results are as follows:

performance test shows that the memory pool is used Compared with the ByteBuf that is dying, the performance of ByteBuf is about 23 times higher (the performance data is strongly related to the usage scenario).

Let's briefly analyze the memory allocation of the Netty memory pool together:
Detailed NIO framework: Netty's high-performance way_5.png

continue to look at the newDirectBuffer method, we find that it is an abstract method, which is implemented by the subclass of AbstractByteBufAllocator. The code is as follows: the

code jumps to the newDirectBuffer method of PooledByteBufAllocator, from the Cache Get the memory area PoolArena, call its allocate method for memory allocation:
Detailed NIO framework: Netty's high-performance way_7.png

PoolArena's allocate method is as follows:

We focus on the analysis of the implementation of newByteBuf, which is also an abstract method. Different types of buffer allocation are implemented by subclasses DirectArena and HeapArena. Due to testing The use case uses off-heap memory:
Detailed NIO framework: Netty's high-performance way_9.png

Therefore, the focus is on the implementation of DirectArena: If unsafe using sun is not enabled, then:

Execute the newInstance method of PooledDirectByteBuf, the code is as follows:

ByteBuf object is recycled through the get method of RECYCLER, if it is a non-memory pool implementation, a new ByteBuf object is directly created. After getting the ByteBuf from the buffer pool, call the setRefCnt method of AbstractReferenceCountedByteBuf to set the reference counter, which is used for object reference counting and memory recycling (similar to JVM garbage collection mechanism).

3.4 Efficient Reactor threading model

There are three commonly used Reactor threading models, as follows:

Reactor single-threaded model;
Reactor multi-threaded model;
Master-slave Reactor multi-threaded model.

The Reactor single-threaded model means that all IO operations are completed on the same NIO thread. The responsibilities of the NIO thread are as follows:

As the NIO server, it receives the TCP connection of the client;
As a NIO client, initiate a TCP connection to the server;
Read the request or response message of the communication peer;
Send a message request or response message to the communication peer.

The schematic diagram of the Reactor single-threaded model is as follows:
Detailed NIO framework: Netty's high-performance way_1.png

Since the Reactor mode uses asynchronous non-blocking IO, all IO operations will not cause blocking. In theory, one thread can independently handle all IO-related operations. From an architectural perspective, a NIO thread can indeed fulfill its responsibilities. For example, receiving the client's TCP connection request message through Acceptor, after the link is successfully established, the corresponding ByteBuffer is dispatched to the designated Handler through Dispatch for message decoding. The user Handler can send messages to the client through the NIO thread.

For some small-volume application scenarios, a single-threaded model can be used. However, it is not suitable for applications with high load and large concurrency. The main reasons are as follows:

A NIO thread processes hundreds of thousands of links at the same time, and its performance cannot be supported. Even if the CPU load of the NIO thread reaches 100%, it cannot meet the encoding, decoding, reading and sending of massive messages;
When the NIO thread is overloaded, the processing speed will slow down, which will cause a large number of client connections to time out, and retransmissions will often occur after the timeout. This will increase the load on the NIO thread and eventually cause a large number of message backlogs and processing timeouts. NIO threads will become the performance bottleneck of the system;
Reliability issues: Once the NIO thread accidentally runs away or enters an infinite loop, the communication module of the entire system will be unavailable, unable to receive and process external messages, and cause node failure.

In order to solve these problems, the Reactor multi-threading model has evolved. Let's learn the Reactor multi-threading model together.

The biggest difference between the Rector multi-threaded model and the single-threaded model is that there is a set of NIO threads to handle IO operations. Its schematic diagram is as follows:
Detailed NIO framework: Netty's high-performance way_3.png

Features of the Reactor multi-threaded model:

There is a dedicated NIO thread-Acceptor thread for monitoring the server and receiving TCP connection requests from the client;
Network IO operations-reading, writing, etc. are handled by a NIO thread pool. The thread pool can be implemented using a standard JDK thread pool. It contains a task queue and N available threads. These NIO threads are responsible for message reading, decoding, and Encoding and sending;
One NIO thread can process N links at the same time, but one link only corresponds to one NIO thread, preventing concurrent operation problems.

In most scenarios, the Reactor multi-threaded model can meet performance requirements; however, in very special application scenarios, a NIO thread responsible for monitoring and processing all client connections may have performance problems. For example, millions of clients connect concurrently, or the server needs to perform security authentication on the client's handshake message. The authentication itself is very lossy. In such scenarios, a single Acceptor thread may have insufficient performance. In order to solve the performance problem, a third type of Reactor threading model-the master-slave Reactor multi-threading model was created.

The feature of the master-slave Reactor thread model is that the server is no longer a single NIO thread for receiving client connections, but an independent NIO thread pool. After Acceptor receives the client's TCP connection request and finishes processing (may include access authentication, etc.), it registers the newly created SocketChannel to an IO thread in the IO thread pool (sub reactor thread pool), and it is responsible for the read and write of the SocketChannel And codec work. The Acceptor thread pool is only used for client login, handshake and security authentication. Once the link is established successfully, the link is registered to the IO thread of the backend subReactor thread pool, and the IO thread is responsible for subsequent IO operations.

Its threading model is shown in the following figure:
Detailed NIO framework: Netty's high-performance way_4.png

Using the master-slave NIO threading model, it can solve the problem of insufficient performance of a server listening thread that cannot effectively handle all client connections. Therefore, in Netty's official demo, it is recommended to use this threading model.

In fact, Netty's threading model is not fixed. By creating different EventLoopGroup instances in the startup auxiliary class and configuring appropriate parameters, the above three Reactor threading models can be supported. It is precisely because Netty's support for the Reactor thread model provides flexible customization capabilities, it can meet the performance requirements of different business scenarios.

3.5 Unlocked serial design concept

In most scenarios, parallel multi-threaded processing can improve the concurrent performance of the system. However, if concurrent access to shared resources is improperly handled, serious lock contention will result, which will eventually lead to performance degradation. In order to avoid the performance loss caused by lock competition as much as possible, serialization design can be used, that is, message processing is completed in the same thread as much as possible, and thread switching is not performed during this period, thus avoiding multi-thread competition and synchronization locks.

In order to improve performance as much as possible, Netty uses a serial lock-free design to perform serial operations inside the IO thread to avoid performance degradation caused by multi-thread competition. On the surface, the serialization design seems to have low CPU utilization and insufficient concurrency. However, by adjusting the thread parameters of the NIO thread pool, multiple serialized threads can be started at the same time to run in parallel. This partial lock-free serial thread design has better performance than a queue-multiple worker thread model.

The working principle diagram of Netty's serialization design is as follows: After
Detailed NIO framework: Netty's high-performance way_5.png

Netty's NioEventLoop reads the message, it directly calls ChannelPipeline's fireChannelRead(Object msg). As long as the user does not actively switch threads, the NioEventLoop will always be called to the user's Handler. Thread switching, this serialized processing method avoids lock competition caused by multi-threaded operations, and is optimal from a performance perspective.

3.6 Efficient concurrent programming

Netty's efficient concurrent programming is mainly reflected in the following points:

Large and correct use of volatile;
Wide use of CAS and atomic classes;
The use of thread-safe containers;
Improve concurrency performance through read-write locks.

If you want to know the details of Netty's efficient concurrent programming, you can read the "Analysis of the Application of Multithreaded Concurrent Programming in Netty" I shared on Weibo. In this article, we have a detailed description of Netty's multithreading skills and applications. Introduction and analysis.

3.7 High-performance serialization framework

The key factors affecting serialization performance are summarized as follows:

Serialized code stream size (occupation of network bandwidth);
Serialization & deserialization performance (CPU resource usage);
Whether to support cross-language (heterogeneous system docking and development language switching).

Netty provides support for Google Protobuf by default. By extending Netty's codec interface, users can implement other high-performance serialization frameworks, such as Thrift's compressed binary codec framework.

Let's take a look at the comparison of the byte arrays serialized by different serialization & deserialization frameworks:
Detailed NIO framework: Netty's high-performance way_6.png

As can be seen from the above figure, the code stream after Protobuf serialization is only about 1/4 of that of Java serialization. It is precisely because of the poor performance of Java's native serialization that a variety of high-performance open source serialization technologies and frameworks have emerged (poor performance is only one of the reasons, and there are other factors such as cross-language, IDL definitions, etc.).

3.8 Flexible TCP parameter configuration capability

Proper setting of TCP parameters can have a significant effect on performance improvement in certain scenarios, such as SO_RCVBUF and SO_SNDBUF. If it is set incorrectly, the impact on performance is very large.

Below we summarize several configuration items that have a greater impact on performance:

SO_RCVBUF and SO_SNDBUF: usually recommended value is 128K or 256K;
SO_TCPNODELAY: The NAGLE algorithm automatically connects small packets in the buffer to form larger packets, preventing the sending of a large number of small packets from blocking the network, thereby improving the efficiency of network applications. However, the optimization algorithm needs to be turned off for delay-sensitive application scenarios;
Soft interrupt: If the Linux kernel version supports RPS (version 2.6.35 and above), after enabling RPS, soft interrupts can be realized and network throughput can be improved. RPS calculates a hash value according to the source address, destination address, and destination and source ports of the data packet, and then selects the cpu running by the soft interrupt based on this hash value. From the upper level, that means binding each connection to the cpu This hash value is used to balance soft interrupts on multiple CPUs and improve network parallel processing performance.

Netty can flexibly configure TCP parameters in the startup auxiliary class to meet different user scenarios. The relevant configuration interface is defined as follows:
Detailed NIO framework: Netty's high-performance way_7.png

4. Summary of this article

Through the analysis of the Netty architecture and performance model, we found that the high performance of the Netty architecture is carefully designed and implemented. Thanks to the high-quality architecture and code, it is not very difficult for Netty to support 10W TPS cross-node service calls. Things.

Detailed NIO framework: Netty's high-performance way (recommended collection)