Netty, Nio depth profiling with concurrent instances, to achieve high performance and high concurrency

1. Background

1.1. Astonishing performance data

Recently a friend told by insiders private letter I, through the use of Netty4 + Thrift compressed binary codec technology, they achieve a 10W TPS (complex POJO objects of 1K) remote service calls across the nodes. Compared with the traditional based Java serialization + BIO (synchronous blocking IO) communication framework, performance more than 8 times.

In fact, I am not surprised by this data, based on my five years of programming experience NIO, by selecting the appropriate NIO framework, coupled with high-performance compressed binary codec technology, careful design Reactor threading model, to achieve the above performance index is entirely possible.

Calls Here we look together at how to support 10W TPS Netty is a cross-node remote service, before the formal start of talks, we briefly introduce Netty.

1.2. Netty Basics

Netty is a high performance, asynchronous event-driven NIO framework that provides support for TCP, UDP and file transfer as an asynchronous NIO framework, Netty all IO operations are asynchronous non-blocking by Future-Listener mechanism, users can easily retrieve or get active IO operation results notification mechanism.

As the most popular NIO framework, Netty in the Internet field, big data distributed computing, game industry, communications industry, such as access to a wide range of applications, some of the industry's leading open-source components are also based NIO box Netty's architecture was built.

2. Netty Road High Performance

2.1 Performance Analysis Model RPC calls

2.1.1. Traditional RPC call three sins of poor performance

Network transmission problem: traditional RPC or RMI-based framework and other forms of remote service (process) call using a synchronous blocking IO, when the pressure is complicated by the client or network latency increases, synchronous blocking IO due to frequent cause IO wait thread regular obstruction, due to the thread can not work efficiently, IO processing power falls.

Below, we look at BIO BIO communication via the communication model diagram drawbacks:

FIG. 2-1 BIO communication model of FIG.

After using BIO communication model of the server, usually ends connected by a separate Acceptor thread is responsible for monitoring customer receives a client connection after a client to create a new thread to process the request message connection, processing is completed, returns a response message to the client , thread destruction, which is a typical request-response model. The biggest problem of this architecture is that does not have the elastic scalability, when the increase of concurrent traffic, the number of threads the server and concurrent access number is linearly proportional, because the thread is the JAVA virtual machine is very valuable system resources, when the number of threads expansion, the sharp decline in the performance of the system, with the continued increase in the amount of concurrency problems can occur handle overflow thread stack overflow, and eventually causes the server downtime.

Serialization problem: Java serialization has the following several typical problems:

. 1) Java serialization mechanism of Java is an object-internal codec, language can not cross; for example, the interface between heterogeneous systems, the stream needs to be able Java serialization original object into other languages ​​by deserializing ( copies), it is difficult to support;

2) compared to other open-source framework sequences, code stream of Java serialization is too large, whether network or transmission persisted to disk, will lead to additional resource consumption;

3) a sequence of poor performance (CPU high resource consumption).

Threading model problem: As a result of synchronous blocking IO, which can lead to each TCP connection are occupied by a thread, the thread because the JVM virtual machine resource is an invaluable resource when IO write blocking a thread to release in a timely manner, will cause the system a sharp decline in performance, serious and even cause the virtual machine can not create a new thread.

2.1.2. High-performance three themes

1) Transmission: What kind of channel to send data to each other, BIO, NIO or AIO, IO model largely determines the performance of the frame.

2) Protocol: What kind of communication protocol, HTTP, or using an internal private protocol. Different protocols, the performance model is also different. Compared to public protocols, better performance internal private protocol typically can be designed.

3) Thread: How to read data reported? After reading codec in which thread, how to distribute news codec, different Reactor threading model, the impact on performance is also very large.

Figure 2-2 RPC call to the performance of the three elements

2.2. Netty high-performance road

2.2.1 asynchronous non-blocking communication

In the IO programming process, when it is necessary to handle multiple clients access request can be processed using multi-threading or IO multiplexing. IO blocked by multiplexing a plurality of IO multiplexed onto the same blocking select, so that the system can handle the case where a plurality of single-threaded client requests simultaneously. With the traditional multi-threaded / multi-process model ratio, I / O multiplexing biggest advantage is the small system overhead, the system does not need to create new and additional processes or threads, do not need to run these maintenance processes and threads, reducing the system maintenance workload, saving system resources.

JDK1.4 provides support for non-blocking IO (NIO) is, JDK1.5_update10 version uses epoll alternative to the traditional select / poll, greatly enhance the performance of NIO communications.

JDK NIO communication model as follows:

FIG. 2-3 NIO FIG multiplexing model

Socket and ServerSocket with corresponding, NIO SocketChannel and also provides two different socket channel ServerSocketChannel achieved. Both new channel supports blocking and non-blocking modes. Blocking mode using very simple, but the performance and reliability is not good, non-blocking mode just the opposite. Developers can generally be to select the appropriate mode according to their needs, in general, low-load, low-concurrent applications can choose synchronous blocking IO to reduce programming complexity. But for high-load, high concurrency of network applications, the need to use the NIO non-blocking mode for development.

Netty architecture and implemented in Reactor mode, its server communication sequence is as follows:

FIG. 2-3 NIO server communication sequence of FIG.

The client communication sequence is as follows:

FIG. 2-4 NIO client communication sequence of FIG.

Netty IO thread NioEventLoop Selector Since polymerized multiplexer, can handle hundreds of concurrent clients Channel, since the read and write operations are non-blocking, which can sufficiently enhance the operating efficiency of the IO thread, to avoid IO threads frequently blocked due to hang. Further, since Netty the asynchronous communication mode, a thread can be processed concurrently IO N and clients connected to read and write operations, which solves the conventional synchronous blocking IO performance from a fundamentally a connecting thread model, architecture, and the elastic scalability reliability have been greatly improved.

2.2.2. Zero-copy

Many users have heard Netty has a "zero-copy" feature, but could not say where embodied, this section will explain in detail Netty "zero copy" function.

Netty "zero-copy" is mainly reflected in the following three aspects:

1) Netty ByteBuffer reception and transmission using DIRECT BUFFERS, the direct use of the heap memory for Socket reader, does not require a secondary copy of the byte buffer. If conventional heap memory (HEAP BUFFERS) Socket for reading and writing, the JVM will heap Buffer direct copy memory before writing the Socket. Compared to the outer direct memory heap, a message multiple copies of the buffer memory during transmission.

2) Netty Buffer objects provides a composition, you can be polymerized ByteBuffer plurality of objects, as the user can easily operate the combination as the operation of a Buffer Buffer, avoid the traditional manner by the memory copy several small Buffer Buffer is merged into a large .

3) Netty transferTo using file transfer method, which can directly transmit data to the target file buffer Channel, avoiding the problems of the conventional memory copy mode write cycle caused by.

Here, we have the three "zero copy" is described, Netty look at the reception to create Buffer:

Figure 2-5 Asynchronous message reading "zero-copy"

Message read once per cycle, on the object acquired by ioBuffer ByteBuf ByteBufAllocator method, the following definitions continue to look at its interface:

FIG. 2-6 ByteBufAllocator outer heap memory allocation by ioBuffer

When performed Socket IO to read and write, in order to avoid heap memory copy from a copy directly to memory, Netty's ByteBuf distributor directly create non-heap memory buffers to avoid secondary copy, through the "zero-copy" to improve read and write performance .

Here we continue to look at the second "zero copy" to achieve CompositeByteBuf, it will be more ByteBuf Foreign packaged as a ByteBuf, it provides external interfaces ByteBuf after the reunification of the package, its class is defined as follows:

Figure 2-7 CompositeByteBuf class inheritance

By inheritance we can see is a ByteBuf CompositeByteBuf actual wrapper, it will be more ByteBuf combined into a collection, and then outside to provide a unified ByteBuf interface, as defined as follows:

FIG class definition 2-8 CompositeByteBuf

Add ByteBuf, do not need to copy the memory, the relevant code is as follows:

Figure 2-9 New ByteBuf "zero copy"

Finally, we look at the file transfer "zero copy":

Figure 2-10 File Transfer "zero-copy"

Netty DefaultFileRegion file transfer method to send the files to the destination via transferTo Channel, emphasis method see transferTo FileChannel below its API DOC follows:

Figure 2-11 File Transfer "zero-copy"

For many operating systems it will be sent directly to the contents of the file buffer to the target Channel, without the need to copy the way, this is a more efficient means of transmission, which implements the File Transfer "zero copy."

2.2.3. Memory Pool

With the development of real-time JVM virtual machine and JIT compiler technology, distribution and recovery of objects is a very lightweight work. But for buffer Buffer, the situation is slightly different, especially for foreign direct heap memory allocation and recovery, is a time-consuming operation. To maximize reuse buffer, Netty provides a memory-based buffer pools reuse mechanism. Here we take a look of realization Netty ByteBuf:

Figure 2-12 memory pool ByteBuf

Netty provides a variety of memory management strategy, by configuring relevant parameters will start secondary classes, the difference of customization can be achieved.

By the following performance tests , we look at the performance differences based memory pool recycling ByteBuf and the general ByteBuf.

A use case, use a memory pool allocator to create a Direct Memory Buffer:

Figure 2-13 on non-heap memory buffer pool memory test

Use Case Second, the use of non-direct memory heap buffer memory allocator created:

Figure 2-14 non-heap memory buffer test cases based on non-memory pool created

Implementing 3,000,000, performance comparison results are shown below:

Figure 2-15 memory pool and buffer pool memory write non performance comparison

Performance tests show that using the memory pool ByteBuf Chaosheng evening off compared to the ByteBuf, is about 23 times higher performance (performance data associated with the use of a strong scene).

Here we take a simple analysis of the memory allocation under Netty memory pool:

2-16 AbstractByteBufAllocator buffer allocation of FIG.

Continue to look newDirectBuffer method, we found it to be an abstract method, responsible for specific sub-class AbstractByteBufAllocator implemented by the following code:

2-17 newDirectBuffer different implementations of FIG.

NewDirectBuffer PooledByteBufAllocator code branches to the method, the memory area acquired from the Cache PoolArena, call its methods allocate memory allocation:

FIG 2-18 PooledByteBufAllocator memory allocation

The PoolArena allocate follows:

FIG buffer allocation of 2-18 PoolArena

Our analysis focused on the realization newByteBuf, it also is an abstract method, by a subclass DirectArena and HeapArena to achieve different types of buffer allocation, because the test cases using external memory heap,

2-19 PoolArena newByteBuf abstract method of FIG.

Therefore, the focus of analysis to achieve DirectArena: If you do not use the sun to open the unsafe, the

newByteBuf 2-20 DirectArena implemented method of FIG.

newInstance method performed PooledDirectByteBuf code is as follows:

2-21 PooledDirectByteBuf newInstance method implementation of FIG.

RECYCLER by the get method of recycling ByteBuf object if non-memory pool implementation, directly create a new ByteBuf object. After obtaining the buffer pool ByteBuf, call AbstractReferenceCountedByteBuf provided a method of setRefCnt reference counter, the reference count and memory recovery (garbage collection similar JVM) for the object.

2.2.4. Efficient Reactor threading model

Reactor common thread model, there are three, are as follows:

1) Reactor single-threading model;

2) Reactor multi-threaded model;

3) Reactor master-slave multi-threaded model

Reactor single-threaded model, refers to all IO operations are done in the same NIO thread above duties NIO thread as follows:

1) As NIO server, the client receives TCP connection;

2) As a NIO client initiates a TCP connection to the server;

3) the read request or the communication peer response message;

4) Send a message to the communication peer request or response message.

Reactor threaded schematic model as follows:

Figure 2-22 Reactor single-threaded model

As the Reactor pattern using asynchronous non-blocking IO, all IO operations will not cause obstruction, in theory, a thread can independently handle all IO-related operations. From an architectural perspective, a really NIO thread can complete its duties assumed. For example, by the receiving client Acceptor TCP connection request message, after a link is established successfully, a message dispatched to decode specified by the Handler Dispatch corresponding ByteBuffer. Users Handler can be sent to the client by the NIO thread message.

For some small-capacity scenarios, you can use single-threaded model. But for high-load, large concurrent application was inappropriate, mainly due to the following:

1) a process NIO threads hundreds of links, can not support the performance, even if the CPU load NIO thread 100%, can not meet the massive message encoding, decoding, reading and transmission;

2) When the NIO thread overloaded, processing speed will slow down, which causes a large number of client connection timeout, timeout after tend to be retransmitted, which is even more heavy load NIO thread will eventually lead to a large number of messages and processing backlog timeout, NIO thread will become the bottleneck of the system performance;

3) reliability: Once NIO thread unexpectedly running out, or into the loop, it can cause the entire system communication modules are not available to receive and process external message, causing node failure.

To address these issues, the evolution of the Reactor multi-threaded model, let's learn together under Reactor multi-threading model.

Rector's largest multi-threaded and single-threaded model model difference is that there is a set of IO NIO threading operation, its principle is as follows:

Figure 2-23 Reactor multi-threaded model

Reactor multi-threading model features:

1) has a special thread -Acceptor NIO thread for monitoring server, the client receives TCP connection requests;

2) Network operation IO - reading and writing NIO responsible for a thread pool, thread pool may be employed to achieve the standard JDK thread pool comprising N and a task queue of available threads, these threads are responsible NIO message reading, decoding, encoding and transmitting;

3) a thread NIO N links can be processed simultaneously, but only a link corresponding to a NIO thread from concurrent problems.

In most scenarios, Reactor multi-threading models to meet the performance requirements; however, in a very special scenario, a NIO thread is responsible for monitoring and handling all client connections may exist performance problems. Eg millions of concurrent client connections, or the server requires a handshake message client security authentication, certification itself is very loss performance. In such scenarios, a single thread Acceptor there may be insufficient performance issues in order to solve performance problems, resulting in a third Reactor threading model - multi-threaded master model from Reactor.

Reactor thread features from the master model is: the service terminal for receiving a client connection is no longer a separate thread NIO, but an independent NIO thread pool. IO on a thread (may contain access authentication, etc.), the newly created SocketChannel registered to the IO thread pool (sub reactor thread pool) after the Acceptor receives a client TCP connection request processing is completed, which is responsible for reading and writing of SocketChannel and codec work. Acceptor thread pool only client only login, handshakes and safety certification, once the link is established, the link will be registered on the IO thread pool thread backend subReactor, responsible for the follow-up operation by the IO IO thread.

It threading model as shown below:

Figure 2-24 Reactor master-slave multi-threaded model

NIO thread using the master-slave model, you can solve a server monitor thread can not effectively deal with inadequate performance problems for all client connections. Therefore, the official demo Netty's recommended that thread model.

In fact, Netty's threading model is not fixed, by creating different instances of EventLoopGroup start secondary classes and through appropriate configuration parameters, you can support the three Reactor threading model. It is because Netty provides flexible customization capabilities to support Reactor threading model, so to meet the performance demands of different business scenarios.

2.2.5. Serial no lock-oriented design concepts

In most scenarios, parallel multi-threaded processing concurrency can improve the performance of the system. However, if concurrent access to shared resources handled properly, it will cause serious lock contention, which will eventually lead to performance degradation. To avoid loss of performance caused by lock contention as to be, i.e., processing a message in the same thread as the serial design is completed, not during a thread switch, thus avoiding competition and multi-thread synchronization lock.

In order to improve performance as much as possible, Netty using the serial lock-free design, operate within the serial IO thread, multi-thread to avoid performance degradation caused by competition. On the surface, the serial design seems CPU utilization rate is not high enough degree of concurrency. However, by adjusting parameters NIO thread pool threads may be started simultaneously a plurality of threads run in parallel serialized, no such localized locking thread design comparison of a serial queue - a plurality of threading model work better performance.

Netty serial design works is as follows:

Figure 2-25 Netty serialization operating principle

After the NioEventLoop Netty read message, a direct call ChannelPipeline fireChannelRead (Object msg), as long as the user does not automatically switch the thread, the thread switching has been performed can not call to the user by the NioEventLoop Handler, during such a way to avoid serialization process competition resulting from the operation of the lock multi-threaded, from a performance point of view is optimal.

2.2.6. Efficient concurrent programming

Netty efficient concurrent programming is mainly reflected in the following points:

1) volatile lot, proper use;

2) CAS and widely used class of atoms;

3) the use of thread-safe container;

4) improve performance by concurrent read-write lock.

If you want to know the details Netty efficient concurrent programming, I can read before sharing microblogging "Application of multi-threaded programming in Netty," in this article for multithreading techniques and application of a detailed Netty presentation and analysis.

2.2.7. High performance framework sequences

The key factors affecting the performance of serialization summarized as follows:

1) the code stream after the size of the sequence (of network bandwidth);

2) & serialization deserialization performance (CPU resource consumption);

3) whether to support cross-language (docking and heterogeneous systems development language switching).

Netty default Google Protobuf provides support for the interface, the user may implement other performance framework sequence by extension codec Netty, e.g. Thrift compressed binary frame codec.

Here we take a look at the different byte array deserializing serialized & comparison of framework sequences:

Figure 2-26 framework sequences of each sequence of code stream size comparison

As can be seen from the figure, the sequence of Protobuf stream is only about 1/4 of Java serialization. It is because of the performance of native Java serialization poor performance, it gave birth to a variety of high-performance open source technology and serialization framework (poor performance is just one of the reasons, as well as cross-language, IDL definitions and other factors).

2.2.8. TCP parameters Flexible configuration capabilities

TCP parameters provided reasonable in some scenarios to improve performance can play a significant effect, e.g. SO_RCVBUF and SO_SNDBUF. If set up properly, the performance impact is very large. Below we summarize under the influence of performance on relatively large number of configuration items:

1) SO_RCVBUF and SO_SNDBUF: 128K or 256K is generally recommended;

2) SO_TCPNODELAY: NAGLE algorithm by small packets in the buffer is automatically connected to form a larger packet from a large number of small packet transmission network congestion, thereby improving the efficiency of network applications. But for delay-sensitive applications scenarios need to close the optimization algorithm;

3) soft interrupt: If Linux can achieve soft interrupt the kernel version supports RPS (2.6.35 or later), open the RPS, improve network throughput. RPS data packet source address, destination address and source port and a destination, calculate a hash value, and then select the soft interrupt cpu operating according to this hash value from the top view, that is tied to each connection and cpu set, and by the hash value, to balance on a plurality of soft interrupt cpu, parallel processing to improve network performance.

Netty TCP parameters can be configured flexibly start secondary class, to meet different user scenarios. Configuration interface is shown below:

TCP parameters of the configuration definitions 2-27 Netty FIG.

2.3 Summary

By Netty's architecture and performance analysis model, we found Netty high-performance architecture that has been carefully designed and implemented, thanks to high-quality architecture and code, Netty support 10W TPS cross node service invocation is not an extremely difficult thing.

Guess you like

Origin www.cnblogs.com/java889/p/12156676.html