Netty long connection service

push service

I still remember a year and a half ago, a project I did that needed to use the Android push service. Unlike iOS, there is no unified push service in the Android ecosystem. Although Google has Google Cloud Messaging, it is not even unified abroad, let alone domestic, and it is directly blocked.

Therefore, most of the push on Android can only be done by polling. When we were doing technical research before, we found jPush's blog, which introduced some of their technical features. What they mainly do is the long connection service under the mobile network. The single 50W-100W connection really surprised me! Later, we also adopted their free plan, because it is a product with a small audience, so their free version is enough for us. After more than a year, the operation is stable, very good!

After a lapse of two years, after changing departments, I actually received a task to optimize the company's own long-term connection server.

After searching the online technical information again, I found that many related difficulties have been overcome, and there are also many summary articles on the Internet. The single-machine 50W-100W connection is not a dream at all, in fact, everyone can do it. But it's not enough to have a connection alone, and QPS also needs to go up together.

Therefore, this article summarizes the various difficulties and optimization points in the process of using Netty to achieve long-term connection services.

 

 

What is Netty

Netty: http://netty.io/

Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.

The official explanation is the most accurate, and the most attractive thing in the period is high performance. But many people will have this question: if it is implemented directly with NIO, it will be faster, right? It's like I write JDBC directly, although the amount of code is a bit larger, but it must be faster than iBatis!

However, if you know Netty, you will find out that this is not necessarily true!

The advantages of using Netty instead of writing directly with NIO are:

  • High-performance and high-expansion architecture design, in most cases you only need to focus on the business and not the architecture

  • Zero-Copy Techniques to minimize memory copies

  • Implementing Native Sockets for Linux

  • Write the same code, compatible with NIO2 of java 1.7 and NIO before 1.7

  • Pooled Buffers Greatly reduce  Buffer and release  Buffer stress

  • ……

There are too many features, you can read the book "Netty in Action" to learn more.

Also, the Netty source code is a great textbook! You can take a look at its source code in the process of using it, it's great!

 

 

what is the bottleneck

If you want to be a long-chain service, what is the ultimate goal? And what is its bottleneck?

In fact, there are two main goals:

  1. more connections

  2. Higher QPS

So, let's talk about their difficulties and points of attention for these two goals.

 

 

more connections

 

non-blocking IO

In fact, whether you use Java NIO or Netty, there is no difficulty in reaching millions of connections. Because they are non-blocking IO, there is no need to create a thread for each connection.

For more details, you can search for BIOrelated knowledge points of , NIO.AIO

 

 

Java NIO achieves millions of connections

 

 

This code will only accept incoming connections without doing anything, just to test the limit of the number of standby connections.

You can see that this code is the basic way of writing NIO, nothing special.

 

 

Netty achieves millions of connections

 

 

This is actually a very simple Netty initialization code. Again, there is nothing special at all in order to achieve a million connections.

 

 

Where is the bottleneck

The above two different implementations are very simple and without any difficulty, so someone will definitely ask: what is the bottleneck of achieving millions of connections?

In fact, as long as non-blocking IO is used in java (both NIO and AIO are counted), then they can use a single thread to implement a large number of Socket connections. A thread is not created per connection like BIO, because the code level won't be a bottleneck.

In fact, the real bottleneck is in the Linux kernel configuration. The default configuration will limit the global maximum number of open files (Max Open Files) and limit the number of processes. Therefore, some modifications to the Linux kernel configuration are required.

This thing seems to be very simple now, just change it according to the configuration on the Internet, but everyone must not know how difficult it is to study this person for the first time.

Here are a few articles that describe how to modify the relevant configuration:

Building a server for the C1000K

(http://www.ideawu.net/blog/archives/740.html)

1 million concurrent connections to the server notes 1M concurrent connection goal reached (http://www.blogjava.net/yongboy/archive/2013/04/11/397677.html)

Taobao Technology shares 2 million HTTP long connection attempts and tuning

(http://www.linuxde.net/2013/08/15150.html)

 

 

How to verify

It is not difficult to make the server support one million connections. We quickly got a test server, but the biggest question is, how can I verify that this server can support one million connections?

We wrote a test client in Netty, which also uses non-blocking IO, so there is no need to open a lot of threads. However, the number of ports on a machine is limited. If you use rootpermissions, there will be more than 6W connections at most. So we use Netty to write a client here, and use up all the connections of a single machine.

 

The code is also very simple, as long as you connect it, you don't need to do any other operations.

So just find a computer to start the program. It should be noted here that it is best for the client to modify the Linux kernel parameter configuration like the server.

 

 

How to find so many machines

According to the above practice, a single machine can have a maximum of 6W connections, and at least 17 machines are required for a million connections!

How can we overcome this limitation? In fact, this limitation comes from the network card. We later solved the problem by using a virtual machine and configuring the virtual machine's virtual network card to bridge mode.

According to the memory size of the physical machine, a single physical machine can run at least 4-5 virtual machines, so in the end, only 4 physical machines are enough for millions of connections.

 

 

playful approach

In addition to using the virtual machine to fully squeeze the machine resources, there is also a very neat approach, which I also stumbled upon during the verification process.

According to the TCP/IP protocol, FINthe normal disconnection process is initiated after either party sends. In the event of a momentary network interruption, the connection will not be automatically disconnected.

So can we do this?

  1. Start the server, do not set the keep-aliveproperties of Socket, the default is not set

  2. Connect to the server with a virtual machine

  3. Force shutdown of the virtual machine

  4. Modify the MAC address of the virtual machine network card, restart and connect to the server

  5. The server accepts new connections and keeps previous connections

What we want to verify is the limit of the server, so just keep making the server think there are so many connections, right?

After our test, this method is the same as connecting to the server with a real machine, because the server just thinks that the other party's network is not good and will not disconnect you.

In addition, it is disabled keep-alivebecause if it is not disabled, the socket connection will automatically detect whether the connection is available, and if it is not available, it will be forcibly disconnected.

 

 

Higher QPS

Since both NIO and Netty are non-blocking IO, no matter how many connections there are, only a few threads are required. And QPS will not decrease due to the increase in the number of connections (on the premise of sufficient memory).

And Netty itself is designed well enough that Netty is not a bottleneck for high QPS. What is the bottleneck of high QPS?

It's the design of the data structure!

 

 

How to optimize data structures

First of all, it is necessary to be familiar with the characteristics of various data structures, but in complex projects, it is not possible to use a set, and sometimes a combination of various sets is often used.

To achieve high performance, but also to achieve consistency, there can be no deadlock, the difficulty here is really not small...

My takeaway here is, don't optimize prematurely. Prioritize consistency, ensure data accuracy, and then find ways to optimize performance.

Because consistency is much more important than performance, and for many performance problems, when the amount is small and the amount is large, the bottleneck will be completely in different places. Therefore, I think the best practice is to focus on consistency and supplement performance in the writing process; after the code is completed, go to the TOP1 and solve it!

 

 

Resolve CPU Bottlenecks

Before doing this optimization, first press your server hard in the test environment.

Once you have stress testing, you need tools to find performance bottlenecks!

I like to use VisualVM, open the tool and look at the sampler (Sample), in reverse order according to the self-use time (Self Time (CPU)), the number one is the point you need to optimize!

Note: What is the difference between Sample and Profiler? The former is sampling, the data is not the most accurate but does not affect performance; the latter is statistically accurate, but greatly affects performance. If your program consumes a lot of CPU, try to use Sample, otherwise the performance will be reduced after the Profiler is turned on, but it will affect the accuracy.

I still remember that the first bottleneck we found in our project turned out to be the methods in ConcurrentLinkedQueuethis class . size()When the amount is small, it has no effect, but when the amount is Queuelarge, it counts the total from scratch every time, and size()we call this method very frequently, so it has an impact on performance.

size()The implementation is as follows:

 

Later we solved the problem by using an extra one AtomicIntegerto count. But wouldn't it be impossible to achieve high consistency after separation? It doesn't matter, this part of our code cares about eventual consistency, so we just need to ensure eventual consistency.

In short, specific cases should be analyzed in detail, and different businesses should be implemented in different ways.

 

 

Solve GC bottlenecks

GC bottlenecks are also part of CPU bottlenecks, because unreasonable GC can greatly affect CPU performance.

VisualVM is still used here, but you need to install a plugin: VisualGC

With this plugin, you can visually see the GC activity.

According to our understanding, it is normal to have a large number of New GCs during the stress test, because a large number of objects are being created and destroyed.

But having a lot of Old GC in the first place is a bit overwhelming!

Later, it was found that in our stress testing environment, because the QPS of Netty is not closely related to the number of connections, we only connected a small number of connections. Memory allocation is not much.

In the JVM, the default ratio of the young generation to the old generation is 1:2, so a lot of the old generation is wasted, and the young generation is not enough.

After adjustment  -XX:NewRatio , the Old GC has been significantly reduced.

However, the production environment is different. The production environment will not have such a large QPS, but there will be many connections, and the connection-related objects will survive for a long time, so the production environment should allocate more old generations.

In a word, GC optimization, like CPU optimization, also requires constant adjustment and optimization, not overnight.

 

 

Other optimizations

If you've done your own, then be sure to check out this website by the author of Netty in Action: Netty Best Practices aka Faster == Better (http://normanmaurer.me/presentations/2014-facebook-eng-netty /slides.html).

I believe you will benefit a lot. After some small optimizations mentioned in it, our overall QPS has improved a lot.

Last but not least, java 1.7 is much more performant than java 1.6! Because Netty's writing style is event-based, it seems to be AIO. But java 1.6 does not have AIO, and java 1.7 supports AIO, so if you use java 1.7, the performance will be significantly improved.

 

 

final result

After several weeks of continuous stress testing and continuous optimization, we achieved 600,000 connections and 200,000 QPS with java 1.6 on a machine with 16 cores and 120G memory (JVM only allocates 8G).

In fact, this is not the limit, the JVM only allocates 8G memory, and the memory configuration can be increased even if the number of connections is larger;

The QPS seems to be very high, and the System Load Average is very low, which means that the bottleneck is neither CPU nor memory, so it should be IO! The above Linux configuration is configured to reach millions of connections, and is not optimized for our own business scenarios.

Because the current performance is completely sufficient, and the online single-machine QPS is only 1W at most, so we focus on other places first. I believe that we will continue to optimize the performance of this piece in the future, and look forward to a greater breakthrough in QPS!

 

Xiaobian language :

Netty is suitable for the rapid development of high-performance, high-availability network services, and has a wide range of applications in the IT industry, such as the distributed service framework Dubbo, which uses netty as the basic communication component. Netty can be used as a research direction for advanced java programmers.

The recommended information is as follows:

netty threading model

http://www.infoq.com/cn/articles/netty-threading-model/

grpc principle analysis

http://shift-alt-ctrl.iteye.com/blog/2292862

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325982596&siteId=291194637