When the business is necessary, how to make the performance optimization impeccable and the effect is outstanding?

Performance optimization sometimes seems to be a relatively virtual technical requirement. Unless the code is unbearably slow, very few companies have the guts to devote the resources to doing it. Even if you have performance metrics, it's hard to convince leaders to make an improvement that takes 300ms down to 150ms because it has no business value.

It's sad, but it's a sad reality.

Performance optimization, usually initiated by people with technical pursuits, is a forward optimization based on observed indicators. They are often craftsmen, nitpicking every millisecond and striving for perfection. Of course, the premise is that you have time.

1. Optimize the context and goals

Our performance optimization this time is because it has reached an unbearable level, and the optimization work is carried out in a post-event remedy and problem-driven way. This is usually no problem. After all, business comes first, and iterations are carried out in the pit.

Background first. For the service to be optimized this time, the request response time is very unstable. As the amount of data increases, most requests take about 5-6 seconds! It's beyond what ordinary people can bear.

Of course optimization is required.

To illustrate the goal to be optimized, I sketched its topology roughly. As shown in the figure, this is a set of services of a microservice architecture.

Among them, the goal of our optimization is to be in a relatively upstream service. It needs to call a large number of downstream service providers through the Feign interface, obtain data, aggregate and splicing, and finally send it to the browser client through the zuul gateway and nginx.

In order to observe the calling relationship and monitoring data between services, we connected the Skywalking call chain platform and the Prometheus monitoring platform to collect important data so that optimization decisions can be made. Before optimizing, we need to first look at the two technical indicators that need to be referenced for optimization.

  • Throughput: The number of occurrences per unit time. Such as QPS, TPS, HPS, etc.
  • Average Response Time: The average time spent per request.

The average response time is naturally as small as possible, and the smaller it is, the higher the throughput. The increase in throughput can also make reasonable use of multi-core, and increase the number of occurrences per unit time through the degree of parallelism.

The goal of our optimization this time is to reduce the average response time of some interfaces to less than 1 second; to increase the throughput, that is, to increase the QPS, so that the single-instance system can undertake more concurrent requests.

2. Sharply reduce time-consuming through compression

I want to start by introducing one of the most important optimizations that make the system fly: compression.

By looking at the requested data in chrome's inspect, we found a key request interface that transfers about 10MB of data each time. How much stuff is there.

Downloading such a large amount of data takes a lot of time. As shown in the figure below, it is a request I made to the juejin homepage, and the content download in it represents the transmission time of the data on the network. If the user's bandwidth is very slow, the time-consuming of this request will be very long.

To reduce the transfer time of data over the network, gzip compression can be enabled. Gzip compression is a time-for-space practice. For most services, the last link is nginx, and most people will do compression at the nginx layer. Its main configuration is as follows:

gzip on;
gzip_vary on;
gzip_min_length 10240;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml;
gzip_disable "MSIE [1-6]\.";
复制代码

How amazing is the compression ratio? We can take a look at this screenshot. It can be seen that after the data is compressed, it has been reduced from 8.95MB to 368KB! It can be downloaded by the browser in an instant.

But wait, nginx is just the outermost part, it's not over yet, we can also make requests faster.

Please see the request path below. Due to the use of micro-services, the flow of requests becomes complicated: nginx does not directly call related services, it calls the zuul gateway, the target service that the zuul gateway really calls, the target service Other services are also called. Intranet bandwidth is also bandwidth, and network latency will also affect the call speed, which should also be compressed.

nginx->zuul->服务A->服务E
复制代码

To make all calls between Feign go through the compression channel, additional configuration is required. We are a springboot service, which can be handled through okhttp's transparent compression.

Add its dependencies:

<dependency>
	<groupId>io.github.openfeign</groupId>
	<artifactId>feign-okhttp</artifactId>
</dependency>
复制代码

Enable server configuration:

server:
  port:8888
  compression:
    enabled:true
    min-response-size:1024
    mime-types:["text/html","text/xml","application/xml","application/json","application/octet-stream"]
复制代码

Enable client configuration:

feign:
  httpclient:
    enabled:false
  okhttp:
    enabled:true
复制代码

After these compressions, the average response time of our interface is directly reduced from 5-6 seconds to 2-3 seconds, and the optimization effect is very significant.

Of course, we also made an article on the result set. In the data returned to the front end, the unused objects and fields have been simplified. But in general, these changes are traumatic and require a lot of code adjustment, so we have limited energy on this, and the effect is naturally limited.

3. Acquire data in parallel and respond quickly

Next, it is necessary to go deep into the internal code logic for analysis. We mentioned above that the user-facing interface is actually a data aggregation interface. Each of its requests, through Feign, calls the interfaces of dozens of other services to obtain data, and then splices the result set.

Why is it slow? Because these requests are all serial! Feign calls are remote calls, that is, network I/O-intensive calls, waiting for most of the time. If the data is satisfied, it is very suitable for parallel calls.

First, we need to analyze the dependencies of these dozens of sub-interfaces to see if they have strict ordering requirements. If most don't, all the better.

The results of the analysis are mixed. These interfaces can be roughly divided into A and B categories according to the calling logic. First, you need to request the class A interface, and after splicing the data, the data will be used by class B. But within A and B classes, there is no ordering requirement.

That is to say, we can split this interface into two parts that are executed sequentially, and data can be obtained in parallel in a certain part.

Then try to transform it according to this analysis result. Using CountDownLatch in the concurrent package, it is easy to realize the function of parallel retrieval.

CountDownLatch latch = new CountDownLatch(jobSize);
//submit job
executor.execute(() -> { 
    //job code
	latch.countDown(); 
}); 
executor.execute(() -> { 
	latch.countDown(); 
}); 
...
//end submit
latch.await(timeout, TimeUnit.MILLISECONDS); 
复制代码

The results are very satisfying, our interface time-consuming has been reduced by nearly half! At this time, the interface time has been reduced to less than 2 seconds.

You might ask, why not use Java's parallel streams? For the pits of parallel streams, you can refer to this article. It is highly not recommended that you use it.

"The pit of parallelStream, you don't know if you don't step on it, you will be shocked when you step on it"

Be careful with concurrent programming, especially in business code. We constructed a dedicated thread pool to support this concurrent acquisition function.

final ThreadPoolExecutor executor = new ThreadPoolExecutor(100, 200, 1, 
            TimeUnit.HOURS, new ArrayBlockingQueue<>(100)); 
复制代码

Compression and parallelization are the most effective means in our optimization. They directly cut off most of the time-consuming of the request, which is very effective. But we are still not satisfied, because each request still has more than 1 second.

4. Cache classification for further acceleration

We found that the acquisition of some data is placed in a loop, and there are many invalid requests, which cannot be tolerated.

for(List){
    client.getData();
}
复制代码

If these commonly used results are cached, the number of network IO requests can be greatly reduced and the running efficiency of the program can be increased.

Caching plays a huge role in the optimization of most applications. However, due to the comparison of compression and parallel effects, the effect of caching in our scenario is not very obvious, but it still reduces the request time by about 30 to 40 milliseconds.

We do it.

First, we put a part of the data with simple code logic and suitable for the Cache Aside Pattern mode in the distributed cache Redis. Specifically, when reading, read the cache first, and then read the database when the cache cannot be read; when updating, update the database first, and then delete the cache (delayed double deletion). In this way, most of the cache scenarios with simple business logic can be solved, and the problem of data consistency can be solved.

However, just doing this is not enough, because some business logic is very complex, and the updated code is very scattered, so it is not suitable to use the Cache Aside Pattern for transformation. We learned that some data have the following characteristics:

  1. These data, after time-consuming acquisition, will be used again in extreme time.
  2. The consistency requirements of business data for them can be controlled within seconds
  3. For the use of these data, cross-code, cross-thread, use in various ways

In response to this situation, we designed a very short-lived in-heap memory cache. After 1 second, the data will be invalidated and then read from the database again. Adding a node to call the server interface is 1k times per second, and we directly reduced it to 1 time.

Here, Guava's LoadingCache is used, and the Feign interface calls are reduced by orders of magnitude.

LoadingCache<String, String> lc = CacheBuilder
      .newBuilder()
      .expireAfterWrite(1,TimeUnit.SECONDS)
      .build(new CacheLoader<String, String>() {
      @Override
      public String load(String key) throws Exception {
            return slowMethod(key);
}});
复制代码

5. MySQL index optimization

Our business system uses the MySQL database, because there is no professional DBA involved, and the data table is generated using JPA. During the optimization, a large number of unreasonable indexes were found, of course, they should be optimized.

Since SQL is highly sensitive, I will only talk about some index optimization rules encountered in the optimization process. I believe you can also make analogies in your own business system.

Indexes are very useful, but be aware that if you do functional operations on fields, indexes will not work. Common index failure, there are the following two situations:

  • The index field type of the query is different from the data type passed by the user, and an implicit conversion is required. For example, on a field of type varchar, an int parameter is passed in
  • Between the two tables to be queried, the character sets used are different, so the associated field cannot be used as an index

MySQL index optimization, the most basic is to follow the leftmost prefix principle, when there are three fields a, b, c, if the query condition uses a, or a, b, or a, b, c, then we will You can create an index (a, b, c), which contains a and ab. Of course, strings can also be prefixed and indexed, but they are less common in ordinary applications.

Sometimes, the MySQL optimizer will choose the wrong index, we need to use force index to specify the index used. In JPA, nativeQuery is used to write SQL statements bound to the MySQL database. We try to avoid this situation as much as possible.

Another optimization is to reduce the return table. Since InnoDB uses a B+ tree, if a non-primary key index is not used, the clustered index will be found first through the secondary index, and then the data will be located. One more step, a return table is generated. Using a covering index can avoid returning to the table to a certain extent, which is a common optimization method. The specific method is to put the field to be queried together with the index to make a joint index, which is a way of changing space for time.

6. JVM optimization

I usually put JVM optimizations on the last ring. Moreover, unless the system has a serious freeze or OOM problem, it will not actively over-optimize it.

Unfortunately, our application, due to the large memory (8GB+), is often stuck under the default parallel collector of JDK1.8. Although it is not very frequent, every few seconds has seriously affected the smoothness of some requests.

At the beginning of the program, it ran bare under the JVM, GC information, and OOM, nothing was left. In order to record GC information, we have made the following modifications.

The first step is to add various parameters for GC troubleshooting.

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/xxx.hprof  -DlogPath=/opt/logs/ -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -Xloggc:/opt/logs/gc_%p.log -XX:ErrorFile=/opt/logs/hs_error_pid%p.log
复制代码

In this way, we can take the generated GC file and upload it to platforms such as gceasy for analysis. You can view the throughput of the JVM and the latency of each stage, etc.

The second step is to open SpringBoot's GC information and access Promethus monitoring.

Add dependencies to pom.

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
复制代码

Then configure the exposure point on it. In this way, we have real-time analysis data and a basis for optimization.

management.endpoints.web.exposure.include=health,info,prometheus
复制代码

After observing the performance of the JVM, we switched to the G1 garbage collector. G1 has a maximum pause target, which can make our GC time smoother. It mainly has the following tuning parameters:

  • -XX:MaxGCPauseMillis sets the target pause time, G1 will try to achieve it.
  • -XX:G1HeapRegionSize Sets the small heap size. This value is a power of 2, neither too large nor too small. If you don't know how to set it, keep the default.
  • -XX:InitiatingHeapOccupancyPercent When the entire heap memory usage reaches a certain percentage (45% by default), the concurrent marking phase will be started.
  • -XX:ConcGCThreads The number of threads used by the concurrent garbage collector. The default value varies with the platform the JVM is running on. Modifications are not recommended.

After switching to G1, this uninterrupted pause magically disappeared! During the period, there were many memory overflow problems, but with the blessing of the artifact MAT, it was finally solved easily.

7. Other optimizations

In terms of engineering structure and architecture, if there is a flaw, then the role of code optimization is actually limited, such as our case.

But the main code still needs to be adjusted. Some critical code in high time-consuming logic, we have paid special attention to it. According to the development specification, the code has been cleaned up uniformly. Among them, there are a few impressive points.

In order to reuse the map collection, some students use the clear method to clean up after each use.

map1.clear();
map2.clear();
map3.clear();
map4.clear();
复制代码

The data in these maps is very special, and the clear method is a bit special. Its time complexity is O(n), which causes a high time-consuming.

public void clear() {
    Node<K,V>[] tab;
    modCount++;
    if ((tab = table) != null && size > 0) {
        size = 0;
        for (int i = 0; i < tab.length; ++i)
            tab[i] = null;
    }
}
复制代码

The same thread-safe queue, there is ConcurrentLinkedQueue, its size() method, the time complexity is very high, it is used by colleagues for some reason, these are some performance killers.

public int size() {
        restartFromHead: for (;;) {
            int count = 0;
            for (Node<E> p = first(); p != null;) {
                if (p.item != null)
                    if (++count == Integer.MAX_VALUE)
                        break;  // @see Collection.size()
                if (p == (p = p.next))
                    continue restartFromHead;
            }
            return count;
        }
}
复制代码

In addition, the response of some service web pages is very slow. This is due to the complex business logic and the slow execution of front-end JavaScript. This part of the code optimization needs to be handled by front-end colleagues. As shown in the figure, using the performance tab of chrome or firefox, it is easy to find the time-consuming front-end code.

8. Summary

There are actually routines for performance optimization, but generally teams wait until problems occur before optimizing, and rarely take precautions. But with monitoring and APM, it is different. We can get data at any time and reverse the optimization process.

Some performance problems can be solved at the business requirement level or at the architectural level. Any optimization that has been brought to the code layer and requires the intervention of programmers has reached a situation where the demand side and the architecture side can no longer move, or do not want to move anymore.

Performance optimization must first collect information, identify bottlenecks, weigh CPU, memory, network, IO and other resources, and then try to reduce the average response time and improve throughput.

Caching, buffering, pooling, reducing lock conflicts, asynchrony, parallelism, and compression are all common optimization methods. In our scenario, data compression and parallel requests play the biggest role. Of course, with the assistance of other optimization methods, the time-consuming of our business interface has been reduced from 5-6 seconds to less than 1 second. The optimization effect is quite impressive. It is estimated that it will not be optimized for a long time in the future.

Guess you like

Origin blog.csdn.net/wdjnb/article/details/124426381