Performance optimization services

background

There are several groups within the online service, in addition to the business logic is not the same, request processing it is basically the same. Logic to perform these services are very simple, but there are several issues:

  1. Stand-alone QPS is very low, need a lot of machines
  2. The 99th took a lot longer than the theoretical
  3. Reached the upper limit of the load service, the server load is very low

Above these issues in front of the company's deep pockets is not a problem ah, the performance increase is not enough machine! Add enough performance machine! Once again come to reflect the upstream timeout issues, to the dignity of the prime programmer, this time decided not to just simply add a machine to put this program to optimize it, once Dragon technology.

Technology stack:

语言:java
通信方式:thrift,server模式为THsHaServer
redis客户端:jedis
模型:xgboost
执行环境:docker,8核 8G
缓存集群:通信协议为redis,但实现方式和redis不一样

Optimization Results

optimization process

Want to enhance the performance of a program, you first need to find a program where the bottleneck, then there is targeted to be optimized.

Process Analysis

Flow chart before optimization

Here Insert Picture Description

The business logic flowchart is omitted, show only the interaction between the individual threads. That part of the logical structure redis request, in fact, uses two forkjoin thread pool, but they are very similar two logic, in order to simplify them into one of the process. There are two steps in the flowchart of magenta, it is evident that significantly affect the performance of the two places.

Combined with service execution environment, it can be concluded the following questions:

Excessive number of threads

Because the operating system preemptive scheduling of threads, the thread context switching frequently lead to several problems:

  1. System instruction execution time increases, the corresponding index value cpu.sys, resulting in a waste of cpu time of execution
  2. User mode instruction execution time per unit time assigned to a decrease, the corresponding indicatorscpu.user
  3. In summary it can be seen that not only make the machine load increases, will also perform a single task time-consuming growth

Two blocking

Successful two steps of the flowchart will block the red thread execution. Wherein thrift-server is waiting for all results redis obstruction, wait forkjoin is redis return results in blocking the communication network using a synchronous mode io .

When a thread is switched from the operating state to the blocking state, a thread context switch occurs and the thread needs to wait to be rescheduled. This is affecting the operating system level.

40 assuming the service requests are received simultaneously, it can be seen from the graph, the Service can send a maximum of 86 simultaneous requests redis, but want to perform a service task 40 can pass, it must be simultaneously transmitted 2000 = 40 * 20 redis request. Total number of threads if the maximum number of threads forkjoin adjusted to 2000 obviously is not possible, actually in this initial version of the service is no limit on the maximum number of threads forkjoin, so that when the service load increases, the server has been turned on in the soaring, this time will receive crazy server load alarm, call request time-consuming and also very high.

  1. Since the operating block a thread, in order to increase concurrent programs can only open multi-threaded, while the number of threads and more will affect the performance of the program
  2. Transmitting the synchronization request using the redis io embodiment, the number of simultaneously transmitting a request redis very limited. This not only seriously dragged down qps, but also the 99th time consuming growth.

Task wait indefinitely

Limit can also be seen from the above tasks are not timed out, the wait can go on indefinitely. This results in the following effects:

  1. If the client does not set the timeout, we may continue to wait indefinitely, might outlast the client.
  2. If the client due to timeout canceled the task, the task is to be executed again without any sense, but also squeeze the execution time for other tasks, and even cause an avalanche process.

to sum up:

From the foregoing analysis, the program is now the bottleneck is mainly redis inquiry process, followed by blocking a logical thread of thrift-server.

Process Optimization

Since the JVM GC monitoring indicators of program within the normal range, and the need to re-architecture changes observed GC situation, so we do first program on the transformation of architecture.

Query caching process

This is mainly a network optimization process, the business layer usually start from the following:

  1. Reduce the number of communications, i.e., combined into a plurality of requests network request. Because our company's cache cluster implementations and redis not the same, using the pipeline, mget performance increase is limited, or even lower.
  2. The IO asynchronous use, not only can reduce the number of threads, and can increase the number of requests sent redis simultaneously.

Redis client is currently used jedis, using an open source asynchronous redis client lettuce instead of, neety its underlying used.

thrift-server

AsyncProcessor to use to transform it into asynchronous.

Task timeout limit increase

To the top of CompletableFuture set a timeout logic, when the task execution timeout to cancel the execution CompletableFuture, and notice of cancellation to pass along this layer by layer.

Optimization of the process flow chart:

Here Insert Picture Description

The above process still has several weaknesses, in the last eventuality, the first issue, the article .

With the new flow chart, the code transformation is purely a realization according to FIG. In order to reduce the transformation process to introduce a new bug, to uphold a principle: as much as possible before reuse of code .

Pressure measurement

Tools QA group pressure measurement tools provided goperf2, the use of jprofilermonitoring the implementation of the jvm process. jprofiler This tool is very important, through its visual interface can help me grasp the implementation of state of all threads jvm process, then according to this state again tuning parameters.

Testing process is very boring, but watching the lift metrics, but also very rewarding.

Preliminary test results show that the pressure in front of the optimization idea is correct. But walking on the road how it will be no little surprise.

Pressure measurements found during question:

  1. When the pressure measurement, the service is given a null response time consuming particularly long. Null response is not any business logic, receives a request to return directly.
    The reason: selector threading, but come.
  2. After the pressure test program ran for a while, a lot of time_wait occurred on the server, it can not be prompted to assign the connection address. Tcp typical interview questions, because the number of ports are not enough of a handle or process the maximum.
  3. On the server, there have been a lot of time_wait. lettuce connection pool parameters problem, the connecting kept closed.
  4. A single visit takes unstable. jvm class load, initialize the connection pool, thread pool initialization problems.
  5. Connection pooling borrowObject take a long time. Parameter optimization.
  6. Regular Expressions performance problems, the original time-consuming code logic
  7. java stream time-consuming
  8. New non-stop hashmap

The final pressure test results:

Types of
qps
The average response time
90th
95th
99th
99.9 quintile
45 wherein only take before optimization in real time 2951 6.761 7 10 50 55
Optimized to take only the real-time feature 45 11781 2.1 3 3 4 9
Optimized just take 45 real-time characteristics, three processes

17218

2.25 3 3 7 11
Optimization forecast ago 1774 14 19 35 61 68
Optimized forecast 2702 9.2 11 12 14 17
The optimized forecasting three processes 4705 8.5 13 14 17 24

Pressure measurement used in command

See tcp half case and a full-connection queue netstat -s | egrep "listen|LISTEN" ;ss -lnt

Statistics tcp connection number of each state netstat -n | grep <port> | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

Jstack number of threads and jvm process has been started threads inconsistent

After the program architecture, optimization, a new version of the program code to realize finally on the line. Upstream of feedback is no longer timeout, the machine load is not high, appetite is big.

But when one day look at the monitor and found that the JVM process actually opens up more than 800 threads, far beyond the number of threads in theory 50 within. My God, excitement, there are such a good problem! ! !

But the article did not want to write. I would simply say under analysis and the reasons for it.

Positioning process

Verify whether the monitoring error

To see the process through these orders and the number of thread states, and found consistent monitoring, surveillance prove no problem.

top -H -p <pid>
ps -o nlwp <pid>
ls /proc/<pid>/task | wc -l
ls /proc/<pid>/status | grep thread 

Jstack verification issues

After a review of documents found jstack, using jstack <pid>only output by the jvm process management thread, only jstack -m <pid>to output .solibrary open thread, but jstack -mthe stack of print incomplete information.

Positioning jvm open non-threaded code

Printing process stack, gstack <pid>and found that in addition to the jvm thread, the thread stack information mostly are the same and are included libomp.so. This library is used for c / c ++ multithreading library, relevant information OpenMP . Also said program code calls have libomp.soopened a large number of additional threads.

The open thread is certainly to call the system library through stracecan be tracked to the command which thread open libompthread. In this process, the need jstackto print out jvm thread stack, can accurately locate specific jvm thread by comparing the thread id.

xgboost

After locating the problem model jar package given set of algorithms, and start browsing their code, and finally found the root of the problem and xgboost. FIG related code is as follows.

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-3bA2RjM3-1582343704536) (/ assets / img / omp1.png)]

jvm thread and then call Predictwhen the method will open omp_get_thread_numa thread, but the thread will open a jvm omp_get_thread_numthreads, the final total number of threads xgboost coincides with the count(jvm thread) * omp_get_thread_nummatch.

solution

There are two solutions:

  1. Add environment variables before the program starts, for example,OMP_THREAD_LIMIT=1 control.sh start

  2. Using pure java implementation xgboost library to replace the library's official xgoobst

The service also encountered a heap outside the memory problem, but the process of locating enough to write an article a.

Further optimization

So far, the effect of optimization services has reached expectations. But in fact it can also make further optimization, but the effort required is very big.

Optimization of the program is to investigate the nature of the operating system programmer understanding of the knowledge points:

  1. Process / thread principle, the thread is the smallest unit of execution of the operating system, if truly understand this sentence, so do optimize any aspect of the architecture is very easy
  2. IO principles, knowledge point: storage, bus, DMA
  3. Synchronization between threads will eventually be reflected in the execution of the thread

One step closer to optimize the thinking, because the bound between multiple threads will involve competition for resources, it is generally synchronized by lock. Then find the program synchronization point and try to eliminate these synchronization point performance of the program will have a very large increase. Optimize synchronization point is a very critical but also very energy-consuming thing.

Program up and running, so sure is to accept the input and output process again, then the tracking data to a request, then a request can be found in the complete process, what are synchronized condition occurred. The program is a multi-threaded thread pool logic execution units to complete the entire program, the thread pool thread is generally performed by the same logic. To coarse-grained analysis to synchronize relational data, and then analyze the relationship between the sync a thread pool thread.

A single process can not avoid lock contention, can be done to reduce the impact of the lock by opening multiple processes. In this pressure measurement process, but also to test the performance of multi-process network services. Multi-process web services, refer to this article .


reference

  1. Performance optimization mode
  2. [Books] modern operating system
  3. jstack works
  4. openmp
  5. Guide into OpenMP
Published 32 original articles · won praise 10 · views 50000 +

Guess you like

Origin blog.csdn.net/jiangxiaoma111/article/details/104441347