background
There are several groups within the online service, in addition to the business logic is not the same, request processing it is basically the same. Logic to perform these services are very simple, but there are several issues:
- Stand-alone QPS is very low, need a lot of machines
- The 99th took a lot longer than the theoretical
- Reached the upper limit of the load service, the server load is very low
Above these issues in front of the company's deep pockets is not a problem ah, the performance increase is not enough machine! Add enough performance machine! Once again come to reflect the upstream timeout issues, to the dignity of the prime programmer, this time decided not to just simply add a machine to put this program to optimize it, once Dragon technology.
Technology stack:
语言:java
通信方式:thrift,server模式为THsHaServer
redis客户端:jedis
模型:xgboost
执行环境:docker,8核 8G
缓存集群:通信协议为redis,但实现方式和redis不一样
optimization process
Want to enhance the performance of a program, you first need to find a program where the bottleneck, then there is targeted to be optimized.
Process Analysis
Flow chart before optimization
The business logic flowchart is omitted, show only the interaction between the individual threads. That part of the logical structure redis request, in fact, uses two forkjoin thread pool, but they are very similar two logic, in order to simplify them into one of the process. There are two steps in the flowchart of magenta, it is evident that significantly affect the performance of the two places.
Combined with service execution environment, it can be concluded the following questions:
Excessive number of threads
Because the operating system preemptive scheduling of threads, the thread context switching frequently lead to several problems:
- System instruction execution time increases, the corresponding index value
cpu.sys
, resulting in a waste of cpu time of execution - User mode instruction execution time per unit time assigned to a decrease, the corresponding indicators
cpu.user
- In summary it can be seen that not only make the machine load increases, will also perform a single task time-consuming growth
Two blocking
Successful two steps of the flowchart will block the red thread execution. Wherein thrift-server is waiting for all results redis obstruction, wait forkjoin is redis return results in blocking the communication network using a synchronous mode io .
When a thread is switched from the operating state to the blocking state, a thread context switch occurs and the thread needs to wait to be rescheduled. This is affecting the operating system level.
40 assuming the service requests are received simultaneously, it can be seen from the graph, the Service can send a maximum of 86 simultaneous requests redis, but want to perform a service task 40 can pass, it must be simultaneously transmitted 2000 = 40 * 20 redis request. Total number of threads if the maximum number of threads forkjoin adjusted to 2000 obviously is not possible, actually in this initial version of the service is no limit on the maximum number of threads forkjoin, so that when the service load increases, the server has been turned on in the soaring, this time will receive crazy server load alarm, call request time-consuming and also very high.
- Since the operating block a thread, in order to increase concurrent programs can only open multi-threaded, while the number of threads and more will affect the performance of the program
- Transmitting the synchronization request using the redis io embodiment, the number of simultaneously transmitting a request redis very limited. This not only seriously dragged down qps, but also the 99th time consuming growth.
Task wait indefinitely
Limit can also be seen from the above tasks are not timed out, the wait can go on indefinitely. This results in the following effects:
- If the client does not set the timeout, we may continue to wait indefinitely, might outlast the client.
- If the client due to timeout canceled the task, the task is to be executed again without any sense, but also squeeze the execution time for other tasks, and even cause an avalanche process.
to sum up:
From the foregoing analysis, the program is now the bottleneck is mainly redis inquiry process, followed by blocking a logical thread of thrift-server.
Process Optimization
Since the JVM GC monitoring indicators of program within the normal range, and the need to re-architecture changes observed GC situation, so we do first program on the transformation of architecture.
Query caching process
This is mainly a network optimization process, the business layer usually start from the following:
- Reduce the number of communications, i.e., combined into a plurality of requests network request. Because our company's cache cluster implementations and redis not the same, using the pipeline, mget performance increase is limited, or even lower.
- The IO asynchronous use, not only can reduce the number of threads, and can increase the number of requests sent redis simultaneously.
Redis client is currently used jedis, using an open source asynchronous redis client lettuce instead of, neety its underlying used.
thrift-server
AsyncProcessor to use to transform it into asynchronous.
Task timeout limit increase
To the top of CompletableFuture set a timeout logic, when the task execution timeout to cancel the execution CompletableFuture, and notice of cancellation to pass along this layer by layer.
Optimization of the process flow chart:
The above process still has several weaknesses, in the last eventuality, the first issue, the article .
With the new flow chart, the code transformation is purely a realization according to FIG. In order to reduce the transformation process to introduce a new bug, to uphold a principle: as much as possible before reuse of code .
Pressure measurement
Tools QA group pressure measurement tools provided goperf2
, the use of jprofiler
monitoring the implementation of the jvm process. jprofiler This tool is very important, through its visual interface can help me grasp the implementation of state of all threads jvm process, then according to this state again tuning parameters.
Testing process is very boring, but watching the lift metrics, but also very rewarding.
Preliminary test results show that the pressure in front of the optimization idea is correct. But walking on the road how it will be no little surprise.
Pressure measurements found during question:
- When the pressure measurement, the service is given a null response time consuming particularly long. Null response is not any business logic, receives a request to return directly.
The reason: selector threading, but come. - After the pressure test program ran for a while, a lot of time_wait occurred on the server, it can not be prompted to assign the connection address. Tcp typical interview questions, because the number of ports are not enough of a handle or process the maximum.
- On the server, there have been a lot of time_wait. lettuce connection pool parameters problem, the connecting kept closed.
- A single visit takes unstable. jvm class load, initialize the connection pool, thread pool initialization problems.
- Connection pooling borrowObject take a long time. Parameter optimization.
- Regular Expressions performance problems, the original time-consuming code logic
- java stream time-consuming
- New non-stop hashmap
The final pressure test results:
Types of
|
qps
|
The average response time
|
90th
|
95th
|
99th
|
99.9 quintile
|
---|---|---|---|---|---|---|
45 wherein only take before optimization in real time | 2951 | 6.761 | 7 | 10 | 50 | 55 |
Optimized to take only the real-time feature 45 | 11781 | 2.1 | 3 | 3 | 4 | 9 |
Optimized just take 45 real-time characteristics, three processes | 17218 |
2.25 | 3 | 3 | 7 | 11 |
Optimization forecast ago | 1774 | 14 | 19 | 35 | 61 | 68 |
Optimized forecast | 2702 | 9.2 | 11 | 12 | 14 | 17 |
The optimized forecasting three processes | 4705 | 8.5 | 13 | 14 | 17 | 24 |
Pressure measurement used in command
See tcp half case and a full-connection queue netstat -s | egrep "listen|LISTEN" ;ss -lnt
Statistics tcp connection number of each state netstat -n | grep <port> | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
Jstack number of threads and jvm process has been started threads inconsistent
After the program architecture, optimization, a new version of the program code to realize finally on the line. Upstream of feedback is no longer timeout, the machine load is not high, appetite is big.
But when one day look at the monitor and found that the JVM process actually opens up more than 800 threads, far beyond the number of threads in theory 50 within. My God, excitement, there are such a good problem! ! !
But the article did not want to write. I would simply say under analysis and the reasons for it.
Positioning process
Verify whether the monitoring error
To see the process through these orders and the number of thread states, and found consistent monitoring, surveillance prove no problem.
top -H -p <pid>
ps -o nlwp <pid>
ls /proc/<pid>/task | wc -l
ls /proc/<pid>/status | grep thread
Jstack verification issues
After a review of documents found jstack, using jstack <pid>
only output by the jvm process management thread, only jstack -m <pid>
to output .so
library open thread, but jstack -m
the stack of print incomplete information.
Positioning jvm open non-threaded code
Printing process stack, gstack <pid>
and found that in addition to the jvm thread, the thread stack information mostly are the same and are included libomp.so
. This library is used for c / c ++ multithreading library, relevant information OpenMP . Also said program code calls have libomp.so
opened a large number of additional threads.
The open thread is certainly to call the system library through strace
can be tracked to the command which thread open libomp
thread. In this process, the need jstack
to print out jvm thread stack, can accurately locate specific jvm thread by comparing the thread id.
xgboost
After locating the problem model jar package given set of algorithms, and start browsing their code, and finally found the root of the problem and xgboost. FIG related code is as follows.
[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-3bA2RjM3-1582343704536) (/ assets / img / omp1.png)]
jvm thread and then call Predict
when the method will open omp_get_thread_num
a thread, but the thread will open a jvm omp_get_thread_num
threads, the final total number of threads xgboost coincides with the count(jvm thread) * omp_get_thread_num
match.
solution
There are two solutions:
-
Add environment variables before the program starts, for example,
OMP_THREAD_LIMIT=1 control.sh start
-
Using pure java implementation xgboost library to replace the library's official xgoobst
The service also encountered a heap outside the memory problem, but the process of locating enough to write an article a.
Further optimization
So far, the effect of optimization services has reached expectations. But in fact it can also make further optimization, but the effort required is very big.
Optimization of the program is to investigate the nature of the operating system programmer understanding of the knowledge points:
- Process / thread principle, the thread is the smallest unit of execution of the operating system, if truly understand this sentence, so do optimize any aspect of the architecture is very easy
- IO principles, knowledge point: storage, bus, DMA
- Synchronization between threads will eventually be reflected in the execution of the thread
One step closer to optimize the thinking, because the bound between multiple threads will involve competition for resources, it is generally synchronized by lock. Then find the program synchronization point and try to eliminate these synchronization point performance of the program will have a very large increase. Optimize synchronization point is a very critical but also very energy-consuming thing.
Program up and running, so sure is to accept the input and output process again, then the tracking data to a request, then a request can be found in the complete process, what are synchronized condition occurred. The program is a multi-threaded thread pool logic execution units to complete the entire program, the thread pool thread is generally performed by the same logic. To coarse-grained analysis to synchronize relational data, and then analyze the relationship between the sync a thread pool thread.
A single process can not avoid lock contention, can be done to reduce the impact of the lock by opening multiple processes. In this pressure measurement process, but also to test the performance of multi-process network services. Multi-process web services, refer to this article .
reference
- Performance optimization mode
- [Books] modern operating system
- jstack works
- openmp
- Guide into OpenMP