Troubleshooting and solutions for the problem of low CPU and high load average during stress testing

1. Background description
This time the code has added input parameter decryption, output encryption secret , and then the pressure test cannot be passed. The interface response time exceeds the company's stipulation, so check it and make a record.

2. Troubleshooting process
Method 1:
Printing time-consuming log
1. It is suspected that the encryption and decryption algorithm is too time-consuming, we use the RSA+AES method; so make up the log first and print it before and after the code.
2. Finally, looking at the log , it takes a lot of time to convert the stream to String . It should be that ordinary io operations are blocked. The read and write of io need to block reading or writing. In the case of high concurrency, the response cannot be achieved.
This method is more anti-locking, and in the case of more modifications, a large amount of logs needs to be supplemented, which is not very convenient. The second method is recommended.

Method two:
1. Use the top command to check the CPU and memory usage. Our machine is 6C6G. It can be found that the CPU is not full, but the load average is very high, indicating that there are many queued processes. For details, please refer to:
load average is the load of the CPU. The information it contains is not the CPU usage status, but the statistical information of the sum of the number of processes that the CPU is processing and waiting for the CPU within a period of time, that is, the statistical information of the length of the CPU usage queue
explains the cpu and load average
Insert picture description here
2. So you need to print out thread information for analysis, use
jstack -l pid> jstack.log such as jstack -l 14> jstack.log

3. Then use
top -Hp pid to find out the thread that occupies the most CPU such as top -Hp 14

4. Use
printf “%x\n” pid to get the most time-consuming PID and need to be converted to hexadecimal.

5. Search the obtained hexadecimal value in jstack.log to find thread information
Insert picture description here
Insert picture description here

Combining the actual situation of our project, you can see from the picture that there are two partial effects, 1. a large number of io operations, 2. oneapm, onepam are deployed full-link probes, and
finally remove the probes and optimize io.

3. Repair plan
1. Remove the probe
2. First use the packaging stream, add buffer to process, find the effect is general
3. Then use NIO to manipulate the stream, there is a certain improvement based on the buffer, and finally use NIO to convert the stream to String

private String getStringFromStream(ServletInputStream inputStream){
    
    
		StringBuilder builder = new StringBuilder();
		try(ReadableByteChannel readableByteChannel = Channels.newChannel(inputStream)) {
    
    
			ByteBuffer buffer = ByteBuffer.allocate(1024);
			while((readableByteChannel.read(buffer)) != -1) {
    
    
				//写模式变读模式
				buffer.flip();
				while(buffer.hasRemaining()) {
    
    
				//因为入参已经加密过了,所以不考虑中文的问题
					builder.append((char)buffer.get());
				}
				//重置
				buffer.clear();
			}
		} catch (Exception e) {
    
    
			log.error("getStringFromStream error:"+e);
		}
		return builder.toString();
	}

Then I retested and found that the load average has dropped and the stress test has passed
Insert picture description here
four. Conclusion For
daily troubleshooting, you need to be proficient in jdk tools, such as jstack, jmap, and jstat, which can help us quickly troubleshoot and analyze the cause of the problem.

Guess you like

Origin blog.csdn.net/u010857795/article/details/112788775