"The following are the problems I encountered, and some simple troubleshooting ideas. If there is something wrong, please leave a message to discuss. If you have encountered the OOM problem caused by InMemoryReporterMetrics and it has been resolved, you can ignore this article. If you are right CPU100% and online application OOM troubleshooting ideas are not clear, you can browse this article.
Problem phenomenon
[Alarm Notification-Application Abnormal Alarm]
Simply look at the warning message: connection refused , anyway, there is a problem with the service, please don't care too much about the mosaic.
Environmental description
Spring Cloud F版。
By default, spring-cloud-sleuth-zipkin is used in the project to rely on zipkin-reporter. The analyzed version found that the zipkin-reporter version is 2.7.3.
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>
版本:2.0.0.RELEASE
Imprint
Troubleshooting
Through the alarm information, you can know which server has a problem with which service. First log in to the server to check.
1. Check the service status and verify whether the health check URL is ok
"This step can be ignored/skip. It is related to the health check of the actual company and is not universal.
①Check whether the service process exists.
"Ps -ef | grep service name ps -aux | grep service name
②Check whether the address of the corresponding service health check is normal, and check whether the ip port is correct
"Is the url configured for the alert service check wrong? Generally, this will not cause a problem
③Verify the health check address
"This health check address is like: http://192.168.1.110:20606/serviceCheck to check whether the IP and Port are correct.
# 服务正常返回结果
curl http://192.168.1.110:20606/serviceCheck
{"appName":"test-app","status":"UP"}
# 服务异常,服务挂掉
curl http://192.168.1.110:20606/serviceCheck
curl: (7) couldn't connect to host
2. View service logs
Check whether the service log is still printing and whether there is a request coming in. Check out the discovery service OOM.
OOM error
tips:java.lang.OutOfMemoryError GC overhead limit exceeded
oracle官方给出了这个错误产生的原因和解决方法:Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message "GC overhead limit exceeded" indicates that the garbage collector is running all the time and Java program is making very slow progress. After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown. This exception is typically thrown because the amount of live data barely fits into the Java heap having little free space for new allocations. Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit.
Reason: It probably means that the JVM spends 98% of its time for garbage collection, but only gets 2% of the available memory. Frequent memory collection (at least 5 consecutive garbage collections have been performed), the JVM will expose Ava.lang.OutOfMemoryError: GC overhead limit exceeded error occurred.
Source of the above tips: java.lang.OutOfMemoryError GC overhead limit exceeded cause analysis and solutions
3. Check the server resource usage
To query the resource occupation status of each process in the system, use the top command. Check out that there is a process with a process of 11441 whose CPU usage reaches 300%, as shown in the following screenshot:
CPU burst table
Then query the CPU usage of all threads in this process:
“
top -H -p pid save the file: top -H -n 1 -p pid> /tmp/pid_top.txt
# top -H -p 11441
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11447 test 20 0 4776m 1.6g 13m R 92.4 20.3 74:54.19 java
11444 test 20 0 4776m 1.6g 13m R 91.8 20.3 74:52.53 java
11445 test 20 0 4776m 1.6g 13m R 91.8 20.3 74:50.14 java
11446 test 20 0 4776m 1.6g 13m R 91.4 20.3 74:53.97 java
....
Check the threads below PID: 11441 and find that several threads occupy a higher CPU.
4. Save the stack data
1. Print system load snapshot top -b -n 2> /tmp/top.txt top -H -n 1 -p pid> /tmp/pid_top.txt
2. The thread list corresponding to the CPU ascending print process ps -mp-o THREAD,tid,time | sort -k2r> /tmp/process number_threads.txt
3. Look at the number of tcp connections (preferably sample multiple times) lsof -p process number> /tmp/process number_lsof.txt lsof -p process number> /tmp/process number_lsof2.txt
4. View thread information (preferably sample multiple times) jstack -l process number> /tmp/process number _jstack.txt jstack -l process number> /tmp/process number _jstack2.txt jstack -l process number> /tmp/process No._jstack3.txt
5. View the overview of heap memory usage jmap -heap process number> /tmp/process number_jmap_heap.txt
6. View the statistics of objects in the heap jmap -histo process number | head -n 100> /tmp/process number_jmap_histo.txt
7. View GC statistics jstat -gcutil process number> /tmp/process number _jstat_gc.txt 8. Production-to-heap snapshot Heap dump jmap -dump:format=b,file=/tmp/process number _jmap_dump.hprof process number
"All the data of the heap, the generated file is larger.
jmap -dump:live,format=b,file=/tmp/process number_live_jmap_dump.hprof process number
"Dump:live, this parameter means that we need to grab the memory objects that are currently in the life cycle, that is to say, the objects that the GC cannot collect, generally use this.
Get the snapshot data with the problem, and then restart the service.
problem analysis
According to the above operations, the GC information, thread stack, heap snapshot and other data of the service in question have been obtained. Let's analyze it to see where the problem is.
1. Analyze 100% of threads occupied by cpu
Conversion thread ID
Analysis of thread stack process generated from jstack.
Convert the above thread ID to 11447: 0x2cb711444: 0x2cb411445: 0x2cb511446: 0x2cb6 to hexadecimal (the thread ID recorded in the jstack command output file is hexadecimal). The first conversion method:
$ printf “0x%x” 11447
“0x2cb7”
The second conversion method: Add 0x to the conversion result.
Find thread stack
$ cat 11441_jstack.txt | grep "GC task thread"
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f971401e000 nid=0x2cb4 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f9714020000 nid=0x2cb5 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f9714022000 nid=0x2cb6 runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f9714023800 nid=0x2cb7 runnable
Found that these threads are doing GC operations.
2. Analyze the generated GC file
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 0.00 100.00 99.94 90.56 87.86 875 9.307 3223 5313.139 5322.446
- S0: Current ratio of surviving zone 1
- S1: Current ratio of surviving zone 2
- E: Eden Space (Eden Garden) area usage ratio
- O: Old Gen (old age) usage ratio
- M: Use ratio of metadata area
- CCS: compression ratio
- YGC: Number of young generation garbage collections
- FGC: Number of garbage collections in the old age
- FGCT: garbage collection in the old age takes time
- GCT: Total time consumed by garbage collection
FGC is very frequent.
3. Analyze the generated heap snapshot
Use the Eclipse Memory Analyzer tool. Download link: https://www.eclipse.org/mat/downloads.php
Results of the analysis:
See the specific content of the stacked large objects:
The rough cause of the problem is OOM caused by InMemoryReporterMetrics. zipkin2.reporter.InMemoryReporterMetrics @ 0xc1aeaea8Shallow Size: 24 B Retained Size: 925.9 MB
You can also use: Java Memory Dump (https://www.perfma.com/docs/memory/memory-start) for analysis, the screenshot below, the function is not as powerful as MAT, and some functions are charged.
4. Reason analysis and verification
Because of this problem, check the configuration of the zipkin service that has the problem , and it is no different from other services. Found that the configuration is the same.
Then I tried the corresponding zipkin jar package, and found that the service in question depends on a lower version of zipkin .
The zipkin-reporter-2.7.3.jar
of the service in question and other packages that the service depends on: zipkin-reporter-2.8.4.jar
Upgrade the package version that the service in question depends on, verify in the test environment, check the stack snapshot and find that there is no such problem.
Reason exploration
Check the github of zipkin-reporter: search for the corresponding information https://github.com/openzipkin/zipkin-reporter-java/issues?q=InMemoryReporterMetrics and find the following issue: https://github.com/openzipkin/zipkin- reporter-java/issues/139
Repair code and verification code: https://github.com/openzipkin/zipkin-reporter-java/pull/119/files Compare the differences between the two versions of the code:
Simple DEMO verification:
// 修复前的代码:
private final ConcurrentHashMap<Throwable, AtomicLong> messagesDropped = new ConcurrentHashMap<Throwable, AtomicLong>();
// 修复后的代码:
private final ConcurrentHashMap<Class<? extends Throwable>, AtomicLong> messagesDropped = new ConcurrentHashMap<>();
Use this key after repair: Class<? extends Throwable> to replace Throwable.
Simple verification:
solution
Just upgrade the zipkin-reporter version. Use the following dependency configuration, and the imported version of zipkin-reporter is 2.8.4.
<!-- zipkin 依赖包 -->
<dependency>
<groupId>io.zipkin.brave</groupId>
<artifactId>brave</artifactId>
<version>5.6.4</version>
</dependency>
Tips: Add the following parameters when configuring JVM parameters, and output stack snapshots when memory overflows.
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=path/filename.hprof