Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

"The following are the problems I encountered, and some simple troubleshooting ideas. If there is something wrong, please leave a message to discuss. If you have encountered the OOM problem caused by InMemoryReporterMetrics and it has been resolved, you can ignore this article. If you are right CPU100% and online application OOM troubleshooting ideas are not clear, you can browse this article.

Problem phenomenon

[Alarm Notification-Application Abnormal Alarm]

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Simply look at the warning message: connection refused , anyway, there is a problem with the service, please don't care too much about the mosaic.

Environmental description

Spring Cloud F版。

By default, spring-cloud-sleuth-zipkin is used in the project to rely on zipkin-reporter. The analyzed version found that the zipkin-reporter version is 2.7.3.

<dependency>
 <groupId>org.springframework.cloud</groupId>
 <artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>
  版本:2.0.0.RELEASE

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Imprint

Troubleshooting

Through the alarm information, you can know which server has a problem with which service. First log in to the server to check.

1. Check the service status and verify whether the health check URL is ok

"This step can be ignored/skip. It is related to the health check of the actual company and is not universal.

①Check whether the service process exists.

"Ps -ef | grep service name ps -aux | grep service name

②Check whether the address of the corresponding service health check is normal, and check whether the ip port is correct

"Is the url configured for the alert service check wrong? Generally, this will not cause a problem


③Verify the health check address

"This health check address is like: http://192.168.1.110:20606/serviceCheck to check whether the IP and Port are correct.

# 服务正常返回结果
curl http://192.168.1.110:20606/serviceCheck
{"appName":"test-app","status":"UP"}
# 服务异常,服务挂掉
curl http://192.168.1.110:20606/serviceCheck
curl: (7) couldn't connect to host

2. View service logs

Check whether the service log is still printing and whether there is a request coming in. Check out the discovery service OOM.

 

 

OOM error

tips:java.lang.OutOfMemoryError GC overhead limit exceeded

oracle官方给出了这个错误产生的原因和解决方法:Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message "GC overhead limit exceeded" indicates that the garbage collector is running all the time and Java program is making very slow progress. After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown. This exception is typically thrown because the amount of live data barely fits into the Java heap having little free space for new allocations. Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit.

Reason: It probably means that the JVM spends 98% of its time for garbage collection, but only gets 2% of the available memory. Frequent memory collection (at least 5 consecutive garbage collections have been performed), the JVM will expose Ava.lang.OutOfMemoryError: GC overhead limit exceeded error occurred.

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Source of the above tips: java.lang.OutOfMemoryError GC overhead limit exceeded cause analysis and solutions

 

3. Check the server resource usage

To query the resource occupation status of each process in the system, use the top command. Check out that there is a process with a process of 11441 whose CPU usage reaches 300%, as shown in the following screenshot:

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

CPU burst table

Then query the CPU usage of all threads in this process:

top -H -p pid save the file: top -H -n 1 -p pid> /tmp/pid_top.txt

# top -H -p 11441
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11447 test    20   0 4776m 1.6g  13m R 92.4 20.3  74:54.19 java
11444 test    20   0 4776m 1.6g  13m R 91.8 20.3  74:52.53 java
11445 test    20   0 4776m 1.6g  13m R 91.8 20.3  74:50.14 java
11446 test    20   0 4776m 1.6g  13m R 91.4 20.3  74:53.97 java
....

Check the threads below PID: 11441 and find that several threads occupy a higher CPU.

 

4. Save the stack data

1. Print system load snapshot top -b -n 2> /tmp/top.txt top -H -n 1 -p pid> /tmp/pid_top.txt

2. The thread list corresponding to the CPU ascending print process ps -mp-o THREAD,tid,time | sort -k2r> /tmp/process number_threads.txt

3. Look at the number of tcp connections (preferably sample multiple times) lsof -p process number> /tmp/process number_lsof.txt lsof -p process number> /tmp/process number_lsof2.txt

4. View thread information (preferably sample multiple times) jstack -l process number> /tmp/process number _jstack.txt jstack -l process number> /tmp/process number _jstack2.txt jstack -l process number> /tmp/process No._jstack3.txt

5. View the overview of heap memory usage jmap -heap process number> /tmp/process number_jmap_heap.txt

6. View the statistics of objects in the heap jmap -histo process number | head -n 100> /tmp/process number_jmap_histo.txt

7. View GC statistics jstat -gcutil process number> /tmp/process number _jstat_gc.txt 8. Production-to-heap snapshot Heap dump jmap -dump:format=b,file=/tmp/process number _jmap_dump.hprof process number

"All the data of the heap, the generated file is larger.

jmap -dump:live,format=b,file=/tmp/process number_live_jmap_dump.hprof process number

"Dump:live, this parameter means that we need to grab the memory objects that are currently in the life cycle, that is to say, the objects that the GC cannot collect, generally use this.

Get the snapshot data with the problem, and then restart the service.

problem analysis

According to the above operations, the GC information, thread stack, heap snapshot and other data of the service in question have been obtained. Let's analyze it to see where the problem is.

1. Analyze 100% of threads occupied by cpu

 

Conversion thread ID

Analysis of thread stack process generated from jstack.

Convert the above thread ID to 11447: 0x2cb711444: 0x2cb411445: 0x2cb511446: 0x2cb6 to hexadecimal (the thread ID recorded in the jstack command output file is hexadecimal). The first conversion method:

$ printf “0x%x” 11447
“0x2cb7”

The second conversion method: Add 0x to the conversion result.

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Find thread stack

$ cat 11441_jstack.txt | grep "GC task thread"
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f971401e000 nid=0x2cb4 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f9714020000 nid=0x2cb5 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f9714022000 nid=0x2cb6 runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f9714023800 nid=0x2cb7 runnable

Found that these threads are doing GC operations.

2. Analyze the generated GC file

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00   0.00 100.00  99.94  90.56  87.86    875    9.307  3223 5313.139 5322.446
  • S0: Current ratio of surviving zone 1
  • S1: Current ratio of surviving zone 2
  • E: Eden Space (Eden Garden) area usage ratio
  • O: Old Gen (old age) usage ratio
  • M: Use ratio of metadata area
  • CCS: compression ratio
  • YGC: Number of young generation garbage collections
  • FGC: Number of garbage collections in the old age
  • FGCT: garbage collection in the old age takes time
  • GCT: Total time consumed by garbage collection

FGC is very frequent.

3. Analyze the generated heap snapshot

Use the Eclipse Memory Analyzer tool. Download link: https://www.eclipse.org/mat/downloads.php

Results of the analysis:

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

See the specific content of the stacked large objects:

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

The rough cause of the problem is OOM caused by InMemoryReporterMetrics. zipkin2.reporter.InMemoryReporterMetrics  @ 0xc1aeaea8Shallow Size: 24 B Retained Size: 925.9 MB

You can also use: Java Memory Dump (https://www.perfma.com/docs/memory/memory-start) for analysis, the screenshot below, the function is not as powerful as MAT, and some functions are charged.

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

4. Reason analysis and verification

Because of this problem, check the configuration of the zipkin service that has the problem  , and it is no different from other services. Found that the configuration is the same.

Then I tried the corresponding zipkin jar package, and found that the service in question depends on a lower version of  zipkin .

The  zipkin-reporter-2.7.3.jar
of the service in question and other packages that the service depends on: zipkin-reporter-2.8.4.jar

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Upgrade the package version that the service in question depends on, verify in the test environment, check the stack snapshot and find that there is no such problem.

Reason exploration

Check the github of zipkin-reporter: search for the corresponding information https://github.com/openzipkin/zipkin-reporter-java/issues?q=InMemoryReporterMetrics and find the following issue: https://github.com/openzipkin/zipkin- reporter-java/issues/139

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Repair code and verification code: https://github.com/openzipkin/zipkin-reporter-java/pull/119/files Compare the differences between the two versions of the code:

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Simple DEMO verification:

// 修复前的代码:
  private final ConcurrentHashMap<Throwable, AtomicLong> messagesDropped =      new ConcurrentHashMap<Throwable, AtomicLong>();
// 修复后的代码:
  private final ConcurrentHashMap<Class<? extends Throwable>, AtomicLong> messagesDropped =      new ConcurrentHashMap<>();

Use this key after repair: Class<? extends Throwable> to replace Throwable.

Simple verification:

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

Experience the troubleshooting and resolution process of an online CPU 100% and application OOM

 

solution

Just upgrade the zipkin-reporter version. Use the following dependency configuration, and the imported version of  zipkin-reporter is 2.8.4.

<!-- zipkin 依赖包 -->
<dependency>
  <groupId>io.zipkin.brave</groupId>
  <artifactId>brave</artifactId>
  <version>5.6.4</version>
</dependency>

Tips: Add the following parameters when configuring JVM parameters, and output stack snapshots when memory overflows.

 -XX:+HeapDumpOnOutOfMemoryError 
 -XX:HeapDumpPath=path/filename.hprof 

Guess you like

Origin blog.csdn.net/qq_45401061/article/details/108719913