Online FullGC Troubleshooting Practice—Teach you how to troubleshoot online problems | JD Cloud Technical Team

Author: Han Kai of JD Technology

1. Problem discovery and troubleshooting

1.1 Find the cause of the problem

The cause of the problem is that we received a jdos container CPU alarm, and the CPU usage has reached 104%

image-20230421152602849

Observing the machine log shows that there are many threads executing batch tasks at this time. Normally, batch running tasks are low-CPU and high-memory type, so at this time, it is considered that a large amount of CPU usage is caused by FullGC (there was a similar situation before, and the problem was solved after telling the user to restart the application).

Check the memory usage of the machine through Taishan:

image-20230421152933510

It can be seen that the CPU usage rate is indeed high, **but the memory usage rate is not high, only 62%, which is within the normal range.

It's actually a bit confusing to get here. It stands to reason that the memory should be full at this time.

Later, based on other indicators, such as the sudden inflow of traffic, I also suspected that the cpu was fully occupied by the jsf interface caused by a sudden large number of calls, so the memory usage rate was not high, but it was gradually ruled out later. In fact, it is a bit at a loss here, the phenomenon does not match the guess, only the CPU increases but not the memory, **so what causes the unilateral CPU increase? ** Then I checked in this direction for a long time and was denied.

Later, I suddenly realized, could it be that there is a "problem" with the monitoring?

In other words, there should be a problem with the monitoring we see. The monitoring here is the monitoring of the machine, not the monitoring of the JVM!

The CPU used by the JVM can be reflected on the machine, but the high usage of the heap memory of the JVM is not very obvious on the machine.

Then go to sgm to check the jvm related situation of the corresponding node:

image-20230421154928774

It can be seen that the old generation of our heap memory has indeed been filled up and then cleaned up. Checking the CPU usage at this time can also correspond to the GC time.

Then it can be determined at this time that the problem is caused by Full GC.

1.2 Find the cause of FULL GC

We first dumped the heap memory snapshots before and after gc,

Then use JPofiler for memory analysis. (JProfiler is a heap memory analysis tool, which can be directly connected to the online jvm to view relevant information in real time, and can also analyze dumped heap memory snapshots to analyze the heap memory situation at a certain moment)

First unzip the file we dumped, modify the suffix name .bin, and then open it. (We use the small dump tool that comes with Xingyun, or we can go to the machine and manually dump files by command)

image-20230421155755209

First select Biggest Objects to view the largest objects in the heap memory at that time.

It can be seen from the figure that the four List objects occupy nearly 900MB of memory , and we just saw that the maximum heap memory is only 1.3GB. Therefore, adding other objects, it is easy to fill up the old generation and cause The problem of full gc.

image-20230421160135305

Pick one of the largest objects as the one we want to look at

At this time, we can already locate the location corresponding to the corresponding large memory object:

image-20230421160241646

In fact, we have been able to roughly locate the problem so far. If you are still not sure, you can check the specific object information. The method is as follows:

image-20230421160532920

You can see that our large List object actually contains many Map objects, and each Map object has many key-value pairs.

Here you can also see the relevant attribute information in the Map.

You can also directly see relevant information on the following interface:

image-20230421160715617

Then click all the way down to see the corresponding properties.

So far, we have theoretically found the location of the large object in the code.

2. Problem solving

2.1 Find the location of the large object in the code and the root cause of the problem

First, we find the corresponding position and logic according to the above process

The general logic of our project is as follows:

  1. First, the Excel sample uploaded by the user will be parsed and loaded into memory as a List variable, which is the variable we saw above. For a 20w sample, there are a number of fields at this time, which takes up about 100mb of space.
  2. Then traverse the loop user samples, add some additional request data according to the data in the user samples, and request relevant results based on this data. At this time, the number of fields is a+n, and the occupied space is already about 200mb.
  3. Store this 200mb of data in the cache after the loop is complete.
  4. Start to generate excel, take out 200mb data from the cache, and take out the initial sample field and fill it into excel according to the a field recorded before.

Expressed as a flow chart:

jvmgc-Page-1.drawio

Combined with some pictures of specific troubleshooting problems:

image-20230421172512115

One of the phenomena is that the minimum memory after each gc is gradually increasing, corresponding to the second step in the above steps, the memory is gradually expanding.

Conclusion :

Load the excel sample uploaded by the user into the memory and List<Map<String, String>>store it as a structure. First, a 20mb excel file stored in this way will expand and occupy about 120mb of heap memory . This step will occupy a large amount of heap memory, and because the task For logical reasons, the large object memory will exist in the jvm for as long as 4-12 hours , causing the jvm heap memory to be easily filled up once there are too many tasks.

Here is a list of why the use of HashMap will cause memory expansion. The main reason is that the storage space efficiency is relatively low:

A Long object occupies memory calculation: In the HashMap<Long, Long> structure, only the two long integer data stored in Key and Value are valid data, with a total of 16 bytes (2×8 bytes). After these two long integer data are packaged into java.lang.Long objects, they have 8 bytes of MarkWord, 8 bytes of Klass pointer, and 8 bytes of long value for storing data (a package object occupies 24 characters) Festival).

Then after these two Long objects form Map.Entry, there is an additional 16-byte object header (8-byte MarkWord+8-byte Klass pointer = 16 bytes), and then an 8-byte next field and 4 bytes The hash field of the int type (8-byte next pointer + 4-byte hash field + 4-byte padding = 16 bytes), in order to align, you must also add 4-byte blank padding, and finally there is an entry in the HashMap The 8-byte reference, adding two long integer numbers in this way, the actual memory consumption is (Long(24byte)×2)+Entry(32byte)+HashMapRef(8byte)=88byte, and the space efficiency is the valid data divided by all Memory space, that is, 16 bytes/88 bytes = 18%.

—— "In-depth understanding of Java virtual machine" 5.2.6

The following is the heap memory object dumped from the excel just uploaded, which occupies a memory of 128mb, but the uploaded excel is actually only 17.11mb.

image-20230423145825354

image-20230423145801632

Space efficiency 17.1mb/128mb≈13.4%

2.2 How to solve this problem

Leaving aside whether the above process is reasonable or not, the solutions can generally be divided into two categories. One is to cure the root cause , that is, not put the object in the jvm memory , but store it in the cache. If it is not in the memory, the problem of large objects will naturally be solved. The other is to treat the symptoms , that is, shrink the large memory object so that it will not trigger frequent full gc problems in daily use scenarios.

Both approaches have pros and cons:

2.2.1 Radical treatment: do not store him in memory

The solution logic is also very simple. For example, when loading data, store it into the redis cache one by one according to the sample loading data, and then we only need to know how many samples there are in the sample, and take out the data from the cache according to the order of quantity. fix the problem.

Advantages: This problem can be solved fundamentally, and this problem will basically not exist in the future. No matter how large the data volume is, you only need to add corresponding redis resources.

Disadvantages: First of all, it will increase the consumption of a lot of redis cache space. Secondly, considering the display, for our project, the code here is old and obscure, and the modification requires a large amount of work and regression testing.

Fuxi operation background fullgc-page 2.drawio

2.2.2 Conservative treatment: reducing its data volume

Analyzing the above process of 2.1, firstly, the third step is completely unnecessary. It is first stored in the cache and then taken out, which takes up additional cache space. (Guessing is a historical issue, so I won't delve into it here).

Secondly, in the second step, the extra field n is useless after the request ends, so you can consider deleting the useless fields after the request ends.

There are also two solutions at this time, one is to only delete useless fields to reduce the size of the map , and then pass it as a parameter to the generated excel; the other is to request the completion of the direct deletion of the map , and then restart when generating the excel Read excel samples uploaded by users.

Advantages: small changes, no need for too complicated regression testing

Disadvantage: In the case of extremely large amount of data, full gc may still occur

jvmgc-page 3.drawio

The specific implementation method will not be expanded.

one of the implementations

//获取有用的字段
String[] colEnNames = (String[]) colNameMap.get(Constant.BATCH_COL_EN_NAMES);
List<String> colList = Arrays.asList(colEnNames);
//去除无用的字段
param.keySet().removeIf(key -> !colList.contains(key));

3. Expand thinking

First of all, the monitoring picture in this article is a man-made gc common when reproducing the scene at that time.

In the cpu usage graph, you can observe that the rising time of the cpu usage is indeed consistent with the time of gc, but the 104% CPU usage in the scene at that time did not appear .

image-20230423103730420

image-20230423103800435

In fact, the direct reason is relatively simple, because although the system has full gc, it does not appear frequently .

A full gc with a small range and low frequency will not cause the CPU of the system to soar, which is what we have seen.

So what was the reason for the scene at that time?

image-20230423105534963

As we mentioned above, our large objects in the heap memory will gradually expand as the task progresses , so when we have enough tasks and the time is long enough, the available space may become smaller after each full gc It becomes smaller and smaller. When the available space becomes smaller to a certain extent, after each full gc is completed, it is found that the space is still not enough , and the next gc will be triggered, resulting in the frequent occurrence of gc in the final result, causing the CPU frequency to soar. No more.

4. Troubleshooting summary

  • When we encounter a situation where the online cpu usage rate is too high, we can first check whether the problem is caused by full gc. Pay attention to the monitoring of jvm, or use jstat related commands to check. Don't be misled by machine memory monitoring.
  • If it is determined that the problem is caused by gc, you can directly connect to the online jvm through JProfiler or use dump to save the heap snapshot and analyze it offline.
  • First, you can find the largest object, which is usually the full gc caused by the large object. Another situation is that it is not so obvious that there are four large objects, but it may also be a dozen or so 50mb objects that are relatively balanced. The specific situation needs to be analyzed in detail.
  • Use the above tools to find the problematic object and find the code location corresponding to its stack, find the specific cause of the problem through code analysis, deduce whether the guess is correct through other phenomena , and then find the real cause of the problem.
  • Fix the problem based on the cause of the problem.

Of course, the above is not a very complicated troubleshooting situation. Different systems must have different memory conditions. We should analyze specific problems in detail, and what we can learn from this problem is how to troubleshoot and solve problems.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8704836