Elasticsearch GC optimization practice

Recently, frequent timeout alarms have occurred in the ES cluster for business query online, especially a wave of timeouts at a certain time in the morning. It is difficult to tell what business behavior caused it from the call chain monitoring.

preliminary guess

Looking at the basic monitoring of Elasticsaerch on Grafana, we found that the business alarms are basically consistent with the ES Old GC (old generation GC) stuck time:

 

At the same time, we noticed that the memory in the Old area continues to grow, and the Old area will be filled up in less than an hour. Almost all of it can be recycled after the Old GC:

guess:

  • What caused the rapid growth of the Old District? Could it be that the memory allocation rate is too high, causing premature promotion? Maybe allocating a very large object?
  • Why is Old GC so slow? The ES lagging is most likely related to it. The regular occurrence in the morning may be related to certain business behaviors.

View GC configuration

So first use the JVM tool to check the GC configuration and general situation from the outside.

jmap checks the ES heap situation:

  • MaxHeapSize: The entire heap is 31GB.
  • MaxNewSize: The Young area is only 1GB.
  • OldSize: There are 30GB in the Old area.
  • NewRatio: The value 2 means that the Young area should occupy 1/3 of the entire heap, which should be 10GB, but it is actually only 1GB, which is very strange.

The default behavior we expect is Young=10GB and Old=20GB. Why do they change to 1GB and 30GB?

After checking the final startup parameters of the JVM, I actually inferred that the Young area only occupies 1GB of space. Is it a BUG of the JVM?

I Googled the remaining GC parameters and found that someone had tested JDK8's -XX:+UseConcMarkSweepGC, which would cause the NewRatio parameter to fail for unknown reasons!

Such a small Young area will definitely lead to frequent Young GC  (observe YGC 1~2 times per second through jstat -gc)  , which is definitely not good for ES performance. Although it is not directly related to the slow Old GC, it must be repaired first. Yes, you can directly specify 10GB of Young area through -Xmn:

Restart ES and observe that the Young area size is correct. Observe jstat -gc and find that the Young GC frequency has dropped significantly by 6 times.

Observing grafana again, we found that the frequency of YGC has indeed dropped (there is an interval between YGC):

The Old District still maintains a high growth rate:

However, because the Young area has been adjusted from 1GB to 10GB, the shrinkage amplitude of the JVM heap by YGC will be more obvious each time, and there will be obvious ups and downs on the graph, but the momentum of the entire JVM heap continuing to rise has not changed, because the objects are still rapidly promoted to Old area, until the Old area is filled up, the Old GC will drop sharply.

Enable GC logs

Next, we need to analyze the reasons for the rapid growth of the Old area. We also need to look at why the Old GC freezes for 1 second. Can it be optimized?

Configure to enable GC logs and restart ES:

In order to determine whether there are many "middle life" objects that cause frequent promotion to the Old area, it is necessary to turn on the -XX:MaxTenuringThreshold=15 parameter and increase the Young area promotion condition to the 15th generation YGC in order to observe the objects in the Young area. age distribution.

At the beginning, I used the default parameters for promotion to the 6th generation. This is the screenshot at that time:

It is observed in the figure that there are objects aged 1 to 6 years old, and each generation takes up dozens of MB of space. We know that after YGC, the 6-year-olds will enter the Old area, and the 1-5 year olds will all increase by 1 year old, so I I suspect that each YGC will cause "mid-life" objects of tens of MB in the 6th generation to be promoted to the Old area. I roughly calculated that this speed is indeed close to the Old GC cycle. It seems that this is the most likely reason for the growth of the Old area. of.

If I increase the promotion age to 15 generations, the "middle life" objects that may live to 10 years old will eventually be recycled in YGC, which may slow down the growth rate of the Old area. With this idea in mind, I added XX: MaxTenuringThreshold is adjusted to 15 to give "mid-life" objects more opportunities to be recycled by YGC.

But the actual situation is that the objects are evenly distributed in 15 generations. It can be seen that the objects in the "middle life" live longer than imagined. They can survive the 15 generations of YGC without releasing them. Since the reality is that there are many "mid life" objects. If the object exists, the idea of ​​slowing down the growth of the Old area is relatively difficult to achieve.  Let's simply continue to study why the Old GC is so stuck  .

Optimize Old GC speed

According to the gc log printed to the disk, we can conduct an in-depth analysis of where the process of CMS Old GC takes time. We should focus on the time-consuming stage that will cause STW (stop the world).

The CMS garbage collection algorithm is the GC algorithm for the Old area, and its beginning is marked with this log:

2022-03-12T13:19:54.273+0800: 96253.129: [GC (CMS Initial Mark) [1 CMS-initial-mark: 23554181K(31398336K)] 23611096K(32395136K), 0.0063801 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

During the initial marking phase, STW only takes a few seconds and can basically be ignored.

The entire CMS experience is as follows:

There will be obvious STW in the Remark stage, which is described on Zhihu like this:

Because the whole process of Old GC is relatively long, the Young area will be filled quickly during this period. When the entire heap is scanned in the Remark stage, the Young area will also be filled with many objects. At this time, a wave of YGC (which itself is very fast) is forced to be configured, which should reduce the number of objects. Remark's STW is time consuming.

It is currently observed that the Remark phase takes as long as:

STW is 0.8 seconds long, which is really miserable. We add this option:

-XX:+CMSScavengeBeforeRemark

After adding the above configuration, you can see that the Remark phase STW time is shortened by about 7 times:

Optimization effect

Red is the comparison instance of the original online configuration. The new configuration has taken effect on other machines:

It can be seen that the GC frequency and time-consuming have both dropped significantly. The new Old GC time-consuming has been reduced to the original Young GC time-consuming level, and the timeout alarm in the morning has also disappeared.

Guess you like

Origin blog.csdn.net/h952520296/article/details/127983683