Fierce eyes to break through ES pseudo-slow query | JD Logistics technical team

1.Problem phenomenon

service phenomenon

TP99 performance degradation for service interfaces

ES phenomenon

  • YGC: The time consumption is extremely abnormal, with a peak value of 200+ times and a time consumption of 7s+
  • FULL GC: Abnormal, the number is 1 but frequent, STW 5s
  • Slow query: There are 5+ slow queries

2. Solution process

1. Remove interference factors

  • Judging from the phenomenon, the application is causing the JVM memory usage to continue to increase for some reason, triggering frequent YGC and then triggering FGC (this is just a bold guess at this time).
  • At this time, the JVM configuration of ES is 40G of JVM memory and uses the CMS garbage collector. The performance of CMS garbage collector using 40G memory is obviously not as good as G1, which is more suitable.
  • Find an ES operation and maintenance classmate to change the garbage collector from CMS to G1

(Tips: Not all ES are suitable for G1. The Full GC of G1 for many large queries will cause the GC mode to degenerate into serial scanning of the entire heap, resulting in dozens of seconds or even minute-level pauses. This long pause is not only It affects the user's query and can easily cause the communication between nodes to time out, causing the master and dataNode to leave the cluster, affecting the stability of the cluster.)

GC changes after modifying to G1:

  • YGC: The time consumption is extremely normal, with a peak value of 35+ times and a time consumption of 800ms.
  • FULL GC: normal, the number of times is 0
  • Slow query: There are 10+ slow queries

2. Find the problem

After the JVM garbage collector of ES is adjusted, the performance of the service interface of the Jeff interface has not been solved because of the resolution of the GC problem.

  • Through communication with students on the ES side, I learned that the refresh of this ES cluster is extremely abnormal, refresh: 2w+.

  • The slow query statements in ES monitoring are not slow to execute alone.

reason:

The interaction between the application and ES uses the 3.1.9.RELEASE version of the spring-data-elasticsearch package. The ES data synchronization work saves the data through the save method in the API, as shown in the figure below. This version The save operation will perform a refresh operation after each save.

<groupId>org.springframework.data</groupId>
<artifactId>spring-data-elasticsearch</artifactId>
<version>3.1.9.RELEASE</version>


Why does every refresh have an impact on the query? Today we will also follow the trend and let GPT reply to us:

3. Repair plan

  • Upgrade the version of spring-data-elasticsearch to 4.x or above. Since the higher version of spring-data-elasticsearch is incompatible with the lower version, the modification cost is high. All the places involving API operations in the project need to be modified.

  • The save operation is changed to operation (the currently selected solution has fewer changes)

Slow queries are gone

The number of refreshes has also dropped.

3. Problem solving

The final business service interface performance is normal.

Teachers often say that we are always influenced by empirical ideas and opportunism. The fundamental solution to this problem is to seek truth from facts, and practice is the standard of truth.

Author: JD Logistics Wang Yijie

Source: JD Cloud Developer Community Ziyuanqishuo Tech Please indicate the source when reprinting

おすすめ

転載: blog.csdn.net/jdcdev_/article/details/135011354