How sweet are sub-millisecond GC pauses? First experience of JDK17+ZGC|Dewu Technology

1 Introduction

The pause of the garbage collector has always been the focus of Java engineers, especially for services with high real-time response requirements, the pause time of tens of milliseconds or even hundreds of milliseconds of mainstream garbage collectors such as CMS and G1 is quite fatal. In addition, the tuning threshold is relatively high, and it is necessary to have a certain understanding of the internal mechanism of the garbage collector in order to be able to perform effective tuning. In order to solve such problems, JDK 11 began to introduce a low-latency garbage collector ZGC. ZGC uses some new technologies and optimization algorithms, which can control the GC pause time within 10 milliseconds, and with the blessing of JDK 17, the ZGC pause time can even be controlled at the sub-millisecond level!

2 ZGC

There are already many similar articles on the Internet about the introduction and principle of ZGC, so here is just a brief introduction.

2.1 Design goals

ZGC was first introduced as an experimental feature in JDK 11 and declared production-ready in JDK 15. As a low-latency garbage collector, it is designed to meet the following goals:

  • Heap size support from 8MB to 16TB

  • 10ms max GC temporarily

  • In the worst case, the throughput will be reduced by 15% (low latency for throughput is worthwhile, and throughput expansion can be solved)

1.png

2.2 ZGC memory distribution

ZGC is different from traditional CMS and G1. It does not have the concept of generation. It only has the Region probability similar to G1. The Region of ZGC can have three types of capacity as shown in the figure below:

  • Small Region (Small Region): The capacity is fixed at 2MB, and it is used to place small objects less than 256KB.

  • Medium Region (Medium Region): The capacity is fixed at 32MB, which is used to place objects larger than 256KB but smaller than 4MB.

  • Large Region (Large Region): The capacity is not fixed and can be changed dynamically, but it must be an integer multiple of 2MB, which is used to place large objects of 4MB or above. Each large region will store a large object, which also indicates that although the name is "large region", its actual capacity may be smaller than that of a medium region, and the minimum capacity can be as low as 4MB. Large Regions will not be redistributed in the implementation of ZGC (redistribution is a processing action of ZGC, which is used in the collector stage of copying objects) because the cost of copying large objects is very high.

2.png

2.3 GC working process

Similar to ParNew and G1 in CMS, ZGC also uses the mark-copy algorithm, but ZGC solves the problem of accurately accessing objects during the transfer process through colored pointers and read barrier technology, and almost all of them are concurrent in the mark, transfer and relocation stages Execution, which is the most critical reason for ZGC to achieve the goal of a pause time of less than 10ms.

3.png

As can be seen from the figure above, ZGC has only three STW stages: initial marking, re-marking, and initial transfer. For the specific transfer process, there are a lot of similar articles on the Internet, so I won’t give a detailed introduction here. If you are interested, you can refer to the following articles:

Exploration and practice of the new generation of garbage collector ZGC ZGC The latest generation of garbage collector | Programmer advanced

3 Why choose JDK17?

Released on September 14, JDK 17 is a long-term support (LTS) release, meaning it will be supported and updated for many years. This is also the first LTS release that includes a production-ready ZGC release. To recap, an experimental version of ZGC has been included in JDK 11 (the previous LTS release), while the first production-ready ZGC version appeared in JDK 15 (a non-LTS release).

4 Upgrade process

The upgrade from JDK8+G1 to JDK17+ZGC is mainly for adaptation at the code level and the JVM startup parameter level.

4.1 JDK download

First of all, jdk17 chooses openjdk, download address: https://jdk.java.net/archive/, select version 17 GA

4.png

4.2 Code adaptation

  • JDK11 removes Java EE and CORBA modules

If you use packages starting with javax.annotation.*, javax.xml.*, etc. in the project, you need to manually introduce the corresponding dependencies

<dependency>
    <groupId>javax.annotation</groupId>
    <artifactId>javax.annotation-api</artifactId>
</dependency>
<dependency>
    <groupId>javax.xml.bind</groupId>
    <artifactId>jaxb-api</artifactId>
</dependency>
<dependency>
    <groupId>com.sun.xml.bind</groupId>
    <artifactId>jaxb-core</artifactId>
</dependency>
<dependency>
    <groupId>com.sun.xml.bind</groupId>
    <artifactId>jaxb-impl</artifactId>
</dependency>

  • Maven-related dependency version upgrade
<!-- 仅供参考 -->
<maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
<maven-assembly-plugin.version>3.3.0</maven-assembly-plugin.version>
<maven-resources-plugin.version>3.2.0</maven-resources-plugin.version>
<maven-jar-plugin.version>3.2.0</maven-jar-plugin.version>
<maven-surefire-plugin.version>3.0.0-M5</maven-surefire-plugin.version>
<maven-deploy-plugin.version>3.0.0-M1</maven-deploy-plugin.version>
<maven-release-plugin.version>3.0.0-M1</maven-release-plugin.version>
<maven-site-plugin.version>3.9.1</maven-site-plugin.version>
<maven-enforcer-plugin.version>3.0.0-M2</maven-enforcer-plugin.version>
<maven-project-info-reports-plugin.version>3.1.0</maven-project-info-reports-plugin.version>
<maven-plugin-plugin.version>3.6.1</maven-plugin-plugin.version>
<maven-javadoc-plugin.version>3.3.0</maven-javadoc-plugin.version>
<maven-source-plugin.version>3.2.1</maven-source-plugin.version>
<maven-jxr-plugin.version>3.0.0</maven-jxr-plugin.version>

<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
   <!-- <version>1.16.20</version>-->
    <version>1.18.22</version>
</dependency>

5.png

  • After Java9 is modularized, the application is not allowed to view all classes from the JDK, which will affect the operation of some reflections. It needs to be solved by the following command
--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED

  • After using transmittable-thread-local-2.14.2.jar locally, an error is reported when starting

6.png

It can be solved by adding log output after the agent. As for the reason, the guess is that it has something to do with the class loading order

-javaagent:/Users/admin/Documents/transmittable-thread-local-2.14.2.jar
=ttl.agent.logger:STDOUT

The above content is only for the problems encountered in the upgrade of the Rainbow Bridge project. The adaptation of different business codes may be different, and a solution needs to be found according to the actual situation.

4.3 JVM parameter replacement

The following are some general GC parameters and ZGC-specific parameters and some diagnostic selections of ZGC, from the official website: Main - Main - OpenJDK Wiki

13.png

The specific meaning of each parameter is not introduced here. You can refer to the official website document The java Command, which contains detailed instructions.

The startup parameters of JKD8+G1:

-server -Xms36600m -Xmx36600m
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+PrintReferenceGC
-XX:+ParallelRefProcEnabled
-XX:G1HeapRegionSize=16m
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/apps/errorDump.hprof
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintGCApplicationConcurrentTime
-verbose:gc
-Xloggc:/opt/apps/logs/${app_name}-gc.log

The startup parameters of JDK17+ZGC are as follows:

-server -Xms36600m -Xmx36600m
#开启ZGC
-XX:+UseZGC 
#GC周期之间的最大间隔(单位秒)
-XX:ZCollectionInterval=120
#官方的解释是 ZGC 的分配尖峰容忍度,数值越大越早触发GC
-XX:ZAllocationSpikeTolerance=4
#关闭主动GC周期,在主动回收模式下,ZGC 会在系统空闲时自动执行垃圾回收,以减少垃圾回收在应用程序忙碌时所造成的影响。如果未指定此参数(默认情况),ZGC 会在需要时(即堆内存不足以满足分配请求时)执行垃圾回收。
-XX:-ZProactive 
#GC日志
-Xlog:safepoint=trace,classhisto*=trace,age*=info,gc*=info:file=/opt/logs/gc-%t.log:time,level,tid,tags:filesize=50M 
#发生OOM时dump内存日志
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/apps/errorDump.hprof

5 Pressure test results

directly on the map

7.png

8.png

9.png

As described in the ZGC design goals, it reduces GC pause times from tens of milliseconds in the past to the amazing sub-millisecond level. However, this ultra-low latency performance also requires a certain price, because ZGC will occupy a certain amount of CPU resources while achieving low latency. Normally, ZGC does not occupy more than 15% of the CPU. In the Rainbow Bridge project, after using the above recommended JVM parameters, the CPU resource occupied by ZGC is about 6%.

6 ZGC log

6.1 Output ZGC log

The GC log contains detailed information about GC operations, which can help us analyze current GC problems. First look at the parameters about GC logs in the above JVM parameters

-Xlog:safepoint=trace,classhisto*=trace,age*=info,gc*=info:file=/opt/logs/gc-%t.log:time,level,tid,tags:filesize=50M


  • safepoint=trace: record trace level logs about safepoint. Safepoint is a special state in JVM, which is used to ensure that all threads enter a safe state before certain operations (such as garbage collection, code optimization, etc.).

  • classhisto*=trace: Record trace-level logs related to the history of classes. age*=info: record info level logs related to the age of the object (the time it existed in the young generation).

  • gc*=info: record info level logs related to garbage collection.

  • file=/opt/logs/gc-%t.log: write the log to the file in the /opt/logs/ directory, the file name is gc-%t.log, where %t is a placeholder, indicating current timestamp.

  • time,level,tid,tags: Include timestamp, log level, thread id, and tags in each log record.

  • filesize=50M: Set the log file size limit to 50MB. When the log file size reaches this limit, the JVM will create a new log file and continue logging.

For more detailed gc log configuration, please refer to: https://docs.oracle.com/en/java/javase/17/docs/specs/man/java.html#enable-logging-with-the-jvm-unified-logging -framework

6.2 STW key log

Among them, we focus on the STW situation of GC. The following are some keywords representing the STW stage of GC

  • The most basic three stages of STW, initial mark: Pause Mark Start in the log, remark: Pause Mark End in the log, initial transfer: Pause Relocate Start in the log.

10.png

  • Memory allocation blocking: This is generally because the garbage production speed is greater than the recycling speed, and the garbage is too late to recycle. When the garbage heap is full, the thread will block and wait for the GC to complete. The keyword is Allocation Stall (the name of the blocked thread)

11.jpeg

If such logs appear, you can try the following solutions:

  1. -XX:ZCollectionInterval The configuration meaning: the maximum interval between two GC cycles (in seconds). By default, this option is set to 0 (disabled), and this configuration can be appropriately adjusted to shorten the GC cycle and improve the garbage collection speed, but this will increase the CPU usage of the application.

  2. The official explanation of -XX:ZAllocationSpikeTolerance is the allocation spike tolerance of ZGC. In fact, the larger the value, the earlier the recycling will be triggered. This configuration can be appropriately adjusted to trigger recycling earlier and improve the speed of garbage collection, but this will increase the CPU usage of the application.

  • Safe point: GC can only be performed after all threads enter the safe point, and ZGC regularly enters the safe point to judge whether GC is required. The thread that enters the safe point first needs to wait for the thread that enters the safe point until all threads are suspended. Log keywords safepoint ... stopped

  • Dump threads and memory: such as jstack and jmap commands, usually caused by manual dump, the log keyword HeapDumper

7 Linux Huge Page Memory

It can also be seen on the official website of openjdk that enabling the Linux large page memory will improve the performance of the application.

12.png

For the opening method, see the official website document https://wiki.openjdk.org/display/zgc/Main#Main-EnablingLargePagesOnLinux. Note that in addition to modifying the system configuration, you also need to add -XX:+UseLargePages configuration in the process JVM startup parameters

After several rounds of stress testing, it was found that after the Linux huge page is enabled, the CPU drops by about 8%. However, because the huge page will reserve a specified size of memory in advance, the memory usage of the machine will be high. Moreover, there is no other application in the production environment to enable this configuration. The stability needs to be studied. The production environment can evaluate whether to enable it by itself.

8 summary

In this article, we explored how to upgrade to JDK 17 and use the latest generation of garbage collector ZGC. After practice and testing, we found that the upgraded system performed well in garbage collection, and the pause time was effectively controlled within 1 millisecond. Although this optimization process may consume additional CPU resources, the ultra-low GC pause times obtained are obviously very worthwhile. In short, compared with other garbage collectors, ZGC's performance and stability are already very good, and it does not require much tuning. In most cases, excellent performance can be obtained by using the default settings officially recommended by ZGC. For those RT-sensitive applications, upgrading to JDK 17 and adopting ZGC is a wise choice.

Text: Shinichi

This article belongs to Dewu technology original, source: Dewu technology official website

It is strictly forbidden to reprint without the permission of Dewu Technology, otherwise legal responsibility will be investigated according to law!

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/10083247