Adding fire to the Arm ecosystem, Tencent Kona JDK Arm architecture optimization practice

Arm architecture has been widely used in smart terminals and embedded fields due to its combination of performance and power consumption, and its influence has been continuously expanded. On the PC side and the data center, the x86 architecture often played a major role in it. Recently, with the rise of technologies such as artificial intelligence and cloud computing, and the continuous maturity of 5G networks, in the era of the Internet of Everything, the needs of applications are becoming more and more diverse, making the requirements for chip architectures more and more diverse.

On the basis of providing reliable performance, the Arm architecture has the characteristics of low power consumption and low overhead, making it more and more widely used in data centers and cloud computing, becoming an indispensable and important part of it. Amazon has invested a lot of energy in self-developed Arm servers and applied them to AWS services, reducing costs by up to 45%; Alibaba has also adopted Arm servers in large numbers in cloud services, and actively participated in Linaro, Adoptium and other organizations to continuously promote the Arm architecture development of.

In recent years, Tencent's demand for Arm architecture has also continued to increase, and various product lines have continued to introduce Arm servers, and the demand for Arm architecture software has also continued to grow. The KonaJDK team provides high-performance and high-stability commercial JDK versions within Tencent, firmly regards the Arm architecture as one of the key architectures supported by KonaJDK , continuously expands the functions of the JDK in the Arm architecture, and continuously improves the performance of the JDK in the Arm architecture .

picture

With the widespread application of Arm architecture in terminal and cloud computing scenarios, JDK needs to do a good job of supporting Arm architecture in order to develop better. Currently in the JDK community, the Arm architecture belongs to the first echelon support architecture. For the Arm architecture, the "compile once, run anywhere" feature of the Java language is suitable for seamless promotion of business applications to the Arm platform, and JDK is a necessary condition for Java applications to run. JDK's support for Arm architecture is also a strong support for Arm ecological promotion. During this process, the KonaJDK team hopes to work closely with Arm to develop together.

Cooperation between Tencent and Arm on JDK

At present, Tencent and Arm have had in-depth exchanges and cooperation in JDK. The two parties conducted extensive and in-depth discussions on the common performance problems of JDK in the Arm architecture and the support for new features of the Arm architecture.

KonaJDK team Arm platform optimization technology introduction

Currently in the Arm architecture, the KonaJDK platform has released two versions of JDK8 and JDK11, and the latest JDK17 version will be released later in 2021. The Kona JDK team supports the general features of KonaJDK in the Arm architecture from various aspects of function and performance, and optimizes the architectural features to ensure the consistency of Java applications migration to the Arm platform and prepare for the promotion of the Arm architecture.

ZGC :

GC makes the program no longer need to manually control the release of memory, effectively reducing the possibility of memory management related errors. However, for the GC algorithm, how to clean up memory accurately and efficiently is a complicated process. With the continuous development of business requirements, the GC algorithm is constantly iterating. Only by selecting the most suitable GC algorithm for different business goals can we better help the business achieve its goals. In recent years, with the increasingly powerful performance of server hardware, its software applications often require larger heaps, ranging from 10G to 100G, or even terabytes. In this environment, the pause time of traditional GC algorithms such as CMS and G1 tends to increase with the growth of the heap size. When a full GC is triggered for an ultra-large heap, a minute-level pause may even occur, which is sensitive to latency. For the application of GC, GC pause has become a major stubborn problem that hinders its wide application, and more suitable GC algorithms are needed to meet the needs of these businesses.

ZGC was introduced into JDK by JEP333, hoping to completely solve the delay problem caused by GC pauses. Its design goals are: control the pause time of each GC under 10ms; compared with G1 GC, the throughput rate does not drop by more than 15%; support large heap and Very large heap, and the pause time does not grow with the size of the heap. ZGC has launched an experimental version from JDK11, and continued to supplement and improve it with the release of new versions of JDK, and finally became an official version in JDK15, ensuring that the Java pause time will not increase with the increase of heap size and business scale. It provides a better choice for businesses that require high GC pauses.

picture

Figure 1 ZGC performance (from The Design of ZGC, Per Lidén)

In order to meet the needs of the business, the KonaJDK team has improved the completion of the ZGC function in the Tencent Kona JDK11 version, and carried out long-term verification and implementation, so that the business that is sensitive to GC pauses can also meet the requirements for low GC latency in the JDK11 version. need. JDK11 was released in the second half of 2018 and belongs to the Long-Term Support version, while the subsequent LTS version is JDK17, which is expected to be released in September 2021. The other intermediate versions are transitional development versions without continuous updates and fixes. Therefore, the KonaJDK team chose to improve the functions of ZGC in JDK11 to meet the needs of the business. Even after the subsequent release of JDK17, the business version update requires a process. During this period, the support of JDK11 is still required.

For the Arm architecture, supporting ZGC in JDK11 is a bigger challenge than the x86 architecture. The x86 architecture has been released as an experimental feature since JDK11, but in the Arm architecture, ZGC has only been supported since JDK13. The KonaJDK team has done a lot of work to complete the Arm architecture's support for ZGC in JDK11:

  • Need to choose the appropriate JDK submission in the Arm architecture to port to the JDK11 version

  • From JDK11 to JDK13, ZGC code and Hotspot code have been refactored many times. In the process of code porting, it is necessary to analyze the function and impact of code refactoring, or transplant related refactored code, or adapt related code according to JDK11.

  • According to the characteristics of the Arm architecture, the adaptation team optimized ZGC, enhanced its functions, and fixed bugs.

  • Arm belongs to the RISC architecture and uses a weakly ordered memory model. Therefore, when adapting the relevant assembly code (especially the barrier used by ZGC), the choice of instructions needs to be carefully considered, and the overhead should be reduced as much as possible on the basis of ensuring correctness. Improve efficiency

  • Fully and comprehensively test on the Arm platform to ensure the robustness of the relevant code

The biggest difficulty that the KonaJDK team encountered in the process of supporting ZGC in the Arm structure was how to correctly add barrier instructions to ensure correctness. Because Arm uses a weakly ordered memory model, code that executes correctly on the x86 platform may generate random errors due to the lack of necessary barriers under the Arm architecture. After the KonaJDK team initially completed the ZGC support code and conducted the ZGC stress test, it was found that after performing several GCs, the JDK crashed randomly, with a probability of 1 in 1,000. Through the analysis of the error scene, the high probability is suspected to be caused by the lack of necessary barriers. We tried to analyze the problem by analyzing the community code and ZGC logic. During this process, the difference in the code structure of JDK13 and JDK11 further increased the difficulty of the analysis. Finally, the KonaJDK team completed the repair of the problem. The ZGC code runs continuously for several times in the Arm architecture. A million times without problems.

Like other GC algorithms, ZGC also has its applicable business scenarios. The biggest advantage of the ZGC algorithm is that it can control the pause time below 10ms, which is especially suitable for services that are sensitive to pause time. But in order to achieve such short pause times, ZGC comes at a cost of performance penalty and memory consumption. ZGC transforms several tasks concurrently, so that some tasks that must be completed during the pause can be executed concurrently with the application code, effectively reducing the necessary pause time. However, this kind of concurrent execution, and the various barriers it introduces, will also lead to a certain degree of application throughput drop. Through the continuous investment of the entire OpenJDK community, the current performance degradation of ZGC in performance loss scenarios has been controlled within a small range. In terms of performance, in the case of large heaps with sufficient memory, ZGC can exceed G1 by about 5% to 20% in various benchmarks, while in the case of small heaps, it is about 10% lower than G1.

Therefore, different businesses need to choose a more suitable GC algorithm according to the actual situation to ensure that the throughput rate and pause time can meet the needs of the business. At present, if the business application uses an ultra-large heap (tens of gigabytes or even hundreds of gigabytes), it is recommended to use ZGC in order to avoid the pause of tens of seconds or even minutes caused by the full GC of traditional G1 and other GC algorithms. In addition, if the business has strict time limit requirements for the pause time, it is also recommended to use ZGC.

KonaFiber

When the application needs to execute multiple tasks concurrently, it will create multiple threads, each of which is responsible for one task, so as to realize the concurrent execution of tasks. However, as the business scale continues to increase, if you still create a thread for each task, the thread itself consumes a lot of memory, which will cause a lot of memory to be occupied. In addition, thread switching needs to be completed by the core. When a large number of threads exist, the frequent switching overhead will also affect the efficiency of concurrent execution. Coroutines were born to solve this situation. A coroutine is a lightweight thread that takes into account both development efficiency and execution efficiency. The switching of coroutines is completed in user mode, which is much less expensive than thread switching, and has lower memory requirements. Relatively, it is necessary to pay attention to some of the work of coroutine switching when writing application code. Compared with threads, coroutines can achieve better performance in high concurrency scenarios, and are more and more widely used. OpenJDK has also launched a native support project for Java coroutines: Project Loom, which has been developed for more than 3.5 years and is constantly developing and improving, and will soon become an Experimental feature.

KonaFiber is a coroutine solution implemented by the KonaJDK team. It is compatible with the OpenJDK community Loom API and provides better switching performance, but requires some additional memory overhead. KonaFiber is currently implemented in JDK8 and JDK11 according to the needs of the business, and the API compatible with the community makes it a coroutine solution that can evolve with the community solution for a long time. At present, KonaFiber has completed support for the Arm architecture, which can meet the needs of Arm architecture applications for coroutines.

picture

Figure 2 KonaJDK and Loom comparison

In order to meet the needs of the business and provide better coroutine switching performance, KonaFiber adopts the JKU-based StackFul stack solution to create an independent stack for each coroutine. When switching between coroutines, JDK only needs to modify the values ​​of Frame Pointer and Stack Pointer to complete the switching of coroutines, in addition to detecting the pin status of the coroutine and saving the context. The logic is simple and the performance overhead is small. However, compared to the community solution, KonaFiber's StackFul solution uses more memory and is more suitable for business scenarios that are less sensitive to memory consumption but more sensitive to performance. The performance data is shown in Figure 2. The left figure shows the comparison of the number of coroutine switches per second under different number of coroutines; the right figure compares the memory consumption.

picture

Figure 3 KonaFiber performance comparison

KonaFiber's implementation focuses on optimization and code refactoring, and continues to optimize in a variety of ways:

  • Coroutines are lightweight and continuously optimized to reduce the resource consumption of coroutines

  • Create on demand, create coroutines according to business needs, reduce memory usage

  • GC optimization, optimize implementation, reduce the overhead introduced by coroutines to GC

  • Stability fixes, improved robustness through extensive testing and business adaptation

Compared with Loom, the coroutine solution of the OpenJDK community, KonaFiber provides higher and more stable scheduling performance. Figure 3 compares the number of dispatches per second for KonaFiber and Loom with different number of coroutines.

picture

Figure 4 Scheduling performance comparison

At present, KonaFiber has been open sourced in KonaJDK8, and will be open sourced in KonaJDK11 in the future. KonaJDK will continue to follow up with the Loom community and continue to improve the implementation of KonaFiber.

OWST optimization

During the GC operation, there are several GC threads that process various tasks in parallel, but the processing time of different tasks varies, which makes the load distribution among the various GC threads unbalanced. In the JDK, the load between each GC thread is balanced and the GC pause time is reduced by the following methods: when a GC thread executes its assigned task, it will check the task queue of other GC threads, and if there is this thread, it can be executed , then it will "steal" the task and execute it. This process continues to loop until the GC ends. This solution achieves automatic load balancing, but during the execution process, since multiple GC threads may “steal” tasks at the same time, when the number of threads is large, the competition for locks will be fierce. Spin-waiting also incurs a certain performance overhead, making the algorithm unsatisfactory in practice.

In order to optimize this process, Google's paper at ISMM 2016 proposed a new load balancing algorithm: Optimized Work-Stealing Threads (OWST). The basic idea of ​​the algorithm is: when there are multiple GC threads that need to "steal" tasks, only one thread will perform the "stealing" operation, and the other threads will enter the waiting state. The thread performing the "stealing" operation checks the task queue of each GC thread, wakes up the thread according to the number of tasks, and executes the task. The algorithm effectively reduces the competition for locks between GC threads and improves the efficiency of the entire load balancing.

The OpenJDK community first implemented the OWST algorithm on Shenandoah GC, merged into the main branch in the JDK12 version and became the default Parallel Terminator. In order to better support the LTS version, the KonaJDK team transplanted the OWST algorithm related code to JDK8 and JDK11, and completed the related code adaptation and testing work. After verification on the business side, commercial OWST algorithm support was added to JDK8 and JDK11, effectively reducing The execution time of GC parallel tasks is reduced, and the pause time of GC is reduced.

By testing the performance of SPECjbb2015, when using ParallelGC, OWST can improve the critical-jOPS score by about 8% without affecting the max-jOPS. In addition, for the Map/Reduce and Spark SQL tasks related to Tencent's internal big data, the execution performance is also improved by 10+%.

Business Applications

At present, in the Arm architecture, ZGC has been practically applied in Tencent's large-scale production.

The feature that ZGC controls the pause time to less than 10ms makes it especially suitable for pause time-sensitive services. Tencent's WAF team uses the Java language to quickly implement product function iteration and launch. The team has a bypass security service, which is an Http service based on the Netty framework. It has very strict requirements on the latency of end-to-end requests, and needs to achieve the SLA target of 99.99% of the request latency being less than 80ms. It is difficult for traditional GC algorithms to achieve such a high standard. Longer pause times have a certain negative impact on the service. It is necessary to find a GC algorithm with lower pause times. The WAF team used the G1 GC algorithm before, and spent a lot of time and energy on G1 GC option tuning and code-level modifications. However, due to the insufficiency of G1GC itself, there is still a request jitter delay, which cannot achieve the established SLA goals. Subsequently, with the cooperation of the KonaJDK team, by switching the ZGC algorithm, the P9999 request delay of this business was stably less than 80ms, providing users with faster and more stable services.

follow-up plan

At present, the KonaJDK team is in the Arm architecture, mainly optimizing and supporting JDK8 and JDK11 versions, and will support JDK17 and other versions in the future. The KonaJDK team will continue to analyze and test various modules such as the JDK basic class library, runtime, memory management, execution engine, etc., and continuously expand the functions of the JDK and improve performance.

The KonaJDK team will always take the Arm architecture as a key support platform, and continue to increase investment to promote and improve the JDK's support for the Arm architecture to meet the growing demand for the Arm architecture.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324137097&siteId=291194637