Tencent JDK localized CPU architecture support sharing

GIAC (GLOBAL INTERNET ARCHITECTURE CONFERENCE) is an annual technology architecture conference for architects, technical leaders and high-end technical practitioners launched by the high-availability architecture technology community and msup that has long focused on Internet technology and architecture. It is the largest technology in China One of the meetings.

GIAC sixth on this year's General Assembly, JAVA topic in big data architecture in evolution, Tencent senior engineer Fu Jie doctor issued a "Tencent JDK localization CPU architecture supports shared" on the topic. The following is a record of guest speeches:

Dear guests, good afternoon everyone! I am very happy to have the opportunity to share with you the topic of Tencent JDK's localized CPU architecture support. I'm Jiefu (Fu Jie) from Tencent's JVM team. I started to work on OpenJDK research and development during my postgraduate and Ph.D studies at the Institute of Computing Technology of the Chinese Academy of Sciences. I am currently a committer in the OpenJDK community. I used to work at Loongson and was the core developer of the OpenJDK mips branch. I developed and implemented the OpenJDK C2 compiler on Loongson. After joining Tencent, he is mainly devoted to the exploration and practice of KonaJDK in the fields of big data and machine learning.

Today, I first give you a brief introduction to Tencent Kona JDK; then, I elaborate on the JVM's support for domestic CPU architecture; finally, I will discuss with you the impact of the processor memory model on JVM implementation.

Introduction to Tencent Kona JDK

Tencent Kona is a JDK product developed by Tencent based on OpenJDK. It is free and open source in 2019 and provides long-term support (LTS). Every release of Kona has been tested and verified by Tencent Cloud and the actual internal production environment. Welcome to download and use. 

When JDK14 was released in March 2020, our company was a limited number of domestic companies and entered the global list of outstanding contributors/organizations. The OpenJDK global contributor list is an authoritative statistics on the contributions of various companies or individuals around the world to OpenJDK, which is announced by Oracle when the new version of JDK is released.

Tencent's JVM team (including many OpenJDK community author/committer) is responsible for the development and maintenance of Kona. In just the last six months, the team has contributed dozens of patches to the OpenJDK community to fix bugs. At the same time, Goose Factory also contributes its massive production load experience and cutting-edge practices to the OpenJDK community. In the future, we will actively embrace open source with a more open attitude and continue to contribute to open source.

JVM support for domestic CPU architecture

Let me share with you the relevant content of JVM's support for domestic CPU architecture. Domestic processors are the foundation of my country's development of the letter and creative industry. At present, domestic processors that have entered the official directory can be divided into four major architectures: ARM, MIPS, Alpha and X86. Among them, ARM is represented by Kunpeng and Feiteng, MIPS is represented by Loongson, Alpha is represented by Shenwei, and X86 is represented by Zhaoxin and Haiguang. For the above four architectures, except for the OpenJDK community support for ARM and X86, neither MIPS nor Alpha has community support, and all need to be developed and maintained by themselves. Therefore, mastering the technology supported by JVM for processors is of great significance for breaking foreign monopolies and promoting the sustainable and healthy development of domestic processors.

OpenJDK's HotSpot virtual machine is the most widely used high-performance Java virtual machine in the world. From the macro design level, the HotSpot virtual machine can be divided into four modules: class loader, runtime, execution engine and garbage collector. Among them, only the execution engine and processor architecture are closely related, and the other three modules are almost platform-independent (or only partially related to the operating system, such as runtime modules). The execution engine of the JVM is responsible for converting Java bytecode into machine instructions supported by the processor hardware, so this module is mostly related to the CPU. Therefore, the support of JVM for the architecture of the domestically produced processor is essentially to realize the JVM execution engine on the domestically produced processor. So, how should the execution engine of JVM be implemented at the code level?

The left part of the PPT on this page shows the organization of the HotSpot virtual machine source code. According to the correlation with the underlying hardware and operating system, the HotSpot source code is divided into four subdirectories: cpu (processor related), os (operating system related), os_cpu (processor and operating system related at the same time) and share (platform independent). The middle part of the PPT lists the main functions implemented by each subdirectory, among which the parts marked in yellow are related parts of the CPU architecture. The right side of the PPT takes ARM’s aarch64 processor architecture as an example to quantitatively analyze the amount of code required by the JVM to support a processor architecture. The amount of code related to the CPU architecture is about 64,000 lines, and the amount of code in the remaining part is about 70 Million lines. Therefore, the code required for processor architecture support accounts for less than 8%. The architecture-related code mainly includes assembler, interpreter and compiler backend. In addition, since the Java language natively supports multithreading, the processor also needs to provide atomic operations and memory barriers to ensure the correctness of concurrent programs. Below we will expand one by one from the assembler, interpreter, compiler, CPU atomic operation and memory barrier.

The assembler is the first module that needs to be implemented, because the construction of the interpreter and the compiler depend on the assembler to provide an interface. The assembler mainly abstracts and encapsulates the processor hardware, and provides registers and instructions required for programming. The assembler is the simplest function among several modules. But from the perspective of engineering realization, since modern processors support thousands of instructions, the assembler's implementation tasks are arduous, and errors in the instruction format and coding can easily be introduced. Therefore, developers are required to be familiar with the processor instruction set and be careful during the coding process.

After the assembler is completed, the interpreter needs to be implemented immediately. Ask everyone a question: Can you skip the interpreter and directly implement the compiler of the HotSpot virtual machine? Some people think that the performance of the interpreter is too low and want to remove the interpreter module to reduce the workload of JVM support for the CPU architecture. the answer is negative. The HotSpot virtual machine must rely on the function of the interpreter. First of all, for some special Java methods (such as large size), the compiler will refuse to compile and can only be interpreted and executed by the interpreter. Secondly, HotSpot's compilers, especially C2 compilers, make extensive use of radical compilation optimizations based on certain assumptions. But these assumptions are not always true. Once it fails, the virtual machine needs to fall back from compilation and execution to the interpreter to continue execution. Finally, in certain scenarios that require fast startup and response, direct interpretation and execution may be better than compiling and then executing. Therefore, the construction and support of the interpreter is necessary.

HotSpot's interpreter is a high-performance template-based interpreter. The so-called "template" is a sequence of assembly instructions used to implement the semantic functions of Java bytecode. The PPT on this page shows how the add method is compiled into four bytecodes by javac and then interpreted and executed. Interpretation and execution is actually the process of executing the sequence of instructions in the corresponding template of the bytecode one by one according to the control flow of the program. The right side of the PPT shows the interpreter template of the integer addition iadd bytecode. The machine instructions in the yellow dashed box above are used to fetch operands. The machine instructions in the yellow dashed box below are used to jump to the template corresponding to the next bytecode to continue execution. An add instruction in the middle is used to realize the semantics of iadd bytecode. The templates of the interpreter all follow a fixed pattern, that is, first take the operand, then execute, and finally jump to the next template to continue running.

After the interpreter is successfully debugged, the support of the compiler can be started. Compiler support is the most difficult and the debugging cycle is the longest. Two compilers, C1 and C2, are designed in HotSpot. The C1 compiler compiles fast, but the quality of the generated code is not high. It is suitable for scenarios that require fast startup and response, so it is also called the client version compiler. The code generated by the C2 compiler is of high quality, but the compilation speed is slow. It is suitable for service applications that need to be executed repeatedly for a long time, so it is also called the server version compiler. Compared with C1, C2 uses more and more radical compilation optimization algorithms, so C2 is more complicated than C1. The structure of C1 and C2 has many similarities. Let's take the more complex C2 as an example to show you how to implement a compiler that supports the new CPU architecture on the JVM.

This page PPT shows the principle of C2 compiler construction. In order to reduce the difficulty of compiler transplantation, C2 is divided into two parts, platform-independent and platform-dependent. Platform-independent code is applicable to all processor architectures, and only the code of the platform-related part needs to be transplanted and adapted to the processor architecture. Furthermore, in order to reduce the workload of manually writing platform-related part of the code, C2 uses the ADL compiler to automatically generate code related to the processor architecture. ADL is the English abbreviation of Architecture Description Language, which is an architecture description language embedded in the OpenJDK open source code. The ADL compiler generates C2 code by parsing architecture description files (files with a suffix of *.ad, such as aarch64.ad). Therefore, most of the work to support C2 on the new processor architecture is to correctly write the processor's architecture description file. The architecture description file mainly involves three aspects: register description, operand description and instruction set description.

This page PPT shows an example of register description using Aarch64 as an example. Register descriptions usually include general-purpose registers, floating-point registers, and vector registers. In order to be compatible with the 32-bit operating system, the 32-bit length is used as the basic description unit for register description. For example, R1 and R1_H in the upper part of the PPT together represent the 64-bit R1 register. The V0, V_H, V_J, and V_K in the lower half of the PPT represent a 128-bit length V0 floating-point register.

This page PPT shows examples of operand description. Operands describe the types of data directly supported by the processor, including three categories: immediate operands, register operands, and memory operands. In each major category, it will be further subdivided into specific subtypes such as character, integer, floating point, and pointer.

This page PPT shows an example of instruction description. It should be reminded that the instruction description not only describes which instructions the processor hardware supports, but also affects the instruction selection and generation of the C2 compiler, thereby affecting the performance of the compiler. In fact, the instruction description in the architecture file specifies how to use the CPU machine instructions to match the compiler's intermediate code representation. The instruction description of addI_reg_reg on the left side of the PPT will match the AddI node and its operand src1/src2 represented by the compiler's intermediate code, as shown in the right figure of the PPT.

After the registers, operands, and instruction descriptions are completed, JVM's support for the CPU architecture is nearing completion. At this point, you must not forget the CPU atomic operations and memory barriers mentioned earlier. As shown in the PPT on the next page, HotSpot defines very clear atomic operations and memory barrier interfaces, and you only need to implement them one by one according to the processor characteristics. Everyone is familiar with atomic operations, so what is a memory barrier? I will give you a detailed introduction in the next section.

Processor memory model and JVM implementation

Let's discuss the impact of the processor memory model on JVM design with you. Why list this topic alone? Years of practical experience tells us that JVM implementation most tests the level of engineers is the adaptation of the processor memory model and JVM. This part of the work determines whether the virtual machine can run stably on the processor. Hope it can attract everyone's attention.

The processor memory model has strong and weak points. The strong memory model is represented by X86; the weak memory model is represented by the ARM and PowerPC architectures. So how is the strength of the processor memory model defined? The following PPT shows the basis for the division of the strength of the memory model: according to how much the processor allows the reordering of memory access instructions. Generally, the more reordering of memory access instructions is allowed, the weaker the processor memory model, and vice versa. Memory fetching instructions are divided into two operations: read (Load) and write (Store). Therefore, possible reordering scenarios include read-read (Load/Load), read-write (Load/Store), write-read (Store/Load) and write-write (Store/Store) reordering. The X86 architecture processor only allows write and read (Store/Load) reordering, while ARM and PowerPC allow the above four reordering. Therefore, X86 is generally regarded as a strong memory model, while ARM and PowerPC are regarded as a weak memory model.

However, when we are programming, especially in concurrent programming, we may need to prohibit the reordering behavior of the processor. At this time, you need to use a memory barrier to complete. The so-called "memory barrier" refers to machine instructions supported by processor hardware and specifically used to prohibit the reordering of specific memory access instructions. As shown in the PPT on the next page, the HotSpot virtual machine provides corresponding memory barrier interfaces for four possible reordering scenarios. For example, if you want to prohibit write and read reordering of the X86 processor, you only need to call the memory barrier interface OrderAccess::storeload(). In addition to the above four basic interfaces, the virtual machine also defines acquire, release and fence interfaces. Among them, acquire can prohibit read-read and read-write reordering, release can prohibit read-write and write-write reordering, and fence prohibits all reordering.

The compiler needs to fully adapt to the memory model characteristics of the processor during the instruction generation stage. The following PPT shows the C2 compiler MemBarStoreStore intermediate node, the generation of the target code on the X86 architecture and Aarch64 architecture. The semantics of the MemBarStoreStore intermediate node is to prohibit the processor from reordering writes and writes. Since the X86 memory model does not allow reordering of writes and writes, the intermediate node does not need to generate additional machine instructions on the X86 architecture to ensure semantic correctness. The Aarch64 architecture processor itself allows write-write reordering, so an additional write-write memory barrier is needed to correctly implement the node semantics. Generally, the weak memory model architecture usually needs to generate more memory barriers.

What happens if the JVM does not properly adapt the processor access model? It will definitely cause bugs. Such bugs usually have the characteristics of randomness, divergence, and diverse appearances, making analysis and debugging difficult. Let me share with you a bug (JDK-8229169) that I solved that the OpenJDK memory access model is not adapted correctly. This bug was first fixed in jdk14 and then backported to LTS versions such as jdk8 and jdk11.

This bug is located in the work stealing phase of the HotSpot garbage collection framework and affects all garbage collectors except serial GC. The mechanism of the bug is that when the processor executes the GenericTaskQueue::pop method, the two read operations of _age (shown in the yellow font in the PPT on the next page) are out of order by the processor. The solution is to add a read memory barrier (shown in the green font in the PPT) between the two read operations to prohibit the processor from reading out of order. Someone may ask: Since the X86 processor does not allow out-of-order reading, there is no need to add this memory barrier on X86. Why not use the modification method in the lower right corner of the PPT? The correct answer to this question is that X86 also needs to be fixed by adding OrderAccess::loadload(). This is because although X86 does not reorder read operations during execution, the compiler may reorder this code when compiling this code. In order to prohibit code from being reordered during compilation, X86 also needs this patch. It is not difficult to see from the above analysis that the OrderAccess memory access barrier in the JVM also has the function of prohibiting reordering of the processor and the compiler. Please pay more attention to this point in the future development process.

The above is what I shared with you today. thank you all! In addition, welcome everyone to pay attention and star Tencent Kona JDK 8: 

https://github.com/Tencent/TencentKona-8

At the same time, all outstanding developers are also welcome to join the Tencent JVM R&D team, scan the QR code below, or click [ Read the full text ] to join us!

Backstage reply keywords [GIAC] can get guests to share PPT.

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/108301544