[java] basic skills | Java real-time compiler principle analysis and practice

insert image description here

1 Overview

Reprinted: Basic Skills | Principle Analysis and Practice of Java Just-In-Time Compiler

Common compiled languages ​​such as C++ usually compile the code directly into machine code that the CPU can understand to run. In order to realize the feature of "compile once, run everywhere", Java divides the compilation process into two parts 首先它会先由javac编译成通用的中间形式——字节码,然后再由解释器逐条将字节码解释为机器码来执行. So in terms of performance, Java is usually inferior to compiled languages ​​such as C++.

In order to optimize the performance of Java, the JVM introduces a Just In Time compiler in addition to the interpreter: 当程序运行时,解释器首先发挥作用,代码可以直接执行。随着时间推移,即时编译器逐渐发挥作用,把越来越多的代码编译优化成本地代码,来获取更高的执行效率。解释器这时可以作为编译运行的降级手段,在一些不可靠的编译优化出现问题时,再切换回解释执行,保证程序可以正常运行.

The just-in-time compiler greatly improves the running speed of Java programs, and compared with static compilation, the just-in-time compiler can selectively compile hot codes, saving a lot of compilation time and space. At present, just-in-time compilers are very mature, and can even compare with compiled languages ​​in terms of performance. However, in this field, people are still exploring how to combine different compilation methods and use more intelligent means to improve the running speed of the program.

Second, the execution process of Java

The overall execution process of Java can be divided into two parts. 第一步由javac将源码编译成字节码In this process, lexical analysis, syntax analysis, and semantic analysis are performed. The compilation of this part in the compilation principle is called front-end compilation. Next, the bytecode is directly interpreted and executed one by one without compiling. During the process of interpretation and execution, the virtual machine collects the information about the program running at the same time. Compile - Compile bytecode into machine code, but not all code will be compiled, only the hot code identified by the JVM may be compiled.

How can it be considered hot code? JVM中会设置一个阈值,当方法或者代码块的在一定时间内的调用次数超过这个阈值时就会被编译,存入codeCache中。当下次执行时,再遇到这段代码,就会从codeCache中读取机器码,直接执行,以此来提升程序运行的性能. The overall execution process is roughly as follows:

insert image description here

2.1. The compiler in the JVM

There are two compilers integrated in the JVM, Client Compiler和Server Compilerand their roles are also different. Client Compiler注重启动速度和局部的优化,Server Compiler则更加关注全局的优化,性能会更好,但由于会进行更多的全局分析,所以启动速度会变慢. The two compilers have different application scenarios and play a role in the virtual machine at the same time.

2.2 Client Compiler

HotSpot VM comes with a Client Compiler C1 compiler. This kind of compiler starts fast, but the performance will be worse than Server Compiler. C1 will do three things:

  • Local simple and reliable optimizations, such as some basic optimizations on bytecode, method inlining, constant propagation, etc., give up many time-consuming global optimizations.

  • The bytecode is constructed into a high-level intermediate representation (High-level Intermediate Representation, hereinafter referred to as HIR). The HIR is independent of the platform and usually adopts a graph structure, which is more suitable for the JVM to optimize the program.

  • Finally, the HIR is converted into a low-level intermediate representation (Low-level Intermediate Representation, hereinafter referred to as LIR), and based on the LIR, register allocation and peephole optimization (local optimization methods, the compiler will perform in a basic block or multiple basic blocks. In the block, according to the generated code, combined with the characteristics of the CPU's own instructions, through some conversion rules that may bring performance improvement or through the overall analysis, perform instruction conversion to improve code performance) and other operations, and finally generate machine code.

3.3 Server Compiler

Server Compiler mainly focuses on some global optimizations that take a long time to compile, and even performs some unreliable aggressive optimizations based on information about program running. This compiler has a long startup time and is suitable for long-running background programs, 它的性能通常比Client Compiler高30%以上. Currently, there are two types of Server Compilers used in Hotspot virtual machines: C2 and Graal.

3.4 C2 Compiler

In Hotspot VM, the default Server Compiler is the C2 compiler.

When the C2 compiler performs compilation optimization, it will use a graph data structure that combines control flow and data flow, called Ideal Graph. The Ideal Graph represents the current program's data flow and the dependencies between instructions. With this graph structure, some optimization steps (especially those involving floating code blocks) become less complex.

The construction of Ideal Graph is to add nodes to an empty Graph according to the instructions in the bytecode when parsing the bytecode. The nodes in the Graph usually correspond to an instruction block, and each instruction block contains multiple associated instructions. , JVM will use some optimization techniques to optimize these instructions, such as Global Value Numbering, constant folding, etc. After parsing, it will also perform some dead code elimination operations. After the Ideal Graph is generated, some global optimizations will be performed on this basis in combination with the collected program operation information. At this stage, if the JVM determines that there is no need for global optimization at this time, it will skip this part of the optimization.

Regardless of whether global optimization is performed or not, the Ideal Graph will be converted into a MachNode Graph that is closer to the machine level. The final compiled machine code is obtained from the MachNode Graph. Before generating the machine code, there will be some tasks including register allocation and peephole optimization. and so on. About Ideal Graph and various global optimization methods will be introduced in detail in the following chapters. The process of compiling and optimizing by Server Compiler is shown in the following figure:

insert image description here

3.5 Grail Compiler

Starting from JDK 9, a new Server Compiler, Graal compiler, is integrated in Hotspot VM. Compared with the C2 compiler, Graal has several key features:

  • As mentioned earlier, the JVM will collect various information about the program running during interpretation and execution, and then the compiler will perform some prediction-based aggressive optimizations based on this information, such as branch prediction, selectively according to the running probability of different branches of the program. Compile some branches with high probability. Graal favors this optimization more than C2, so Graal's peak performance is generally better than C2.

  • Written in Java, it is more friendly to the Java language, especially new features such as Lambda and Stream.

  • Deeper optimizations, such as inlining of virtual functions, partial escape analysis, etc.

The Graal compiler can be -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompilerenabled via Java virtual machine parameters. When enabled, it replaces the C2 compiler in HotSpot and responds to compilation requests that would otherwise be handled by the C2.

3. Layered compilation

Before Java 7, developers were required to choose a compiler according to the nature of the service. For services that need to be started quickly, or some services that will not run for a long time, C1 with higher compilation efficiency can be used, corresponding 参数-client. For long-running services, or background services that require peak performance, C2 with better peak performance can be used 参数-server. Java 7 began to introduce the concept of layered compilation, which combines the advantages of C1 and C2, and pursues a balance between startup speed and peak performance. Layered compilation divides the execution state of the JVM into five levels. The five levels are:

解释执行。
执行不带profiling的C1代码。
执行仅带方法调用次数以及循环回边执行次数profiling的C1代码。
执行带所有profiling的C1代码。
执行C2代码。

Profiling is the collection of data that reflects the state of program execution. The most basic statistics are the number of method calls, and the number of times the loop is executed.

Usually, C2代码的执行效率要比C1代码的高出30%以上. The code executed by the C1 layer, in order of execution efficiency from high to low, is layer 1 > layer 2 > layer 3. Among the five layers, layers 1 and 4 are in the termination state. When a method reaches the termination state, as long as the compiled code does not fail, the JVM will not issue a compilation request for the method again. When the service is actually running, the JVM will choose a different compilation path from the interpretation and execution of the service until it reaches the termination state. The following figure lists several common compilation paths:

insert image description here

  • The first path in the figure represents the general situation of compilation. The hotspot method is interpreted and executed to be compiled by the 3-layer C1, and finally compiled by the 4-layer C2.

  • If the method is relatively small (such as the common getter/setter methods in Java services), the 3-layer profiling does not collect valuable data, and the JVM will conclude that the method has the same execution efficiency for C1 code and C2 code, and will execute the graph. in the second path. In this case, the JVM will give up the C2 compilation after the 3-layer compilation, and directly choose to use the 1-layer C1 compilation to run.

  • When C1 is busy, the ③ path in the diagram is executed, and the program is profiled during the interpretation and execution process, and is directly compiled by the fourth layer of C2 according to the information.

  • As mentioned above, the execution efficiency in C1 is layer 1>layer 2>layer 3, and layer 3 is generally more than 35% slower than layer 2. Therefore, when C2 is busy, path 4 in the figure is executed. At this time, the method will be compiled by the C1 of the 2nd layer, and then compiled by the C1 of the 3rd layer to reduce the execution time of the method in the 3rd layer.

  • If the compiler makes some aggressive optimizations, such as branch prediction, and finds that the prediction is wrong during actual operation, it will perform de-optimization and re-enter the interpretation and execution. The execution path No. 5 in the figure represents the de-optimization.

In general, the compilation speed of C1 is faster, the compilation quality of C2 is higher, and the different compilation paths of layered compilation are a process in which the JVM finds the best balance point of the current service according to the operation of the current service. Since JDK 8, the JVM has enabled layered compilation by default.

4. Trigger of just-in-time compilation

The Java virtual machine triggers just-in-time compilation based on the number of method calls and the number of times the loop is executed. The loop back edge is a concept in the control flow graph. The program can be simply understood as an instruction to jump back, such as the following code:

loop back

public void nlp(Object obj) {
    
    
  int sum = 0;
  for (int i = 0; i < 200; i++) {
    
    
    sum += i;
  }
}

The above code is compiled to generate the following bytecode. where the bytecode at offset 18 will jump back to the bytecode at offset 4. During interpretation, every time the instruction is run, the Java virtual machine increments the method's loopback counter by 1.

bytecode

public void nlp(java.lang.Object);
    Code:
       0: iconst_0
       1: istore_1
       2: iconst_0
       3: istore_2
       4: iload_2
       5: sipush        200
       8: if_icmpge     21
      11: iload_1
      12: iload_2
      13: iadd
      14: istore_1
      15: iinc          2, 1
      18: goto          4
      21: return

During just-in-time compilation, the compiler recognizes the head and tail of the loop. In the above bytecode, the head and tail of the loop body are the bytecode with offset 11 and the bytecode with offset 15, respectively. The compiler will count the loop by incrementing the loop-back counter at the end of the loop body.

Just-in-time compilation is triggered when the sum of the number of method calls and the number of loopbacks exceeds -XX:CompileThresholdthe threshold specified by the parameter (when using C1, the default value is 1500; when using C2, the default value is 10000).

When layered compilation is enabled, -XX:CompileThresholdthe thresholds set by the parameters will be invalid, and the triggering of compilation will be determined by the following conditions:

  • The number of method calls is greater than -XX:TierXInvocationThresholdthe threshold specified by the parameter multiplied by the factor.
  • When the number of method invocations is greater than -XX:TierXMINInvocationThresholdthe threshold specified by the parameter multiplied by the coefficient, and the sum of the number of method invocations and the number of loopbacks is greater than -XX:TierXCompileThresholdthe threshold specified by the parameter multiplied by the coefficient.

Hierarchical Compilation Trigger Condition Formula

i > TierXInvocationThreshold * s || (i > TierXMinInvocationThreshold * s  && i + b > TierXCompileThreshold * s) 
i为调用次数,b是循环回边次数

If one of the above conditions is met, just-in-time compilation will be triggered, and the JVM will dynamically adjust the coefficient s according to the current number of compilation methods and the number of compilation threads.

5. Compilation optimization

The just-in-time compiler will perform a series of optimizations on the running service, including analysis in the bytecode parsing process, local optimization based on some intermediate forms of the code in the compilation process, and global optimization based on the program dependency graph. machine code is generated.

5.1. Intermediate Representation

In the compilation principle, the compiler is usually divided into front-end and back-end. The front-end compilation generates an intermediate representation (Intermediate Representation, hereinafter referred to as IR) through lexical analysis, syntax analysis, and semantic analysis, and the back-end will optimize the IR and generate object code.

Java bytecode is a kind of IR, but the structure of bytecode is complex, and IR in the form of code such as bytecode is not suitable for global analysis and optimization. Modern compilers generally use graph-structured IR, and Static Single Assignment (SSA) IR is the most commonly used one. The characteristic of this IR is that each variable can only be assigned a value once, and it can only be used when the variable is assigned a value. for example:

SSA IR

Plain Text
{
    
    
  a = 1;
  a = 2;
  b = a;
}

In the above code we can easily find that the assignment of a = 1 is redundant, but the compiler cannot. Traditional compilers need to use data flow analysis to confirm which variable values ​​are overwritten from back to front. However, with the help of SSA IR, the compiler can easily identify redundant assignments.

The pseudocode in SSA IR form of the above code can be expressed as:

SSA IR

Plain Text
{
    
    
  a_1 = 1;
  a_2 = 2;
  b_1 = a_2;
}

Since each variable in SSA IR can only be assigned once, the a in the code will be assigned to two variables a_1 and a_2 in SSA IR, so that the compiler can easily scan these variables to find the assignment of a_1 and Not used, assignment is redundant.

In addition, SSA IR is also very helpful for other optimization methods, such as the following example of Dead Code Elimination:

DeadCodeElimination

public void DeadCodeElimination{
    
    
  int a = 2;
  int b = 0
  if(2 > 1){
    
    
    a = 1;
  } else{
    
    
    b = 2;
  }
  add(a,b)
}

The SSA IR pseudocode can be obtained:

DeadCodeElimination

a_1 = 2;
b_1 = 0
if true:
  a_2 = 1;
else
  b_2 = 2;
add(a,b)

By executing the bytecode, the compiler can find that b_2 will not be used after assignment, and the else branch will not be executed. After the dead code is removed, the code can be obtained:

DeadCodeElimination

public void DeadCodeElimination{
    
    
  int a = 1;
  int b = 0;
  add(a,b)
}

We can think of each optimization of the compiler as a graph optimization algorithm that takes an IR graph and outputs a transformed IR graph. The process of compiler optimization is a series of optimizations of graph nodes.

5.2 Intermediate expressions in C1

As mentioned above, the C1 compiler internally uses the high-level intermediate expression form HIR and the low-level intermediate expression form LIR for various optimizations, both of which are in the form of SSA.

HIR is a control flow graph structure composed of many basic blocks (Basic Block), each block contains many instructions in the form of SSA. The structure of the basic block is shown in the following figure:

insert image description here
Among them, predecessors represents the precursor basic block (because there may be multiple precursors, so it is a BlockList structure, which is an expandable array composed of multiple BlockBegins). Likewise, successors represent multiple successor basic blocks BlockEnd. In addition to these two parts is the body block, which contains the program execution instructions and a next pointer, pointing to the next execution of the body block.

The final call from the bytecode to the HIR construction is GraphBuilder. GraphBuilder will traverse the bytecode to construct all code basic blocks and store them as a linked list structure, but the basic block at this time is only BlockBegin, excluding specific instructions. In the second step, GraphBuilder will use a ValueStack as the operand stack and local variable table to simulate the execution of bytecode, construct the corresponding HIR, and fill the previously empty basic block. Here is an example of the process of constructing HIR with a simple bytecode block. As follows:

Bytecode constructs HIR

        字节码                     Local Value             operand stack              HIR
      5: iload_1                  [i1,i2]                 [i1]
      6: iload_2                  [i1,i2]                 [i1,i2]   
                                  ................................................   i3: i1 * i2
      7: imul                                   
      8: istore_3                 [i1,i2,i3]              [i3]

It can be seen that when iload_1 is executed, the operand stack is pushed into the variable i1, when iload_2 is executed, the operand stack is pushed into the variable i2, and the two values ​​at the top of the stack are popped out when the multiplication instruction imul is executed, constructing HIR i3 : i1 * i2 , the generated i3 is pushed onto the stack.

Most of the C1 compiler optimizations are done on top of the HIR. When the optimization is completed, it will convert HIR into LIR. LIR is similar to HIR, and it is also an IR used internally by the compiler. HIR can generate LIR by eliminating some intermediate nodes through optimization, which is more simplified in form.

5.3 Sea-of-Nodes IR

The Ideal Graph in the C2 compiler uses an intermediate expression form called Sea-of-Nodes, which is also in the form of SSA. Its biggest feature is that it removes the concept of variables and directly uses values ​​to perform operations. In order to facilitate understanding, the IR visualization tool Ideal Graph Visualizer (IGV) can be used to display the specific IR graph. For example the following code:

example

public static int foo(int count) {
    
    
  int sum = 0;
  for (int i = 0; i < count; i++) {
    
    
    sum += i;
  }
  return sum;
}

The corresponding IR diagram is as follows:

insert image description here
Several sequentially executed nodes in the figure will be included in the same basic block, such as B0, B1, etc. in the figure. The Start node No. 0 in the B0 basic block is the method entry, and the Return node No. 21 in B3 is the method exit. The bold red lines are control flow, blue lines are data flow, and lines of other colors are special control flow or data flow. The fixed nodes are connected by the control flow edge, and the others are floating nodes (floating nodes refer to nodes that can be placed in different positions as long as the data dependencies can be satisfied. The process of changing floating nodes is called Schedule).

Such graphs have a lightweight edge structure. An edge in a graph is simply represented by a pointer to another node. A node is an instance of a Node subclass with an array of pointers specifying the input edges. The advantage of this representation is that changing the input edge of a node is very fast. If you want to change the input edge, you only need to point the pointer to Node and store it in the pointer array of Node.

Relying on this graph structure, by collecting program running information, JVM can get the best compilation effect through those floating nodes of Schedule.

5.4 Phi And Region Nodes

Ideal Graph is SSA IR. Since there is no concept of variables, this creates a problem that different execution paths may set different values ​​for the same variable. For example, in the two branches of the if statement in the following code, 5 and 6 are returned respectively. At this time, the read value is likely to be different according to different execution paths.

example

int test(int x) {
    
    
int a = 0;
  if(x == 1) {
    
    
    a = 5;
  } else {
    
    
    a = 6;
  }
  return a;
}

In order to solve this problem, a concept of Phi Nodes is introduced, which can select different values ​​according to different execution paths. Therefore, the above code can be represented as the following picture:

insert image description here
Phi Nodes stores all the values ​​contained in different paths. Region Nodes obtains the values ​​that should be assigned to variables in the current execution path from Phi Nodes according to the judgment conditions of different paths. The pseudo-code in the form of SSA with Phi nodes is as follows:

Phi Nodes

int test(int x) {
    
    
  a_1 = 0;
  if(x == 1){
    
    
    a_2 = 5;
  }else {
    
    
    a_3 = 6;
  }
  a_4 = Phi(a_2,a_3);
  return a_4;
}

5.5 Global Value Numbering

Global Value Numbering (GVN) is an optimization technique made very easy by Sea-of-Nodes.

GVN refers to assigning a unique number to each calculated value, and then traversing the instructions to find opportunities for optimization, which can find and eliminate optimization techniques for equivalent calculations. If multiple multiplications with the same operand occur in a program, the JIT compiler can combine these multiplications into one, thereby reducing the size of the output machine code. GVN also saves redundant multiplication operations if these multiplications occur on the same execution path. In Sea-of-Nodes, since there is only the concept of value, the GVN algorithm will be very simple: the just-in-time compiler only needs to judge whether the floating node has the same number as the existing floating node, and whether the input IR node is consistent, The two floating nodes can then be merged into one. For example the following code:

GVN

a = 1;
b = 2;
c = a + b;
d = a + b;
e = d;

GVN will use the Hash algorithm to number, when calculating a = 1, you will get number 1, when calculating b = 2, you will get number 2, and when calculating c = a + b, you will get number 3, and these numbers will be stored in the Hash table. = a + b, you will find that a + b already exists in the Hash table, and no further calculation will be performed, and the calculated value will be taken directly from the Hash table. The last e = d can also be found in the Hash table and reused.

GVN can be understood as Common Subexpression Elimination (CSE) on IR graphs. The difference between the two is that GVN directly compares whether the values ​​are the same or not, while CSE uses the lexical analyzer to determine whether two expressions are the same or not.

6. Method inlining

Method inlining means that when a method call is encountered during compilation, 将目标方法的方法体纳入编译范围之中,并取代原方法调用的优化手段. Most of the JIT optimizations are performed on the basis of inlining, and method inlining is a very important part of just-in-time compilers.

There are a large number of methods in Java services getter/setter. If there is no method inlining, when calling getter/setter, the program execution needs to save the execution position of the current method, create and push getter/setterthe stack frame for use, access fields, pop the stack frame, and finally restore the current method. execution. 内联了对 getter/setter的方法调用后,上述操作仅剩字段访问. In the C2 compiler, method inlining is done during the parsing of the bytecode. When encountering a method call bytecode, the compiler will decide whether it needs to inline the current method call based on some threshold parameters. If inlining is required, start parsing the target method's bytecode. For example, the following example (from the Internet):

method inlining

public static boolean flag = true;
public static int value0 = 0;
public static int value1 = 1;
public static int foo(int value) {
    
    
    int result = bar(flag);
    if (result != 0) {
    
    
        return result;
    } else {
    
    
        return value;
    }
}
public static int bar(boolean flag) {
    
    
    return flag ? value0 : value1;
}

IR diagram of the bar method:

insert image description here

IR graph after inlining:

insert image description here
Inlining not only copies the called method's IR graph node into the caller method's IR graph, but also does other things.

The parameters of the called method are replaced by the parameters passed in when the caller method makes the method call. In the above example, replace the No. 1 P(0) node in the bar method with the No. 3 LoadField node in the foo method.

In the IR graph of the caller method, the data dependency of the method calling node becomes the return of the called method. If there are multiple return nodes, a Phi node will be generated, and these return values ​​will be aggregated and used as a replacement object for the original method invocation node. In the figure, the No. 8 == node and the No. 12 Return node are connected to the edge of the original No. 5 Invoke node, and then point to the newly generated No. 24 Phi node.

If the called method will throw a certain type of exception, and the calling method happens to have a handler for that exception type, and the exception handler overrides the method call, then the just-in-time compiler needs to make the called method throw an exception. The path to connect to the caller method's exception handler.

6.1 Conditions for method inlining

Most of the compiler's optimizations are based on method inlining. So in general, the more methods you inline, the more efficient the generated code will be. But for just-in-time compilers, the more methods are inlined, the longer the compilation time, and the later the program reaches peak performance.

-XX:MaxInlineLevelThe number of inlined layers can be adjusted by virtual machine parameters , and the direct recursive call of 1 layer (can be -XX:MaxRecursiveInlineLeveladjusted by virtual machine parameters). Some common inline-related parameters are shown in the following table:

insert image description here

6.2 Virtual function inlining

Inlining is the main way the JIT improves performance, but virtual functions make inlining difficult because the inlining phase doesn't know which method they will call. For example, we have an interface for data processing. One method in this interface has three implementations of add, sub and multi. The JVM stores all virtual functions in the class object by saving the virtual function table Virtual Method Table (hereinafter referred to as VMT). The instance object of the class stores a VMT pointer. When the program runs, the instance object is first loaded, then the VMT is found through the instance object, and the address of the corresponding method is found through the VMT. Therefore, the performance of the virtual function call is better than that of the classic call that directly points to the method address. Worse. Unfortunately, Java中所有非私有的成员函数的调用都是虚调用.

The C2 compiler is smart enough to detect this situation and optimize for virtual calls. For example the following code example:

virtual call

public class SimpleInliningTest
{
    
    
    public static void main(String[] args) throws InterruptedException {
    
    
        VirtualInvokeTest obj = new VirtualInvokeTest();
        VirtualInvoke1 obj1 = new VirtualInvoke1();
        for (int i = 0; i < 100000; i++) {
    
    
            invokeMethod(obj);
            invokeMethod(obj1);
        }
        Thread.sleep(1000);
    }
    public static void invokeMethod(VirtualInvokeTest obj) {
    
    
        obj.methodCall();
    }
    private static class VirtualInvokeTest {
    
    
        public void methodCall() {
    
    
            System.out.println("virtual call");
        }
    }
    private static class VirtualInvoke1 extends VirtualInvokeTest {
    
    
        @Override
        public void methodCall() {
    
    
            super.methodCall();
        }
    }
}

After JIT compiler optimization, disassemble to get the following assembly code:

 0x0000000113369d37: callq  0x00000001132950a0  ; OopMap{
    
    off=476}
                                                ;*invokevirtual methodCall  //代表虚调用
                                                ; - SimpleInliningTest::invokeMethod@1 (line 18)
                                                ;   {
    
    optimized virtual_call}  //虚调用已经被优化

You can see that JIT optimizes virtual_call for the methodCall method. The optimized method can be inlined. However, the ability of the C2 compiler is limited, and it is "powerless" for virtual calls of multiple implementation methods.

For example, in the following code, we add an implementation:

Multiple-implemented virtual calls

public class SimpleInliningTest
{
    
    
    public static void main(String[] args) throws InterruptedException {
    
    
        VirtualInvokeTest obj = new VirtualInvokeTest();
        VirtualInvoke1 obj1 = new VirtualInvoke1();
        VirtualInvoke2 obj2 = new VirtualInvoke2();
        for (int i = 0; i < 100000; i++) {
    
    
            invokeMethod(obj);
            invokeMethod(obj1);
        invokeMethod(obj2);
        }
        Thread.sleep(1000);
    }
    public static void invokeMethod(VirtualInvokeTest obj) {
    
    
        obj.methodCall();
    }
    private static class VirtualInvokeTest {
    
    
        public void methodCall() {
    
    
            System.out.println("virtual call");
        }
    }
    private static class VirtualInvoke1 extends VirtualInvokeTest {
    
    
        @Override
        public void methodCall() {
    
    
            super.methodCall();
        }
    }
    private static class VirtualInvoke2 extends VirtualInvokeTest {
    
    
        @Override
        public void methodCall() {
    
    
            super.methodCall();
        }
    }
}

After decompilation, the following assembly code is obtained:

code block

 0x000000011f5f0a37: callq  0x000000011f4fd2e0  ; OopMap{
    
    off=28}
                                                ;*invokevirtual methodCall  //代表虚调用
                                                ; - SimpleInliningTest::invokeMethod@1 (line 20)
                                                ;   {
    
    virtual_call}  //虚调用未被优化

It can be seen that the virtual calls of multiple implementations are not optimized and are still virtual_calls.

In response to this situation, the Graal compiler will collect the information of this part of the execution. For example, after a period of time, it is found that the call add and sub of the previous interface method are each with a 50% probability, then the JVM will be executed every time. When add is encountered, add is inlined, and when sub is encountered, the sub function is inlined, so that the execution efficiency of these two paths will be improved. In the follow-up, if other uncommon situations are encountered, the JVM will perform de-optimization operations, mark that position, and switch back to interpretation execution when encountering such situations.

7. Escape Analysis

逃逸分析是“一种确定指针动态范围的静态分析,它可以分析在程序的哪些地方可以访问到指针”. The just-in-time compiler of the Java virtual machine will perform escape analysis on the newly created object to determine whether the object escapes from the thread or method. There are two basis for the just-in-time compiler to judge whether the object escapes:

  1. Whether the object is stored in the heap (static field or instance field of the object in the heap), once the object is stored in the heap, other threads can get a reference to the object, and the just-in-time compiler cannot track all the code locations that use the object .

  2. Whether the object is passed into unknown code, the just-in-time compiler will treat the code that has not been inlined as unknown code, because it cannot confirm whether the method call will store the caller or the passed parameters in the heap. In this case, it can be directly considered that the caller and parameters of the method call are escaped.

Escape analysis is usually performed on the basis of method inlining, and just-in-time compilers can perform optimizations such as lock elimination, stack allocation, and scalar substitution based on the results of escape analysis. The following code is an example of the object not escaping:

pulbic class Example{
    
    
    public static void main(String[] args) {
    
    
      example();
    }
    public static void example() {
    
    
      Foo foo = new Foo();
      Bar bar = new Bar();
      bar.setFoo(foo);
    }
  }
  class Foo {
    
    }
  class Bar {
    
    
    private Foo foo;
    public void setFoo(Foo foo) {
    
    
      this.foo = foo;
    }
  }
}

In this example, two objects foo and bar are created, one of which is provided as an argument to another method. The method setFoo() stores a reference to the received Foo object. If the Bar object is on the heap, the reference to Foo will escape. But in this case, the compiler can determine through escape analysis that the Bar object itself does not make calls to escape example(). This means that references to Foo cannot escape either. Therefore, the compiler can safely allocate two objects on the stack.

7.1 Lock Elimination

When learning Java concurrent programming, unlocking and elimination is known, and lock elimination is carried out on the basis of escape analysis.

如果即时编译器能够证明锁对象不逃逸,那么对该锁对象的加锁、解锁操作没就有意义. 因为线程并不能获得该锁对象。在这种情况下,即时编译器会消除对该不逃逸锁对象的加锁、解锁操作. In fact, the compiler only needs to prove that the lock object does not escape the thread for lock elimination. Due to the limitations of just-in-time compilation of the Java Virtual Machine, the above conditions are enforced as a way to prove that the lock object does not escape the current compilation. However, lock elimination based on escape analysis is actually rare.

7.2 On-Stack Allocation

我们都知道Java的对象是在堆上分配的,而堆是对所有对象可见的。同时,JVM需要对所分配的堆内存进行管理,并且在对象不再被引用时回收其所占据的内存。如果逃逸分析能够证明某些新建的对象不逃逸,那么JVM完全可以将其分配至栈上,并且在new语句所在的方法退出时,通过弹出当前方法的栈桢来自动回收所分配的内存空间. In this way, 我们便无须借助垃圾回收器来处理不再被引用的对象. However, the Hotspot virtual machine does not perform actual stack allocation, but uses the technique of scalar replacement. A scalar is a variable that can only store one value, such as a primitive type in Java code. In contrast, aggregates may store multiple values ​​at the same time, a typical example of which is Java objects. The compiler breaks down unescaped aggregates into multiple scalars within the method to reduce heap allocations. Here is an example of scalar substitution:

scalar substitution

public class Example{
    
    
  @AllArgsConstructor
  class Cat{
    
    
    int age;
    int weight;
  }
  public static void example(){
    
    
    Cat cat = new Cat(1,10);
    addAgeAndWeight(cat.age,Cat.weight);
  }
}

After the escape analysis, the cat object did not escape the call of example(), so the aggregate cat can be decomposed to obtain two scalars age and weight, and the pseudocode after scalar replacement:

public class Example{
    
    
  @AllArgsConstructor
  class Cat{
    
    
    int age;
    int weight;
  }
  public static void example(){
    
    
    int age = 1;
    int weight = 10;
    addAgeAndWeight(age,weight);
  }
}

Partial escape analysis

Partial escape analysis is also an application of Graal to probabilistic prediction. Generally speaking, if an object is found to escape a method or thread, the JVM will not optimize it, but the Graal compiler will still analyze the execution path of the current program, and it will collect and determine which paths are on the basis of the escape analysis. Objects will escape and which will not. Then, based on this information, optimizations such as lock elimination and stack allocation are performed on paths that do not escape.

8. Loop Transformations

As mentioned in the introduction to the C2 compiler in the article, the C2 compiler will perform a lot of global optimizations after building the Ideal Graph, including the transformation of loops. The two most important transformations are loop unrolling and loop separation.

8.1 Loop Unrolling

Loop unrolling is a loop transformation technique, 它试图以牺牲程序二进制码大小为代价来优化程序的执行速度,是一种用空间换时间的优化手段.

循环展开通过减少或消除控制程序循环的指令,来减少计算开销,这种开销包括增加指向数组中下一个索引或者指令的指针算数等。如果编译器可以提前计算这些索引,并且构建到机器代码指令中,那么程序运行时就可以不必进行这种计算。也就是说有些循环可以写成一些重复独立的代码。For example the following loop:

loop unrolling

public void loopRolling(){
    
    
  for(int i = 0;i<200;i++){
    
    
    delete(i);  
  }
}

The above code needs to be deleted 200 times in a loop, and the following code can be obtained by loop unrolling:

loop unrolling

public void loopRolling(){
    
    
  for(int i = 0;i<200;i+=5){
    
    
    delete(i);
    delete(i+1);
    delete(i+2);
    delete(i+3);
    delete(i+4);
  }
}

In this way, the number of loops can be reduced by unrolling, and the calculation in each loop can also use the CPU pipeline to improve the efficiency. Of course, this is just an example. When the expansion is actually performed, the JVM will evaluate the benefits brought by the expansion and then decide whether to expand.

8.2 Cyclic separation

循环分离也是循环转换的一种手段。它把循环中一次或多次的特殊迭代分离出来,在循环外执行. For example, the following code:

cycle separation

int a = 10;
for(int i = 0;i<10;i++){
    
    
  b[i] = x[i] + x[a];
  a = i;
}

It can be seen that in this code, except for the first loop a = 10, the other cases a is equal to i-1. So you can separate out the special case and turn it into the following code:

cycle separation

b[0] = x[0] + 10;
for(int i = 1;i<10;i++){
    
    
  b[i] = x[i] + x[i-1];
}

This equivalent conversion removes the need for a variable in the loop, thus reducing overhead.

9… Peephole Optimization and Register Allocation

The peephole optimization mentioned above is the last step in the optimization, after which the program will be converted into machine code 窥孔优化就是将编译器所生成的中间代码(或目标代码)中相邻指令,将其中的某些组合替换为效率更高的指令组,常见的比如强度削减、常数合并等, see the following example is an example of strength reduction:

Intensity cuts

y1=x1*3  经过强度削减后得到  y1=(x1<<1)+x1

编译器使用移位和加法削减乘法的强度,使用更高效率的指令组.

Register allocation is also a compilation optimization method that is commonly used in C2 compilers. By saving frequently used variables in registers, the CPU can access registers much faster than memory, which can improve the running speed of the program.

Register allocation and peephole optimization are the final steps in program optimization. After register allocation and peephole optimization, the program is converted into machine code and stored in codeCache.

10. Practice

Just-in-time compilers are complex, and there is little actual combat experience on the Internet. Here are some tuning experiences from our team.

10.1. Important parameters related to compilation

-XX:+TieredCompilation:开启分层编译,JDK8之后默认开启
-XX:+CICompilerCount=N:编译线程数,设置数量后,JVM会自动分配线程数,C1:C2 = 1:2
-XX:TierXBackEdgeThreshold:OSR编译的阈值
-XX:TierXMinInvocationThreshold:开启分层编译后各层调用的阈值
-XX:TierXCompileThreshold:开启分层编译后的编译阈值
-XX:ReservedCodeCacheSize:codeCache最大大小
-XX:InitialCodeCacheSize:codeCache初始大小
-XX:TierXMinInvocationThreshold是开启分层编译的情况下,触发编译的阈值参数,当方法调用次数大于
    由参数-XX:TierXInvocationThreshold指定的阈值乘以系数,或者当方法调用次数大于由参数
    -XX:TierXMINInvocationThreshold指定的阈值乘以系数,并且方法调用次数和循环回边次数之和
    大于由参数-XX:TierXCompileThreshold指定的阈值乘以系数时,便会触发X层即时编译。分层编译
    开启下会乘以一个系数,系数根据当前编译的方法和编译线程数确定,降低阈值可以提升编译方法数,一些
    常用但是不能编译的方法可以编译优化提升性能。

Due to the complex compilation situation, the JVM will also dynamically adjust the relevant thresholds to ensure the performance of the JVM. Therefore, it is not recommended to manually adjust the compilation-related parameters. Unless some specific cases, such as the codeCache is full and the compilation is stopped, the size of the codeCache can be appropriately increased, or some very common methods are not inlined, which drags down the performance. You can adjust the number of introverted layers or the size of the inline method to solve the problem. .

10.2. Analyzing compilation logs through JITwatch

By adding -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining -XX:+PrintCodeCache -XX:+PrintCodeCacheOnCompilation -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=LogPathparameters, you can output compilation, inlining, and codeCache information to the file. However, the printed compilation logs are many and complex, and it is difficult to get information directly from them. You can use the tools of JITwatch to analyze the compilation logs. Select the log file in Open Log on the homepage of JITwatch, and click Start to start analyzing the log.

insert image description here

insert image description here
As shown in the figure above, area 1 is the Java Class of the entire project including the introduced third-party dependencies; area 2 is the functional area Timeline to display the JIT compilation timeline in the form of graphics, Histo is the histogram to display some information, and TopList is the compilation in progress Sorting of some generated objects and data, Cache is free codeCache space, NMethod is Native method, Threads is JIT compiled thread; Area 3 is the display of log analysis results by JITwatch, in which Suggestions will give some code optimization suggestions, For example, as shown in the following figure:

insert image description here
We can see that when calling the read method of ZipInputStream, because the method is not marked as a hotspot method, and at the same time "too big", it cannot be inlined. Using the inline directive in -XX:CompileCommand can force the method to be inlined, but it is recommended to use it with caution. Unless it is determined that a method inlining will bring a lot of performance improvements, it is not recommended to use it, and excessive use of the compilation thread and codeCache will bring a lot of pressure.

After the -Allocs and -Locks escape analysis in area 3, the JVM optimizes the code, including stack allocation, lock elimination, etc.

10.3. Using the Graal compiler

Since the JVM will dynamically adjust the compilation threshold according to the current number of compilation methods and compilation threads, there is not much room for adjustment of this part in the actual service, and the JVM has done enough.

To improve performance, tried the latest Graal compiler in the service. Just use -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler to start the Graal compiler instead of the C2 compiler and respond to the C2 compilation request, but it should be noted that the Graal compiler is not compatible with ZGC, only with G1 For use with.

As mentioned earlier, Graal is a just-in-time compiler written in Java. It has been integrated into the JDK since Java 9 as an experimental just-in-time compiler. Graal compiler is out of GraalVM, GraalVM is a high-performance execution environment that supports multiple programming languages. It can run on the traditional OpenJDK, can also be compiled into an executable file through AOT (Ahead-Of-Time) to run alone, or even integrated into the database to run.

As mentioned several times before, Graal's optimizations are based on certain assumptions. When the assumption is wrong, the Java virtual machine uses the mechanism of Deoptimization to switch from executing the machine code generated by the just-in-time compiler to interpreted execution. If necessary, it will even discard the machine code and run After recollecting the program profile, compile it again.

These aggressive measures make Graal’s peak performance better than C2, and it performs even better in languages ​​such as Scale and Ruby. Twitter has already used Graal a lot in its services to improve performance, and the enterprise version of GraalVM has improved Twitter’s service performance. 22%.

Performance after using the Graal compiler

In our online service, after enabling Graal compilation, TP9999 drops from 60ms -> 50ms by 10ms, a drop of 16.7%.

Peak performance during operation will be higher. It can be seen that for this service, the Graal compiler brings a certain performance improvement.

Problems with Graal compiler

The optimization method of the Graal compiler is more aggressive, so more compilations are performed at startup. The Graal compiler itself also needs to be compiled just-in-time, so the performance will be poor when the service is just started.

The solution considered: JDK 9 began to provide the tool jaotc, and the Native Image of GraalVM can be statically compiled to greatly improve the startup speed of the service, but GraalVM will use its own garbage collection, which is a very primitive The performance of garbage collection based on replication algorithm is not good compared to G1 and ZGC, which are excellent new garbage collectors. At the same time, GraalVM does not support some features of Java, such as configuration-based support. For example, reflection needs to configure a JSON file for all classes that need to be reflected. When using a large number of reflection services, such a configuration will be a lot of work. We are also doing research in this area.

V. Summary

This article mainly introduces the principle of JIT just-in-time compilation and some practical experience in Meituan, as well as the use of the most cutting-edge real-time compiler. As a technology for improving performance in interpreted languages, JIT is relatively mature and is used in many languages. For Java services, the JVM itself has done enough, but we should continue to deeply understand the optimization principles of JIT and the latest compilation technology, so as to make up for the disadvantages of JIT, improve the performance of Java services, and constantly strive for excellence.

Guess you like

Origin blog.csdn.net/qq_21383435/article/details/127344021