How Just-In-Time (JIT) in the JVM works

origin

In order to achieve "write once, run anywhere" in the Java language, the compiler compiles the source code into an intermediate language - bytecode, and then translates the bytecode into the corresponding machine code for execution through the Java Virtual Machine (JVM). And this translation process is interpreted and executed, and there will be a certain performance loss compared with statically compiled languages such as C++.

In order to improve the execution efficiency of Java at runtime, the just-in-time compilation technology introduced by JVM, namely JIT (Just-In-Time): bytecode is still interpreted and executed, but through the analysis of the execution process, hot code is selectively Compile it into machine code and cache it to improve the execution performance of Java as a whole. At the same time, there will be some code optimization methods during the compilation process to make code execution more efficient.

Let's talk about JIT-related content and practice in the JVM below.

JIT compiler in JVM

The most commonly used JVM is HotSpot. By default, its operating mode is mixed mode (mixed mode), that is, the interpreter and JIT compiler are executed together during the runtime phase, and its execution mode can also be set through JVM parameters:

-Xint: Execute the program completely in interpreter mode.
-Xcomp: Completely execute the program in just-in-time compilation mode. If there is a problem with just-in-time compilation, the interpreter will intervene.
-Xmixed: It uses the mixed mode of interpreter + just-in-time compilation to jointly execute the program, which is the default execution mode of the JVM.

Generally speaking, we don't need to modify its execution mode. Before introducing the execution method of the mixed mode, first introduce the JIT compiler. HotSpot provides us with two JIT compilers, which differ in code optimization strategy:

C1 Compiler: Also known as Client Compiler or Client mode, it is usually used for programs such as client programs that have less local resources and are sensitive to startup time. The compilation time is usually shorter and there will be a certain optimization effect.
C2 Compiler: Also known as Server Compiler or Server mode, the local resources of the operating environment are relatively unrestricted, and more in-depth optimizations will be performed, but usually the compilation takes a long time, and the compiled code execution efficiency will be better.

It can be found that the two have their own advantages and applicable scenarios. In the early days, -clientthe C1 compiler and -serverthe C2 compiler could be specified through JVM parameters. So is there a way to combine the advantages of both? Java 7 began to introduce the concept of layered compilation to integrate the advantages of the two.

layered compilation

After introducing the concept of layered compilation, there will be multiple transition processes from interpreter interpretation and execution of bytecode to JIT compiler intervention, and the code state is also divided into 5 levels:

compile level	explain
Level 0, interpreting and executing code	When initially started, the JVM executes all code through interpretation.
Level 1, C1 Simple compiled code	At this level, the JVM starts to use C1 to compile the code, but due to the complexity of the method, it is found that even using the C2 compiler will not get obvious performance benefits, so it will not further collect profiling information for such methods, and directly use C1 The compiler compiles.
Level 2, C1 restricted compiled code	At this level, the C1 compiler only compiles code based on the profile information provided by method counters and edge counters.
Level 3, C1 fully compiled code	On the basis of Level 2, the code compiled based on the completed Profile information.
Level 4, C2 compiled code	At this level, code compiled with the C2 compiler is executed.

In the process of JVM interpreting and executing bytecode, information during code execution will be collected, and a dedicated JIT compilation thread will intervene in optimization when necessary. From the several transition states in the above figure, it can be found that the ideal situation is from L0 directly to L3 and finally to L4, but if your method complexity is very low, such as a simple get/set method, even if you let C2 compile it There will be no more gain, then L1 level is enough.

Since the compilation thread is limited and your method is compiled through a queue, there must be a case of being blocked due to busyness. If the C2 compilation thread is busy, it will first transition to L2 to optimize the code as soon as possible, and then transition to L3. When the C2 compilation thread is idle, it is handed over to it to transition to L4. The C1 compilation thread is also busy at times. If enough information has been collected during the interpretation and execution of the code at this time, it will be handed over to the C2 compilation thread directly to transition to L4.

Starting from Java 8, layered compilation has been enabled by default, and the previous parameters for manually specifying the use of the C1 or C2 compiler -clientand -serverno longer valid.

compile thread

As mentioned above, the JVM will start a compilation thread to perform JIT compilation. The number of compilation threads is related to the number of CPU cores of the current machine by default. The relationship between the number of CPU cores and compilation threads is as follows by default:

CPUs	C1 number of compilation threads	C2 Compilation Threads
1	1	1
2	1	1
4	1	2
8	1	2
16	2	6
32	3	7
64	4	8
128	4	10

You can also -XX:CICompilerCountmanually . If you manually specify, then 1/3 of the threads are C1, and the rest are C2. For example, if you manually specify 6 compilation threads, then it will be There are 2 compilation threads of C1 and 4 compilation threads of C2.

Just-in-time compilation trigger timing

As mentioned above, the code collects information during the interpretation and execution process, and the JIT compiler will intervene when necessary, so what is necessary? Or what kind of code would be considered worthy of optimization? It must be the hot code that is executed frequently. The hot code is divided into two types: the method that is called many times, and the loop body that is executed many times. The JVM will create counters for each method or code block. If the number of executions reaches a certain threshold, they are considered hot codes, and the JIT compiler will intervene.

method call counter

It counts the relative number of method calls, that is, the number of times a method is called within a period of time. When a certain time limit is exceeded, if the count of the method is still not enough for it to be submitted to the compiler for compilation, the count will be reduced. Half, this process is called the decay of the method call (Counter Decay), and this period of time is called the half-life cycle (Counter Half Life Time) of this method statistics.

Back edge counter

First of all, what is backside? Back Edge refers to an instruction that jumps backward when the bytecode encounters a control flow during execution. The counter counts the absolute number of loop executions of this method, and there is no attenuation process.

Before there is a layered compilation mechanism, you can -XX: CompileThresholdset the threshold of the method call counter through the JVM parameter: the default is 1500 times in Client mode, and 10000 times in Server mode. There is also a fixed threshold calculation formula for the back edge counter. With the layered compilation mechanism, the above parameters are no longer valid, and the threshold is dynamically calculated based on multiple parameters:

方法调用次数 > Tier{X}InvocationThreshold * s ||
方法调用次数 > Tier{X}MinInvocationThreshold * s && 方法调用次数 + 循环回边次数 > Tier{X}CompileThreshold * s

Among them, X refers to the compilation level mentioned above, which is 3 or 4. The default values of the above parameters are:

Tier3InvocationThreshold = 200
Tier4InvocationThreshold = 5000
Tier3CompileThreshold = 2000
Tier4CompileThreshold = 15000
Tier3MinInvocationThreshold = 100
Tier4MinInvocationThreshold = 600

The calculation formula of the coefficient s is as follows:

s = Level X 层待编译方法数 / (TierXLoadFeedback * 编译线程数) + 1

Where X also refers to the code compilation level of 3 or 4, and TierXLoadFeedbackthe default values are as follows:

Tier3LoadFeedback = 5
Tier4LoadFeedback = 3

Although the calculation formula has become more complicated and it is difficult to analyze where the formula comes from, it is not difficult to see from the calculation formula that if the application has just started and the number of methods to be compiled is small or even 0, then the first few method calls will be It will be compiled soon. With the startup of the application, more and more methods will be called, and the number of methods to be compiled will also increase. Then the methods called later will need more calls to be optimized.

Compared with the previous fixed threshold mode, the dynamic calculation threshold is more flexible, which balances the timeliness of JIT compilation and the performance impact of JIT compilation on the application as much as possible.

OSR（On Stack Replacement）

Here is an additional mention of the concept of OSR. First of all, the compilation unit of the JIT compiler is a method, but the hot code covered by the return counter is the code in the loop body, so when the compilation unit is still a method, its execution entry will be The difference is that during method execution, the loop body is replaced with optimized code, that is, when the stack frame of the method is still on the stack, the method is replaced. This is the concept of OSR "replacement on the stack".

About deoptimization

After talking about when the compiler starts optimization, let's talk about optimization. The so-called de-optimization means that the optimization of the code by the JIT compiler is no longer effective. At this time, this code will be rolled back and the mode of analysis and execution will be re-experienced from execution analysis to optimization. Deoptimization has two states: made not entrantand made zombie.

Not Entrant Code

made not entrantIt can be translated as "no more entry", as the name implies, this code will be reused later. There are two cases where the code will be marked made not entrant:

The first situation is related to aggressive optimization. In order to obtain better optimization results, the C2 compiler will perform some aggressive optimizations. As the code runs, the previous optimizations may become invalid. Here is an example of branch prediction optimization:

StockPriceHistory sph;
String log = request.getParameter("log");
if (log != null && log.equals("true")) {
    sph = new StockPriceHistoryLogger(...);
}
else {
    sph = new StockPriceHistoryImpl(...);
}
// Then the JSP makes calls to:
sph.getHighPrice();
sph.getStdDev();

For example, if your server is running the above code, if a large number of requests do not have log parameters before, they all use the else condition. At this time, sphthe variable type is StockPriceHistoryImpl, and the compiler will perform some optimization methods such as method inline processing, but once you start If there is a request with a log parameter, the logic will go to the if condition is true branch, at this time sphthe variable type is StockPriceHistoryLogger, the previous optimization edge will be considered invalid, and de-optimization will occur.

Another situation is related to the mechanism of layered compilation. From the transition process of layered compilation mentioned above, the final state of the code is L4 or L1, then the code compilation results of L2 and L3 generated in the process are compiled Deoptimization also occurs when a method reaches a final state.

Zombie Code

This state can be understood as the code that was previously made not entrantmarked as has been marked as recyclable. These zombie codes will be cleared when the Code Cache cleanup mechanism is triggered. The following will introduce the related content of Code Cache.

compile cache

There is a memory area called Code Cache in the JVM, which is used to cache the compilation results of the JIT compiler. In subsequent executions, when the program calls the method again, the local code in the Code Cache can be used directly without having to to compile. It should be noted that the size of the Code Cache is fixed. When layered compilation is enabled, the default is 240MB . If the Code Cache space is insufficient, the JIT compiler will not be able to continue compiling new code, which will lead to a decrease in the performance of the application.

If the Code Cache is full, it will be output in the JVM log CodeCache is full. Compiler has been disabled.. You can -XX:ReservedCodeCacheSizeadjust .

In addition, Code Cache has a speculative cleaning mechanism, which -XX:+UseCodeCacheFlushingis controlled . This switch has been enabled by default since Java 7. When Code Cache is almost full, half of the previously compiled methods will be put into an old method queue. , and turn on the check at a fixed frequency (30s by default). If the modified method is still not used after two inspections, the modified method will be marked as and will made not entrantbe cleaned up gradually.

JIT compilation log analysis and practice

Observe the JIT compilation log

We can observe the execution process of JIT -XX:+PrintCompilationthrough , taking a piece of test code as an example:

public class SomeComputation {
    
    
    public String doSomething(String str) {
    
    
        if (str == null || str.isEmpty()) {
    
    
            return "Hello World!";
        }

        return str.toUpperCase() + str.toLowerCase();
    }
}

public class TrivialObject {
    
    
    private int a;

    private String b;

    public int getA() {
    
    
        return a;
    }

    public void setA(int a) {
    
    
        this.a = a;
    }

    public String getB() {
    
    
        return b;
    }

    public void setB(String b) {
    
    
        this.b = b;
    }
}

public class JITCompilationPlaygroundMain {
    public static void main(String[] args) {
        SomeComputation sth = new SomeComputation();

        for (int i = 0; i < 100000; i++) {
            TrivialObject obj = new TrivialObject();
            obj.setA(i);
            obj.setB(String.valueOf(i));
            sth.doSomething(obj.getB());
        }
    }
}

After executing the code with parameters, you will get the following output (because it is relatively long, only the key part is posted):

149   61       3       me.leozdgao.playground.TrivialObject::setB (6 bytes)
149   62       3       me.leozdgao.playground.SomeComputation::doSomething (39 bytes)
150   72       1       me.leozdgao.playground.TrivialObject::setB (6 bytes)
150   61       3       me.leozdgao.playground.TrivialObject::setB (6 bytes)   made not entrant
150   42       1       me.leozdgao.playground.TrivialObject::getB (5 bytes)
150   56       3       me.leozdgao.playground.TrivialObject::<init> (5 bytes)
150   57       3       me.leozdgao.playground.TrivialObject::setA (6 bytes)
150   58       3       java.lang.String::valueOf (5 bytes)
150   60       3       java.lang.Integer::toString (48 bytes)
150   32       3       java.lang.String::getChars (62 bytes)   made not entrant
150   71       4       java.lang.StringBuilder::append (8 bytes)
151   73       1       me.leozdgao.playground.TrivialObject::setA (6 bytes)
151   57       3       me.leozdgao.playground.TrivialObject::setA (6 bytes)   made not entrant
151   30   !   3       sun.misc.URLClassPath$JarLoader::ensureOpen (36 bytes)
152   33       1       java.net.URL::getProtocol (5 bytes)
152   55  s    1       java.util.Vector::size (5 bytes)
156   82       4       me.leozdgao.playground.SomeComputation::doSomething (39 bytes)
163   80       4       java.lang.String::valueOf (5 bytes)
163   34       3       java.lang.String::<init> (82 bytes)   made not entrant
163   81       4       java.lang.Integer::toString (48 bytes)
171   58       3       java.lang.String::valueOf (5 bytes)   made not entrant
172   60       3       java.lang.Integer::toString (48 bytes)   made not entrant
172   62       3       me.leozdgao.playground.SomeComputation::doSomething (39 bytes)   made not entrant
176   84 %     3       me.leozdgao.playground.JITCompilationPlaygroundMain::main @ 10 (53 bytes)
177   85       3       me.leozdgao.playground.JITCompilationPlaygroundMain::main (53 bytes)

First explain the log format:

时间 JVM编译ID 标识符 编译层次 编译的方法名 去优化标记

Explain to the log we output, where the time refers to the time from the application startup to the compilation trigger, and the meaning of the identifier. The three representatives that appear in the above log:

%: Indicates whether it is OSR compilation
s: Whether it is a synchronized method
!: Whether to include an exception handler

There are two more identifiers that do not appear in the above example:

n: Whether it is a native method
b: Whether to block the application thread

The transition process mentioned above can be better verified by observing the value of the compilation level column. Taking method TrivialObject::setBas an , because it is a simple setter method, which belongs to Trivial Method, it is L3 at 149ms, and it transitions to L1 at 150ms. Looking at the method again SomeComputation::doSomething, some logic is defined in its method body. It is L3 at 149ms, and transitions to L4 at 156ms, which is the most common transition process.

Careful observation reveals that the method of transitioning to L1 or L4 has an additional L3 record in the follow-up, and there is a column of values at the end. made not entrantThis is the de-optimization flag mentioned above, because of the layered compilation mechanism, L3 is used as an intermediate state. Marked as "No More Entry".

Application Warming

In the actual development process, due to the existence of the JIT mechanism, the JIT compiler has not optimized the hot code in time when the application is redeployed and just started, which may cause the response time of the service to increase in a short period of time, and a large number of JIT compilation leads to high CPU utilization, which potentially affects the availability of services, especially for C-side service interfaces.

To solve this problem, you can consider "application warm-up". This needs to be judged according to the actual situation of each application. Taking the Spring Boot application as an example, you can call the application you think in advance based on events before the application actually provides external services ApplicationReadyEvent. Hot code, trigger JIT compilation in advance, so that when the application actually provides services, the hot code has been optimized.

Summarize

The author also encountered a scenario where application preheating is required, so I sorted out the JIT mechanism from layered compilation to Code Cache and log analysis. Of course, the application preheating depends on the situation of different applications. In fact, the things involved It is not just a factor of JIT compilation, which will not be expanded in this article.

Since the introduction of the Graal compiler in Java9, the code compiled based on AOT has been compiled into the machine code of the target environment during compilation, and there is no JIT mechanism at all, so there is no need to worry about the need to warm up the application. Part of the introduction and practice will be introduced in subsequent articles.

References: