JMH Must-Knows for Performance Tuning 3: Writing Correct Microbenchmark Test Cases

JMH must know and must know series of articles (continuously updated)

I. Introduction

In the previous two articles, what is JMH and the basic law of JMH were introduced respectively. Now let's introduce how to write JMH's correct microbenchmark test cases. 【单位换算：1秒(s)=1000000微秒(us)=1000000000纳秒(ns)】

Official JMH source code (including samples, in the jmh-samples package) download address: https://github.com/openjdk/jmh/tags .

Official JMH sample online browsing address: http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/ .

The content of this article refers to the book "Detailed Explanation of Java High Concurrency Programming: In-depth Understanding of Concurrent Core Library", the author is Wang Wenjun, readers can buy genuine books if necessary.

本文由 @大白有点菜原创，请勿盗用，转载请说明出处！如果觉得文章还不错，请点点赞，加关注，谢谢！

2. Write correct microbenchmark test cases

1. Add JMH dependency package

Search the Maven repository for the dependency package jmh-coreand jmh-generator-annprocess, the version is 1.36. It is necessary to comment "<scope>test</scope>" in the jmh-generator-annprocess package, otherwise an error will be reported when the project runs.

<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core -->
<dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-core</artifactId>
    <version>1.36</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-generator-annprocess -->
<dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-generator-annprocess</artifactId>
    <version>1.36</version>
<!--            <scope>test</scope>-->
</dependency>

2. Avoid DCE (Dead Code Elimination)

The so-called Dead Code Elimination means that the JVM erases some context-independent code for us, and even after calculation, it is determined that it will not be used at all, such as the following code block.

public void test(){
    
    
    int x=10;
    int y=10;
    int z=x+y;
}

We defined x and y separately in the test method, and obtained z after addition, but there is no other place to use z in the context of the method (neither return z nor double it. Once used, z is not even a global variable), the JVM is likely to treat the test() method as an empty method, that is to say, it will erase the definition of x, y, and the related code for calculating z .

[Verify whether the virtual machine will erase context-independent code during the execution of Java code - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：编写正确的微基准测试用例（避免DCE，即死码消除）
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Coding_Correct_Benchmark_Case_DCE {
    
    

    @Benchmark
    public void baseline(){
    
    
        // 空的方法
    }

    @Benchmark
    public void measureLog1(){
    
    
        // 进行数学运算，但是在局部方法内
        Math.log(Math.PI);
    }

    @Benchmark
    public void measureLog2(){
    
    
        // result是通过数学运算所得并且在下一行代码中得到了使用
        double result = Math.log(Math.PI);
        // 对result进行数学运算，但是结果既不保存也不返回，更不会进行二次运算
        Math.log(result);
    }

    @Benchmark
    public double measureLog3(){
    
    
        // 返回数学运算结果
        return Math.log(Math.PI);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Verify whether the virtual machine will erase context-independent code during the execution of Java code - code running result]

Benchmark                                               Mode  Cnt   Score    Error  Units
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.baseline     avgt    5  ≈ 10⁻⁴           us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.measureLog1  avgt    5  ≈ 10⁻⁴           us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.measureLog2  avgt    5  ≈ 10⁻⁴           us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.measureLog3  avgt    5   0.002 ±  0.001  us/op

The baseline method is an empty method, mainly used for benchmark data.
Although the log operation is performed in measureLog1, the result is neither used nor returned.
The log operation is also performed in measureLog2. Although the result of the first operation is used as the second input parameter, it is not further used after the second execution.
The measureLog3 method is similar to the measureLog1 method, but this method returns the operation result.

From the output results, the benchmark performance of the measureLog1 and measureLog2 methods is almost identical to the baseline, so we can be sure that the code in these two methods has been erased. Such code is called Dead Code (dead code, elsewhere None of the code snippets are used), and measureLog3 is different from the above two methods, because it returns the result, Math.log(PI) will not be considered as Dead Code, so it will occupy a certain amount of CPU time.

Conclusion: To write a micro-benchmark method with good performance, do not let the method have Dead Code, and it is best for each benchmark method to have a return value .

[Attach the official Dead Code sample (JMHSample_08_DeadCode) - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：官方Dead Code样例
 * @author 大白有点菜
 */
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class JmhTestApp14_DeadCode {
    
    

    /**
     * The downfall of many benchmarks is Dead-Code Elimination (DCE): compilers
     * are smart enough to deduce some computations are redundant and eliminate
     * them completely. If the eliminated part was our benchmarked code, we are
     * in trouble.
     *
     * 许多基准测试的失败是死代码消除（DCE）：编译器足够聪明，可以推断出一些计算是多余的，并完全消除它们。
     * 如果被淘汰的部分是我们的基准代码，我们就有麻烦了。
     *
     * Fortunately, JMH provides the essential infrastructure to fight this
     * where appropriate: returning the result of the computation will ask JMH
     * to deal with the result to limit dead-code elimination (returned results
     * are implicitly consumed by Blackholes, see JMHSample_09_Blackholes).
     *
     * 幸运的是，JMH 提供了必要的基础设施来在适当的时候解决这个问题：返回计算结果将要求 JMH 处理结果
     * 以限制死代码消除（返回的结果被黑洞隐式消耗，请参阅 JMHSample_09_Blackholes）。
     */

    private double x = Math.PI;

    private double compute(double d) {
    
    
        for (int c = 0; c < 10; c++) {
    
    
            d = d * d / Math.PI;
        }
        return d;
    }

    @Benchmark
    public void baseline() {
    
    
        // do nothing, this is a baseline
    }

    @Benchmark
    public void measureWrong() {
    
    
        // This is wrong: result is not used and the entire computation is optimized away.
        compute(x);
    }

    @Benchmark
    public double measureRight() {
    
    
        // This is correct: the result is being used.
        return compute(x);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_DeadCode.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Attach the official Dead Code sample (JMHSample_08_DeadCode) - code running results]

Benchmark                           Mode  Cnt   Score   Error  Units
JmhTestApp14_DeadCode.baseline      avgt    5   0.261 ± 0.027  ns/op
JmhTestApp14_DeadCode.measureRight  avgt    5  13.920 ± 0.591  ns/op
JmhTestApp14_DeadCode.measureWrong  avgt    5   0.266 ± 0.039  ns/op

[Official Dead Code sample (JMHSample_08_DeadCode) annotations - Google and Baidu translation complementary]

The failing of many benchmarks is dead code elimination (DCE): compilers are smart enough to deduce that some computations are redundant, and eliminate them entirely. If the part that gets knocked out is our benchmark code, we're in trouble.

Fortunately, JMH provides the necessary infrastructure to solve this problem in due course: returning computation results will require JMH to process the results to limit dead code elimination (returned results are implicitly consumed by blackholes, see JMHSample_09_Blackholes).

3. Use Blackhole

Assuming that in the benchmark method, two calculation results need to be returned as return values, how should we do it? Our first thought may be to store the result in an array or container as the return value, but this operation on the array or container will interfere with performance statistics, because the write operation on the array or container also costs a certain amount of time. of CPU time.

JMH provides a Blackhole (black hole) class, which can avoid the occurrence of Dead Code without any return, which is very similar to the black hole device /dev/null under the Linux system .

[Blackhole sample - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：编写正确的微基准测试用例（使用Blackhole，即黑洞）
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole {
    
    

    double x1 = Math.PI;
    double x2 = Math.PI * 2;

    @Benchmark
    public double baseline()
    {
    
    
        // 不是Dead Code，因为对结果进行了返回
        return Math.pow(x1, 2);
    }

    @Benchmark
    public double powButReturnOne()
    {
    
    
        // Dead Code会被擦除
        Math.pow(x1, 2);
        // 不会被擦除，因为对结果进行了返回
        return Math.pow(x2, 2);
    }

    @Benchmark
    public double powThenAdd()
    {
    
    
        // 通过加法运算对两个结果进行了合并，因此两次的计算都会生效
        return Math.pow(x1, 2) + Math.pow(x2, 2);
    }

    @Benchmark
    public void useBlackhole(Blackhole hole)
    {
    
    
        // 将结果存放至black hole中，因此两次pow操作都会生效
        hole.consume(Math.pow(x1, 2));
        hole.consume(Math.pow(x2, 2));
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Blackhole sample - code running result]

Benchmark                                                             Mode  Cnt  Score   Error  Units
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.baseline         avgt    5  2.126 ± 0.163  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.powButReturnOne  avgt    5  2.065 ± 0.112  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.powThenAdd       avgt    5  2.181 ± 0.151  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.useBlackhole     avgt    5  3.748 ± 0.342  ns/op

In the baseline method, the pow operation is performed on x1 and then returned, so this benchmark method is very reasonable.
The first pow operation in the powButReturnOne method still cannot avoid the fate of being regarded as Dead Code, so it is difficult for us to get the time-consuming method of two pow calculations, but the pow operation on x2 will be returned as the return value, so it is not Dead Code .
The powThenAdd method is smarter, it also has a return value, and the two pow operations will be executed normally, but because the addition operation is adopted, the CPU time consumption of the addition operation is also calculated into the two pow operations.
In the useBlackhole method, the pow method will be executed twice, but instead of returning it, we write it into the black hole.

The output results show that the performance of the baseline and putButReturnOne methods is almost the same, and the performance of the powThenAdd method takes a little longer CPU time than the first two methods, because the method performs two pow operations. Although no merge operation is performed on the two parameters in useBlackhole, a certain amount of CPU resources will be occupied due to the execution of the consume method of the black hole. Although the consume method of blackhole will occupy a certain amount of CPU resources, if the use of local variables in the benchmark method without return value is consumed through blackhole, then the same benchmark execution conditions can be ensured, just like a boxing match , The need for a unified weight and weight between the boxers in the confrontation is the same.

Conclusion: Blackhole can help you avoid DC (Dead Code) situations in benchmark methods that return no value .

[Attach the official Blackhole sample (JMHSample_09_Blackholes) - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：官方Blackhole样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Balckhole {
    
    

    /**
     * Should your benchmark require returning multiple results, you have to
     * consider two options (detailed below).
     *
     * 如果您的基准测试需要返回多个结果，您必须考虑两个选项（详细信息如下）。
     *
     * NOTE: If you are only producing a single result, it is more readable to
     * use the implicit return, as in JMHSample_08_DeadCode. Do not make your benchmark
     * code less readable with explicit Blackholes!
     *
     * 注意：如果您只生成一个结果，使用隐式返回更具可读性，如 JMHSample_08_DeadCode。不要使用显式黑洞来降低基准代码的可读性！
     */

    double x1 = Math.PI;
    double x2 = Math.PI * 2;

    private double compute(double d) {
    
    
        for (int c = 0; c < 10; c++) {
    
    
            d = d * d / Math.PI;
        }
        return d;
    }

    /**
     * Baseline measurement: how much a single compute() costs.
     *
     * 基线测量：单个 compute() 的成本是多少。
     */

    @Benchmark
    public double baseline() {
    
    
        return compute(x1);
    }

    /**
     * While the compute(x2) computation is intact, compute(x1)
     * is redundant and optimized out.
     *
     * 虽然 compute(x2) 计算完好无损，但 compute(x1) 是多余的并经过优化。
     *
     */

    @Benchmark
    public double measureWrong() {
    
    
        compute(x1);
        return compute(x2);
    }

    /**
     * This demonstrates Option A:
     *
     * 这演示了选项 A：
     *
     * Merge multiple results into one and return it.
     * This is OK when is computation is relatively heavyweight, and merging
     * the results does not offset the results much.
     *
     * 将多个结果合并为一个并返回。当计算相对重量级时，这是可以的，并且合并结果不会抵消太多结果。
     */

    @Benchmark
    public double measureRight_1() {
    
    
        return compute(x1) + compute(x2);
    }

    /**
     * This demonstrates Option B:
     *
     * 这演示了选项 B：
     *
     * Use explicit Blackhole objects, and sink the values there.
     * (Background: Blackhole is just another @State object, bundled with JMH).
     * 
     * 使用明确的 Blackhole 对象，并将值下沉到那里。
     * （背景：Blackhole 只是另一个 @State 对象，与 JMH 捆绑在一起）。
     */

    @Benchmark
    public void measureRight_2(Blackhole bh) {
    
    
        bh.consume(compute(x1));
        bh.consume(compute(x2));
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Balckhole.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Attach the official Blackhole sample (JMHSample_09_Blackholes) - code running results]

Benchmark                              Mode  Cnt   Score   Error  Units
JmhTestApp14_Balckhole.baseline        avgt    5  13.691 ± 0.661  ns/op
JmhTestApp14_Balckhole.measureRight_1  avgt    5  22.318 ± 1.664  ns/op
JmhTestApp14_Balckhole.measureRight_2  avgt    5  26.079 ± 3.079  ns/op
JmhTestApp14_Balckhole.measureWrong    avgt    5  13.276 ± 1.931  ns/op

[Official Dead Code sample (JMHSample_09_Blackholes) annotations - Google and Baidu translation complementary]

See comments in code

4. Avoid Constant Folding

Constant folding is an early optimization of the Java compiler-compilation optimization. In the process of javac compiling the source file, through lexical analysis, it can be found that some constants can be folded, that is, the calculation result can be directly stored in the declaration, without the need to perform calculations again in the execution phase. for example:

private final int x = 10;
private final int y = x*20;

In the compilation stage, the value of y will be directly assigned to 200, which is the so-called constant folding.

[Constant Folding sample - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：编写正确的微基准测试用例（避免Constant Folding，即常量折叠）
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding {
    
    

    /**
     * x1和x2是使用final修饰的常量
     */
    private final double x1 = 124.456;
    private final double x2 = 342.456;

    /**
     * y1则是普通的成员变量
     */
    private double y1 = 124.456;
    /**
     * y2则是普通的成员变量
     */
    private double y2 = 342.456;

    /**
     * 直接返回124.456×342.456的计算结果，主要用它来作基准
     * @return
     */
    @Benchmark
    public double returnDirect()
    {
    
    
        return 42_620.703936d;
    }

    /**
     * 两个常量相乘，我们需要验证在编译器的早期优化阶段是否直接计算出了x1乘以x2的值
     * @return
     */
    @Benchmark
    public double returnCalculate_1()
    {
    
    
        return x1 * x2;
    }

    /**
     * 较为复杂的计算，计算两个未被final修饰的变量，主要也是用它来作为对比的基准
     * @return
     */
    @Benchmark
    public double returnCalculate_2()
    {
    
    
        return Math.log(y1) * Math.log(y2);
    }

    /**
     * 较为复杂的计算，操作的同样是final修饰的常量，查看是否在编译器优化阶段进行了常量的折叠行为
     * @return
     */
    @Benchmark
    public double returnCalculate_3()
    {
    
    
        return Math.log(x1) * Math.log(x2);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Constant Folding sample - code running result]

Benchmark                                                                      Mode  Cnt   Score   Error  Units
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnCalculate_1  avgt    5   1.873 ± 0.119  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnCalculate_2  avgt    5  36.126 ± 2.372  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnCalculate_3  avgt    5   1.888 ± 0.169  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnDirect       avgt    5   1.869 ± 0.115  ns/op

We can see that the statistical data of the three methods 1, 3, and 4 are almost the same, which means that constant folding occurs when the compiler optimizes. The result can be returned, but the statistics of the second method are not so good-looking, because the early compilation stage will not optimize it.

[Attach the official Constant Folding sample (JMHSample_10_ConstantFold) - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：官方Constant Folding样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_ConstantFolding {
    
    

    /**
     * The flip side of dead-code elimination is constant-folding.
     *
     * 死码消除的另一面是常量折叠。
     *
     * If JVM realizes the result of the computation is the same no matter what,
     * it can cleverly optimize it. In our case, that means we can move the
     * computation outside of the internal JMH loop.
     *
     * 如果 JVM 意识到无论如何计算的结果都是一样的，它可以巧妙地优化它。
     * 在我们的例子中，这意味着我们可以将计算移到内部 JMH 循环之外。
     *
     * This can be prevented by always reading the inputs from non-final
     * instance fields of @State objects, computing the result based on those
     * values, and follow the rules to prevent DCE.
     *
     * 这可以通过始终读取 @State 对象的非最终实例字段的输入，根据这些值计算结果，并遵循防止 DCE 的规则来防止。
     */

    // IDEs will say "Oh, you can convert this field to local variable". Don't. Trust. Them.
    // IDEs 会说“哦，你可以将这个字段转换为局部变量”。不要.相信.它们.
    // (While this is normally fine advice, it does not work in the context of measuring correctly.)
    // (虽然这通常是很好的建议，但它在正确测量的情况下不起作用。)
    private double x = Math.PI;

    // IDEs will probably also say "Look, it could be final". Don't. Trust. Them. Either.
    // IDEs 可能还会说“看，它可能是最终版本”。 也.不要.相信.它们.
    // (While this is normally fine advice, it does not work in the context of measuring correctly.)
    // (虽然这通常是很好的建议，但它在正确测量的情况下不起作用。)
    private final double wrongX = Math.PI;

    private double compute(double d) {
    
    
        for (int c = 0; c < 10; c++) {
    
    
            d = d * d / Math.PI;
        }
        return d;
    }

    @Benchmark
    public double baseline() {
    
    
        // simply return the value, this is a baseline
        // 简单地返回值，这是一个基线
        return Math.PI;
    }

    @Benchmark
    public double measureWrong_1() {
    
    
        // This is wrong: the source is predictable, and computation is foldable.
        // 这是错误的：来源是可预测的，计算是可折叠的。
        return compute(Math.PI);
    }

    @Benchmark
    public double measureWrong_2() {
    
    
        // This is wrong: the source is predictable, and computation is foldable.
        // 这是错误的：来源是可预测的，计算是可折叠的。
        return compute(wrongX);
    }

    @Benchmark
    public double measureRight() {
    
    
        // This is correct: the source is not predictable.
        // 这是正确的：来源是不可预测的。
        return compute(x);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_ConstantFolding.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Attach the official Constant Folding sample (JMHSample_10_ConstantFold) - code running results]

Benchmark                                    Mode  Cnt   Score   Error  Units
JmhTestApp14_ConstantFolding.baseline        avgt    5   1.871 ± 0.077  ns/op
JmhTestApp14_ConstantFolding.measureRight    avgt    5  13.989 ± 0.909  ns/op
JmhTestApp14_ConstantFolding.measureWrong_1  avgt    5   1.846 ± 0.075  ns/op
JmhTestApp14_ConstantFolding.measureWrong_2  avgt    5   1.870 ± 0.090  ns/op

[Official Constant Folding sample (JMHSample_10_ConstantFold) annotations - Google and Baidu translation complementary]

See comments in code

5. Avoid Loop Unwinding

When we write JMH code, in addition to avoiding Dead Code and reducing references to constants, we also need to avoid or reduce loops in benchmark methods as much as possible, because loop codes are extremely difficult in the running phase (JVM post-optimization). It is possible to be "killed by pain" for related optimization. This optimization is called loop unwinding. Let's take a look at what is loop unwinding (Loop Unwinding).

int sum=0;
for(int i = 0;i<100;i++){
    
    
    sum+=i;
}

In the above example, the code such as sum=sum+i will be executed 100 times, that is to say, the JVM will send such calculation instructions to the CPU 100 times, which seems to be nothing, but the designers of the JVM will think that Such an approach can be optimized as follows (possibly).

int sum=0;
for(int i = 0;i<20; i+=5){
    
    
    sum+=i;
    sum+=i+1;
    sum+=i+2;
    sum+=i+3;
    sum+=i+4;
}

After optimization, the calculation instructions in the loop body are sent to the CPU in batches. This batch method can improve the efficiency of calculation. Assuming that the operation of 1+2 takes 1 nanosecond of CPU time to execute, then in a 10-cycle calculation In , we think it may be 10 nanoseconds of CPU time, but the real calculation situation may be less than 10 nanoseconds or even lower.

[Loop Unwinding Sample - Code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：编写正确的微基准测试用例（避免Loop Unwinding，即循环展开）
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding {
    
    

    private int x = 1;
    private int y = 2;

    @Benchmark
    public int measure()
    {
    
    
        return (x + y);
    }

    private int loopCompute(int times)
    {
    
    
        int result = 0;
        for (int i = 0; i < times; i++)
        {
    
    
            result += (x + y);
        }
        return result;
    }

    @OperationsPerInvocation
    @Benchmark
    public int measureLoop_1()
    {
    
    
        return loopCompute(1);
    }

    @OperationsPerInvocation(10)
    @Benchmark
    public int measureLoop_10()
    {
    
    
        return loopCompute(10);
    }

    @OperationsPerInvocation(100)
    @Benchmark
    public int measureLoop_100()
    {
    
    
        return loopCompute(100);
    }

    @OperationsPerInvocation(1000)
    @Benchmark
    public int measureLoop_1000()
    {
    
    
        return loopCompute(1000);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Loop Unwinding sample - code running result]

Benchmark                                                                   Mode  Cnt  Score   Error  Units
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measure           avgt    5  2.038 ± 0.167  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_1     avgt    5  2.112 ± 0.548  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_10    avgt    5  0.226 ± 0.013  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_100   avgt    5  0.026 ± 0.003  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_1000  avgt    5  0.023 ± 0.002  ns/op

In the above code, the measure() method performs the calculation of x+y, the measureLoop_1() method is almost equivalent to the measure() method, and also performs the calculation of x+y, but the measureLoop_10() method has no effect on result+=( x+y) has performed such operations 10 times. In fact, to put it bluntly, it is calling measure() or loopCompute(times=1) 10 times. But we certainly can't directly compare the CPU time consumed by 10 operations and 1 operation, so the @OperationsPerInvocation(10) annotation is used to record the op operation every time the measureLoop_10() method is benchmarked. for 10 times.

Through the benchmark test of JMH, it is not difficult to find that when the number of cycles is large, there are more folds, so the performance will be better, indicating that the JVM optimizes our code during runtime.

[Attach the official Loop Unwinding sample (JMHSample_11_Loops) - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：官方Loop Unwinding样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_LoopUnwinding {
    
    

    /**
     * It would be tempting for users to do loops within the benchmarked method.
     * (This is the bad thing Caliper taught everyone). These tests explain why
     * this is a bad idea.
     *
     * 对于用户来说，在基准方法中进行循环是很有吸引力的。
     * （这是 Caliper 教给大家的坏事）。这些测试解释了为什么这是一个坏主意。
     *
     * Looping is done in the hope of minimizing the overhead of calling the
     * test method, by doing the operations inside the loop instead of inside
     * the method call. Don't buy this argument; you will see there is more
     * magic happening when we allow optimizers to merge the loop iterations.
     *
     * 循环是为了最小化调用测试方法的开销，通过在循环内而不是在方法调用内进行操作。
     * 不要相信这个论点； 当我们允许优化器合并循环迭代时，您会看到更多神奇的事情发生。
     */

    /**
     * Suppose we want to measure how much it takes to sum two integers:
     */

    int x = 1;
    int y = 2;

    /**
     * This is what you do with JMH.
     * 这是您使用JMH所做的。
     */

    @Benchmark
    public int measureRight() {
    
    
        return (x + y);
    }

    /**
     * The following tests emulate the naive looping.
     * 以下测试模拟了天真的循环。
     * This is the Caliper-style benchmark.
     * 这是 Caliper 风格的基准测试。
     */
    private int reps(int reps) {
    
    
        int s = 0;
        for (int i = 0; i < reps; i++) {
    
    
            s += (x + y);
        }
        return s;
    }

    /**
     * We would like to measure this with different repetitions count.
     * 我们想用不同的重复次数来衡量这一点。
     * Special annotation is used to get the individual operation cost.
     * 使用特殊注释来获得单个操作成本。
     */

    @Benchmark
    @OperationsPerInvocation(1)
    public int measureWrong_1() {
    
    
        return reps(1);
    }

    @Benchmark
    @OperationsPerInvocation(10)
    public int measureWrong_10() {
    
    
        return reps(10);
    }

    @Benchmark
    @OperationsPerInvocation(100)
    public int measureWrong_100() {
    
    
        return reps(100);
    }

    @Benchmark
    @OperationsPerInvocation(1_000)
    public int measureWrong_1000() {
    
    
        return reps(1_000);
    }

    @Benchmark
    @OperationsPerInvocation(10_000)
    public int measureWrong_10000() {
    
    
        return reps(10_000);
    }

    @Benchmark
    @OperationsPerInvocation(100_000)
    public int measureWrong_100000() {
    
    
        return reps(100_000);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_LoopUnwinding.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

[Attach the official Loop Unwinding sample (JMHSample_11_Loops) - code running results]

Benchmark                                       Mode  Cnt  Score   Error  Units
JmhTestApp14_LoopUnwinding.measureRight         avgt    5  2.326 ± 0.089  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_1       avgt    5  2.052 ± 0.085  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_10      avgt    5  0.225 ± 0.006  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_100     avgt    5  0.026 ± 0.001  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_1000    avgt    5  0.022 ± 0.001  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_10000   avgt    5  0.019 ± 0.001  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_100000  avgt    5  0.017 ± 0.002  ns/op

[Official Loop Unwinding sample (JMHSample_11_Loops) annotations - Google and Baidu translation complementary]

See comments in code

6. Fork is used to avoid profile-guided optimizations

What is fork used for? This section will introduce the role of Fork and the JVM's profile-guided optimizations.

Before we start explaining Fork, let's imagine how we usually test application performance. For example, we want to test the response speed of Redis when it performs a total of 100 million write operations in 50, 100, and 200 threads at the same time. how to do? First, we will clear the Redis library to ensure that different test cases stand on the same starting line for each test as much as possible. For example, the size of the server memory, the size of the server disk, and the size of the server CPU are basically the same , such a comparison is meaningful, and then test it according to the test case, then clean up the Redis server resources to return it to the state before the test, and finally make a test report based on the statistical test results.

The introduction of Fork also takes this problem into consideration. Although Java supports multi-threading, it does not support multi-processes, which leads to all codes running in one process. The execution of the same code at different times may introduce problems in the previous stage. The optimization of the process profiler will even mix in the parameters of other code profiler optimizations, which is likely to cause inaccurate problems in the micro-benchmarks we write. You may think this statement is a bit abstract, and it is better to understand it through examples.

[Fork set to 0 sample - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：编写正确的微基准测试用例（Fork用于避免 profile-guided optimizations）
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
// 将Fork设置为0
@Fork(0)
// 将Fork设置为1
//@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Coding_Correct_Benchmark_Case_Fork {
    
    

    // Inc1 和Inc2的实现完全一样
    interface Inc
    {
    
    
        int inc();
    }

    public static class Inc1 implements Inc
    {
    
    
        private int i = 0;

        @Override
        public int inc()
        {
    
    
            return ++i;
        }
    }

    public static class Inc2 implements Inc
    {
    
    
        private int i = 0;

        @Override
        public int inc()
        {
    
    
            return ++i;
        }
    }

    private Inc inc1 = new Inc1();
    private Inc inc2 = new Inc2();

    private int measure(Inc inc)
    {
    
    
        int result = 0;
        for (int i = 0; i < 10; i++)
        {
    
    
            result += inc.inc();
        }
        return result;
    }

    @Benchmark
    public int measure_inc_1()
    {
    
    
        return this.measure(inc1);
    }

    @Benchmark
    public int measure_inc_2()
    {
    
    
        return this.measure(inc2);
    }

    @Benchmark
    public int measure_inc_3()
    {
    
    
        return this.measure(inc1);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.class.getSimpleName())
                .build();

        new Runner(opt).run();
    }
}

[Fork set to 0 example - code running result]

Benchmark                                                      Mode  Cnt  Score    Error  Units
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_1  avgt    5  0.002 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_2  avgt    5  0.012 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_3  avgt    5  0.012 ±  0.001  us/op

If Fork is set to 0, each benchmark method will use the same JVM process as JmhTestApp14_Coding_Correct_Benchmark_Case_Fork, so the benchmark method may be mixed into the Profiler of the JmhTestApp14_Coding_Correct_Benchmark_Case_Fork process.

The implementation methods of measure_inc_1 and measure_inc_2 are almost the same, but there is a big gap in their performance. Although the code implementation of measure_inc_1 and measure_inc_3 is exactly the same, there are still different performance data. This is actually profiler-guided optimizationscaused All our benchmarking methods are shared with the JVM process of JmhTestApp14_Coding_Correct_Benchmark_Case_Fork, so it is inevitable to mix the Profiler of the JmhTestApp14_Coding_Correct_Benchmark_Case_Fork process, but when Fork is set to 1, that is to say, a new JVM process will be created every time the benchmark is run Test it and there will be no more interference between multiple benchmarks.

[Fork is set to 1 sample - code]

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：编写正确的微基准测试用例（Fork用于避免 profile-guided optimizations）
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
// 将Fork设置为0
//@Fork(0)
// 将Fork设置为1
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Coding_Correct_Benchmark_Case_Fork {
    
    

    // Inc1 和Inc2的实现完全一样
    interface Inc
    {
    
    
        int inc();
    }

    public static class Inc1 implements Inc
    {
    
    
        private int i = 0;

        @Override
        public int inc()
        {
    
    
            return ++i;
        }
    }

    public static class Inc2 implements Inc
    {
    
    
        private int i = 0;

        @Override
        public int inc()
        {
    
    
            return ++i;
        }
    }

    private Inc inc1 = new Inc1();
    private Inc inc2 = new Inc2();

    private int measure(Inc inc)
    {
    
    
        int result = 0;
        for (int i = 0; i < 10; i++)
        {
    
    
            result += inc.inc();
        }
        return result;
    }

    @Benchmark
    public int measure_inc_1()
    {
    
    
        return this.measure(inc1);
    }

    @Benchmark
    public int measure_inc_2()
    {
    
    
        return this.measure(inc2);
    }

    @Benchmark
    public int measure_inc_3()
    {
    
    
        return this.measure(inc1);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.class.getSimpleName())
                .build();

        new Runner(opt).run();
    }
}

[Fork set to 1 example - code running result]

Benchmark                                                      Mode  Cnt  Score    Error  Units
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_1  avgt    5  0.003 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_2  avgt    5  0.003 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_3  avgt    5  0.003 ±  0.001  us/op

The above output is the result of setting Fork to 1. Is it a lot more reasonable? If Fork is set to 0, it will share the same process Profiler as the class running the benchmark test. If it is set to 1, it will be for each benchmark method. Open up a new process to run, of course, you can set Fork to a value greater than 1, then it will run multiple times in different processes, but in general, we only need to set Fork to 1.

[Attach the official Fork sample (JMHSample_12_Forking) - code]

package cn.zhuangyt.javabase.jmh;

import cn.zhuangyt.javabase.jmh.jmh_sample.JMHSample_12_Forking;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/**
 * JMH测试14：官方Forks样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class JmhTestApp14_Fork {
    
    

    /**
     * JVMs are notoriously good at profile-guided optimizations. This is bad
     * for benchmarks, because different tests can mix their profiles together,
     * and then render the "uniformly bad" code for every test. Forking (running
     * in a separate process) each test can help to evade this issue.
     *
     * JVM 以擅长配置文件引导的优化而著称。 这对基准测试不利，因为不同的测试可以将它们的配置文件混合在一起，
     * 然后为每个测试呈现“一致糟糕”的代码。 分叉（在单独的进程中运行）每个测试可以帮助避免这个问题。
     *
     * JMH will fork the tests by default.
     *
     * JMH 将默认分叉测试。
     */

    /**
     * Suppose we have this simple counter interface, and two implementations.
     * Even though those are semantically the same, from the JVM standpoint,
     * those are distinct classes.
     *
     * 假设我们有这个简单的计数器接口和两个实现。尽管它们在语义上是相同的，但从 JVM 的角度来看，它们是不同的类。
     */

    public interface Counter {
    
    
        int inc();
    }

    public static class Counter1 implements JMHSample_12_Forking.Counter {
    
    
        private int x;

        @Override
        public int inc() {
    
    
            return x++;
        }
    }

    public static class Counter2 implements JMHSample_12_Forking.Counter {
    
    
        private int x;

        @Override
        public int inc() {
    
    
            return x++;
        }
    }

    /**
     * And this is how we measure it.
     * 这就是我们衡量它的方式。
     * Note this is susceptible for same issue with loops we mention in previous examples.
     * 请注意，这很容易受到我们在前面示例中提到的循环的相同问题的影响。
     */

    public int measure(JMHSample_12_Forking.Counter c) {
    
    
        int s = 0;
        for (int i = 0; i < 10; i++) {
    
    
            s += c.inc();
        }
        return s;
    }

    /**
     * These are two counters.
     */
    JMHSample_12_Forking.Counter c1 = new JMHSample_12_Forking.Counter1();
    JMHSample_12_Forking.Counter c2 = new JMHSample_12_Forking.Counter2();

    /**
     * We first measure the Counter1 alone...
     * 我们首先单独测量 Counter1 ...
     * Fork(0) helps to run in the same JVM.
     * Fork(0) 有助于在同一个 JVM 中运行。
     */

    @Benchmark
    @Fork(0)
    public int measure_1_c1() {
    
    
        return measure(c1);
    }

    /**
     * Then Counter2...
     * 然后到 Counter2...
     */

    @Benchmark
    @Fork(0)
    public int measure_2_c2() {
    
    
        return measure(c2);
    }

    /**
     * Then Counter1 again...
     * 然后再次到 Counter1
     */

    @Benchmark
    @Fork(0)
    public int measure_3_c1_again() {
    
    
        return measure(c1);
    }

    /**
     * These two tests have explicit @Fork annotation.
     * JMH takes this annotation as the request to run the test in the forked JVM.
     * It's even simpler to force this behavior for all the tests via the command
     * line option "-f". The forking is default, but we still use the annotation
     * for the consistency.
     *
     * 这两个测试有显示的 @Fork 注释。
     * JMH 将此注释作为在分叉的 JVM 中运行测试的请求。
     * 通过命令行选项“-f”为所有测试强制执行此行为甚至更简单。 分叉是默认的，但我们仍然使用注释来保持一致性。
     *
     * This is the test for Counter1.
     * 这是 Counter1 的测试
     */

    @Benchmark
    @Fork(1)
    public int measure_4_forked_c1() {
    
    
        return measure(c1);
    }

    /**
     * ...and this is the test for Counter2.
     * 还有这是 Counter2 的测试
     */

    @Benchmark
    @Fork(1)
    public int measure_5_forked_c2() {
    
    
        return measure(c2);
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp14_Fork.class.getSimpleName())
                .build();

        new Runner(opt).run();
    }
}

[Attach the official Fork sample (JMHSample_12_Forking) - code running results]

Benchmark                              Mode  Cnt   Score   Error  Units
JmhTestApp14_Fork.measure_1_c1         avgt    5   2.162 ± 0.129  ns/op
JmhTestApp14_Fork.measure_2_c2         avgt    5  12.490 ± 0.304  ns/op
JmhTestApp14_Fork.measure_3_c1_again   avgt    5  12.182 ± 0.605  ns/op
JmhTestApp14_Fork.measure_4_forked_c1  avgt    5   3.138 ± 0.162  ns/op
JmhTestApp14_Fork.measure_5_forked_c2  avgt    5   3.179 ± 0.302  ns/op

[Official Fork sample annotation (JMHSample_12_Forking) - Google and Baidu translation complementary]

See comments in code