为什么2 *(i * i)比Java中的2 * i * i更快?

本文翻译自:Why is 2 * (i * i) faster than 2 * i * i in Java?

The following Java program takes on average between 0.50 secs and 0.55 secs to run: 以下Java程序平均需要在0.50秒到0.55秒之间运行:

public static void main(String[] args) {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
    System.out.println("n = " + n);
}

If I replace 2 * (i * i) with 2 * i * i , it takes between 0.60 and 0.65 secs to run. 如果我更换2 * (i * i)2 * i * i ,它需要0.60秒和0.65之间运行。 How come? 怎么会?

I ran each version of the program 15 times, alternating between the two. 我运行了每个版本的程序15次,在两者之间交替。 Here are the results: 结果如下:

 2*(i*i)  |  2*i*i
----------+----------
0.5183738 | 0.6246434
0.5298337 | 0.6049722
0.5308647 | 0.6603363
0.5133458 | 0.6243328
0.5003011 | 0.6541802
0.5366181 | 0.6312638
0.515149  | 0.6241105
0.5237389 | 0.627815
0.5249942 | 0.6114252
0.5641624 | 0.6781033
0.538412  | 0.6393969
0.5466744 | 0.6608845
0.531159  | 0.6201077
0.5048032 | 0.6511559
0.5232789 | 0.6544526

The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i) . 2 * i * i的最快运行时间比2 * (i * i)的最慢运行时间长。 If they were both as efficient, the probability of this happening would be less than 1/2^15 * 100% = 0.00305%. 如果它们同样有效,则发生这种情况的概率将小于1/2 ^ 15 * 100%= 0.00305%。


#1楼

参考:https://stackoom.com/question/3cHUX/为什么-i-i-比Java中的-i-i更快


#2楼

The two methods of adding do generate slightly different byte code: 这两种添加方法会生成略有不同的字节代码:

  17: iconst_2
  18: iload         4
  20: iload         4
  22: imul
  23: imul
  24: iadd

For 2 * (i * i) vs: 对于2 * (i * i) vs:

  17: iconst_2
  18: iload         4
  20: imul
  21: iload         4
  23: imul
  24: iadd

For 2 * i * i . 对于2 * i * i

And when using a JMH benchmark like this: 当使用这样的JMH基准时:

@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {

    @Benchmark
    public int noBrackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * i * i;
        }
        return n;
    }

    @Benchmark
    public int brackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * (i * i);
        }
        return n;
    }

}

The difference is clear: 差异很明显:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

What you observe is correct, and not just an anomaly of your benchmarking style (ie no warmup, see How do I write a correct micro-benchmark in Java? ) 您观察到的是正确的,而不仅仅是您的基准测试风格的异常(即没有热身,请参阅如何在Java中编写正确的微基准测试?

Running again with Graal: 用Graal再次跑步:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler. 你会发现结果更接近,这是有道理的,因为Graal是一个整体性能更好,更现代的编译器。

So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it. 所以这实际上取决于JIT编译器能够优化特定代码片段的程度,并且不一定有合理的理由。


#3楼

I got similar results: 我得到了类似的结果:

2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run. 如果两个循环都在同一个程序中,或者每个循环都在一个单独的.java文件/ .class中,我在单独的运行中执行,我得到了SAME结果。

Finally, here is a javap -c -v <.java> decompile of each: 最后,这里是每个javap -c -v <.java>反编译:

     3: ldc           #3                  // String 2 * (i * i):
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: iload         4
    30: imul
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: imul
    29: iload         4
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

FYI - 仅供参考 -

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

#4楼

Byte codes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html Byte codes Viewer: https://github.com/Konloch/bytecode-viewer 字节代码: https//cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html字节码查看器: https//github.com/Konloch/bytecode-viewer

On my JDK (Windows 10 64 bit, 1.8.0_65-b17) I can reproduce and explain: 在我的JDK(Windows 10 64位,1.8.0_65-b17)上,我可以重现并解释:

public static void main(String[] args) {
    int repeat = 10;
    long A = 0;
    long B = 0;
    for (int i = 0; i < repeat; i++) {
        A += test();
        B += testB();
    }

    System.out.println(A / repeat + " ms");
    System.out.println(B / repeat + " ms");
}


private static long test() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multi(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multi(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms A " + n);
    return ms;
}


private static long testB() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multiB(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multiB(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms B " + n);
    return ms;
}

private static int multiB(int i) {
    return 2 * (i * i);
}

private static int multi(int i) {
    return 2 * i * i;
}

Output: 输出:

...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms

So why? 所以为什么? The byte code is this: 字节代码是这样的:

 private static multiB(int arg0) { // 2 * (i * i)
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         iload0
         imul
         imul
         ireturn
     }
     L2 {
     }
 }

 private static multi(int arg0) { // 2 * i * i
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         imul
         iload0
         imul
         ireturn
     }
     L2 {
     }
 }

The difference being: With brackets ( 2 * (i * i) ): 区别在于:括号( 2 * (i * i) ):

  • push const stack 推送const堆栈
  • push local on stack 在堆栈上推送本地
  • push local on stack 在堆栈上推送本地
  • multiply top of stack 乘以堆栈顶部
  • multiply top of stack 乘以堆栈顶部

Without brackets ( 2 * i * i ): 没有括号( 2 * i * i ):

  • push const stack 推送const堆栈
  • push local on stack 在堆栈上推送本地
  • multiply top of stack 乘以堆栈顶部
  • push local on stack 在堆栈上推送本地
  • multiply top of stack 乘以堆栈顶部

Loading all on the stack and then working back down is faster than switching between putting on the stack and operating on it. 将所有内容加载到堆栈上然后再向下工作比在堆叠和操作之间切换更快。


#5楼

When the multiplication is 2 * (i * i) , the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code: 当乘法是2 * (i * i) ,JVM能够将循环中的乘法分解为2 ,从而得到这个等效但更有效的代码:

int n = 0;
for (int i = 0; i < 1000000000; i++) {
    n += i * i;
}
n *= 2;

but when the multiplication is (2 * i) * i , the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition. 但是当乘法是(2 * i) * i ,JVM不会优化它,因为在加法之前不再乘以常数。

Here are a few reasons why I think this is the case: 以下是我认为是这种情况的几个原因:

  • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same 在循环开始时添加if (n == 0) n = 1语句会导致两个版本同样有效,因为分解乘法不再保证结果将是相同的
  • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version 优化版本(通过将乘法乘以2)与2 * (i * i)版本完全一样快

Here is the test code that I used to draw these conclusions: 以下是我用来得出这些结论的测试代码:

public static void main(String[] args) {
    long fastVersion = 0;
    long slowVersion = 0;
    long optimizedVersion = 0;
    long modifiedFastVersion = 0;
    long modifiedSlowVersion = 0;

    for (int i = 0; i < 10; i++) {
        fastVersion += fastVersion();
        slowVersion += slowVersion();
        optimizedVersion += optimizedVersion();
        modifiedFastVersion += modifiedFastVersion();
        modifiedSlowVersion += modifiedSlowVersion();
    }

    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}

private static long fastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long slowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

private static long optimizedVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += i * i;
    }
    n *= 2;
    return System.nanoTime() - startTime;
}

private static long modifiedFastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long modifiedSlowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

And here are the results: 以下是结果:

Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s

#6楼

I tried a JMH using the default archetype: I also added an optimized version based on Runemoro's explanation . 我使用默认原型尝试了JMH:我还根据Runemoro的解释添加了一个优化版本。

@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
  @Param({ "100", "1000", "1000000000" })
  private int size;

  @Benchmark
  public int two_square_i() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * (i * i);
    }
    return n;
  }

  @Benchmark
  public int square_i_two() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += i * i;
    }
    return 2*n;
  }

  @Benchmark
  public int two_i_() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * i * i;
    }
    return n;
  }
}

The result are here: 结果如下:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units
o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op
o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op
o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op
o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op
o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op
o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op
o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op
o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op
o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC ( Core i7 860 - it is doing nothing much apart from reading on my smartphone): 在我的电脑上( 酷睿i7 860 - 除了在我的智能手机上阅读之外没什么作用):

  • n += i*i then n*2 is first n += i*i然后n*2是第一个
  • 2 * (i * i) is second. 2 * (i * i)是第二名。

The JVM is clearly not optimizing the same way than a human does (based on Runemoro's answer). JVM显然没有像人类那样优化(基于Runemoro的答案)。

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class 现在,阅读字节码: javap -c -v ./target/classes/org/sample/MyBenchmark.class

I am not expert on bytecode, but we iload_2 before we imul : that's probably where you get the difference: I can suppose that the JVM optimize reading i twice ( i is already here, and there is no need to load it again) whilst in the 2*i*i it can't. 我不是字节码的专家,但我们在iload_2之前是imul :这可能是你得到差异的地方:我可以假设JVM优化读取i两次( i已经在这里,并且不需要再次加载它) 2*i*i不能。

发布了0 篇原创文章 · 获赞 3 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/p15097962069/article/details/105407077
I
I2C