本文翻译自:Why is 2 * (i * i) faster than 2 * i * i in Java?
The following Java program takes on average between 0.50 secs and 0.55 secs to run: 以下Java程序平均需要在0.50秒到0.55秒之间运行:
public static void main(String[] args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i)
with 2 * i * i
, it takes between 0.60 and 0.65 secs to run. 如果我更换2 * (i * i)
与2 * i * i
,它需要0.60秒和0.65之间运行。 How come? 怎么会?
I ran each version of the program 15 times, alternating between the two. 我运行了每个版本的程序15次,在两者之间交替。 Here are the results: 结果如下:
2*(i*i) | 2*i*i
----------+----------
0.5183738 | 0.6246434
0.5298337 | 0.6049722
0.5308647 | 0.6603363
0.5133458 | 0.6243328
0.5003011 | 0.6541802
0.5366181 | 0.6312638
0.515149 | 0.6241105
0.5237389 | 0.627815
0.5249942 | 0.6114252
0.5641624 | 0.6781033
0.538412 | 0.6393969
0.5466744 | 0.6608845
0.531159 | 0.6201077
0.5048032 | 0.6511559
0.5232789 | 0.6544526
The fastest run of 2 * i * i
took longer than the slowest run of 2 * (i * i)
. 2 * i * i
的最快运行时间比2 * (i * i)
的最慢运行时间长。 If they were both as efficient, the probability of this happening would be less than 1/2^15 * 100% = 0.00305%. 如果它们同样有效,则发生这种情况的概率将小于1/2 ^ 15 * 100%= 0.00305%。
#1楼
参考:https://stackoom.com/question/3cHUX/为什么-i-i-比Java中的-i-i更快
#2楼
The two methods of adding do generate slightly different byte code: 这两种添加方法会生成略有不同的字节代码:
17: iconst_2
18: iload 4
20: iload 4
22: imul
23: imul
24: iadd
For 2 * (i * i)
vs: 对于2 * (i * i)
vs:
17: iconst_2
18: iload 4
20: imul
21: iload 4
23: imul
24: iadd
For 2 * i * i
. 对于2 * i * i
。
And when using a JMH benchmark like this: 当使用这样的JMH基准时:
@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {
@Benchmark
public int noBrackets() {
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return n;
}
@Benchmark
public int brackets() {
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return n;
}
}
The difference is clear: 差异很明显:
# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>
Benchmark (n) Mode Cnt Score Error Units
MyBenchmark.brackets 1000000000 avgt 5 380.889 ± 58.011 ms/op
MyBenchmark.noBrackets 1000000000 avgt 5 512.464 ± 11.098 ms/op
What you observe is correct, and not just an anomaly of your benchmarking style (ie no warmup, see How do I write a correct micro-benchmark in Java? ) 您观察到的是正确的,而不仅仅是您的基准测试风格的异常(即没有热身,请参阅如何在Java中编写正确的微基准测试? )
Running again with Graal: 用Graal再次跑步:
# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler
Benchmark (n) Mode Cnt Score Error Units
MyBenchmark.brackets 1000000000 avgt 5 335.100 ± 23.085 ms/op
MyBenchmark.noBrackets 1000000000 avgt 5 331.163 ± 50.670 ms/op
You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler. 你会发现结果更接近,这是有道理的,因为Graal是一个整体性能更好,更现代的编译器。
So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it. 所以这实际上取决于JIT编译器能够优化特定代码片段的程度,并且不一定有合理的理由。
#3楼
I got similar results: 我得到了类似的结果:
2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run. 如果两个循环都在同一个程序中,或者每个循环都在一个单独的.java文件/ .class中,我在单独的运行中执行,我得到了SAME结果。
Finally, here is a javap -c -v <.java>
decompile of each: 最后,这里是每个javap -c -v <.java>
反编译:
3: ldc #3 // String 2 * (i * i):
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: iload 4
30: imul
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
vs. 与
3: ldc #3 // String 2 * i * i:
5: invokevirtual #4 // Method java/io/PrintStream.print:(Ljava/lang/String;)V
8: invokestatic #5 // Method java/lang/System.nanoTime:()J
11: lstore_1
12: iconst_0
13: istore_3
14: iconst_0
15: istore 4
17: iload 4
19: ldc #6 // int 1000000000
21: if_icmpge 40
24: iload_3
25: iconst_2
26: iload 4
28: imul
29: iload 4
31: imul
32: iadd
33: istore_3
34: iinc 4, 1
37: goto 17
FYI - 仅供参考 -
java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
#4楼
Byte codes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html Byte codes Viewer: https://github.com/Konloch/bytecode-viewer 字节代码: https : //cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html字节码查看器: https : //github.com/Konloch/bytecode-viewer
On my JDK (Windows 10 64 bit, 1.8.0_65-b17) I can reproduce and explain: 在我的JDK(Windows 10 64位,1.8.0_65-b17)上,我可以重现并解释:
public static void main(String[] args) {
int repeat = 10;
long A = 0;
long B = 0;
for (int i = 0; i < repeat; i++) {
A += test();
B += testB();
}
System.out.println(A / repeat + " ms");
System.out.println(B / repeat + " ms");
}
private static long test() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multi(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multi(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms A " + n);
return ms;
}
private static long testB() {
int n = 0;
for (int i = 0; i < 1000; i++) {
n += multiB(i);
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++) {
n += multiB(i);
}
long ms = (System.currentTimeMillis() - startTime);
System.out.println(ms + " ms B " + n);
return ms;
}
private static int multiB(int i) {
return 2 * (i * i);
}
private static int multi(int i) {
return 2 * i * i;
}
Output: 输出:
...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms
So why? 所以为什么? The byte code is this: 字节代码是这样的:
private static multiB(int arg0) { // 2 * (i * i)
<localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
iload0
imul
imul
ireturn
}
L2 {
}
}
private static multi(int arg0) { // 2 * i * i
<localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>
L1 {
iconst_2
iload0
imul
iload0
imul
ireturn
}
L2 {
}
}
The difference being: With brackets ( 2 * (i * i)
): 区别在于:括号( 2 * (i * i)
):
- push const stack 推送const堆栈
- push local on stack 在堆栈上推送本地
- push local on stack 在堆栈上推送本地
- multiply top of stack 乘以堆栈顶部
- multiply top of stack 乘以堆栈顶部
Without brackets ( 2 * i * i
): 没有括号( 2 * i * i
):
- push const stack 推送const堆栈
- push local on stack 在堆栈上推送本地
- multiply top of stack 乘以堆栈顶部
- push local on stack 在堆栈上推送本地
- multiply top of stack 乘以堆栈顶部
Loading all on the stack and then working back down is faster than switching between putting on the stack and operating on it. 将所有内容加载到堆栈上然后再向下工作比在堆叠和操作之间切换更快。
#5楼
When the multiplication is 2 * (i * i)
, the JVM is able to factor out the multiplication by 2
from the loop, resulting in this equivalent but more efficient code: 当乘法是2 * (i * i)
,JVM能够将循环中的乘法分解为2
,从而得到这个等效但更有效的代码:
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
but when the multiplication is (2 * i) * i
, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition. 但是当乘法是(2 * i) * i
,JVM不会优化它,因为在加法之前不再乘以常数。
Here are a few reasons why I think this is the case: 以下是我认为是这种情况的几个原因:
- Adding an
if (n == 0) n = 1
statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same 在循环开始时添加if (n == 0) n = 1
语句会导致两个版本同样有效,因为分解乘法不再保证结果将是相同的 - The optimized version (by factoring out the multiplication by 2) is exactly as fast as the
2 * (i * i)
version 优化版本(通过将乘法乘以2)与2 * (i * i)
版本完全一样快
Here is the test code that I used to draw these conclusions: 以下是我用来得出这些结论的测试代码:
public static void main(String[] args) {
long fastVersion = 0;
long slowVersion = 0;
long optimizedVersion = 0;
long modifiedFastVersion = 0;
long modifiedSlowVersion = 0;
for (int i = 0; i < 10; i++) {
fastVersion += fastVersion();
slowVersion += slowVersion();
optimizedVersion += optimizedVersion();
modifiedFastVersion += modifiedFastVersion();
modifiedSlowVersion += modifiedSlowVersion();
}
System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}
private static long fastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long slowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
private static long optimizedVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += i * i;
}
n *= 2;
return System.nanoTime() - startTime;
}
private static long modifiedFastVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * (i * i);
}
return System.nanoTime() - startTime;
}
private static long modifiedSlowVersion() {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
if (n == 0) n = 1;
n += 2 * i * i;
}
return System.nanoTime() - startTime;
}
And here are the results: 以下是结果:
Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s
#6楼
I tried a JMH using the default archetype: I also added an optimized version based on Runemoro's explanation . 我使用默认原型尝试了JMH:我还根据Runemoro的解释添加了一个优化版本。
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
@Param({ "100", "1000", "1000000000" })
private int size;
@Benchmark
public int two_square_i() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * (i * i);
}
return n;
}
@Benchmark
public int square_i_two() {
int n = 0;
for (int i = 0; i < size; i++) {
n += i * i;
}
return 2*n;
}
@Benchmark
public int two_i_() {
int n = 0;
for (int i = 0; i < size; i++) {
n += 2 * i * i;
}
return n;
}
}
The result are here: 结果如下:
Benchmark (size) Mode Samples Score Score error Units
o.s.MyBenchmark.square_i_two 100 avgt 10 58,062 1,410 ns/op
o.s.MyBenchmark.square_i_two 1000 avgt 10 547,393 12,851 ns/op
o.s.MyBenchmark.square_i_two 1000000000 avgt 10 540343681,267 16795210,324 ns/op
o.s.MyBenchmark.two_i_ 100 avgt 10 87,491 2,004 ns/op
o.s.MyBenchmark.two_i_ 1000 avgt 10 1015,388 30,313 ns/op
o.s.MyBenchmark.two_i_ 1000000000 avgt 10 967100076,600 24929570,556 ns/op
o.s.MyBenchmark.two_square_i 100 avgt 10 70,715 2,107 ns/op
o.s.MyBenchmark.two_square_i 1000 avgt 10 686,977 24,613 ns/op
o.s.MyBenchmark.two_square_i 1000000000 avgt 10 652736811,450 27015580,488 ns/op
On my PC ( Core i7 860 - it is doing nothing much apart from reading on my smartphone): 在我的电脑上( 酷睿i7 860 - 除了在我的智能手机上阅读之外没什么作用):
-
n += i*i
thenn*2
is firstn += i*i
然后n*2
是第一个 -
2 * (i * i)
is second.2 * (i * i)
是第二名。
The JVM is clearly not optimizing the same way than a human does (based on Runemoro's answer). JVM显然没有像人类那样优化(基于Runemoro的答案)。
Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class
现在,阅读字节码: javap -c -v ./target/classes/org/sample/MyBenchmark.class
- Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI 2 *(i * i)(左)和2 * i * i(右)之间的差异: https : //www.diffchecker.com/cvSFppWI
- Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP 2 *(i * i)和优化版本之间的差异: https : //www.diffchecker.com/I1XFu5dP
I am not expert on bytecode, but we iload_2
before we imul
: that's probably where you get the difference: I can suppose that the JVM optimize reading i
twice ( i
is already here, and there is no need to load it again) whilst in the 2*i*i
it can't. 我不是字节码的专家,但我们在iload_2
之前是imul
:这可能是你得到差异的地方:我可以假设JVM优化读取i
两次( i
已经在这里,并且不需要再次加载它) 2*i*i
不能。