How does array access affect the performance?

user6611764 :
int steps = 256 * 1024 * 1024;
int[] a = new int[2];

// Loop 1
for (int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2
for (int i=0; i<steps; i++) { a[0]++; a[1]++; }

Can someone explain why the second loop is 20x times slower than the first (19 ms vs 232 ms)?

That is how I'm timing it:

long start_time = System.currentTimeMillis();

// Loop

long end_time = System.currentTimeMillis();
System.out.println(end_time - start_time);
Peter de Rivaz :

Summary

The JIT compiler is converting the first loop into a multiply, but not optimizing the second loop very much.

Discussion

The bytecode for both loops is basically the same (you can view this with javap -c test.class).

In Java, the bytecode is converted into x86 instructions by a JIT compiler which has the ability to perform additional optimizations.

You can actually view the assembly produced by the JIT via java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly ... if you have the hsdis plugin.

I changed the value you add to each element to 0xbad to make it easier to spot the relevant code and changed the loop counter to long.

The first loop produces:

  mov     r11d,dword ptr [r13+10h]    Load from memory a[0]
  ...
  add     r11d,175ah                  Add 2 * 0xbad to the value
  mov     dword ptr [r13+10h],r11d    Store to memory a[0]

The second loop produces:

   mov     ebx,dword ptr [rax+10h]    Load from memory a[0]
   add     ebx,0badh                  Add 0xbad
   ...
   mov     dword ptr [rax+10h],ebx    Store to memory
   ...
   mov     ebx,dword ptr [rax+14h]    Load from memory a[1]
   add     ebx,0badh                  Add 0xbad
   ...
   mov     dword ptr [rax+14h],ebx    Store to memory a[1]

so you can see that the compiler is already able to optimize the first loop into fewer instructions.

In particular, it has spotted that the two additions to the same array element can be coalesced into a single addition of twice the value.

When I changed the loop counter back to int I noticed that the compiler manages to do even better with your first loop:

mov     r10d,dword ptr [r14+10h]
imul    ecx,r13d,175ah     This line converts lots of adds of 0xbad into a single multiply  
mov     r11d,r10d
sub     r11d,ecx
add     r10d,175ah
mov     dword ptr [r14+10h],r10d

In this case it has spotted that it can actually implement several iterations of your loop in a single pass by using a multiplication! This explains how the first loop can be an order of magnitude faster than the second.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=473541&siteId=1