Java Math.abs(int) optimizations, why this code 6x times slower?

xtern :

As you may know, Math.abs(Integer.MIN_VALUE) == Integer.MIN_VALUE and to prevent a negative value, the safeAbs method was implemented in my project:

    public static int safeAbs(int i) {
        i = Math.abs(i);

        return i < 0 ? 0 : i;
    }

I compared the performance with the following one:

    public static int safeAbs(int i) {
        return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
    }

And the first one is almost 6x times slower than the second (the second one performance is almost the same as "pure" Math.abs(int)). From my point of view, there is no significant difference in bytecode, but I guess the difference is present in the JIT "assembly" code:

"slow" version:

  0x00007f0149119720: mov     %eax,0xfffffffffffec000(%rsp)
  0x00007f0149119727: push    %rbp
  0x00007f0149119728: sub     $0x20,%rsp
  0x00007f014911972c: test    %esi,%esi
  0x00007f014911972e: jl      0x7f0149119734
  0x00007f0149119730: mov     %esi,%eax
  0x00007f0149119732: jmp     0x7f014911973c
  0x00007f0149119734: neg     %esi
  0x00007f0149119736: test    %esi,%esi
  0x00007f0149119738: jl      0x7f0149119748
  0x00007f014911973a: mov     %esi,%eax
  0x00007f014911973c: add     $0x20,%rsp
  0x00007f0149119740: pop     %rbp
  0x00007f0149119741: test    %eax,0x1772e8b9(%rip)  ;   {poll_return}
  0x00007f0149119747: retq
  0x00007f0149119748: mov     %esi,(%rsp)
  0x00007f014911974b: mov     $0xffffff65,%esi
  0x00007f0149119750: nop
  0x00007f0149119753: callq   0x7f01490051a0    ; OopMap{off=56}
                                                ;*ifge
                                                ; - math.FastAbs::safeAbsSlow@6 (line 16)
                                                ;   {runtime_call}
  0x00007f0149119758: callq   0x7f015f521d20    ;   {runtime_call}

"normal" version:

  # {method} {0x00007f31acf28cd8} 'safeAbsFast' '(I)I' in 'math/FastAbs'
  # parm0:    rsi       = int
  #           [sp+0x30]  (sp of caller)
  0x00007f31b08c7360: mov     %eax,0xfffffffffffec000(%rsp)
  0x00007f31b08c7367: push    %rbp
  0x00007f31b08c7368: sub     $0x20,%rsp
  0x00007f31b08c736c: cmp     $0x80000000,%esi
  0x00007f31b08c7372: je      0x7f31b08c738e
  0x00007f31b08c7374: mov     %esi,%r10d
  0x00007f31b08c7377: neg     %r10d
  0x00007f31b08c737a: test    %esi,%esi
  0x00007f31b08c737c: mov     %esi,%eax
  0x00007f31b08c737e: cmovl   %r10d,%eax
  0x00007f31b08c7382: add     $0x20,%rsp
  0x00007f31b08c7386: pop     %rbp
  0x00007f31b08c7387: test    %eax,0x162c2c73(%rip)  ;   {poll_return}
  0x00007f31b08c738d: retq
  0x00007f31b08c738e: mov     %esi,(%rsp)
  0x00007f31b08c7391: mov     $0xffffff65,%esi
  0x00007f31b08c7396: nop
  0x00007f31b08c7397: callq   0x7f31b07b11a0    ; OopMap{off=60}
                                                ;*if_icmpne
                                                ; - math.FastAbs::safeAbsFast@3 (line 17)
                                                ;   {runtime_call}
  0x00007f31b08c739c: callq   0x7f31c5863d20    ;   {runtime_call}

Benchmark code:

@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = {"-Xms3g", "-Xmx3g", "-server"})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 10)
@Measurement(iterations = 10)
public class SafeAbsMicroBench {

    @State(Scope.Benchmark)
    public static class Data {
        final int len = 10_000_000; 

        final int[] values = new int[len];

        @Setup(Level.Trial)
        public void setup() {
            // preparing 10 million random integers without MIN_VALUE
            for (int i = 0; i < len; i++) {
                int val;

                do {
                    val = ThreadLocalRandom.current().nextInt();
                } while (val == Integer.MIN_VALUE);

                values[i] = val;
            }
        }
    }

    @Benchmark
    public int safeAbsSlow(Data data) {
        int sum = 0;

        for (int i = 0; i < data.len; i++)
            sum += safeAbsSlow(data.values[i]);

        return sum;
    }

    @Benchmark
    public int safeAbsFast(Data data) {
        int sum = 0;

        for (int i = 0; i < data.len; i++)
            sum += safeAbsFast(data.values[i]);

        return sum;
    }

    private int safeAbsSlow(int i) {
        i = Math.abs(i);

        return i < 0 ? 0 : i;
    }

    private int safeAbsFast(int i) {
        return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
    }

    public static void main(String[] args) throws RunnerException {
        final Options options = new OptionsBuilder()
            .include(SafeAbsMicroBench.class.getSimpleName())
            .build();

        new Runner(options).run();
    }
}

Results (Linux x86-64, 7820HQ, checked on oracle jdk 8 and 11 with pretty similar results).

Benchmark                      Mode  Cnt         Score        Error  Units
SafeAbsMicroBench.safeAbsFast  avgt   10   6435155.516 ±  47130.767  ns/op
SafeAbsMicroBench.safeAbsSlow  avgt   10  35646411.744 ± 776173.621  ns/op

Can someone explain why the first code is significantly slower than the second one?

Oleksandr Pyrohov :

There is a difference in the generated native code for the safeAbsSlow and safeAbsFast methods.

safeAbsSlow (C2, level 4):

0x0000023d12ec4b14: add     eax,ecx
0x0000023d12ec4b16: inc     ebx

0x0000023d12ec4b18: cmp     ebx,989680h
0x0000023d12ec4b1e: jnl     23d12ec4b4eh ; jump if `ebx` was not less than `10_000_000`

0x0000023d12ec4b20: mov     ecx,dword ptr [r9+rbx*4+10h]

0x0000023d12ec4b25: test    ecx,ecx
0x0000023d12ec4b27: jnl     23d12ec4b14h ; jump if `ecx` was not less-than `0`

0x0000023d12ec4b29: neg     ecx

0x0000023d12ec4b2b: test    ecx,ecx
0x0000023d12ec4b2d: jnl     23d12ec4b14h ; jump if `ecx` was not less-than `0`

safeAbsFast (C2, level 4):

0x000001d89e8a4b20: mov     ecx,dword ptr [r9+rdi*4+10h]

0x000001d89e8a4b25: cmp     ecx,80000000h
0x000001d89e8a4b2b: je      1d89e8a4b66h ; jump if `ecx` was equal to `2147483648`

0x000001d89e8a4b2d: mov     r11d,ecx
0x000001d89e8a4b30: neg     r11d
0x000001d89e8a4b33: test    ecx,ecx
0x000001d89e8a4b35: cmovl   ecx,r11d

0x000001d89e8a4b39: add     eax,ecx
0x000001d89e8a4b3b: inc     edi

0x000001d89e8a4b3d: cmp     edi,989680h
0x000001d89e8a4b43: jl      1d89e8a4b20h ; jump if `edi` was less than `10_000_000`

As we can see from the above, safeAbsSlow has more conditional jumps than safeAbsFast.

This is particularly because the Math.abs implementation which is inlined into the safeAbsFast has no conditional jumps:

0x000001d89e8a4b2d: mov     r11d,ecx
0x000001d89e8a4b30: neg     r11d
0x000001d89e8a4b33: test    ecx,ecx
0x000001d89e8a4b35: cmovl   ecx,r11d

As a result, there are many more branch-misses in the slow version in comparison to the  normal version when the data set has both positive and negative values that are scattered across an array. Below is the corresponding statistic that was collected using the perf Linux profiler:

Benchmark                          Mode  Cnt          Score         Error  Units
safeAbsFast                        avgt   10    9611659.726 ± 1429082.431  ns/op
safeAbsFast:branch-misses          avgt            2869.853                 #/op
safeAbsFast:branches               avgt        12492918.020                 #/op
safeAbsFast:cycles                 avgt        28212203.936                 #/op
safeAbsFast:instructions           avgt        92352048.153                 #/op
safeAbsSlow                        avgt   10   44524180.366 ± 6324887.086  ns/op
safeAbsSlow:branch-misses          avgt         5006493.144                 #/op
safeAbsSlow:branches               avgt        17496069.911                 #/op
safeAbsSlow:cycles                 avgt       126413171.674                 #/op
safeAbsSlow:instructions           avgt        67549877.558                 #/op

In contrast, here is the result for the sorted data set:

Benchmark                          Mode  Cnt         Score         Error  Units
safeAbsFast                        avgt   10   9026800.584 ±  528992.157  ns/op
safeAbsFast:branch-misses          avgt           2785.463                 #/op
safeAbsFast:branches               avgt       12474751.905                 #/op
safeAbsFast:cycles                 avgt       27379727.603                 #/op
safeAbsFast:instructions           avgt       92418075.715                 #/op
safeAbsSlow                        avgt   10   6981828.374 ± 2375480.834  ns/op
safeAbsSlow:branch-misses          avgt           2801.022                 #/op
safeAbsSlow:branches               avgt       17496585.992                 #/op
safeAbsSlow:cycles                 avgt       19478382.113                 #/op
safeAbsSlow:instructions           avgt       67589946.278                 #/op

The previously slow version becomes even faster when the data set is sorted (costly branch-misses are minimized in this case).


Environment:

openjdk version "12-internal" 2019-03-19
OpenJDK Runtime Environment (slowdebug build 12-internal+0-adhoc.jdk12)
OpenJDK 64-Bit Server VM (slowdebug build 12-internal+0-adhoc.jdk12, mixed mode)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=163443&siteId=1