Saturday 30 May 2015

hotspot code generation, optimisation and deoptimisation

As promised here is an article about the hotspot code generation using the disassembler plugin mention in the last post. I was nearly going to not do it but i'd already done some playing with it.

Unfortunately I had to use AMD64 instructions here; i think the ISA is pretty shithouse so I haven't bothered to learn it very well so i'm doing some guessing below. I even downloaded the APMs from AMD (i find the intel docs quite poor) to look some stuff up.

For the C code i'm using gcc 4.8.2 with -mtune=native -std=gnu99 and -Ox as indicated in the text.

The actual test calculates 1000x dot products of 2^20 elements each. For java i'm using System.nanoTime() and printing the best result across all runs. For C i couldn't be bothered with the gettimeofday() stuff so i'm just using the time command - over 1000 iterations the difference should be negligable and there are some interesting results regardless.

Simple loop

This is the starting function; obvious what it does.

public float dot(float[] a, float[] b, int len) {
        float v = 0;
        for (int i=0;i<len;i++)
                v += a[i] * b[i];
        return v;
}
A C version is identical apart from using pointers rather than arrays and some extra fluffly conventions.
float dot(const float *a, const float  *b, int len) {
        float v = 0;
        for (int i=0;i<len;i++)
                v += a[i] * b[i];
        return v;
}

First pass

After some iterations hotspot will recognise this function could benefit from optimisation and this is what jdk8 spits out at the first compilation pass.

This is using gcc syntax so instruction operands are srca,[srcb,],dst rather than the more conventional dst,srca[,srcb].

.1: movslq %esi,%rdi
    jae    .exception0
    vmovss 0x10(%rdx,%rdi,4),%xmm1

    movslq %esi,%rdi
    jae    .exception1
    vmovss 0x10(%rcx,%rdi,4),%xmm2

    vmulss %xmm2,%xmm1,%xmm1
    vaddss %xmm0,%xmm1,%xmm1
    inc    %esi

    mov    $0x7ffdffc00ce8,%rdi
    mov    0xe0(%rdi),%ebx
    add    $0x8,%ebx
    mov    %ebx,0xe0(%rdi)

    mov    $0x7ffdffc00488,%rdi
    and    $0xfff8,%ebx
    cmp    $0x0,%ebx
    je     .2

.3: test   %eax,0x15e4076a(%rip)
    mov    $0x7ffdffc00ce8,%rdi
    incl   0x128(%rdi)
    vmovaps %xmm1,%xmm0

    cmp    %r8d,%esi
    mov    $0x7ffdffc00ce8,%rdi
    mov    $0x108,%rbx
    jge    .4
    mov    $0x118,%rbx
.4: mov    (%rdi,%rbx,1),%rax
    lea    0x1(%rax),%rax
    mov    %rax,(%rdi,%rbx,1)
    jl     .1

;; clean up and exit

.2: mov    %rdi,0x8(%rsp)
    movq   $0x1d,(%rsp)
    callq  some_function
    jmpq   .3

Of these 11 are for the loop itself, the rest seem to be for profiling the loop.

As far as it goes it looks fairly decent - pretty much gcc -O2 level of optimisation with array bounds checking performed at each array read.

Of course the profiling adds a lot of overhead here.

The following is the output for the inner loop of gcc -O2.

  10:   f3 0f 10 0c 87          movss  (%rdi,%rax,4),%xmm1
  15:   f3 0f 59 0c 86          mulss  (%rsi,%rax,4),%xmm1
  1a:   48 ff c0                inc    %rax
  1d:   39 c2                   cmp    %eax,%edx
  1f:   f3 0f 58 c1             addss  %xmm1,%xmm0
  23:   7f eb                   jg     10

The only real difference apart from having no bounds checking is that it multiplies directly from memory rather than through a register. The latter is how every other mainstream cpu does it so that may have some bearing on it.

I can't easy do comparison timing of the loops (and it isn't very meaningful) but obviously the java will be slower here, and probably on-par with -O0 output from gcc.

Final pass

After it has gained some profiling information the result will be optimised - in this case it recompiles it twice more. The inner loop of the final pass is below:

.1: vmovss 0x10(%rbx,%r14,4),%xmm0
    vmulss 0x10(%rcx,%r14,4),%xmm0,%xmm1
    vaddss %xmm3,%xmm1,%xmm0
    movslq %r14d,%r10
    vmovss 0x2c(%rbx,%r10,4),%xmm2
    vmulss 0x2c(%rcx,%r10,4),%xmm2,%xmm8
    vmovss 0x14(%rbx,%r10,4),%xmm1
    vmulss 0x14(%rcx,%r10,4),%xmm1,%xmm2
    vmovss 0x18(%rcx,%r10,4),%xmm1
    vmulss 0x18(%rbx,%r10,4),%xmm1,%xmm3
    vmovss 0x28(%rbx,%r10,4),%xmm1
    vmulss 0x28(%rcx,%r10,4),%xmm1,%xmm4
    vmovss 0x1c(%rcx,%r10,4),%xmm1
    vmulss 0x1c(%rbx,%r10,4),%xmm1,%xmm5
    vmovss 0x20(%rbx,%r10,4),%xmm1
    vmulss 0x20(%rcx,%r10,4),%xmm1,%xmm6
    vmovss 0x24(%rbx,%r10,4),%xmm1
    vmulss 0x24(%rcx,%r10,4),%xmm1,%xmm7
    vaddss %xmm2,%xmm0,%xmm0
    vaddss %xmm0,%xmm3,%xmm1
    vaddss %xmm1,%xmm5,%xmm1
    vaddss %xmm1,%xmm6,%xmm0
    vaddss %xmm0,%xmm7,%xmm1
    vaddss %xmm1,%xmm4,%xmm0
    vaddss %xmm0,%xmm8,%xmm3

    add    $0x8,%r14d

    cmp    %r8d,%r14d
    jl     .1

    cmp    %ebp,%r14d
    jge    .done
    xchg   %ax,%ax
.2: cmp    %edi,%r14d
    jae    .stuff0
    vmovss 0x10(%rcx,%r14,4),%xmm1

    cmp    %r9d,%r14d
    jae    .stuff1

    vmulss 0x10(%rbx,%r14,4),%xmm1,%xmm1
    vaddss %xmm1,%xmm3,%xmm3

    inc    %r14d

    cmp    %ebp,%r14d
    jl     .2
.done:

So this has removed all the array bounds checking from inside the loop (it's elsewhere - too bulky/not important here). It's also unrolled the loop 8x and is using modern 3-operand instructions to stagger most of the operations for better throughput on typical RISC cpus (I have no knowledge of the AMD scheduling rules). And finally it tacks on a simple 1-element loop to finish off anything left over.

Comparing this to the output of gcc -O3 ...

  30:   f3 0f 10 09             movss  (%rcx),%xmm1
  34:   41 83 c0 10             add    $0x10,%r8d
  38:   0f 18 49 50             prefetcht0 0x50(%rcx)
  3c:   0f 18 48 50             prefetcht0 0x50(%rax)
  40:   48 83 c1 40             add    $0x40,%rcx
  44:   48 83 c0 40             add    $0x40,%rax
  48:   f3 0f 59 48 c0          mulss  -0x40(%rax),%xmm1
  4d:   f3 0f 58 c1             addss  %xmm1,%xmm0
  51:   f3 0f 10 49 c4          movss  -0x3c(%rcx),%xmm1
  56:   f3 0f 59 48 c4          mulss  -0x3c(%rax),%xmm1
  5b:   f3 0f 58 c1             addss  %xmm1,%xmm0
  5f:   f3 0f 10 49 c8          movss  -0x38(%rcx),%xmm1
  64:   f3 0f 59 48 c8          mulss  -0x38(%rax),%xmm1
  69:   f3 0f 58 c1             addss  %xmm1,%xmm0
  6d:   f3 0f 10 49 cc          movss  -0x34(%rcx),%xmm1
  72:   f3 0f 59 48 cc          mulss  -0x34(%rax),%xmm1
  77:   f3 0f 58 c1             addss  %xmm1,%xmm0
  7b:   f3 0f 10 49 d0          movss  -0x30(%rcx),%xmm1
  80:   f3 0f 59 48 d0          mulss  -0x30(%rax),%xmm1
  85:   f3 0f 58 c1             addss  %xmm1,%xmm0
  89:   f3 0f 10 49 d4          movss  -0x2c(%rcx),%xmm1
  8e:   f3 0f 59 48 d4          mulss  -0x2c(%rax),%xmm1
  93:   f3 0f 58 c1             addss  %xmm1,%xmm0
  97:   f3 0f 10 49 d8          movss  -0x28(%rcx),%xmm1
  9c:   f3 0f 59 48 d8          mulss  -0x28(%rax),%xmm1
  a1:   f3 0f 58 c1             addss  %xmm1,%xmm0
  a5:   f3 0f 10 49 dc          movss  -0x24(%rcx),%xmm1
  aa:   f3 0f 59 48 dc          mulss  -0x24(%rax),%xmm1
  af:   f3 0f 58 c1             addss  %xmm1,%xmm0
  b3:   f3 0f 10 49 e0          movss  -0x20(%rcx),%xmm1
  b8:   f3 0f 59 48 e0          mulss  -0x20(%rax),%xmm1
  bd:   f3 0f 58 c1             addss  %xmm1,%xmm0
  c1:   f3 0f 10 49 e4          movss  -0x1c(%rcx),%xmm1
  c6:   f3 0f 59 48 e4          mulss  -0x1c(%rax),%xmm1
  cb:   f3 0f 58 c1             addss  %xmm1,%xmm0
  cf:   f3 0f 10 49 e8          movss  -0x18(%rcx),%xmm1
  d4:   f3 0f 59 48 e8          mulss  -0x18(%rax),%xmm1
  d9:   f3 0f 58 c1             addss  %xmm1,%xmm0
  dd:   f3 0f 10 49 ec          movss  -0x14(%rcx),%xmm1
  e2:   f3 0f 59 48 ec          mulss  -0x14(%rax),%xmm1
  e7:   f3 0f 58 c1             addss  %xmm1,%xmm0
  eb:   f3 0f 10 49 f0          movss  -0x10(%rcx),%xmm1
  f0:   f3 0f 59 48 f0          mulss  -0x10(%rax),%xmm1
  f5:   f3 0f 58 c1             addss  %xmm1,%xmm0
  f9:   f3 0f 10 49 f4          movss  -0xc(%rcx),%xmm1
  fe:   f3 0f 59 48 f4          mulss  -0xc(%rax),%xmm1
 103:   f3 0f 58 c1             addss  %xmm1,%xmm0
 107:   f3 0f 10 49 f8          movss  -0x8(%rcx),%xmm1
 10c:   f3 0f 59 48 f8          mulss  -0x8(%rax),%xmm1
 111:   f3 0f 58 c1             addss  %xmm1,%xmm0
 115:   f3 0f 10 49 fc          movss  -0x4(%rcx),%xmm1
 11a:   f3 0f 59 48 fc          mulss  -0x4(%rax),%xmm1
 11f:   45 39 c8                cmp    %r9d,%r8d
 122:   f3 0f 58 c1             addss  %xmm1,%xmm0
 126:   0f 85 04 ff ff ff       jne    30
  
 12c:   49 63 c0                movslq %r8d,%rax
 12f:   48 c1 e0 02             shl    $0x2,%rax
 133:   48 01 c7                add    %rax,%rdi
 136:   48 01 c6                add    %rax,%rsi
 139:   31 c0                   xor    %eax,%eax
 13b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

 140:   f3 0f 10 0c 87          movss  (%rdi,%rax,4),%xmm1
 145:   f3 0f 59 0c 86          mulss  (%rsi,%rax,4),%xmm1
 14a:   48 ff c0                inc    %rax
 14d:   41 8d 0c 00             lea    (%r8,%rax,1),%ecx
 151:   39 ca                   cmp    %ecx,%edx
 153:   f3 0f 58 c1             addss  %xmm1,%xmm0
 157:   7f e7                   jg     140

The main differences here are that it unrolls the loop 16x here. It only uses the 2-operand instructions - it uses fewer registers. It has also transformed the array indexing into pre-increment pointer arithmetic (in batches).

Well this definitely isn't a RISC cpu as that scheduling looks pants as everything keeps writing to the same registers. But x86 being so dominant has allowed a lot of money to be spent optimising the chip to run shitty code faster to make up for the compiler.

Benchmarks

Here are some timing results. All values are in ms for equivalent of 1 loop (or seconds for 1000 loops).

what       time
 gcc -O0    4.86
     -O2    1.44
     -O3    1.44

 java       1.60
 time java  1.7

The last is using the 'time' command on the whole java loop. i.e. this includes the jvm startup, profiling, and compilation. This isn't too shabby.

Either way these times are pretty good vs effort - maybe one or the other is more tuned to the cpu I have vs intel stuff but it's really neither here nor there.

Unrolled loop

Actually what prompted the idea for this article was some results I had from unrolling loops 4x in Java. I subsequently found that unrolling 2x did just as good a job in this case so i'll do that here just for simplicity. The assembly is almost identical anyway as it just gets unrolled an additional 2x rather than 4x by the compiler.

public float dot(float[] a, float[] b, int len) {
 float v0 = 0, v1=0;
 int i = 0;
 for (int e = len & ~1;i<e;i+=2) {
  v0 += a[i] * b[i];
  v1 += a[i+1] * b[i+1];
 }
 for (;i<len;i++)
  v0 += a[i] * b[i];
 return v0+v1;
}

Final pass

And here's just the inner loop of the final pass:

.1: vmovss 0x10(%rcx,%r8,4),%xmm0
    vmulss 0x10(%rdx,%r8,4),%xmm0,%xmm1
    vaddss %xmm3,%xmm1,%xmm0
    movslq %r8d,%r11
    vmovss 0x2c(%rcx,%r11,4),%xmm2
    vmulss 0x2c(%rdx,%r11,4),%xmm2,%xmm9
    vmovss 0x24(%rcx,%r11,4),%xmm1
    vmulss 0x24(%rdx,%r11,4),%xmm1,%xmm8
    vmovss 0x1c(%rcx,%r11,4),%xmm2
    vmulss 0x1c(%rdx,%r11,4),%xmm2,%xmm1
    vmovss 0x18(%rcx,%r11,4),%xmm3
    vmulss 0x18(%rdx,%r11,4),%xmm3,%xmm2
    vmovss 0x14(%rcx,%r11,4),%xmm4
    vmulss 0x14(%rdx,%r11,4),%xmm4,%xmm3
    vmovss 0x20(%rcx,%r11,4),%xmm5
    vmulss 0x20(%rdx,%r11,4),%xmm5,%xmm4
    vmovss 0x28(%rcx,%r11,4),%xmm6

    vmulss 0x28(%rdx,%r11,4),%xmm6,%xmm5
    vaddss %xmm3,%xmm0,%xmm3
    vaddss %xmm2,%xmm3,%xmm0
    vaddss %xmm1,%xmm0,%xmm1
    vaddss %xmm1,%xmm4,%xmm0
    vaddss %xmm8,%xmm0,%xmm1
    vaddss %xmm1,%xmm5,%xmm0
    vaddss %xmm0,%xmm9,%xmm3

    add    $0x8,%r8d

    cmp    %r10d,%r8d
    jl     .1

So now it's unrolled the loop an addition 4x times so that it looks the same at first glance. But now the moves have been spread across many registers rather than mostly going through xmm1. It runs quite a bit faster.

This is getting too long so i wont include it but the same simple modification applied to the C version also makes a difference - quite a big one. The generated code is almost identical apart from every second xmm0 being replaced with xmm1 - i.e. interleaved as written.

Benchmarks

And here's some benchmarks of this 'version'.

what       time
 gcc -O0    2.76
     -O2    0.833
     -O3    0.735

 java       1.00
 time java  1.2

Conclusions

Well hotspot is pretty good, but could be a little bit better. And it seems mostly just to fall down on some seemingly simple areas like instruction scheduling (simple compared to the rest of the work it's done).

Although I don't have enough knowledge of the architecture here to state that the original scheduling isn't very optimal the benchmark results probably speak loud enough in that absence. It is clearly not optimal as the same machine code which interleaves the output register runs 2x faster in the C case. I don't really feel like translating this to assembly so i can see if some simple re-arrangement would make a difference.

But what is odd that neither compiler is doing this on it's own, one could argue (quite convincingly) that due to floating point peculiarities (addition is only weakly associative) both loops are not actually calculating the same result. In the case of hotspot however this argument is weak because the optimised version is already spreading the addition accross multiple registers.

Lambdas & de-optimisation

This is getting long and the next part could probably go into another article but i've spent enough of my weekend on this so i'll get it out of the way now with a quick summary.

For simplcity I created the following simple 3-parameter map/reduce operation.

public interface FloatTrinaryFunction {
 public float applyAsFloat(float a, float b, float c);
}

public float reduce(float[] a, float[] b, int len, FloatTrinaryFunction func) {
 float v = 0;
 for (int i=0;i<len;i++)
  v = func.applyAsFloat(v, a[i], b[i]);
 return v;  
}
And invoke it thus:
  reduce(a, b, a.length, (float v, float x, float y)->v + x*y);

Opt and de-opt

In short, if you use up to two lambdas it results in equivalent code to the direct dot product equation - nice. But once you go to three or more it de-optimises the loop and reverts to a function call. It also spends more time in the compiler.

The following is what the deoptimised loop looks like:

.1: mov    (%rsp),%rdx
    mov    %rdx,(%rsp)
    cmp    %r10d,%ebp
    jae    .exception0
    mov    0x8(%rsp),%r10
    vmovss 0x10(%rdx,%rbp,4),%xmm1

    cmp    %r10d,%ebp
    jae    .exception1
    mov    0x8(%rsp),%r10
    vmovss 0x10(%r10,%rbp,4),%xmm2

    mov    0x18(%rsp),%rsi
    xchg   %ax,%ax
    mov    $0xffffffffffffffff,%rax
    callq  applyAsFloat

    inc    %ebp

    cmp    0x10(%rsp),%ebp
    jl     .1

So it retains the array bounds checks inside the loop (bummer) and invokes the interface as a function call (expected), but it removes any profiling instrumentation that was present in the first pass (expected also) and generates the smallest code.

This hits around the 4.5ms mark.

Conclusions 2

It's important to note that this is just a run-time decision made by the current version of hotspot - this could be changed or could be tweaked in the future. And as I showed in some previous posts it can be worked-around even with the current hotspot using some bytecode foo.

Given the prevalence of lambdas in java8 i suspect it is something that will gain some tuning attention in future revisions. It's not something one would change lightly so it will probably be based on profiling data and usage.

No comments: