Saturday 29 March 2014

add/or, or, add?

Another micro-optimisation that gcc isn't grabbing.

Example code:

 unsigned int a;
 unsigned int b;

 unsigned int foo(void *dma) {
     return ((unsigned int)dma << 16) | 1;
 }

 -->
00000000 _foo:
   0:   0216            lsl r0,r0,0x10
   2:   2023            mov r1,0x1
   4:   00fa            orr r0,r0,r1
   6:   194f 0402       rts

This is part of the calculation to form a dma start code.

This compiles literally as expected - but orr requires a register argument so it needs an additional register and the load (unfortunately, would be nice to have a fully orthogonal instruction set on such matters). Because the lower 5 bits will always be clear thanks to the shift one can just use an add instead.

 unsigned int fooa(void *dma) {
     return ((unsigned int)dma << 16) + 1;
 }

 -->
00000000 _fooa:
   0:   0216            lsl r0,r0,0x10
   2:   0093            add r0,r0,1
   4:   194f 0402       rts

If the constant is greater than 3 bits (or something that can be made with a negative 3-bit number) then the code-size will grow by two bytes - however it is still only 2 instructions and requires no auxiliary register and allows setting up to 10 bits with a constant.

Actually trying to inline some dma setting stuff hit some interesting issues. Because the base pointer to the dma control packet is only turned into an integer it isn't referenced - it's possible for the compiler to completely optimise the initialisation code out of existence.

static dma_start(int chan, ez_dma_desc_t *dma) {
   uint start = ((uint)dma << 16) | 1;

   set_reg(E_DMA0CONFIG, start);
}

static dma_run(int chan, ez_dma_desc_t *dma) {
   dma_wait(chan);
   dma_start(chan, dma);
   dma_wait(chan);
}

void ez_dma_memcpy(int chan, void *dst, void *src, size_t size) {
 uint align = ((uint) dst | (uint)src | (uint)size) & 0x7;
 uint shift = dma_shift[align];
 ez_dma_desc_t dma;

 dma.config = 3 + (shift << 5);
 dma.inner_stride = 0x00010001 << shift;
 dma.count = 0x10000 | (size >> shift);
 dma.outer_stride = 0x00010001 << shift;
 dma.src_addr = src;
 dma.dst_addr = dst;

 dma_run(chan, &dma);
}
If this code is in the same file optimisation may decide to inline all the functions even they weren't marked as such. Since the only use of dma is as an integer it "loses" it's pointedness and thus it's reference to the object. I dunno, I suppose it's a valid optimisation but an interesting gotcha nonetheless. It is fixed by making the dma declaration volatile.

Actually I had some other strange behaviour with this routine. Originally I was using an initialiser to set the content of dma. As in:

    ez_dma_desc_t dma = {
        .config = 3 + (shift << 5),
        .inner_stride = 0x00010001 << shift,
        etc
    };

But yeah, this did weird shit. It seemed to build the full content of the structure on the stack in a staging buffer, and then copy it to the actual structure, also on the stack. In -Os mode this even uses a call memcpy? *shrug* I'm really not sure why it should do that unless it's trying to implement some alignment restriction but I'm pretty sure structs are 8-byte aligned anyway (they would have to be). It's a syntax I use all the time because it's so handy ...

ez-lib-tiny

I've been working on more compact 'e-lib' for epiphany, this is mostly gained by generous use of inline and inline asm where appropriate in many cases converting a function call into fewer instructions than the invocation sequence. But I have also investigated the code size of almost every routine compared to hand crafted code.

The compiler always loses, sometimes by over a 100% increase over the hand-crafted code size.

Bullshit quote of the day: "compilers can create better code than you can".

So obviously I've been looking at a lot of compiler output in the last few days and that's clearly far from the truth.

Actually i'm pretty bummed out that compilers are still not able to do a really great job "in this day and age", even with cpu that has such a simple instruction set and a ton of registers which should be very compiler friendly. I think it just shows what a complex optimisation problem translating high level text into machine code really is.

I'm not having a go at compiler writers but those who claim how good they are usually from a position of ignorance and because they read it somewhere and it sounded authoritative. Optimising compilers are dreadfully complex things and trying to convert expert knowledge into an algorithm isn't an easy task at all.

No comments: