Monday 30 June 2014

static resource init

I noticed some activity on a thread in the parallella forum and whilst following that up I had another look at how to get the file loader to automgically map symbols across multiple cores.

Right now the loader will simply resolve the address to a core-local address and the on-core code needs to manually do any resolution to remote cores using ez_global_address() and related functions.

First I tried creating runtime meta-data for symbols which let the loader know how the address should be mapped on a given core. Once I settled on some basic requirements I started coding something up but quickly realised that it was going to take quite a bit of work. The first approach was to use address matching so that for example any address which matched a remapped address could be rewritten as necessary. But this wont work because in-struct addresses could fall through. So then I thought about using the reloc record symbol index. This could work fine if the address is in the same program on a different core, but it falls over if the label is in a different program; which is the primary case i'm interested in. Then I looked into just including a symbolic name but this is a bit of a pain because the loader then needs to link the string pointers before it can be used. So I just copied the symbol name into a fixed sized buffer. But after all that I threw it away as there were still a couple of fiddly cases to handle and it just seemed like too much effort.

So then I investigated an earlier idea I had which was to include some meta-data in a mangled symbol name. One reason I never followed it up was it seemed too inconvenient but then I realised I could use some macros to do some of the work:

  #define REF(var, row, col) _$_ ## row ## $_ ## col ## $_ ## var

  // define variable a stored on core 0,0
  extern int REF(a, 0, 0);

  // access variable a on core 0,0
  int b = REF(a, 0, 0);

This just creates a variable named "_$_0$_0$_a" and when the loader goes to resolve this symbol it can be mapped to the variable 'a' on the specific core as requested (i'm using _N to map group-relative, or N to map this-core-relative). The same mechanism just works directly for symbols defined in the same program as well since the linker doesn't know of any relation between the symbols.

So although this looked relatively straightforward and would probably work ... it would mean that the loader would need to explicitly relink (at least some of) the code for each core at load-time rather than just linking each program once and copying it over as a block; adding a lot of on-host overheads which can't be discounted. It didn't really seem to make much difference to code size in some simple tests either.

So that idea went out the window as well. I think i'll just stick to hand coding this stuff for now.

resources

However I thought i would revisit the meta-data section idea but use it in a higher level way: to define resources which need to be initialised before the code executes - e.g. to avoid race conditions or other messy setup issues.

I coded something up and it seems to be fairly clean and lean; I might keep this in. It still needs a bit of manipulation of the core image before it gets copied to each core but the processing is easier to separate and shouldn't have much overhead.

An example is initialising a port pair as used in the workgroup-test sample. There is actually no race condition here because each end only needs to initialise itself but needing two separate steps is error prone.

First the local endpoint is defined with a static initialiser and then it's reference to the remote port is setup at the beginning of main.

extern ez_port_t portb EZ_REMOTE;

ez_port_t porta = {
  .mask = 4-1
};

main() {
    int lid = ez_get_core_row();

    ez_port_setup(&aport, ez_global_core(&bport, lid, 0));

   ... can now use the port
}
This sets up one end of a port-pair between two columns of cores. Each column is loading a different program which communicates with the other via a port and some variables.

Using a resource record, this can be rewritten as:

extern ez_port_t bport EZ_REMOTE;

// includes the remote port which the loader/linker will resolve to the
// lower 15 bits of the address
ez_port_t aport = {
    .other = &bport,
    .mask = 4-1
};

// resource record goes into .eze.restab section which is used but not loaded
// the row coordinate is relative, i.e. 'this.row + 0',
// and the column of the target is absolute '1'.
static const struct ez_restab __r_aport
__attribute__((used,section(".eze.restab,\"\",@progbits ;"))) = {
    .flags = EZ_RES_REL_ROW | EZ_RES_PORT,
    .row = 0,
    .col = 1,
    .target= &aport
};

main() {
    ... can now use the port
}
It looks a bit bulky but it can be hidden in a single macro. I guess it could also update an arbitrary pointer too and this isn't far off what I started with - but it can only update data pointers not in-code immediate pointer loads that elf reloc records allow for.

Not sure what other resources I might want to add though and it's probably not worth it for this alone.

Sunday 29 June 2014

cexpi, power of 2 pi

I did a bit more playing with the sincos stuff and ended up rolling my own using a taylor series.

I was having issues with using single floating point to calculate the sincos values, the small rounding errors add up over a 1M element FFT. I tried creating a fixed-point implementation using 8.24 fixed point but that was slightly worse than a float version. It was quite close in accuracy though, but as it requires a 16x32-bit multiply with a 48-bit result, it isn't of any use on epiphany. For the integer algorithm I scaled the input by 0.5/PI which allows the taylor series coefficients to be represented in 8.24 fixed-point without much 'bit wastage'.

My final implementation uses an integer argument rather than a floating point argument which makes it easier to calculate the quadrant the value falls into and also reduces rounding error across the full input range. It calculates the value slightly differently and mirrors the angle in integer space before converting to float to ensure no extra error is added - this and the forming of the input value using an integer is where the additional accuracy comes from.

I tried a bunch of different ways to handle the swapping and ordering of the code (which surprisingly affects the compiler output) but the following was the best case with the epiphany compiler.

/*
 Calculates cexp(I * x * M_PI / (2^20))
   = cos(y) + I * sin(y)  |y = (x * M_PI / (2^20))

  where x is a positive integer.

  The constant terms are encoded into the taylor series coefficients.
*/
complex float
ez_cexpii(int ai) {
    // PI = 1
    const float As = 3.1415926535897931159979634685441851615906e+00f;
    const float Bs = -5.1677127800499702559022807690780609846115e+00f;
    const float Cs = 2.5501640398773455231662410369608551263809e+00f;
    const float Ds = -5.9926452932079210533800051052821800112724e-01f;
    const float Es = 8.2145886611128232646095170821354258805513e-02f;
    const float Fs = -7.3704309457143504444309733969475928461179e-03f;

    const float Bc = -4.9348022005446789961524700629524886608124e+00f;
    const float Cc = 4.0587121264167684842050221050158143043518e+00f;
    const float Dc = -1.3352627688545894990568285720655694603920e+00f;
    const float Ec = 2.3533063035889320580018591044790809974074e-01f;
    const float Fc = -2.5806891390014061182789362192124826833606e-02f;

    int j = ai >> (20-2);
    int x = ai & ((1<<(20-2))-1);

    // odd quadrant, angle = PI/4 - angle
    if (j & 1)
        x = (1<<(20-2)) - x;

    float a = zfloat(x) * (1.0f / (1<<20));
    float b = a * a;

    float ssin = a * ( As + b * ( Bs + b * (Cs + b * (Ds + b * (Es + b * Fs)))));
    float scos = 1.0f + b * (Bc + b * (Cc + b * (Dc + b * (Ec + b * Fc))));

    int swap = (j + 1) & 2;

    if (swap)
        return bnegate(ssin, j+2, 2) + I * bnegate(scos, j, 2);
    else
        return bnegate(scos, j+2, 2) + I * bnegate(ssin, j, 2);
}

This takes about 90 cycles per call and provides slightly better accuracy than libc's sinf and cosf (compared on amd64) across the full 2PI range. The function is 242 bytes long. If the taylor series only used 5 terms is about as accurate as libc's sinf and cosf and executes 6 cycles faster and fits in only 218 bytes.

Apart from accuracy the other reason for having a power of 2 being PI is that this what is required for the FFT anyway so it simplifies the angle calculation for that.

Whether this is worth it depends on how long it would take to load the values from a 1M entry table

As per the previous post a radix-1024 fft pass (using radix-4 fft stages) requires 341 x 2 = 682 sincos pairs for one calculation. For the first pass this can be re-used for each batch but for the second values must be calculated (or loaded) each time. This is approximately 61K cycles.

The calculation of the radix-1024 pass using radix-4 stages itself needs 1024 / 4 x log4(1024) x 8 complex multiply+add operations, or at least 40K cycles (complex multiply+add using 4xfma) plus some overheads.

Going on the memory profiling i've done so far, it's probably going to be faster calculating the coefficients on-core rather than loading them from memory and saves all that memory too.

Saturday 21 June 2014

sincos

So i've continuted to hack away on a 1M-element fft implementation for epiphany.

The implementation does 2xradix-1024 passes. The first just does 1024x1024-element fft's, and the second does the same but on elements which are strided by 1024 and changes the twiddle factors to account for the frequency changes. This lets it implement all the reads/writes using LDS and it only has to go to global memory twice.

At this point i'm pretty much doing it just for something to do because I did some testing with ffts and that can execute on a single arm core about 2x faster than the epiphany can even read and write the memory twice because all global memory accesses are uncached. According to some benchmarks I ran anyway - assuming the rev0 board is the same. Given that it's about the same speed for accessing the shared memory in the same way from the ARM (it's non-cached) I think the benchmark was valid enough.

Anyway one problem is that it still needs to load the twiddle factors from memory and for the second pass each stage needs it's own set. i.e. more bandwidth consumption. Since there's such a flop v mop impedance mismatch I had a look into generating the twiddle factors on the epiphany core itself.

Two issues arise: generating the frequency offset, and then taking the cosine and sine of this.

The frequency offset involves a floating-point division by a power of two. This could be implemented using fixed-point, pre-calculating the value on the ARM (or at compile time), or a reciprocal. Actually there's one other way - just shift the exponent on the IEEE representation of the floating point value directly. libc has a function for it called scalbnf() but because it has to be standards compliant and handle edge cases and overflows it's pretty bulky. A smaller/simpler version:

float scalbnf(float f, int logN) {
    union {
        float f;
        int i;
    } u;

    u.f = f;
    if (u.i)
        u.i += (logN << 23);

    return u.f;
}
This shifts f by logN - i.e. multiplies f by 2^N, and allows division by using negative logN. i.e. divide by 1024 by using logN=-10.

The second part is the sine and cosine. In software this is typically computed using enough steps of a Taylor Series to exhaust the floating point precision. I took the version from cephes which does this also but has some pre-processing to increase accuracy with fewer steps; it's calculation only works over over PI/4 and it uses pre and post-processing to map the input angle to this range.

But even then it was a bit bulky - standards compliance/range checking and the fpu mode register adds a bit of code to most simple operations due to hardware limitations. Some of the calculation is shared by sin and cos, and actually they both include the same two expressions which get executed depending on the octant of the input. By combining the two, removing the error handling, replacing some basic flops with custom ones good enough for the limited range and a couple of other tweaks I managed to get a single sincosf() function down to 222 bytes verses 482 for just the sinf() function (it only takes positive angles and does no error checking on range). I tried an even more cut-down version that only worked over a valid range of [0,PI/4] but it was too limited to be very useful for my purposes and it would just meant moving the octant handling to another place so wouldn't have saved any code-space.

I replaced int i = (int)f with a custom truncate function because I know the input is positive. It's a bit of a bummer that FIX changes it's behaviour based on a config register - which takes a long time to modify and restore.

I replaced float f = (float)i with a custom float function because I know the input is valid and in-range.

I replaced most of the octant-based decision making with non-branching arithmetic. For example this is part of the sinf() function which makes sign and pre-processing decisions based on the input octant.

    j = FOPI * x; /* integer part of x/(PI/4) */
    y = j;
    /* map zeros to origin */
    if( j & 1 ) {
        j += 1;
        y += 1.0f;
    }
    j &= 7; /* octant modulo 360 degrees */
    /* reflect in x axis */
    if( j > 3 ) {
        sign = -sign;
        j -= 4;
    }

 ... calculation ...

    if( (j==1) || (j==2) ) {
      // use the sin polynomial
    } else {
      // use the cos polynomial
    }

    if(sign < 0)
        y = -y;
    return( y);
This became:
   j = ztrunc(FOPI * x);
   j += (j & 1);
   y = zfloat(j);
   sign = zbnegate(1.0f, j+1, 2);
   swap = (j+1) & 2;

 ... calculation ...
    if( swap ) {
      // use the sin polynomial
    } else {
      // use the cos polynomial
    }

    return y * sign;
Another function zbnegate negates it's floating point argument if a specific bit of the input value is set. This can be implemented straightforward in C and ends up as three CPU instructions.
static inline float zbnegate(float a, unsigned int x, unsigned int bit) {
    union {
        float f;
        int i;
    } u;

    u.f = a;
    u.i ^= (x >> bit) << 31;

    return u.f;
}

Haven't done any timing on the chip yet. Even though the original implementation uses constants from the .data section the compiler has converted all the constant loads into literal loads. At first I thought this wasn't really ideal because double-word loads could load the constants from memory in fewer ops but once the code is sequenced there are a lot of execution slots left to fill between the dependent flops and so this ends up scheduling better (otoh such an implementation may be able to sqeeze a few more bytes out of the code).

Each radix-1024 fft needs to calculate 341x2 sin/cos pairs which will add to the processing time but the number of flops should account for the bandwidth limitations and it saves having to keep it around in memory - particularly if the problem gets any larger. The first pass only needs to calculate it once for all batches but the second pass needs to calculate each individually due to the phase shift of each pass. Some trigonometric identities should allow for some of the calculations to be removed too.

Tuesday 17 June 2014

borkenmacs

Emacs is giving me the shits today. It always used to be rock-solid but over the last few years the maintainers have been breaking shit left right and centre.

  • Indenting is slower than it used to be (and it was already pretty slow) and keeps breaking all the time - I have to keep quitting a file when it decides it's lost the plot and reload it;
  • Some fool thought that a line-oriented text-editor should track the cursor by visual lines by default;
  • c-macro-expand comes broken as packaged with a useless error message whenever you run it (incorrect cpp definition);
  • Locally rendered text bitmaps. i.e. slow text rendering, and slow as all shit over remote X11. And no misc-fixed fonts unless you hunt for them - i.e. shit fonts;
  • Shitty menu's and a really pointless toolbar that need to be turned off. The toolbar often breaks focus if it's left on;
  • Crappy colours, but then again most GNU/linux tools suffer from that and it's probably more to do with distro wank.

Some of it can be fixed by local configuration, but the rest is probably best fixed by going back to a previous version and compiling without the gtk+ toolkit.

Update: Oh i almost forgot because I haven't seen it in a while, bloody undo got broken at some point. It's particularly broken on ubuntu for some reason (where I just saw it, I don't normally use ubuntu but that's what parallella uses). Sometimes editing a file you can only undo a few lines. This is particularly bad if it's in the middle of a kill-yank-undo cycle as you can end up with unrecoverably lost work.

For a text editor I can think of no higher sin.

Monday 16 June 2014

bandwidth. limited.

I've been continuing to experiment with fft code. The indexing for a general-purpose routine got a bit fiddly so I went back to the parallella board to see about some basics first.

The news isn't really very good ...

I was looking at a problem of size 2^20, I was going to split it into two sections: 1024 lots of 1024-element fft's and then another pass which does 1024 lots of 1024-element fft's again but of data strided by 1024. This should mean the minimum external memory accesses - two full read+write passes - with most working running on-core. There's no chance of keeping it all on core, and even then two passes isn't much is it?

But I estimate that a simple radix-2 implementation executing on only two cores will be enough to saturate the external memory interface. FFT is pretty much memory bound even on a typical CPU so this isn't too surprising.

Saturday 14 June 2014

more fft musings

I was kinda getting nowhere with the fft code but then I saw something about doing large fft's on the parallella and it gave me some ideas to keep poking.

I did some experimenting with the memory access patterns (and towards the end had a dejavu moment - i did most of this last year too, ...) and a radix-8 implementation and started thinking about breaking a big problem onto the epiphany. I'm still hand-expanding the expressions using the basic radix-2 algorithm which is not optimal I believe but the more closer I get to grokking it the easier it will be to understand the more sophisticated implementations.

I was thinking of trying to use the mesh network as a sort of bandwidth multiplier, but the LDS is already one so i'll leave those experiments till later. It should be possible to do something like a 16K element fft on the board quite efficiently but one has to work out the addressing logic first.

So I took the first-stage of my algorithm which performs the bit-reversal permute and a radix-N transform (decimation in time) and converted it to a chunked version. The data transfer works out quite nicely - a single regular sparse DMA can load in the input data which can then be 'recursively' transformed in-place up to the available LDS size - 1K elements allows for double buffering. This can then be written out as a contiguous block ready for subsequent passes. I did some preliminary work on the subsequent passes too and I think I can do them in a similar way - and do multiple passes on-core before dumping it out. The trick will be to process the data in a way that writes can be grouped together at least some of the time.

So I think the data is doable in a fairly efficient way and the LDS can be used as a bandwidth multiplier. And if the LDS is effectively always doing the same-sized fft some other code efficiencies may come into play.

I haven't looked at the coefficient tables (aka 'twiddle' factors) yet - although i think there should be some regularities there as well.

Time passes ...

I had a closer look today after starting to write this this morning (it's now evening). The coefficients for the radix-4 algorithm turn out to be quite simple the way i'm stepping through everything and I only need to load 2 values for the 4 coefficients required for a radix-4 step. I then use them as a, a, b, ib (where i = sqrt(-1)). And they are evenly spaced across the table so they can be loaded using 2x dma - i should have enough memory to double-buffer the loads (hmm, or maybe just fit them all in for a given operation since the smaller stages need fewer coefficients).

I also started looking at how to break the stages into smaller parts. I broke a 256 element fft into two phases: first phase which includes the reordering which does 2x radix-4 passes for an equivalent of radix-16. It has to do 16x of these. I can then complete the job by doing another 16x 2x radix-4 passes using data spaced 16 elements apart. In this way each 'fft processor' only requires 16 elements of storage and the memory only has to be read and written twice in total. Scaling this up to what I can fit on the epiphany ... I could do a 1M element fft by performing radix-1024 stages on each core. I would have to do 2K of these.

Normally this would be a pretty inefficient memory access pattern with L1 cache unless you could load the data synchronously across the cores (if only the dma engines had 32-bit modulos) but given the memory interface on the epiphany and the lack of cache and burst mode reads I don't think it matters so much. It still affects writes but if that's a problem there are ways around it.

I haven't yet tried any of this on the epiphany though. On my workstation I did compare it to FFT Packaqe and although it's not as fast it's in the same city and i haven't fully broken it up into the way it needs to fit onto the epiphany yet. I'm just writing the expressions out in full using complex float and letting the compiler optimise all the sub-expressions and so forth.

Speaking of the compiler I tried some really basic ops to see how it did things, a bit odd but probably because it's sitting in a function:

complex float foo1(complex float a, complex float b, complex float w) {
 return a + b * w * I;
}

complex float foo0(complex float a, complex float b, complex float w) {
 return a + b * w;
}

I was just confirming that * I got converted into swapping (re,im) to (-im,re).

Compiler does this:

   1                            .file   "e-boo.c"
   2                            .section .text
   3                            .balign 4
   4                            .global _foo1
   5                    _foo1:
   6 0000 4C150044              ldr r16,[sp,#2]
   7 0004 EF840220              mov ip,r1
   8 0008 CC350044              ldr r17,[sp,#3]
   9 000c 2F2C0701              fmul r1,r3,r16
  10 0010 3F880721              fmadd ip,r2,r16
  11 0014 BF280701              fmadd r1,r2,r17
  12 0018 CF8C0721              fmsub ip,r3,r17
  13 001c 9740                  fsub r2,r0,r1
  14 001e EF300204              mov r1,ip
  15 0022 E208                  mov r0,r2
  16 0024 4F190204              rts
  17                            .size   _foo1, .-_foo1
  18                            .balign 4
  19                            .global _foo0
  20                    _foo0:
  21 0028 4C150044              ldr r16,[sp,#2]
  22 002c CC350044              ldr r17,[sp,#3]
  23 0030 2F8C0721              fmul ip,r3,r16
  24 0034 3F080701              fmadd r0,r2,r16
  25 0038 BF880721              fmadd ip,r2,r17
  26 003c EF000240              mov r16,r0
  27 0040 CF0C0741              fmsub r16,r3,r17
  28 0044 8F900724              fadd ip,ip,r1
  29 0048 EF000208              mov r0,r16
  30 004c EF300204              mov r1,ip
  31 0050 4F190204              rts
  32                            .size   _foo0, .-_foo0

It's doing the I * thing properly which is expected. But the way the result is formed isn't ideal. I think the delay slots added would be:

        ldr r16,[sp,#2]  ; w.r
        ldr r17,[sp,#3]  ; w.i
        fmul ip,r3,r16  ; b.i * w.r
        fmadd r0,r2,r16  ; a.r += b.r * w.r
 ;;
 ;;
 ;; 
        fmadd ip,r2,r17  ; b.i * a.r + b.r * w.i
        mov r16,r0              ; dual issue
        fmsub r16,r3,r17 ; a.r -= b.i * w.i
 ;;
 ;;
 ;; 
        fadd ip,ip,r1  ; a.i += (b.i * a.r + b.r * w.i)
        mov r0,r16
 ;;
 ;;
 ;; 
        mov r1,ip

Just using fma operations instead of a separate mul and add removes the need for an extra step, needs fewer instructions, and leaves the result in the right register anyway:

 ;; a.r += b.r * w.r - b.i * w.i;
 ;; a.i += b.i * w.r + b.r * w.i;

        ldr r16,[sp,#2]  ; w.r
        ldr r17,[sp,#3]  ; w.i

 fmadd r0,r2,r16 ; a.r += b.r * w.r
 fmadd r1,r2,r17 ; a.i += b.r * w.i
 ;;
 ;;
 ;; 
 fmsub r0,r3,r17 ; a.r -= b.i * w.i
 fmadd r1,r3,r16 ; a.i += b.i * w.r

This isn't really a big issue because normally there is more work going on and more opportunity to schedule things to fill the gaps, but I have noticed it isn't using the fma stuff as much as it might.

On a related note (not directly something i've seen out of the compiler) this is the basic step in an fft calculation:

  complex float b0 = a0 + w0 * a1;
  complex float b1 = a0 - w0 * a1;
One would normally expect a common sub-expression optimisation to (correctly, well perhaps: floats are very problematic in general and standards compliance might dictate a certain operational order - i'm using -ffast-math which ignores these details) turn this into:
  complex float w0a1 = w0 * a1;
  complex float b0 = a0 + w0a1;
  complex float b1 = a0 - w0a1;
And "on paper" it looks like an improvement: 1 complex multiply and 2 additions compared to 2 and 2; particularly since a complex multiple is 4x multiples and 2x additions itself.

But yeah, it isn't; actually it's worse ... The first can be represented by 8x multiplies and 8x additions in 10 instructions:

  mov b1r,a0r         ; dual issues
  fmadd a0r,w0r,a1r
  mov b1i,a0i         ; dual issues
  fmadd a0i,w0r,a1i
  fmsub b1r,w0r,a1r
  fmsub b1i,w0r,a1i
  ;;
  fmsub a0r,w0i,a1i
  fmadd a0i,w0i,a1r
  fmadd b1r,w0i,a1i
  fmsub b1i,w0i,a1r
The second can be represented as 4 multiplies and 6 additions in 8 instructions. Better!? No? It adds another delay stage requirement in the pipeline and has less work to do in each wasting further execution slots.
  fmul  w0a1r,w0r,a1r
  fmul  w0a1i,w0r,a1i
  ;;
  ;;
  ;;
  fmsub w0a1r,w0i,a1i
  fmadd w0a1i,w0i,a1r
  ;;
  ;;
  ;;
  fsub  b1r,a0r,w0a1r
  fadd  a0r,a0r,w0a1r
  fsub  b1i,a0i,w0a1i
  fadd  a0i,a0i,w0a1i
Assuming the dual issue ialu ops, the first takes 9 cycles and the second 14.

Again, further work to do or loop-unrolling will hide the delay slots, but it just shows you can't go on pure flops as a performance indicator.

On a slightly unrelated note I also had a quick look into extended precision arithmetic and a few other little bits and pieces. I found the paper ``Extended-Precision Floating-Point Numbers for GPU Computation'' by Andrew Thall that provides an extended precision floating point format using 2x 32-bit floating point numbers. I didn't get around to fully exploring it but it may be of use on the epiphany for certain applications since software doubles are not fast.

Thursday 12 June 2014

quick poke @ aparapi / hsa + fft = bummer

I have a very basic radix-2 fft algorithm that implements each pass as a single loop rather than multiple loops - this allows the algorithm to be parallelised[sic] trivially because each item is calculated in isolation. I converted it to java so I could experiment and compare - although given Java has no complex type it's easier experimenting in C! The single-loop version is slower than the two-loop one which is a bit of a shame but given that the radix-2 algorithm does so little work inside the loop it isn't really surprising.

(The algorithmic efficiency / performance isn't really important right now, it's just something to experiment with)

I tried running it using lambda expressions but the overhead of the thread communications swamped it - it's about 3x slower that way. This was no surprise.

So I thought i'd try it using HSA instead; and about the only bit of that I have handy right now is aparapi-lambda. I was hoping that using HSA would demonstrate where HSA could come into it's own so I hooked it up using aparapi-lambda given that's the only compiler I could think of that I have right now. Unfortunately there is a bit of an impedance mismatch between the way Aparapi and the JVM work and the way HSA does. Aparapi just translates the Java bytecode from javac directly into hsail assembly language; no problem there. But javac intentionally does no optimisation whatsoever - and leaves all that to the JVM instead which has more knowledge which allows it to do a better job. However HSA moves the optimisation to the compiler so that the HSA finaliser can be simpler - which makes it easier to port, smaller, and more robust and reliable.

To cut that long explanation short: it runs like shit because it's generating shit code and I can't really use it as any indication of performance. Bummer. It's about 3x slower than using lambdas.

So much for trying a short-cut - looks like I have to get my hands a bit dirtier on this one.

Oh then I remembered I had the graal stuff, but I forgot how to run it. So I tried updating and after a bit of frobbing about got it to run, ... I think. This generates better code but still has overly complex array indexing arithmetic ... and it's running much much slower too (coincidentally, around another 3x slower again).

So I updated gcc from the hsa branch and got that built but trying to do something will require a bit more work. I don't want to use libOkra for this so I started poking at the ioctls required to talk to the kfd device (not sure what kfd stands for but it's the kernel module which handles the HSA interface). I managed to get some info out of it so at least it's on the right track. It's a tiny interface and most of the work is done in userland and should be straightforward but there are some details which are important to do with cache coherency that I need to find out about.

I tried getting the HSA documents which would aid this work ... but they're all over the place, one is on some shitty website called sl1deshare which has an abysmal eye-hurting in-browser viewer and wont let me download the pdfs without a third-party account which I don't have.

Oh I see, if you send the HSA foundation a message the site sends you an email with a download link anyway. How ... annoying. I wonder what spam service I just inadvertently signed up for.

Hmm, I think that might be enough for today. And that reminds me that I haven't had breakfast yet, and together with another night of poor sleep i'm just not in the mood.

Update: Ok so I had a break and got back to it. But it seems like i misunderstood the abstraction a little bit and the finalising is done at the api level before it hits the device. Well that makes complete sense of course. Duh.

Anyway I tried getting libOkra to load a BRIG generated by gcc but it just aborts, probably due to some elf issues alluded to in the hsa branch of gcc.

So I guess it's just not ready for that kind of poking yet.

Monday 9 June 2014

fft to nowhere

I've been playing with fft's a bit the last few days.

I poked about trying to turn each pass into a single loop - this could allow it for example to be implemented using epiphany hardware loops. But with a radix-2 kernel there just isn't really enough flops to hide the overheads of the in-loop indexing calculations which are more complicated than simple loops.

For convolution it doesn't matter if the intermediate data is bit-reversed, so you can get away without reordering it but you need to use a decimation in frequency (dif) algorithm for the forward transform and a decimation in time (dit) for the inverse. So I gave that a go as well, but the dif algorithm was slower than the in-order algorithm, at least for the data and kernel-size I was playing with. I think I have too much loop arithmetic.

So after poking about with that stuff for a bit I tSo obviously stalls ried a radix-4 kernel. I did get something working but i'm not convinced i've done it the right way because i'm looking up the twiddle factors for each calculation as if were doing two stages of a radix-2 locally.

The compiler refuses to use double-word loads or stores for the complex float accesses though which is a bit of a bummer. So I tried coding up the first pass of the in-order routine - it does a bit-reversing permute and a radix-4 together - in assembly language for comparison. Using the hardware loops and some pre-calculation the inner loop only requires 3 instructions for the address calculations which is pretty tidy.

Probably not going to break any speed records but the code isn't too big.

I also looked at the inner loop - just the calculation part anyway, as the indexing arithmetic is pretty messy. Looks like you need a radix-4 kernel to remove all the stalls in the code, otherwise there isn't enough work to hide it (one could just unroll the loop once too, but since that's part of the calculation anyway may as well make it a radix-4 step). If i ever finish it i'll try putting it in a hardware loop as well.

To confirm some of the details about the timing on the hardware I created a small program which ran a small routine across all timings modes in CTIMER0. This records things like stalls and so on. It's quite interesting.

So for example the worst-case for a dependency stall for a fpu op might be:

stalls:
   fmadd r0,r0,r0
   fmadd r0,r0,r0
   rts
This code adds 4 stalls to the running time. "not much" perhaps, but with dual-issue, the following code executes in the same time:
stalls:
   fmadd r0,r0,r0
   add   r16,r16,r16
   fmadd r1,r1,r1
   add   r16,r16,r16
   fmadd r2,r2,r2
   add   r16,r16,r16
   fmadd r3,r3,r3
   add   r16,r16,r16
   fmadd r12,r12,r12
   add   r16,r16,r16
   fmadd r0,r0,r0
   rts

This dual issue should mean I can hide all the addressing mathematics from the execution time.

This scheduling stuff is handled in the compiler when you don't write assembly - and it typically does a reasonable job of it.

Thursday 5 June 2014

roadblocks

I've hit a roadblock for the moment in parallella hacking. It could just be a bug but I seem to be hitting a design fault when input to the epiphany is loaded - write transactions from the host cpu are getting lost or corrupted.

It happens when I try to communicate with the epu cores by writing directly to their memory like this:

   struct job *jobqueue = @some pointer on-core@

   int qid = ez_port_reserve(1)

   memcpy(jobqueue[qid], job, sizeof(job);

   ez_port_post(1);
And when the core is busy reading from shared memory:
struct job jobqueue[2];
ez_port_t port;

main() {
    while (1) {
        int qid = ez_port_await(1);

        ez_dma_memcpy(buffer, jobqueue[qid].input, jobqueue[qid].N);

        ez_port_complete(1);
    }
}
It's ok with one core but once I have multiple cores running concurrently it starts to fail - at 16 it fails fairly often. It often works once too but putting the requests in a loop causes failures.

If I change the memcpy of the job details to the core into a series of writes with long delays in between them (more than 1ms) it reduces the problem to a rarity but doesn't totally fix it.

I spent a few hours trying everything I could think of but that was the closest to success and still not good enough.

I'll have to wait to see if there is a newer fpga image which is compatible with my rev0 board, or perhaps it's time to finally start poking with the fpga tools myself, if they run on my box anyway.

fft

I was playing with an fft routine. Although you can just fit a 128x128 2D fft all on-core I was focusing on a length of up to 1024.

I did an enhancement to the loader so I can now specify the stack start location which lets me get access to 3 full banks of memory for data - which is just enough for 3x 1024 element complex float buffers to store the sin/cos tables and two work buffers for double-buffering. This lets me put the code, control data, and stack into the first 8k and leave the remaining 24k free for signal.

I also had to try to make an in-place fft routine - my previous one needed a single out-of-place pass to perform the bit reversal following which all operations are streamed in memory order (in two streams). I managed to include the in-place bit reversal with the first radix-2 kernel so execution time is nearly as good as the out of place version. However although it works on the arm it isn't working on the epiphany but I haven't worked out why yet. At worst I can use the out-of-place one and just shift the asynchronous dma of the next row of data to after the first pass of calculation.

Although i'm starting with radix-2 kernels i'm hoping I have enough code memory to include up to radix-8 kernels which can make a good deal of difference to performance. I might have to resort to assembly too, whilst the compiler is doing a pretty good job of the arithmetic all memory loads are just word loads rather than double-word. I think I can do a better job of the addressing and loop arithmetic, and probably even try out the hardware loop feature.

Tuesday 3 June 2014

parallella ezesdk 0.2

Just uploaded version 0.2 of the ezesdk stuff to the ezesdk homepage.

This includes everything i've been working on of late including the mandelbrot demo.

atomic counters

Yesterday I tried a mutex based implementation of an atomic counter to see how it compares.

My first test was to read the atomic counter 2^20 (1024x1024) from each core in a tight loop. Times are wall-clock on the host.

  RxC   Interrupt    Mutex
  ----  ----------   ------
  1x1   0.193488s    0.114833s
  1x2   0.317739s    0.122575s
  2x1   0.317739s    0.121737s
  4x1   0.393244s    0.298871s
  1x4   0.393244s    0.361574s
  4x2   0.542462s    1.122173s
  2x4   0.543283s    0.903163s
  4x4   0.849627s    3.493985s

Interesting to note that orientation of a single line of cores makes a difference. May have something to do with using core 0,0 as the location of the mutex. Also of interest is that the 4x4 case is accessing the atomic counter 16x as many times as the 1x1 case - here the low-bandwidth requirements of the interrupt implementation is scaling better than linear compared to the mutex implementation - because the requesting core is effectively batching up multiple requests if they come too fast, rather than having to serialise everything remotely.

But as can be seen - once more than 4x cores are in play the interrupt driven routine starts to win out comfortably. This is despite effectively blocking the host core while the others are running.

But this isn't a very useful test because no practical software simply increments a remote counter. So for something more realistic I added a delay to the loop using one of the ctimers.

So looking at the most congested case of all cores busy:

4x4

 Delay     Interrupt    Mutex
 ------    ----------   ------
     10    0.965649s    3.138225s
    100    1.083733s    3.919083s
    200    1.630165s    3.693539s
    300    1.780689s    3.792168s
    400    2.297966s    3.666745s
    500    2.448892s    3.563474s
   1000    3.840059s    1.851269s
   2000    4.923238s    3.402963s

So the cross-over point is around 1000 cycles worth of work, at least on a 16 core machine. 1000 cycles isn't much.

Given this data, it's probably better to go with a mutex implementation after-all. It requires about 1/3 of the code and doesn't load the servicing core nearly as much. Oh well, worth a try. (hang on, this doesn't let the host participate ... argh).

I have no direct data on the mesh load or the fairness of the count distribution. Intuitively the mutex based implementation wont be as fair due to the way the routing works, but once you have enough work to do the network shouldn't be particularly busy.

I'm hoping a future hardware revision may include an option on the ctimer to count reads of the ctimer register - which would effectively be a 2xatomic counters per core in hardware. In the meantime there's always that FPGA (maybe that's something I will look at, just no experience there).

Sunday 1 June 2014

note to self ...

Remember you need to save the status register before you execute an add.

I went about coding up that atomic counter idea from the previous post but had a hard time getting it to work. After giving up for a bit I had another go (couldn't let it rest as it was so close) and I discovered the register-saving preamble i'd copied from the async dma isr was simply broken.

_some_isr:
        sub     sp,sp,#24
        strd    r0,[sp]
        strd    r2,[sp,#1]
        strd    r4,[sp,#2]

        movfs   r5,status

... routine

        movts   status,r5

        ldrd    r4,[sp,#2]
        ldrd    r2,[sp,#1]
        ldrd    r0,[sp],#3
        rti

Well it certainly looked the part, but it clobbers the status register in the first instruction. Oops.

        strd    r4,[sp,#-1]
        movfs   r5,status
        sub     sp,sp,#24
        strd    r0,[sp]
        strd    r2,[sp,#1]

There goes the symmetry. Actually it could probably just leave the stack pointer where it is because the routine isn't allowing other interrupts to run at the same time.

Fixed, the atomic counter worked ok. It shows some ... interesting ... behaviour when things get busy. It's quite good at being fair to all the cores except the core on which the atomic counter resides. Actually it gets so busy that the core has no time left to run any other code. The interrupt routine is around 200 cycles for 16 possible clients.

So ... I might have to try a mutex implementation and do some profiling. Under a lot of contention it may still be better due to the fixed bandwidth requirements but it's probably worth finding out where the crossover point is and how each scales. A mutex implementation wont support host access either ... always trade-offs. Definitely a candidate for FPGA I guess.

ipc primitive ideas

Had a couple of ideas for parallella IPC mechanisms whilst in the shower this morning. I guess i'm not out of ideas yet then.

atomic counter

These are really useful for writing non-blocking algorithms ... and would best be implemented in hardware. But since they aren't (yet - it could go into fpga too) here's one idea for a possible solution.

The idea is to implement it as an interrupt routine on each core - each core provides a single atomic counter. This avoids the problem of having special code on some cores and could be integrated into the runtime since it would need so little code.

Each core maintains a list of atomic counter requests and the counter itself.

  unsigned int *atomic_counter_requests[MAX_CORES];
  unsigned int atomic_counter;

To request an atomic number:

  unsigned int local_counter;

  local_counter = ~0;
  remote atomic_counter = &local_counter;
  remote ILATST = set interrupt bit;
  while local_counter = ~0
    // nop
  wend

The ISR is straightforward:

atomic_isr:
  counter = atomic_counter;
  for i [get_group_size()]
    if (atomic_counter_requests[i])
       *atomic_counter_requests[i] = counter++;
       atomic_counter_requests[i] = NULL;
    fi
  done
  atomic_counter = counter;

Why not just use a mutex? Under high contention - which is possible with many cores - they cause a lot of blockage and flood the mesh with useless traffic of retries. Looking at the mesh traffic required:

do {
  -> mutex write request 
  <- mutex read request return
} while (!locked)
  -> counter read request 
  <- counter read request return
  -> counter write request 
  -> mutex write request 
Even in the best-case scenario of no contention there is two round trips and then two writes which don't block the callee.

In comparison the mesh traffic for the interrupt based routine is:

  -> counter location write request 
  -> ILATST write request 
  <- counter result write request

So even if the interrupt takes a while to scan the list of requests the mesh traffic is much lower and more importantly - fixed and bounded. There is also only a single data round-trip to offset the loop iteration time.

By having just one counter on each core the overhead is minimal on a given core but it still allows for lots of counters and for load balancing ideas (e.g. on a 4x4 grid, use one of the central cores if they all need to use it). Atomic counters let you implement a myriad of (basically) non-blocking multi-core primitives such as multi reader/multi-writer queues.

This is something i'll definitely look at adding to the ezecore runtime. Actually if it's embedded into the runtime it can size the request array to suit the workgroup or hardware size and it can include one or more slots for external requests (e.g. allow the host to participate).

Broadcast port

Another idea was for a broadcast 'port'. The current port implementation only works with a single reader / single writer at a time. Whilst this can be used to implement all sorts of variations by polling one idea might be to combine them into another structure to allow broadcasts - i.e. single producer and multiple readers which read the same entries.

If the assumption is that every receiver is at the same memory location then the port only needs to track the offset into the LDS for the port. The sender then has to write the head updates to every receiver and the receivers have to have their own tail slots. The sender still has to poll multiple results but they could for example be stored as bytes for greater efficiency.

This idea might need some more cooking - is it really useful enough or significantly more efficient enough compared to just having a port-pair for each receiver?