Sunday 22 May 2011

branchless stackless merge sort

After a previous post I had some other thoughts about implementing a branch-less merge. And further, of removing any stack or recursion as well, at least in certain cases.

This is more of a mental exercise in this case because I only looked at sorting an array - where you're better off using qsort, but it might be worth it on a linked list.

First merge version:
static inline void merge1(void * restrict head1,
int len1,
void *tmpData,
CmpFunc cmp,
size_t itemsize)
{
void * restrict tail1 = head1 + len1*itemsize;
void * restrict head2 = tail1;
void * restrict tail2 = head2 + len1 * itemsize;

void * restrict h1 = tmpData;
void * restrict t1 = tmpData;
void * restrict tail = head1;

// copy first array to temporary buffer
t1 = tmpData + (tail1-head1);
memcpy(tmpData, head1, tail1-head1+itemsize);

// merge temporary buffer into target buffer
while (1) {
if (cmp(h1, head2) < 0) {
memcpy(tail, h1, itemsize);
tail += itemsize;
if (h1 == t1)
// end of first list - rest is already in place
goto done;
h1 += itemsize;
} else {
memcpy(tail, head2, itemsize);
tail += itemsize;
if (head2 == tail2) {
// end of second list - copy rest of first list back
memcpy(tail, h1, t1-h1+itemsize);
goto done;
}
head2 += itemsize;
}
}
done:
return;
}
Keeping the loop simple is the best policy in this case.

The branchless merge:
static inline void merge2(void * restrict head1,
int len1,
void * restrict tmp,
CmpFunc cmp,
size_t itemsize,
void *sentinel)
{
void *h1 = tmp;
void *h2 = head1+len1*itemsize;
void *out = head1;
int len = len1*2;
int i;

// copy first half array to temporary buffer & add sentinal
memcpy(tmp, head1, len1*itemsize);
memcpy(tmp+len1*itemsize, sentinel, itemsize);

// save spot for sentinel in 2nd array
memcpy(tmp+len1*itemsize+itemsize, h2+len1*itemsize, itemsize);
memcpy(h2+len1*itemsize, sentinel, itemsize);

// merge
for (i=0;i<len;i++) {
int c;

c = cmp(h1, h2) < 0;
memcpy(out, c ? h1 : h2, itemsize);
h1 = c ? h1 + itemsize : h1;
h2 = c ? h2 : h2 + itemsize;
out += itemsize;
}

// restore
memcpy(h2+len1*itemsize, tmp+len1*itemsize+itemsize, itemsize);
}

It uses a sentinel so there needs to be no testing of loop termination beyond the number of total elements being merged.

And finally the removal of the stack or recursion to find out when merges need to take place. One simple way (breadth-first) is just to iterate over the whole array, at size 2, then 4, 8, and so on, but this has poor locality of references. So this simple logic is used to track when and how large merges are to take place and does them in a depth-first fashion.

void MergeSort(void *data, size_t count, size_t itemsize, CmpFunc cmp, void *sentinal) {
int i, j;
int last = 0;
void *tmp;

tmp = malloc((count+1) * itemsize);

for (i=0;i<count;) {
batchers8(data+i*itemsize, tmp, cmp, itemsize);
i += 8;
for (j=16;j<=count;j<<=1) {
if (j&(i^last)) {
merge2(data+(i-j)*itemsize, j/2, tmp, cmp, itemsize, sentinal);
}
}
last = i;
}

free(tmp);
}
This code also performs the bottom sort in 8 element lots using an unrolled batchers sort. So one might note that this particular sort is limited to sorting arrays whose number of elements is a power of 2 ...

I ran a few timings and this is slightly faster than the natural merge sort that I had previously written in certain cases - that of course is still much faster for the already-sorted case. The version I timed had the merge step manually unrolled 8 times as well.

For random input, the branchless merge is about 10% faster than the original merge, and the original merge is about 10% faster than my original natural merge sort. So the lack of conditional execution ends up a bonus, even considering the extra overheads, and all the extra memory moves. This is on an intel cpu anyway - which is so register starved they've had to make such code run fast, so one might not end up with the same result on other architectures.

There is more possible here too if you're sorting basic types (i.e. not a general interface as above) - all of these algorithms can be vectorised to run on SIMD processors. It could for example independently sort 4 interleaved arrays in the same number of operations, and then merge those in 2-3 passes at the end (or 1, with a 4-way merge).

Sentinels are a pretty nice optimisation trick but they usually require some sort of extra housekeeping that makes their use impractical. For example in the camel-mime-parser code I used a sentinel of \n at the end of the i/o buffer every time i refreshed it so the lowest level loop which identified individual lines was only 3 instructions long - and it only checked for running out of input after the loop had terminated. But I had to write my own i/o buffering to have such control of the memory.

No comments: