Thursday 29 March 2012

Web N Shit

So, as inevitably happens, the client wants to look at some mobile interface to some of the application we have.

Which for various reasons at the moment, means: web/html5/all that jazz. Fortunately someone else is working on the browser stuff so I get to avoid javascript ...

I get to play on the backend, so that means Java EE 6 ... Boy it's been a long time since I used any of this. Well I had a play with a wiki/documentation system a couple of years ago for fun but the last time I was paid to work on it was around 12 years ago: you know back when CORBA was mentioned on the first page ...

Java EE 6 has a fairly sleep learning curve - and a lot of reading required - but so far i'm pretty impressed. As you might imagine, with a decade of maturity behind it, there is a good bit of polish in the design. Some of the errors you get from the implementations could be a bit more meaningful though - array index out of bounds exception because one didn't specify proper attributes on the class or method don't help much. So after a couple of days of swearing violently ... it starts to fall into place really well.

I knew this was coming so for the last month or so I'd been re-architecting the desktop application in a more tiered and re-usable fashion so much of the jax-rs interface just slotted straight over the top with only a couple of lines of code for each request type. But that's the easy stuff: I also need long-running async tasks, so thats another big chunk of api to get to grips with.

Sunday 25 March 2012

And the winner is ...

Well I kept poking somewhat disinterestedly at some JNI code generator ideas. Just feeling pretty blah over-all, maybe it's the weather ... hay-fever, sleep apnoea, insomnia, all a big pita over the last few days.

I started looking into m4, and it could probably create a fairly concise definition and generator: but boy, the escaping rules and special cases are such a prick to learn it just isn't worth the effort. For all the intricacies of learning m4 there's very little pay off - it isn't a very useful tool in general and as I wont be using it all the time i'd just have to learn it all again whenever I looked at it. No thanks.

The cpp stuff is still an idea, but the definitions are clumsier than they need to be; one cannot automate the parameter passing convention inferencing which is often possible.

But the exercise did provide useful for the programme design. I took the way it worked and ported it to perl instead (now perl is a tool which is easy to learn with a massive pay-off; definitely not the one to use for every problem though). Macro processors do more work than they appear to on the surface, but after a bit of mucking about (and some fighting with weird perl inconsistencies with references and so on) I had something going. Well I suppose it's an improvement; it halved the amount of code required to do the same thing so that's gotta be better than a poke in the eye.

Because the jjmpeg binding generator is the oldest it's also the crustiest - so when i care I will look into it i'll probably replace it: it isn't a high priority since what is there does work just fine.

Friday 23 March 2012

f'ing the puppy

Hmm, I haven't really been keeping up much with nvidia lately - they've obviously never really cared about OpenCL and what little effort they did for show seemed to disappear early 0-11. Oh and gmail suddenly decided they're developer updates were spam ... (they haven't mentioned opencl in any update for over a year).

Still, they make hardware that is capable of computing jobs, and they still seem to want to push their own platform for the job, so one must keep abreast - and if nothing else healthy competition is good for all of us.

I'd seen the odd fanboi post about how 'kepler' was going to wipe the floor of all the competition, and after all this time to see what has been released looks pretty disappointing. It's nice to see that they've tackled the power and heat problems, but this part is all about games, `compute' and OpenCL is left in the dust, behind even their older parts. And even then the games performance whilst generally better isn't always so. It looks like they skimped on the memory width in order to reduce the costs so they could be price competitive with AMD - so presumably a wider-bus version will be out eventually and is sorely needed to re-balance the system design.

I guess they've decided general purpose compute is still a bit before it's time, not ready as a mainstream technology and therefore not worth the investment. Well it's a gamble. Probably just as much a gamble as AMD putting all their eggs into that basket: although AMD have more to win and lose as for them it is all about CPU / system performance than simply graphics. Still, with multi-core cpu's hitting all sorts of performance ceilings AMD (nor intel for that matter) really has no choice and time will tell who has the better strategy. AMD certainly have a totally awesome vision, whether they can pull it off is another matter. Nvidias vision only seems to be about selling game cards - nothing wrong with that as money is what any company is about, but it's hardly inspiring.

I'm not sure how big the market is for high-end graphics-only cards. When reading fanboi 'enthusiast' sites one has to put in perspective that they are still a niche product, and most graphics cards in use are embedded or the lower-end of the price range. Comments from kids saying they're going to throw away their recently bought top-of-the-range amd/nvidia card now that nvidia/amd have a new one out are just ludicrous and simply not believable (and if they really do such things, their behaviour isn't rational and not worth consideration).

Thursday 22 March 2012

Macro binding

So, late into the night I was experimenting with an alternative approach to a binding generator: using a macro processor.

Idea being: one relatively simple and expressive definition of an interface using macros, then by changing the macro definitions one can generate different outputs.

So for example, for field accessors I ended up with something like this:
class_start(AVCodecContext)
build_gs_int(AVCodecContext, bit_rate, BitRate)
...
class_end(AVCodecContext)

With the right definitions of the macros I then generate C, Java, or wrapper classes by passing this all through the C pre-processor. It's not the most advanced macro processor but it is sufficient for this task.

Then I thought I hit a snag - wont this be totally unwieldy for defining the function calls themselves?

So I looked into an alternative approach, using Java with annotations to define the calling interface instead, and then use reflection to generate the C bindings from that. The problem with this is: it just takes a lot of code. Types to define the type conventions, virtual methods for the generators, so on and so forth. Actually I spent several times the amount of time I spent on the field generator before I had something working (not terribly familiar with the reflection api required).

The definitions are relatively succinct, but only for function calls: for field accessors one ends up writing each get/set function as well. Still not the end of the world and it might be a reasonable approach for some cases.

e.g.
@JGS(cname = "frame_size")
static native int getFrameSize(AVCodecContext p);
@JGS(cname = "frame_size")
static native void setFrameSize(AVCodecContext p, int v);
@JNI(cname = "avcodec_open")
static native int open(AVCodecContext p, AVCodec c);
The annotations are relatively obvious.

Then I had another thought this morning: dynamic cpp macro invocation. I forgot you can create a macro call based on macro arguments. All I would need to then define is a new macro for each number of possible arguments, and other macros could handle the various conventions used to marshal that particular argument. And even then it would mostly be simple cut and paste code once a single-argument macro was defined.

So after half an hour of hacking and 50 lines of code (both java and perl were about 400 to do this, with perl being much more dense), I came up with some macros that generate the JNI C code once something like this is defined:
call2(AVCodecContext, int, int, avcodec_open, open,
object, AVCodecContext, ctx, object, AVCodec, c)

The arguments are: java/c type, return 'convention', return type, c function call name, java method name, argument 1 convention, type, name, argument 2 convention, type, name. If I removed the argument names - which could be defined in the macro - it would be even simpler.

I 'just' need to define a set of macros for each convention type - then I realised I'd just wrote some dynamic object-oriented c-pre-processor ... hah.

For a binding like FFmpeg where there are few conventions and it benefits from grouping and 'consistifying' names, i.e. a lot of manual work - this may be a practical approach. And even when there are consistent conventions, it may be a practical intermediate step: the first generator then has less to worry about and all it has to do is either infer or be told the specific conventions of a given argument.

Of course, using a more advanced macro processor may help here, although i'm not as familiar with them as cpp (and more advanced means harder to learn). So that got me thinking of a programming exercise: try to solve this problem with different mechanisms and just see which turns out the easiest. XSLT anyone? Hmm, maybe not.

I'd started to realise the way I'd done it in jjmpeg's generator wasn't very good: i'm trying to infer the passing convention and coding it in the generator using special cases: rather than working it out at the start and parametrising the generator using code fragment templates. The stuff I was playing with with the openal binding was more along these lines but still has some baggage.

Wednesday 21 March 2012

openal weirdness

Hmm, so after my previous post I later noticed a strange detail on the alcGetProcAddress and alGetProcAddress functions (BTW the specification, unlike the opencl one, isn't very concise or easy to read at all): they may return per-device or per-context addresses(!?).

So rather than the routing to the correct driver that occurs transparently using OpenCL, the application must do it manually by looking up function pointers on the current context or on a given device ... although it isn't clear when which is exactly needed as some extension docs mention it, and others do not (although some function args are suggestive of which way the functions go). Further complicating matters is that the context may be set per-thread if the ALC_EXT_thread_local_context extension is available. A search on the openal archives nets a conclusion that it's a bit confusing for everyone.

Obviously java can't be looking up function pointers explicitly, so what is a neat solution?
  1. When using openal-soft (i.e. any free operating system), just ignore it: that's all it does. i.e. just use the same static symbol resolution mechanism as for the core functions.
  2. Look up the function every time - openal-soft uses a linear scan to resolve function names, so this probably isn't a good solution.
  3. I thought of trying to use a function table which can be reset when the context changes and so on, but the thread_local_context extension makes this messy, and it's just messy anyway.
  4. Move all extension functions to a per-extension wrapper class that loads the right pointers on instantiation. Have factory methods only available on a valid context or device depending on the extension type and associate the wrapper directly with that.
4 is probably the cleanest solution ...

Tuesday 20 March 2012

More JNI'ing about

Well it looks like I brought my 'hacking day' a few days early, and I ended up poking around with JNI most of the day ... Just one of those annoying things that once it was in my head I couldn't get it out (and the stuff I have to do for work isn't inspiring me at the moment - gui re-arrangements/database design and so on).

I took the stuff i discovered yesterday, and tweaked the openal binding I hacked up a week ago. I converted all array arguments + length into java array types, and all factory methods create the object directly.

Then I ran a very crappy timing test ... most of the time in this is spend in alcOpenDevice(), but audio will have a fairly low call requirement anyway. I could spend forever trying to get a properly application-represenative benchmark, but this will show something.

First, the C version.
#define buflen 5
static ALuint buffers[buflen];
void testC() {
ALCdevice *dev;
ALCcontext *ctx;
int i;

dev = alcOpenDevice(NULL);
ctx = alcCreateContext(dev, NULL);

alcMakeContextCurrent(ctx);

alGenSources(buflen, buffers);
for (i=0;i<buflen;i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, 1);
}
for (i=0;i<buflen;i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, 0);
}
alDeleteSources(buflen, buffers);

alcMakeContextCurrent(NULL);
alcDestroyContext(ctx);
alcCloseDevice(dev);

}
Not much to say about it.

Then the JOAL version.
AL al;
ALC alc;
IntBuffer buffers;
int bufcount = 5;
void testJOAL() {
ALCdevice dev;
ALCcontext ctx;

dev = alc.alcOpenDevice(null);
ctx = alc.alcCreateContext(dev, null);
alc.alcMakeContextCurrent(ctx);

al.alGenSources(bufcount, buffers);
for (int i = 0; i < bufcount; i++) {
al.alSourcei(buffers.get(i), AL.AL_SOURCE_RELATIVE, AL.AL_TRUE);
}
for (int i = 0; i < bufcount; i++) {
al.alSourcei(buffers.get(i), AL.AL_SOURCE_RELATIVE, AL.AL_FALSE);
}
al.alDeleteSources(bufcount, buffers);

alc.alcMakeContextCurrent(null);
alc.alcDestroyContext(ctx);
alc.alcCloseDevice(dev);
}
...
buffers = com.jogamp.common.nio.Buffers.newDirectIntBuffer(bufcount);
al = ALFactory.getAL();
alc = ALFactory.getALC();
...
I timed the initial factory calls which loads the library: this isn't timed in the C version. Note that because it's passing around handles you need to go through an interface, and not directly through the native methods.

And then this new version, lets call it 'sal'.
import static au.notzed.al.ALC.*; import
static au.notzed.al.AL.*;

int bufcount;
int[] buffers;
void testSAL() {
ALCdevice dev;
ALCcontext ctx;

dev = alcOpenDevice(null);

ctx = alcCreateContext(dev, null);
alcMakeContextCurrent(ctx);

alGenSources(buffers);
for (int i = 0; i < buffers.length; i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, AL_TRUE);
}
for (int i = 0; i < buffers.length; i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, AL_FALSE);
}
alDeleteSources(buffers);

alcMakeContextCurrent(null);
alcDestroyContext(ctx);
alcCloseDevice(dev);
}
...
buffers = new int[bufcount];
...
Again the library load and symbol resolution is included - in this case it happens implicitly. Notice that when using the static import it's almost identical to the C version. Only here the array length isn't required as it's determined by the array itself.

Also this implementation only needs a small number of very trivial classes to be hand-coded, and everything else is done in the C code; although I also looked into wrapping the whole lot (buffers and sources included) in a high-level api as well. The openal headers are used completely untouched, although I have some mucky scripts which call gcc/cproto/grep to extract the information I need.

Apart from the code itself, I tried two array binding approaches, one which uses GetIntArrayRange(), and the other that uses Get/ReleaseCriticalArray(). Note that for the case of alSourcei() the binding JNI code only needs to read the array and it doesn't copy it back afterwards.

Timings

The results, the above routine is run 1000 times for each run. The runs are a loop within the process, so only the first time has any library load overheads. I used the oracle jdk 1.7.

run c joal range critical
0 3.3 3.678 3.518 3.554
1 3.267 3.622 3.446 3.405
2 3.297 3.513 3.493 3.552
3 3.264 3.482 3.448 3.494
4 3.243 3.575 3.553 3.542
5 3.297 3.472 3.395 3.352
6 3.308 3.527 3.376 3.359
7 3.284 3.52 3.354 3.363
8 3.253 3.419 3.363 3.349
9 3.266 3.42 3.429 3.413

ave 3.2779 3.5228 3.4375 3.4383
min 3.308 3.678 3.553 3.554
max 3.243 3.419 3.354 3.349

As you'd expect, C wins out - it just calls all the functions directly, and even the temporary storage is allocated in the BSS.

Both of the 'sal' versions are much the same, and joal isn't too far behind either - but it is behind.

Ok, so it's not a very good benchmark. I'm not going to re-write all the above, but when I changed to 100 iterations, but repeated the inner 2 loops (between gen/deleteSources) 100 times as well: c was about 0.9s, joal averaged about 4s, and sal averaged 2s. But that case probably goes too far in over-representing the method call overheads relative to what you might expect for an application at run-time - JNI has overheads, but as long as you're not implementing java.lang.Math() with it, it's barely going to be measurable once you add in i/o overheads, mmu costs, system scheduling and even cache misses.

At any rate, it validates the approach taken against another long-standing implementation (if not a particularly heavily developed one). Assuming that is I don't have glaring errors in the code and it's not actually doing all the work I ask of it.

SAL binding

Note also that the 'sal' binding hasn't skimped on safety where possible just to try to get a more favourable result (well, the al*v() methods have an implied data length which I am not checking ...), e.g. the symbols are looked up at run-time and suitable exceptions thrown on error.

An example bound method:
// manually written glue code
int checkfunc(JNIEnv *env, void *ptr, const char *name) {
// returns true if *ptr != null
// opens library if not opened
// sets exception and returns false if it can't open
// looks up method and field id's if not already done
// sets *ptr to dlsym() lookup
// sets exception and returns false if it can't find it
// returns true
}
// auto-generated binding
jobject Java_au_notzed_al_ALCAbstract_alcCreateContext(
JNIEnv *env, jclass jc, jobject jdevice, jintArray jattrlist) {
static LPALCCREATECONTEXT dalcCreateContext;
if (!dalcCreateContext
&& !checkfunc(env, (void **)&dalcCreateContext, "alcCreateContext")) {
return (jobject)0;
}
ALCdevice * device = (jdevice ?
(void *)(*env)->GetLongField(env, jdevice, ALCdevice_p) : NULL);
jsize attrlist_len = jattrlist ?
(*env)->GetArrayLength(env, jattrlist) : 0;
ALCint * attrlist = jattrlist ?
alloca(attrlist_len * sizeof(attrlist[0])) : NULL;
if (jattrlist)
(*env)->GetIntArrayRegion(env, jattrlist, 0, attrlist_len, attrlist);

ALCcontext * res;
res = (*dalcCreateContext)(device, attrlist);
jobject jres = res ?
(*env)->NewObject(env, ALCcontext_jc, ALCcontext_init_p, (long)res) : NULL;
return jres;
}
// auto-generated java side
public class ALCAbstract extends ALNative implements ALCConstants {
...
public static native ALCcontext alcCreateContext
(ALCdevice device, int[] attrlist);
...
}
// manually written classes (could obviously be auto-generated,
// but this isn't worth it if i want to add object api here)
public class ALCcontext extends ALObject {

long p;

ALCcontext(long p) {
this.p = p;
}
}
// and another
public class ALCdevice extends ALObject {

long p;

ALCdevice(long p) {
this.p = p;
}
}
So, 32-bit cpu's amongst you will notice that the handle is a long ... but that's only because I haven't bothered worrying about creating a version for 32-bit machines. Actually, because the JNI code is the only one which creates, accesses, or uses 'p' directly, it's actually easier to do this than if I was passing the handle to all of the native methods.

i.e. all I need is a different ALCdevice concrete implementation for each pointer size, and have the C code instantiate each instance itself. Neither the java native method declarations nor any java-side code needs to know the difference. If I wanted a high level ALCdevice object, that could just be abstract and it also needn't know about the type of 'p'.

Other stuff

So one thing i've noticed when doing these binding generators is that every library does things a bit differently.

The earlier versionf of FFmpeg were pretty clean, and although the function names were all over the place most of the calls took simple arguments with obvious wrappings. It's also a huge api ... which one
would not want to have to hand-code.

For openal, it requires passing a lot of arrays+length around, so to implement an array based interface requires special-case code to remove the length specifiers out of the api (of course, it may be that one actually wants these: e.g. to use less than the full array content, or for indexing within an array, but these cases I have intentionally ignored at this point). The api is fairly small too and changes slower than a wet week! It also has a separate symbol resolution function for extensions - which I haven't implemented yet.

I also looked at OpenCL - and the binding for that requires special-case handling for the event arrays for it to work practically from Java. It is also more of an object based api rather than a 'name' (i.e. an int reference id) based one.

(BTW I'm only experimenting with these apis because i've been looking at them recently and they provide examples of reasonably small and well-defined publc interfaces. I am DEFINITELY NOT planning on a whole jogamp replacement here: e.g. opencl and openal are simple stand-alone interfaces that can work mostly independently of the rest of the system. OpenGL OTOH has a lot of weird system dependencies which are a pain to work with - xgl, wgl, etc. - before you even start to talk about toolkit integration).

So ... what was my point here. I think it's that trying to create a be-all-and-end-all binding generator is too much work for little pay off. Each library will need special casing - and if one tries to include every possible api convention (many of which cannot be determined by examining the header files - e.g. reference counting or passing, null terminated arrays, arrays vs pass by reference, etc! etc! etc!) the cost becomes overwhelming.

For an interface like openal - which is fairly small, mostly repetitive, and changes very slowly, all the time spent on getting a code generator to work would probably be better spent on doing it by hand: with a few judiciously designed macros it would probably only be half to a days work once you nutted out the mechanisms. Although once you have a generator working it's easier to experiment with those mechanisms. In either case once it's done it's pretty much done too - openal isn't going to be changing very fast.

Although a generator just seems like 'the right way to do it(tm)' - you know, the interface is already described in a machine-readable form, so why not use it? But a 680 line bit of write-only perl is probably going to be more work to maintain than the 1400 lines of simple, much repeating and never changing C that it generates.

Monday 19 March 2012

JNI overheads

Update: Also see the next post which has a slightly more real-world example.

So, i've been poking around with some JNI binding stuff and this time I was experimenting with different types of interfaces: I was thinking of doing more work on the C side of things so I can just directly use the native interfaces rather than having to go through wrappers all the time.

So I was curious to see how much overhead a single field access would be.

I'll go straight to the results.

This is the type of stuff I'm testing:
class HObject {
long handle;
public void call1() { invokes static native void call(this.handle); }
public void call2() { invokes native void call(this.handle); }
native public void call3();
}
class BObject {
ByteBuffer handle;

public void call1() { invokes static native void call(this.handle); }
}
All each native function does is resolve the 'handle' to the desired pointer, and assign it to a static location. e.g. for a ByteBuffer it must call JNIEnv.GetDirectBufferAddress(), for long it can just use the parameter directly, and for an object reference is must call back into the JVM to get the field before doing either of those.

The timings ... are for 10^6 10x10^6 invocations, spread over 10050 objects (some attempt to avoid unrealistic optimisations), repeated 5 times: the last result is shown.
What \ time
0 static native void call(HObject o) 0.15
1 HObject.call1() 0.10
2 static native void call(long handle) 0.10
3 static native void call(HObject.handle) 0.10
4 HObject.call2() 0.12
5 HObject.call3() 0.14
6 static native void call (BObject o) 0.59 (!)
7 BObject.call1() 0.36 (!)
The timings varied a bit so I just showed them to 2 significant figures.

Whilst case 2 isn't useful, cases 1 and 3 show that hotspot has effectively optimised out the field dereferences after a few runs. Obviously this is the fastest way to do it. Although it's interesting that the static native method (1) is measurably different to the object native method (4).

The last case is basically how I implemented most of the bindings i've used so far, so I guess I should have a re-think here. There are historical reasons I implemented jjmpeg this way - I was going to write java-side struct accessors. But since I dropped that idea as impractical, it may make sense for a rethink here. PDFZ does have some java-side struct builders, but these are only for simple arrays which are already typed appropriately.

I didn't test the more complex double-level used in jjmpeg which allows it's native objects to be garbage collected more easily.

Conclusions

So I was thinking I could implement code using case 0 or 5: this way the native calls can just be used directly without any extra glue-code.

There are overheads compared to cases 1 and 4, but it's less than 50%, and relatively speaking it will be far less than that. And most of this is due to the fact that hotspot can remove the field access entirely (which is of course: very cool).

Although it is interesting that a static native call from a local method is faster than a local native call from a local method. Whereas a static native call with an object parameter is slower than a local native call with an object parameter.

Although 10^6 10x10^6 calls are a lot of calls, so the absolute overhead is pretty insignificant even for the worst-case. Even if it's 5x slower, it's still only 59 vs 10 ns per call.

Small Arrays

This has me curious now: I wonder what the overhead for small arrays are, versus using a ByteBuffer/IntBuffer, etc.

I ran some tests with ByteBuffer/IntBuffer vs int[], using Get/ReleaseArrayElements vs using alloca(), and Get(Set)ArrayRegion. The test passes from 0 to 60 integers to the C code, which just adds up the elements.

Types used:
IntBuffer ib = ByteBuffer.allocateDirect(nels * 4)
.order(ByteOrder.nativeOrder()).asIntBuffer;
ByteBuffer bb = ByteBuffer.allocateDirect(nels * 4)
.order(ByteOrder.nativeOrder());
int[] ia = new int[nels];

Interestingly:
  • Using GetArrayElements() + ReleaseArrayElements() is basically the same as GetArrayRegion/SetArrayRegion up until there are 32 array elements, beyond that the second is faster. Which is most counter-intuitive.
  • I thought that using a ByteBuffer is slower than using an IntBuffer (which is derived from a ByteBuffer using .asIntBuffer()), but it turns out that GetDirectBufferCapacity returns the elements of the buffer size, not the number of bytes (i.e. as the java is documented, but different to the JNI method docs I found). Actually a ByteBuffer is a tiny bit faster.
  • If one is only reading the data, then calling GetArrayRegion to a copy on the stack is always faster than anything else for these sizes.
  • For read/write the direct byte buffer is the fastest.

But this was just using the same array. What about if i update the content each time? Here I am using the same object, but setting it's content before invocation.
  • Until 16 elements, the order is IntBuffer, Get/SetIntArrayRegion, Get/ReleaseIntArray, ByteBuffer
  • 16-24 elements, Get/SetIntArrayRegion, IntBuffer, Get/ReleaseIntArray, ByteBuffer
  • Beyond that, Get/SetIntArrayRegion, Get/ReleaseIntArray, IntBuffer, ByteBuffer

Obviously the ByteBuffer suffers from calling setInt(), but all the values are within 30% of each other so it's much of a muchness really.

And finally, what if the object is created every time as well?
  • Here, any of the direct buffer methods fall down - several times slower than the arrays - 6-10x slower than the fastest array version.
  • Here, using Get/SetIntArrayRegion is much faster than anything else, it was consistently at least twice as fast as the Get/ReleaseIntArray version.
So this contains few few curious results.

Firstly (well perhaps not so curious), only if you know the direct Buffer has been allocated beforehand is it always going to win. Dynamic allocation will be a killer; a cache might even it up, but i'm doubtful it would put any Buffer back to a winning spot.

Secondly - again not so curious: small array allocation is pretty damn fast. The timings hint that these small loops might be optimising away the allocation completely which cannot be done for the direct buffers.

And finally the strangest result; copying the whole array to the stack is usually faster than trying to access it directly. Possibly the latter case is either having to take the memory from the heap first and is effectively just doing the same thing. Or it needs to lock the region or perform other GC-related things which slows it down.

Small arrays aren't the only thing needed for a JNI binding, but they do come up often enough. Now I know they're just fine to use, I will look at using them more: they will be easier to use on the Java side too.

Update: So I realised I'd forgotten Get/ReleasePrimitiveArrayCritical: for the final test cases, this was always a bit faster even than Get/SetArrayRegion. I don't know if it has other detrimental effects in an MT application though.

However, it does seem to work fine for very large arrays too, so it might be the simple one-stop shop, as at least on Oracle's JVM it always seems to be the fastest solution.

I tried some runs of 1024 and 10240 elements, and oddly enough the same relative results hold in all cases. Direct buffers only win when pre-allocated, GetIntArrayRegion is equal/faster to GetIntArrayElements, and GetCriticalArray is the fastest.

Saturday 17 March 2012

Filter Graph

This morning I threw together an implementation of some of the ideas I mentioned in the last few posts for a filter/processing graph and checked them into socles.

I had a pretty good result applying the same ideas to my work stuff and I want to eventually use them elsewhere and experiment as well. The main thing I'm trying to resolve is the memory management and scheduling issues. Well I guess the other thing which is also pretty `main' is that by having a component framework it is easier to encapsulate functionality into re-usable connectible parts: i.e. less code, more functionality.

At the moment I have a dummy graph executor (it executes jobs in the right order, but not with any event management), and I still need to play with the queuing. But simple examples work because it's always using the same queue, and the climage-to-java-image conversion is necessarily synchronous. If I get it all working right, separating the graph execution from the nodes will allow me to try out different ideas - e.g. multiple queues, even multiple devices(?) without having to change the underlying code.

Actually one of the biggest hassles was attempting to support Java2D images as much as possible. I've tried to ensure images never lose any bits, although they may need format conversion (e.g. there are no 3-channelx8 bit image formats). I haven't tested all the formats as yet so quite likely the channel orderings are broken for the colour images (it's usually easier to just run it and tweak it till it works rather than try to grok all the endian-ness issues and ordering nomenclature of every library involved). I've got a feeling it was a bit of overkill - Greyscale/RGBA, UNORM_INT8/FLOAT are probably the only formats one really needs to handle at this level. Given that there are efficiency issues, it may be that I throw some of the code away so they can be addressed with less work.

Anyway, a simple example shows that even at this stage it can be used to create useful code: SimpleFilter.java. (at this stage it's not doing resource management, but most of the skeleton is there to handle it).

Eventually I want to be able to do similar things with video sequences (via jjmpeg).

(To what end? Who knows?)

Tuesday 13 March 2012

Processing Graphs 3

So based on the stuff in my previous post I decided to ditch the stuff I had already done for work and start with the stuff I worked on in socles on my hack-day as a baseline.

I had to fill out the event management stuff, and then I had to port the few processing nodes I had already ported to my other half-arsed attempt - which apart from a couple of cases was fairly straightforward. I kept the getInputX/getOutputX stuff, implemented some image converters, and wrote a video frame source node (which turned out a bit more involved than I wanted).

Some of the code is a bit messier, and now i've implemented the event stuff I had to go back and fill that out: which was a bit tedious. The 'source' interfaces are a bit of extra mess too but aren't any worse than implementing bound properties and in some cases I can sub-class the implementation.

The graph executor is nice and simple, it visits all nodes and runs every one once in the right order based on the graph. Each execution produces a CLEvent, and takes a CLEventList of the all events from pre-requisite incoming nodes. Each node takes a queue, a CLEventList of conditions, and a CLEventList to which to add it's (one-and-always-one) event output. It wont handle all cases but it will do what I need.

Because input/outputs use simple component-ised syntax, I can auto-connect some of the sources as well using some simple reflective code: this is only run once per graph so isn't expensive.

Possibly I want to allow individual 'source' instances to be able to feed off their own valid-data events too: although I think that's just adding too much complication. If nodes produce partial results the nodes themselves can always be split.

So after all of that (it's been a long day for unrelated reasons), I think it's an improvement ... at least the individual nodes are more re-usable, although some of the glue code to make the stuff I have work is a bit shit. I still have a bunch of cpu-synchronous code because I can't tie in user events with the api I have (I need to update my jocl for that), but with the api i've chosen I should be able to retro-fit that, or other possible design changes later on (e.g. multiple devices? multiple queues?) without needing to change the node code.

Actually I wish I had had more time to play with it in socles before I went with it, because there are always things you don't notice until you try it out. And things are a lot simpler and easier to test in the 'WebcamFX' demo than in a large application. This is also just a side-issue which is keeping me from the main game, so I want to get clear of it ASAP, whereas in socles it's just another idea to play with.

Sunday 11 March 2012

Processing Graphs 2

So, I looked into implementing the graph idea on Friday.

I came up with a really simple bit of code that runs everything in the right order, and a few 'nodes' which wrap some of the stuff in socles in re-usable components. I'm working on a simple component architecture that could let a GUI for example auto-query some of the node capabilities and data i/o ports.

I changed my idea on the data format stuff, and now the components explicitly import and export a given type: so either the application can know what is needed to interface disparate functions, or potentially graph code could just add it in as needed.

For example, something that might take two images and blend them together might have:
 ImageSouce getOutputImage();
void setInputA(ImageSource);
void setInputB(ImageSource);
void setInputFactor(ValueSource);

The *Source types are interfaces which let you retrieve the actual data or value (and perhaps meta-data thereof). Reflection can be used based on the names (setInput/setOutput) and/or the argument/returns (instanceof Source) in order to auto-create GUI or auto-connect parts.

Although I didn't quite get that far I pretty much worked out the event generation stuff too: each node takes a CLEventList condition, and CLEventList events (these are the way all CL calls are synchronised) and honours/updates them as if it were a single call. i.e. waits on the condition, and then sets a single event on events either via a enqueue call or using user events when it's done. And although the graph is invoked synchronously, this allows for asynchronous operation: i.e. any call that might block (e.g. read-modify-write from gpu) just shuffles it's work off to another thread/work queue and it can sync up later. Well that's the intention anyway. I'd like to be able to automate the parallelisation via queues across the graph paths as well, but I should get something working first.

I hit a bit of a hurdle with JOCL not allowing user events to be added to an event list: which means I couldn't get much further with some of the stuff I wanted to do (allow java code to participate in the OpenCL managed event graph). If JOCL wasn't such a pain to build and then propagate to the various projects I would've added the trivial patch to get it working, but by then it was pretty much the end of the productive day anyway so I left it at that.

As an aside: i'm not sure what possessed me to actually start work on this, but I have previously considered - and dismissed - hacking together another jni binding for opencl (there's little things every now and then that get me thinking about it, but usually my senses kick in before I start). Because there's a lot more involved than just the code it's really not an avenue I want to cycle down. But this morning - complete with hangover - ugh, which still seems to be lingering now at the other end of the day - I started with openal and then looking into opencl. I use gcc and cproto to simplify the stuff I need to parse, filtered further by a few greps in a shell script. Then another big ugly perl script to process these into bindings and classes. Anyway, that stuff probably wont go anywhere. Probably.

Friday 9 March 2012

Processing Graphs

So, I've been looking into how to re-do the processing graph stuff in my customer's application over the last few days. First reason is I need to change the way it works, from a real-time interface to an interactive/batched one. And the second reason is that thinking about how Aparapi works (or could work with added concurrency) got me thinking that I really could do a better job at simplifying the whole lot, whilst making it more useful.

Currently I have a processing tree, and use threads and multiple queues to feed the data around so the main feeding thread isn't blocked on long-running tasks. There are basically 3 levels of abstraction: the top-level tree which consists of processing nodes which are invoked in the correct order: these might call kernels directly or higher level components. The middle level is only there sometimes, and includes more complex routines which consist of several opencl kernels combined in various ways with it's work data (e.g. something like KLT). And the lowest level is the direct kernel bindings (which usually do not manage their own data) very much the same as the ones in socles. The tree only defines the invocation order; data relationships are statically created/assigned at tree creation time or dynamically synchronised at run-time.

This works ok, but it is pretty messy and none of the top level is re-usable in the least (and there's little middle-level to re-use). It is actually fairly efficient - 'real time' means for some tasks I have plenty of spare time to waste and give up to to background jobs, so small bits of cpu-synchronous code aren't a deal-breaker. Obviously I want to simplify the usage of this, whilst increasing the possibilities for automatic job-level-parallelism.

My first re-cut of this was to take much the same idea, but make a couple of alterations. Firstly to re-arrange the abstraction levels, so that the top-level doesn't do so much direct kernel invocation but move this code to more second-tier components which are hopefully more re-usable. Either way this should be a big plus. And secondly to simplify the data management; use the tree to define data flow, automatic data conversion between stages, plus a bit of double buffering to cope with cpu synchronisation and some cpu parallelism (at least, one-way cpu-to-gpu). But otherwise basically just a synchronous fixed call-tree managed by a single queue.

But I don't think this is going to cut it either. It's not really flexible enough and unless I have lots of batch processes running concurrently (which i wont) the device will be underutilised in many cases.

Last night I started working on a socles version of something similar but quickly got bogged down in the data-conversion issues which get messy pretty fast (I want to be able to mix and match image, array, and java native graph nodes without each having to worry about where the data is coming from). And that was before I got onto the synchronisation stuff.

Yesterday was just a crap day anyway: very little sleep, grumpy as hell, and not able to think straight, so it was flying blind a bit by just not being on the ball ...

Then I realised this morning (i'm sure now that i knew this, and this is one reason I didn't want to do a call-graph interface in the first place; i'd just forgotten about it); opencl already has all of the stuff required to manage the processing graph (and it can handle a graph, not just a tree) (or I would've realised earlier if my searches hadn't have kept pointing to the 1.0 spec). The graph invoker just has to call the processing nodes in the right order, and it can build and maintain the event nodes used to link them all together. User events can be used to synchronise with (asynchronous/mt) java-side processing so I don't need to stop that branch entirely just to do some cpu code. All the processing nodes need to do in addition to their work is use the standard opencl condition/event set to ensure synchronisation. I can possible even manage queue stuff automatically. The data conversion stuff will still be a pain - but it's just a pain anyway and just can't be avoided.

Representing the graph in a simple way and turning that into an invocation sequence with events is another issue, but at least it gives me something new (to me) and useful to learn about.

Monday 5 March 2012

Aparapi

So, I kept watching some more of the AMD fusion summit videos yesterday evening - and now I remember one reason I haven't watched them before: they take a long time to get through!

The GCN one was interesting (from 'Mike and Mike'), although the first Mike was a bit nervous and together with his Texan accent and rushing a bit, made him a bit hard to follow.

But the Aparapi one gave me some new respect for the project.

Although one thing I didn't think was correct was he kept going on about how Java programmers are completely braindead and don't know how to code for performance or deal with other issues (it's not like they're Python coders FFS):

Pointers?
Java coders have never seen a pointer eh? Object references behave the same because they are the same, and although you can't increment/index them, most coders stick to direct pointer dereferencing and array indexing anyway: i.e. exactly the same way they're used in Java. And they both have null pointer dereferences, it's just that Java's are non-fatal. So one could say Java programmers have never seen a fatal pointer access, but just because they're called object references or arrays, doesn't mean they're not pointers ...

Other languages?
Java programmers seem to love XML for some reason: they can cope with other languages. Look at ant: it's an XML form of bourne-shell! Not to mention Scala and so on.

Performance
Actually any of them already using Java for engineering/science know how to get performance: use objects of (single-dimensional) arrays, not arrays of objects. Write really ugly maths to cope with it. There's just no other practical choice without taking unacceptable performance hits.

And this goes for more fundamental understanding of the underlying architecture too. I know everyone likes to make out that abstraction means you don't have to know about registers and cache and how language constructs are executed by the hardware (and I know some schools of computer science try to hide such details), but that's bunkum. It affects every language because at the end of the day they're all executed as machine instructions.

Threading, barriers, etc.
Again for performance you can't avoid these. Java barriers are also the same semantically as OpenCL barriers and are a very simple concept if you're able to cope with the concept of concurrency at all.

Dynamic memory
Absolutely agree here, this is one thing Aparapi should hide.

So if a coder can't cope with these concepts already they are not going to be a potential user of Aparapi, so it seems strange to only target them as a potential adopter of the technology.

As with these concepts in Java, these 'braindead' coders will just use third party libraries and (ugh) frameworks to do this fiddly stuff for them, because even the simplest concurrency model that Aparapi presents (which is more like an OpenGL shader than a CL kernel) will be beyond them.

It's the actual concurrency which is the hard part: and for the most part OpenCL's concepts make the concurrency easier to deal with (or even, possible to deal with). So hiding the tools to deal with the concurrency whilst exposing the concurrency does seem a little counter-productive.

So I don't think hiding all the details of OpenCL is necessarily a good idea: sometimes one does need to know about data flow, local memory, 3-d workgroups, and barriers. 'system' and 'framework' programmers already need to know about threads and concurrency and so these related concepts will not be foreign to them. Although I noticed that Aparapi now has support for local memory blocks and so there's hope yet.

I'm not sure why Aparapi uses such an explicit memory transfer model either, when it could have managed the buffer memory in a simpler way. e.g. Java has accessors, why not use them to determine when buffer memory needs to be returned to the host? The data flow should be quite explicit from the kernel invocation order: no need to analyse the host code for this (and this is problematic: I don't see how the host code can invoke multiple kernels in a sequence from the example, since each kernel has its own single 'host' method - imho 'host' should be an attribute on any method rather than a single one, but maybe the api has moved on since then).

However, taking into account the future plans of the AMD/ARM platforms ... Aparapi has the potential to be much much more useful as the programming model could map quite well to the future target design. i.e. once one has zero-copy unified memory and a low-overhead job queue mechanism the cost of an Aparapi kernel call will become very low.

Although reviewing the plans, I suspect they're being a bit optimistic: i only recently discovered for example that async memory copies between CPU and GPU is only an 'preview' feature in the current sdk ... which was quite a 'WTF' moment - how seriously IBM-PC-XT is that ... OTOH moving the ringbuffer processing to the hardware removes most of the OS-specific code required to talk to the hardware and so will reduce the resource requirements for driver development (i.e. we wont have to wait for the driver devs, it'll already be done by the hardware guys).

I think the biggest problem for Java however is in the language itself, and in particular it's lack of mathematical expressivity[sic]. It doesn't support vector types for starters - which are handy in this case more for their ability to concisely realise mathematical expressions than the ability to map efficiently to SIMD processors. And even if one does wrap these in objects, using them is clumsy, error prone, and even messier if you're aiming for efficiency. e.g. it's simply cleaner writing C or OpenCL C code to perform most maths than it is doing the same in Java (in it's most purest object form, let alone optimised for practicality).

So although a lot of problems will probably map fine enough, occasionally one will be better off with lower-level access. If one could override the kernel code with your own string, but let Aparai handle the buffer management and let them all interoperate it would be very cool. Even if for example, it compiled the OpenCL code in to Java bytecode as the fallback (ok, this is a rather much bigger problem to solve ...).

Sunday 4 March 2012

OpenCL

So i've been wondering just what opencl can be 'used for'. Apart from making an image editor or video tool run faster, I can't really think of what it might enable. As important as those tools are - and OpenCL does make some features possible now which were not before - the are very niche products of little direct benefit to most of us (unless all you do all day is sit on your arse consuming hollywood movies and glossy magazines, which is probably - utterly sadly - a majority or at least major minority).

Unfortunately when searching for 'opencl kill app' the first hit is my own post of the same title ... ahh well.

Anyway, along the way I found this this blog which has some interesting observations and ideas about OpenCL hacking.

For one I had missed that opencl 1.1 added a shuffle instruction: which is pretty much required for good performance for 'data streaming' applications: using SIMD to access non-aligned data. It can also be used to implement a vector index lookup. However, I'm not sure if this is a single-instruction function on GPU hardware, and it was just added to the spec for the CELL BE backend (without shuffle, the SPU's are kneecapped). Although it isn't too hard to find out by dumping the source code - which I did, and found that it isn't efficient at all. Oh well.

So back to the question of OpenCL applications: as from the last time I wrote about, enabling desktop performance on a hand-held computer is probably still the main thing we will all eventually see (although it's taking it's sweet time). This latest iteration of the thin client/server model (which harkens back to the original mainframe idea) wont necessarily last, as no previous iteration ever did. And although having an internet connectedness will be pervasive, you wont need to go to a back-end server to do the work when you have 100G/flops in your pocket.

The other thing, which I think is more interesting, is that OpenCL will enable cheaper design of complex systems - which means more competition and more products for the general public. So for example it has always been that even from the days of 8 bit computers you could design a hardware component to increase performance (sprite hardware, sound synth, etc), and this still allows advanced performance from portable or low-power devices. As OpenCL technology matures and die sizes continue to shrink, it will become feasible to replace the expensive custom silicon with more and more software, and so as we went from custom silicon to fpgas, we may well move toward more software-only solutions even in portable devices. OpenCL might be a prick to code in, but it's a lot easier than FPGAs are, and you can still get higher performance from purpose-built functional units (FPGA's big edge is in power requirements, and concurrency).

So I think OpenCL's killer application is not so much a consumer-facing one as a tool for system developers. Or at least, removing the need for system design input for a given application-specific device, and opening up similar facilities to all developers. This is pretty much the gist of one of the AMD talks from a couple of years ago (I think from the AMD developer summit, perhaps from 2010 as I can't see it in the 2011 talks - many of which look interesting enough to watch) about how they were able to produce results in less time and with fewer steps than going the hardware route. i.e. lower costs, higher payoffs, and so on. I.e. pretty much what FPGAs did for system design, OpenCL may do as well: it might not be as fast, but the reduced development costs and overheads for low-volume production, Moore's Law and so on more than make up for it.

As an aside, when I was playing around with the beagleboard I also looked up some arduino stuff from time to time. I thought it was amusing that so many 'old hands' would whine about how overpriced the arduino was and that you could get away with something simpler like a PIC and so on. It was amusing because we can all see where this is going with a computer such as the Raspberry PI coming out: these things are getting so cheap and ubuquitous, before too long those 'low cost' simple parts will only be made by niche manufacturers to service dying equipment: i.e. they will become too expensive for anything (not to mention the skills required). Before too long it wont be economically viable to use anything smaller than a 32-bit floating point cpu for anything requiring some calculations.

It's a pity that the whole patent issue is getting so ridiculously out of hand: now that we are on the verge of effectively commoditising the entire platform as it pertains to signal processing, we're all going to be thrown into a dark age of propping up the leeches and rentiers who seem to have captured governments the world over for their own benefit.

Update: So ... I did end up spending a couple of hours today watching some of the AMD fusion summit videos from 2011 (with a bit of a hangover i wasn't up for much more than this!). Quite interesting, I guess I should keep an eye out on this stuff a bit more, but when i'm busy with work or on leave I don't always keep up with it.

The first keynote was quite interesting (after the waffle from the first guy), the future direction of the 'HSA' (heterogeneous system architecture) platform. At one point there is a graphic showing a bunch of hardware driven queues where applications feed in job tasks directly from their process space and they are picked up and executed by the hardware directly ... without a cpu context switch or kernel-mode memory copy in sight.

Now THAT is interesting.

Although ... TBH what the diagramme most reminded me of was ... non-copying asynchronous message passing, and where have I seen that before? Oh ... AmigaDOS 1!? Well, it only took nearly 30 years to finally catch up ... sure there are differences in the capability, but given the technology of the time it's definitely the closest to the model proposed here. i.e. unified memory (CHIP ram!), separate processors working together, non-copying data flows, etc. It's more than just about the non-copying of data (although that is pretty important too), it's about keeping the ALUs fed with data, and avoiding the overheads of system interactions and exception/syscall processing (this is of course why AmigaOS worked in a similar way). Already with GPUs keeping them fed with data is basically impossible - they spend a lot of time waiting around twiddling their thumbs.

Still, it's funny how nice hardware design is when intel aren't involved ...

Speaking of not-intel, the talk from the ARM guy also was quite interesting. From their perspective the push for non-heterogeneous compute is not just about getting the job done faster, it's about getting it faster whilst using less power. With the process shrink there isn't enough power/heat sink to run all the transistors at once, so you have to divvy up the work in a way which gets it done the fastest without wasting the silicon, the power, and so on. And on a mobile device it's about trying to utilise all the extra gates are your disposal to keep pushing the edge of performance; whilst still keeping within heat and power budget. This obviously fits directly in with AMD's software stuff: how to actually programme such a chip in a practical way.

Although to me what is most interesting about this approach is that it acknowledges that there really is a hard physical limit on the amount of work a single silicon chip can do. And even though they're making them smaller and can fit more transistors they can't turn the new ones on at the same time. I had seen this in the OMAP processors but I thought that was just about power saving: not burning out the chips. Thus the only way to improve performance and utilise the extra silicon further is not by increasing the concurrency but by increasing the specialisation of the functional units, and having more of them. This is quite a fundamental change to hardware design: where normally one tries to maximise resource utilisation. (and yes, it contradicts my speculation above to some extent: but I wrote that before seeing the talk)

And it has pretty far reaching implications for software development too. And particularly for free software; if the specific functional units aren't available to everyone or use proprietary instruction sets or firmware.

I'm not sure i'm convinced of the whole cache coherency model though: I thought the CELL BE was a better idea for high performance since each CPU had it's own dedicated simple memory that runs fast (=== LDS), and knowing you have no cache coherency just means you code differently (and avoid a few pitfalls along the way). The main benefit you get is much higher effective bandwidth: if it has to go through a global cache you're still hitting a bottleneck. Still, it's all about the system performance not just the fpu, and also being able to practically achieve a good fraction of its theoretical peak in a reasonable time with a decent programmer: and these are system issues beyond the raw numbers and queue mathematics.

Well off to the GCN video: although from the last AMD roadmap there is a NEW GPU architecture headed our way by the end of the year, so I'm not so sure what's going on with GCN.

Saturday 3 March 2012

socles denoise/sharpen

This morning I ported the ImageZ denoise & sharpen stuff to soclesdemo now that I have the QWT code (mostly?) working.


I still get occasional GPU hangs, but it is completely random: maybe it's code bugs, maybe it's driver bugs ...

It scales fairly well: I tried a 4288x2848 pic, and it was still interactively-quick (swing stuffed up visually since it isn't in a scroll-pane, so I couldn't see how long it was taking), and much of that time was to/fro java/swing/X. For a 1546x1178 image it was about 8ms for a the full wavelet processing (kernel times). It's only greyscale. This 1-2 orders of magnitude faster than a contemporary high-end cpu, so is well worth it: and it makes real-time 1080P processing possible.

But for an interactive application however you would only need to perform the forward transform once and just apply the thresholding and inverse with each change. So in that case it could be quite a bit faster.

Source

Friday 2 March 2012

QWT

Since I haven't done any work in OpenCL for a while I thought on my hacking-day today I'd poke around socles for something different.

I managed to mostly get the quaternion wavelet transform code I started pre-xmas working, at least for the forward transform. It is based on the code I have in ImageZ, which i've mentioned before. This can then be used to form a dual-tree complex-wavelet-transform 'on the fly' as I do in ImageZ, or by copying to another structure (these both have some fairly interesting properties, so far i've only done work with denoise/sharpen, but there are also other applications such as registration).

It was actually a bit of a ramp-up to remember where the code was at. And then to decipher what it was supposed to do. For what consists a fairly simple algorithm (dual-convolution - the difficulty in wavelets is the filter design) there is a lot of very fiddly addressing and mucking about. Implementing upfirdn which is essentially what it uses is a pain even with such simple ratios as 1:2 and 2:1.

Although I seem to be getting an ok result, sometimes when I execute it on my GPU I end up having to hit the reset button ... so it's stuck on the CPU for now. Probably just some bounds checking errors. I'm not really on the ball today, pretty tired and grumpy, and I spent the better part of the day working on it so far; bit shit considering I already had a working Java implementation and most of the OpenCL scaffolding done. I guess not every day can be a super-productive one.

As above, i've only worked on the forward transform so far. Most of the tricky stuff is in the DWTGenerator which started as a copy of the convolution generator class, although QWT.forward() is a lot hairier than it looks too. I also managed to improve on the ImageZ code a tad: the redundant copies and vertical transforms aren't needed as I in-lined the sub-routines and have simpler data management.

Update: Well I just kept poking at it, and eventually ended up with getting the inverse transform working: sometimes it just takes perseverance. Unfortunately the forward transform still crashes my GPU quite regularly and I haven't traced that down yet.

Between the resets I did manage to get a couple of runs through sprofile: it's about 350uS per transform for a 512x512 depth=1 transform, on a Radeon HD 6950. There are some LDS bank conflicts in the X convolutions, although the main bottleneck is the Y convolution since it cannot benefit from LDS and relies on coalesced reads and the global cache. I'm reasonably happy with that, and i'm not sure I can get much more out of it.

Thursday 1 March 2012

jjmpeg updates

I finally checked in some jjmpeg changes I had lying around, added seeking to JJMediaReader and an icon creation helper (hmmm, maybe that was api bloat ...) and a few other odds and sods.

I've been playing with some interactive video stuff for work and when the video file is in good condition (i.e. seeking works properly) it's quite amazing to me just how zippy it all is - coming from the days of the C=64 I still can't get over just how fast modern machines are. And Java of course makes the multi-threading required to make the GUI very responsive an absolute doddle.

Absorbing rapid events

One idea I borrowed from my experiments on ReaderZ was a fairly simple mechanism to collapse rapidly incoming events. e.g. for a slider bar calibrated in ms, one can get many many updates as the slider moves; more than can be accommodated whilst panning an e-ink display or seeking around HD video. In the past i've either used a timeout, or some other throttling mechanism on the caller end such as a 'i'm busy' flag. This usually needs some other logic to handle completion cases, perhaps cancelling of jobs and other quite complex synchronisation tasks to ensure valid programme state when it's all done and dusted.

In ReaderZ I tried a different approach in order to simplify serial processing:
  1. Incoming tasks are queued as they arrive into a blocking queue.
  2. A consumer thread waits until something arrives on the queue.
  3. The consumer thread then polls the queue for any other tasks waiting to be processed.
  4. Based on the class of the request, jobs are discarded explicitly. For example, if you have a seek followed by an open or another seek, the first seek can be thrown away. i.e. the command is either kept, changed, or nullified.
  5. Then at most, 1 job each of each class are executed in the correct order.
  6. Repeat, go back to 1.

So basically a simplified broken-apart state-machine with explicit state reduction. If a given job is indivisible/can't be ignored (say, 'save current image'), then the collapse processing is cut short, and it jumps to step 5.

This way there's no need for any locking (apart from the task queue): the code is always called from a single thread, with guaranteed execution order and with simple state management. And even when something does take a while to run; it eventually catches up and never does more than less than one lot of extra work. It does need to ferry ALL sequentially oriented tasks through the command queue, but usually one has a fairly limited number of operations required.

Because the same thread is used to decode and play the video for my video player, if it is in 'play' mode, this just polls the command queue after each frame is displayed rather than waiting for it to contain something; but the overall logic is the same.

Also because it's done in one place I can more easily add a timeout if I really want to make sure the system is idle: rather than a separate timeout callback which needs resetting and gets called once things are done, I can just add a timeout to the Queue.poll() invocation. Or just as easily not, for example if it's been faffing around collapsing too many commands and hasn't updated the output for a while.

Speaking of ReaderZ, although I don't have any other plans for it at the moment, I am waiting for the next version of mupdf to be released at which point I will update PDFZ to match that. It should be quite soon.

Although I wasn't going to look into JavaFX too closely, every time I hit a problem in Swing I keep thinking the solution is a dead-end and the time spent on it is wasted. Unfortunately the one thing I need the GUI toolkit to to: i.e. display an array of pixels generated elsewhere: is one of the major things it cannot do yet! It can only read images from a url/disk, and that feature is targeted for 3.0 - about 18 months away. I suppose i'll just have to wait ...

Damn I wish I wasn't so tired. Must be the weather ... Autumn started very suddenly on the 29th.

It's scary joining a free software project?

I started writing a comment on this this post about contributing to free software: and it got so long I thought I'd move it here.

Overall I agree: it is quite scary, but the comment I was writing follows, somewhat expanded.



For the types of people for whom meeting people is difficult, software projects cannot be any different because the same notions apply: you don't really know how someone will react to you and whether you will be accepted or respected. I've been writing free software for about 15 years, and before that I gave away 'freeware' as well, and i'm probably more anxious about contributing to a new project than i've ever been ...

It is also unfortunate you use the term 'open source', because clearly merely having access to the source code makes no representation on whether a project is even interested in contributions. There are many reasons people write software and publish it freely, and for many projects, success or popularity is simply not a concern: the developers don't really care what anyone thinks because they have something they use themselves and the sharing is already an end in itself.

However ignoring the specific terminology used, trying to brush a wide audience with their sole unique characteristic is generally a pointless exercise. e.g. that all Greens voters are smelly hippy vegetarians, conservatives are all gun tot'n 4WD drivers, etc. People with only a few things in common are still very different from each other.

And just because the source is available and has a project page and a mailing list, it doesn't mean the project is interested in contributions from the general public. But clearly Layfield's experience is pretty poor - if a project purports to desire contributors and has a work wanted list, then at the bare minimum civility and politeness should be present. If such a project intends to survive by using external contributions, then it wont live too long.

Some of my experiences:

a) 'my first elisp' code, which turned into a handy script to add java-doc like comments to C functions. I submitted this to emacs, but RMS wanted it integrated into CC mode and a bunch of other stuff which was well beyond the time I wanted to spend on it or the features I needed (and I wasn't particularly interested in the kudos of contributing to a high-profile project). So I just added it to the project repository and my .emacs and left it at that.

(needless to say, I never wrote any elisp subsequently, but that's because I just wasn't interested in lisp as a language and that was the sum-total of the lisp I ever wrote).

b) AROS - these guys were very easy, commit access was easy to get, and then it was pretty much commit what you liked - obviously avoiding stepping on any toes. Even for a project with a lot of politics, there were plenty of small holes to fill.

c) I submitted a patch to mplayer, which was accepted without too much fuss. Just a bit of formatting changes iirc. In hindsight this was smoother than i'd have thought: certainly at the time they gave an abrasive impression of themselves on their web-site.

d) I think the first free software i contributed to was a patch to amanda circa 1995 - amanda is a distributed backup system. It was a horrible patch in hindsight but they accepted it easily. Of course this was back in the day when the internet was only accessible to academics, students, engineers, and sysops and overall was a much nicer place (yes, despite the flame-wars).

e) Working on Evolution. This was a commercial product with a (reasonably) defined direction and design. It was also complex enough and with enough of a user-base that any changes needed a lot of checking to make sure they were going to work technically and be up to scratch in terms of quality. Although the whole team spent quite a bit of effort trying to increase the community involvement: In the end I didn't really like being offered all but the most trivial of patches because it was always much faster just to write the code myself. Or I felt like a real heel telling some young lad that we couldn't use his patch because it didn't fit with the PM driven agenda. The one time we did accept a sizeable patch (and I was on holidays so was overridden), I spent weeks replacing a poor implementation which caused a lot of problems with a decent one. Nobody ever became a long-term contributor so we were left to learn and maintain any patches they gave us as well. I thought the 'bounty' system was an unmitigated disaster and would never consider such an approach again. People who are desperate for the paltry money on offer are probably not the cream to start with, and it is also very unlikely to lead to long term unpaid commitment.

Developers ...

Developer scalability was a huge issue in evolution: with thousands of reported bugs/feature requests and 2-3 coders there's just nowhere to even start making a dent. People wanted stuff we could never deliver (either too costly, unfit for the application, etc), and some people were nasty and insistent arseholes who wouldn't take no for an answer, or wouldn't take any time to try a patch or other work-around suggestions (which obviously took non-trivial effort to suggest). Crash reports were rarely followed up, and without being able to re-create them were basically useless. Not to mention distributions (esp debian) re-packaging the code in ways we only had to guess, and maintaining their own separate patch-sets. Placing bugs into 'wishlist' limbo was just as bad as saying 'wontfix', since they were never going to happen.

This latter point about scalability can't be ignored even for projects which do actively seek contributors. Every contributor comes along with a clean slate and thinks they're the first to be in their position. Yet for developers they might be one of hundreds, and even after giving out the same information only 10 times one gets pretty sick of it. This is actually one reason I find it more difficult to contribute to projects now: I don't want to piss someone off because I couldn't find their FAQ or didn't search the email archives enough, or they're still anal about 'top posting' (I really can't believe anyone still gives a rats-arse about that anymore ...).

Submitters ...

The 'problem' isn't just with the developers either: for example, what is the motivation behind the people submitting the patch? Why should a developer be particularly interested in a patch from someone who is just after the experience of submitting a patch? Or hoping for the fame of having a bit of their code included in a popular application?

I would certainly be much more interested in a patch from an active user who has found a deficiency in their day-to-day active use of the project versus someone who is just looking for something to do or something to add to their CV.

And if you're not a direct or close peer to the developer: the relationship is in quite a different space and now the developer has become a mentor. It takes far more effort and resources to be a mentor and in the vast majority of cases that effort is never returned to the project. The goal of most projects is to provide a solution to a problem, not to train people how to code or interact with a public project. It's quite arrogant and rude to assume that just because it's code and mailing list is available to the public that it gives the public a free reign on developer time ...

Me ...

Now, i'm definitely not interested in authoring applications for the general public. I get paid to work on a research project with a single individual as the sole customer I deal with. And for my free software projects the only ones of worth are only useful for other developers. And even then most of those are just stuff i'm playing with for my own entertainment; it is therefore costing me nothing to share it with the world and i'm more interested in helping people learn than solving their problems (that's not to say I don't get a buzz out of knowing my stuff is used - I check the stats all the time - but it isn't the motivation at all).

I don't think it will happen any time soon (not the least reason being that I'm miles away from building anything useful to the average user): but having a project of mine picked up by a distribution would be quite unappealing.

As for patches, I still submit the odd small patch here and there. But what turns me off is:
Anal retentiveness about specific code style, mailing list etiquette and so on.
I used to do this way way too much in evolution: If the patch was basically ok I should've just taken it and fixed it up to match my preferences and fixed minor errors. Once one has submitter access things are different, but for a random patch it's just not worth the to-fro and agro. Previous to ximian I'd had a pretty unpleasant experience working with an Indian sub-contractor (TATA Infotech) and we were being anal about their (really bad) code because we were paying a lot for it for no reason - so I was a bit thingy with code reviews.

Worrying about 'top posting'? How 90s, get-the-fuck over it.

Using git (or some obscure cms)
I just hate git to start with. And being asked to create a public fork of a project for a one-off small patch is low on my list of `things to do before i die'.

The first patch I ever created I used 'diff -ruN' to create, and that's still a reliable way to do it without having to learn obscure commands for half a dozen popular systems.

Too many pre-requisites.
e.g. joining a mailing list and a bug tracker in which you must create a bug and attach the patch, copyright assignments, and so on. It's ok for a couple of projects or if you become an active long-term participant, but it quickly gets far too unwieldy if you're just submitting a rare patch to some product you use occasionally. Another thing we (totally!) fucked up in evolution.

Obviously for a team project a mailing list (or forum) is pretty much essential, and legalities might require a copyright assignment or other agreement, but the bug systems tend to give me the willies.

Build complexity.
Some projects are just too complex to build or require too many pre-requisites: rubbish like ant, cmake, and all the other weird arsed build systems (jam, bitbake, custom python, rake ...) become insurmountable barriers that stop you even getting started.

I was quite astonished the other day that I even got a cmake based project to compile at all. If it wasn't for netbeans I wouldn't be using ant, and even then it sometimes fucks it up.

Python.
Which reminds me ... it's just a personal thing but I detest python in any form. tcl isn't far behind.

Arrogance.
Using a project leader's celebrity or the project's popularity to make one feel like it is an absolute privilege to be doing some of their work for free. I don't really encounter this because I'm just not attracted to such projects, but there needs to be some sort of recognition that work is being done for free (assuming it's at an adequate level for the complexity of the patch).


Concluding ...

To respond to the final question of what can be done to improve first impressions, I think i'd just say 'not much'.

Unless your specific goal is to maximise user contributions and popularity amongst potential contributors you're probably not too concerned about what they think. And if you are, you're probably already doing all you need to do.

And importantly beyond some fairly basic civility, there should be absolutely no obligation on you, as a free software developer, to provide any sort of expected level of support or accept patches in any form from anyone whatsoever.

If one wishes to be popular and extra-friendly then all the better for you and your users, but it is certainly no pre-requisite to calling your software free (or the related but somewhat meaningless term 'open source').