Sunday 29 March 2015

more in less

Curiosity got the better of me again and now there's 3:30am night and early rise behind me and pretty much another whole weekend "down the drain".

I had another go trying to grok VCDIFF format but failed again: not sure if it's my encoder or decoder but I can't get them to agree on the address compression and so it fails on most data.

Putting that aside I experimented with some ideas on address compression from VCDIFF, and then ended up creating a new format based on what i learnt from it and other experiments. My work with VCDIFF demonstrated the utility of address compression and dual operation instructions.

Anyway I think it turned out quite neat and tidy so here's a summary. It might be useful if you're trying to understand VCDIFF - its very different but borrows similar mechanisms.

Opcodes

There are basically two families of opcodes: dual-operation and single operation. The MSb selects which so each has a full 7 bits of information to play with. This allows for quite a wide range of data lengths and operations to be encoded into a single byte.

Only 6 individual operations are defined together with 2 dual operations.

Integers are encoded like many formats as unsigned 7-bit big-endian with the MSb as a continue bit. Addresses are encoded independently using a separate mechanism as described below.

There are a couple of other format parameters:

smallest
The smallest copy size possible. Encoder dependent.
split
The split point for the single operation opcodes. Fixed at 100 but could be changed.

Dual operation instructions

The encoder tracks the last instruction and if possible can combine two into the same instruction. This is one of the things vcdiff uses to get more compact deltas.

0Taaabbb
  0 - selects a dual instruction
  T - the type of the first operation, 0=ADD 1=COPY
aaa - 3-bit length for first operation.  This allows a range of
      (1-8 ) byte appends or (0-7)+smallest copy.
bbb - 3-bit length for second operation.  It is always COPY.
      This allows a range of (0-7)+smallest copy.

3 bits allows for an ADD of 1-8 bytes and a COPY of (0-7)+smallest bytes. With a smallest size of 6 bytes this allows ADD(1-8)+COPY(6-13) or COPY(6-13)+COPY(6-13).

This is all bits exhausted so the only dual instructions possible. Any data or addresses required are encoded in order in the following bytes.

I also investigated a fixed function 3-bit ADD and 4-bit COPY but it wasn't as good (albeit with very limited investigation).

Single operation instructions

The single op instructions allow for longer immediate copies as well as extended instructions which encode the length as a separate parameter.

Rather than split the data in bit-selected blocks a single 128-value number is broken into several ranges which are interpreted differently.

1nnnnnnnn
  1 - selects single operation
  n - 7-bit number interpreted via inclusive ranges as follows:

000 - 099   copy (0-99)+smallest bytes
100 - 123   add (1-24) bytes
      124   read length.  copy length+100+smallest bytes.
      125   read length.  add length+24+1 bytes.
      126   read length.  run of length+3 bytes.
      127   eof/reserved or something

The split was based on some statistical analysis of a couple of files: copies cover a larger range than adds, and runs are rare. Well and 100 is easier to work with.

For a smallest of size 6, this allows a single instruction to encode a copy of 6-105 bytes or an append of 1-24 bytes. These cover the majority of cases for the data i've been testing with (not that it is a very big set) and the behaviour of the matcher.

The multi-byte encoding is such that the 6+ and 4+ bits already implied taken care of are removed from their lengths which can save the occasional `overflow' byte.

Addresses

This format generated slightly smaller deltas than the previous format but I knew from my experiments with VCDIFF that address compression could increase the gains. The problem here is now i've run out of bits to use so I had to come up with a solution which encoded addresses independently of any operation codes.

Setting aside even a few bits for each address would be too costly so after a bit of thought I came up with a solution based on the observation that most addresses will be >127 and require 2 bytes at least anyway, therefore if I just encode the rare all-7-bit addresses in 16 bits instead it leaves a full 7 bits to use for other encoding schemes whilst retaining the `natural use' of those bits for longer values.

The next problem is how to use those 7 bits. VCDIFF uses 3 bits to select from a near/same table to chose either a positive offset of the last 4 addresses or together with an octet to select a specific address from a table of 768 (using the default code table). I did some experiments to find out which aids more: 'next' is used more often but 'same' saves a lot each time it is used; but it needs 11 bits to do it. Too much here. I also found that adding a sign to the near addresses offset improved the results.

I chose a trade-off which has features of both but requires fewer bits. It combines the same and near table into the same array and adds a sign to the offsets of near addresses. Because I don't need the sign bit for 'same' addresses I can use that to increase the address space of the same table. This allows a full 6 bits to be used for the match table and 5 for the near table.

It is necessarily slower than VCDIFF because I perform a linear search over these values to find an exact match (64 elements) or the nearest match (32 elements). The tables could be overlapped: they are just the last 32 or 64 addresses encoded or decoded stored using a cyclic index. Like VCDIFF both encoder and decoder must maintain this in sync.

This is the encoding used:

00nnnnnn - This is an exact match and a complete address.
           n is an index into a 64-element table.
01Smmmmm - This is a near match.  S is a sign bit and m is
           an index into a 32-element table.  The offset follows.
1aaaaaaa* 0aaaaaaa - An absolute address.  To avoid clashing with
          the above it is forced to at least 2 bytes length.

Note that absolute addresses are just encoded as simple integers: with no `wasted bits' for the other information if their value is greater than 127; which is very likely in the common case.

The encoder is free to choose the smallest option of these for any address in the stream.

Results

These are with the same encoder settings of a 'smallest' of 6 bytes, as with the other benchmark data from the home page.

                       dez-1.2     dez-?.?     gzip -4
  GPL2 to GPL3         13 591      12 053
  jjmpeg.dll           10 809       8 770
  bible (compress)  1 731 250   1 539 501   1 550 998 

Runtime is a tiny bit longer for the shorter files due to the address table lookup, although i haven't optimised the address table scan yet. It's still 180x slower than gzip on the KJV.

This actually beats my current VCDIFF encoder but since that is still broken it's pretty much useless to compare to it. Even a single bug can radically alter the patch size.

But one (admittedly small) plus is that unlike VCDIFF this format is fully streamed and doesn't require staging. Actually another plus is that the code is quite a bit simpler due to a more orthogonal instruction set and few special cases, but it only has an implied-fixed code table so isn't as flexible.

Can it go smaller?

Because the string matcher performs an exhaustive search it may find multiple equivalent-length matches for a given target sub-string. This provides an obvious opportunity for selecting a source address that can be represented in fewer bytes than others. This could save some more bytes at the cost of encoding time.

Update: Boy, didn't even let the paint dry on this one. Tried the last idea.

                       dez-1.2     dez-?.?     addr opt    gzip -4
  GPL2 to GPL3         13 591      12 053      11 965
  jjmpeg.dll           10 809       8 770       8 725
  bible (compress)  1 731 250   1 539 501   1 507 072   1 550 998 

Well it's more than zero, so I guess that's a yes?

No comments: