Skip to content

Using a variation on RLE to achieve lossless compression for near-consecutive or grouped tabular data

November 18, 2022 | 03:00 PM

I’m working on a project called Track that among other things, requires me storing data about lots of trains in a tabular format. In the UK, most trains (or rolling stock) are ordered by operating companies in batches, leading to groups of almost identical stock. Each unit is numbered in a near consecutive way, and most of the time this is the only way to distinguish between stock from the same group.

Therefore, when it comes to storing this as a table there is quite a lot of unnecessary repetition, which can be efficiently removed using a variation on run-length-encoding.

Examples

It’s probably best if I try and explain with some examples. Let’s say we need to store the following data about 5 trains:

NumberOperatorCarriagesBuilt
453101AWR82010
453102AWR82011
453103AWR52011
453104LWR52011
453105LWR52011

The first thing you might notice is that a lot of the data is grouped, and the only thing that changes every time is the number. This is a perfect candidate for variable-length run-length-encoding implementation.

NumberOperatorCarriagesBuilt
453101AWR82010
+//+
+/5/
+LWR//
+///

As you can see, the first record is stored in full, this sets the defaults for each of the following cells. Any occurance where data is not filled (represented with a slash /) will be replaced with the previous value, which at the start will be taken from the first record.

Occurences where data is incremented from the value above are represented with a plus +, this can be seen in the number column.

How does this differ from RLE?

RLE is a very simple algorithm that is used to compress data by storing the number of times a value is repeated. For example, the following data:

Apple Apple Apple Apple Banana Banana Apple Apple Apple

can be compressed to:

4 Apple 2 Banana 3 Apple

The difference between this and my implementation is that my implementation only stores a value once until it is overidden, and thus only works with data represented vertically. ‘Rotating’ the data to the vertical and compressing it with my implementation would result in:

Apple
/
/
/
Banana
/
Apple
/
/

Which as you can see, is more complex than the original data! It’s only really when data is stored in a tabular format with multiple columns of data that my algorithm can be used to gain a substantial effect.

Performance

I’ve implemented this technique on a larger dataset of 87 trains with size of 10,430 Bytes and the compressed result is only 3,123 Bytes, which gives a reduction of 70.3%.

Caveats

This technique is not suitable for all data, and there are a few caveats to be aware of:

Conclusion

This is a very simple technique that can be used to achieve a form of lossless compression for data that is stored in a tabular format. It’s not suitable for all data, but it can be very useful in certain scenarios as shown in the examples.


As of early 2023, I have written a new post addressing some issues in this post. Update: here