Run length encoding (RLE) for lossless compression
Introduction
In 'lossless compression', the codecs keep all of the information about a file. The compressed file, once decompressed, can be reconstructed so it is exactly like the file before it was compressed, with no loss of any information at all.
RLE works by looking through the data in a file and identifying repeating strings of characters, called a 'run'. The run is then encoded into a small number of bytes, usually two. The first byte, called the 'run count', holds the number of characters in the run. The second character, called the 'run value' is the actual character in the run.
For example, the following data would take 20 bytes to store (because there are twenty characters): GGGGGGGGGGGGGGGGGGGG but the same data could be encoded to 20G using RLE and you would only need two bytes to do it. 20G is called a 'run packet'. The first byte of this run packet, the run count, holds 20 and the second byte, the run value, holds G.
In RLE, you create a new run packet each time the run character changes or each time the run length is exceeded. If you are using one byte to hold the run length, then the maximum length that can be held is 255.
Here is another string:
BBBBBaaa55555555X000000000
Using RLE, this would be encoded to:
5B3a851X90
Before encoding, you would need 25 bytes to store the data. After encoding, you would only need 10 bytes to store the data. Note that the X was stored as a run packet, even though it is only one character long. A run packet for X (two bytes) is actually twice the size of the single byte needed to store X. As we have already said, in RLE, you need to create a new run packet each time the run character changes. As it turned out, the single X didn't really change the overall compression we achieved very much. Suppose we had this data, though:
BONGOthedog
Using RLE coding, we actually double the storage size needed, from 11 bytes to 22:
1B1O1N1G1O1t1h1e1d1o1g
RLE algorithms are fast and simple. How well they compress data depends upon what is being encoded. Suppose you are encoding a picture of a page in a book that you've just scanned. If the page is mostly white then RLE will compress the file very well because there will be lots of runs of the same ASCII code for white. If the page is mostly a busy colour photo, then there will be far less runs of the same ASCII colour code.