Never waste your memory! (Huffman Encoding)

In Morse code, a simple “dit” (.) represents the letter e, while a more complex sequence “dah-dit-dah-dah” (-.–) represents the letter y. As you probably know, the occurrence of e is significantly higher than the occurrence of y in most languages. So obviously Samuel F. B. Morse tried to make his code as efficient as possible for transmission of large text documents.

Morse followed a principle which is fundamental to data compression. In an uncompressed simple text file all characters may be represented by ASCII characters. The length of each character is exactly 8 bit, regardless of what character it is and how often it occurs in the text.

David A. Huffman (1925-1999) developed a fancy method which allows us to encode symbols, e.g. characters, in a way that symbols of high occurrence will be represented by shorter bit patterns than symbols of lower occurrence. In fact, Huffman encoding is one of the most efficient ways to reduce redundancy (unnecessary data) while maintaining 100% of the information. In computer science this is therefore called lossless data compression.

At this point I’d like to skip the mathematical background, which is not as hard to understand as you might think, though. Instead, I want to show you a concrete example. The German word “ERDBEEREIS” (strawberry ice cream in English) takes 10 * 8 bit = 80 bits when stored in 8-bit ASCII code.

Let’s now try to reduce its memory consumption by applying the Huffman algorithm. What we are going to do is to build a binary tree that helps us to find out the bit pattern for each letter.

Step 1 – For each different letter in the word create a node and write down its occurrence in the word.

Step 2 Take the letters that have the lowest occurrence and combine them into a new node. Add the occurrences of both letters. (In our example we might as well could have chosen ‘D’ and ‘B’ instead of ‘I’ and ‘S’).

Step 3 – Now again choose the two nodes of lowest occurrence, combine them into a new node, and so on.

The Huffman tree is now complete. We can easily find the code for a letter starting from the root and moving down to the letter. Each time we follow the left branch, we add a 0 to the code. As you can guess, each time we follow the right branch, we add a 1 to the code.

So the code for letter ‘B’ is 1011 while ‘E’ is just a single 0.

With this tree, the word “ERDBEEREIS” is coded as
0 100 1010 1011 0 0 100 0 110 111

The compressed word only takes 24 bits of memory instead of 80 bits. Pretty cool, eh? (For those of you who feel comfortable with math: The entropy of “ERDBEEREIS” is about 23.2 bits, which means the redundance is now <1 bit.)

You think we do not need to bother about “wasting” bits in an era where we can get Terabytes of memory for no money and where we have Gigabit-Ethernet? Well, you might be wrong, because the amount of data we need to store and to transport has steadily increased (well, it literally had exploded during the last decade). Just think of television on demand, high definition movies, and so on. Data compression is and will be an issue of growing importance.

Well, now relax and get yourself some ERDBEEREIS. 🙂

Cheers,

— Andre M. Maier

Advertisements

About bitjunkie

Teacher, Lecturer, and BITJUNKIE ...
This entry was posted in Bit-Twiddling, Information Theory, Uncategorized and tagged , , , , , . Bookmark the permalink.