0.10.4 Adaptive Huffman Compression

Search
Next: 0.10.4.1 References Up: 0.10 Data Compression Algorithms Previous: 0.10.3.2 References

0.10.4 Adaptive Huffman Compression

If there are x levels in a Huffman tree, read the node values on each level l_x from left to right. When you have exhausted level l_x move up a level to l_x-1 and repeat the process. Continue until you have moved up to the root node (and, thus, have read the value of every node in the tree).

You should notice an interesting pattern - the values always remain the same or increase. This is called the sibling property of Huffman trees. It tell us that given a node n, its sibling s(n)is the node on the same level as n to the right. If n is the rightmost node on its level, s(n) is the leftmost node on previous level. If v_n is the value of node n, we know that v_s(n), the value of the sibling, will be greater than or equal to v_n. If you understand this property understanding the rest of this algorithm is a breeze.

One of the drawbacks of static Huffman compression algorithms is the need to transmit the character frequency tally with the compressed text. While an intelligently encoded table only adds about 250 bytes (on average) to a compressed image, it would be nice to get rid of it all together. This seems to be impossible because without the table's data the decompression routine would not know the structure of the Huffman tree used to encode the compressed data and therefore would not be able to ascertain the proper codeword to character mappings.

However, adaptive Huffman compression algorithms overcome the need to store the character counts by beginning with a mostly empty tree. They then build up and fill in the tree as they go. The Huffman trees generated by adaptive algorithms are dynamic meaning they change in structure as the statistical tendancies of the text change. The compression function adds each new character encountered to the tree. The algorithm does not, however, compress the character the first time it is seen. Instead the character is passed on to the compressed text as is (in fact, a special flag character is pre-pended to it, as we will discuss in the next paragraph). When the compression system sees a character already present in the tree, it increments that character's count by one, adjusts the tree accordingly, and uses the compression code in the output stream.

The complimentary decompression routine operates in much the same manner - when a new character is encountered it is added to the dynamic Huffman tree but otherwise left alone. When a code is found it is decoded and the weight of the cooresponding character is increased. This increase may cause the tree to be reorganized.

The way that plain (non-compressed) characters and (compressed) codewords are distinguished in the compressed stream is by use of a flag or escape character. This symbol, when encountered, signifies that the next byte is a literal character. All other data is assumed to be codes. The escape symbol is one of the only items present in the initial Huffman tree. When the decompression subroutine runs across the code for an escape symbol it immediately reads a byte from the input, adds it to the tree, and sends it to output. Note that I am not talking about the escape character here (ASCII 27) but rather a made-up symbol that has a node on the Huffman tree. This symbol, of course, produces no output in the uncompressed stream - its sole function is to convey a message from the compression routine to the decompression routine about the character following it in the stream.

The complicated part of this algorithm, as you might expect, is the tree manipulation. It is unreasonable to reconstruct the entire Huffman tree every time a symbol is added to it. But recall from the opening paragraph of this section that all Huffman trees must obey the sibling property.

Imagine the steps of incrementing a Huffman node's weight: first, since all nodes are stored at the leaves of the tree, we will add one to the count of a leaf. We must now make sure the tree follows the sibling property. If our incremented leaf, i, has a larger value then its sibling then the sibling rule is broken.

When the sibling property has been broken matters can be fixed by swaping the offending incremented node i with its sibling. This will not always work, though. Imagine that there is a node n with value 5. Its immediate sibling is node o with value 5 also. The immediate sibling of p is 5 also. (That is, there are three leaf nodes in a row, all value 5). Now we increment node n to 6. The sibling property is broken because now n has a larger value than its right sibling, o (which is still 5). Swapping the two would be a problem because n is also larger than o's sibling, p.

The proper way to restore the sibling property when it has been violated by an incrementation is to loop over the siblings starting with the immediate sibling of the incremented node. Continue to loop while the value of the nodes encountered stays the same. Break out of the loop when the value changes. Swap the incremented node with the last node in the run of same values:

...
node[i].value++;
if (node[i].value > node[i+1].value)
{
  val = node[i+1].value;
  j = i+1;
  while (node[j].value == val)
  {
    j++;
  }
  swap (node[i], node[j-1]);
}
...

The above code assumes, of course, that we can traverse along siblings by simply moving to adjacent positions in a node array. In order to implement an adaptive Huffman algorithm it should be easy to find the sibling of a given node many times in a row.

0.10.4.1 References

Scott Gasch
1999-07-09