CMU 15-112: Fundamentals of Programming and Computer Science
Class Notes: Data Compression with Huffman Encoding
- Data Compression: Overview
- Tree Data Structure: Overview
- Huffman Encoding: Overview
- Huffman Encoding: Implementation
- Huffman Encoding: Accessibility Application
- Discussion
-
Data Compression: Overview
- Root - The top node in a tree
- Parent - An internal node has one or more child nodes and is called the parent of its child nodes.
- Siblings - Nodes with the same parent
- Leaf - A node with no children
- Internal node - A node with at least one child
- Edge - Connection between one node to another
- Construct leaf Huffman trees for each character/frequency pair
- Repeatedly choose two minimum-frequency Huffman trees and join them together into a new Huffman tree whose frequency is the sum of their frequencies.
- When only one Huffman tree remains, it represents an optimal encoding.
Whenever we represent data in a computer, we need to choose some sort of encoding with which to represent it. When representing strings, for example, we have learned about ASCII codes to represent individual characters. Under the ASCII encoding, each character is represented using 8 bits, so a string of length n requires 8n bits of storage. So for example let's consider an encoding for the non-whitespace characters of the string "more free coffee". Ignoring spaces, this string can be represented in ASCII using 14 * 8 = 112 bits. The table below shows the relevant subset of the standard ASCII table:
Character | ASCII | bit pattern |
'm' | 109 | 01101101 |
'o' | 111 | 01101111 |
'r' | 114 | 01110010 |
'e' | 101 | 01100101 |
'f' | 102 | 01100110 |
'c' | 99 | 01100011 |
So the string "more free coffee" would be encoded in ASCII as:
109 | 111 | 114 | 101 | 102 | 114 | 101 | 101 | 99 | 111 | 102 | 102 | 101 | 101 |
or as the following stream of bits:
01101101 | 01101111 | 01110010 | 01100101 | 01100110 | 01110010 | 01100101 | 01100101 | 01100011 | 01101111 | 01100110 | 01100110 | 01100101 | 01100101 |
When using the ASCII encoding we confine ourselves to representing each character using the same number of bits ( fixed-length encoding ). What if we allowed ourselves to use a variable length encoding ? In that case we can take advatage of special properties of data, such as letter frequency, by assigning shorter codes to characters that occur more frequently. For example, consider using the following code:
Character | Code |
'e' | 0 |
'o' | 110 |
'm' | 1010 |
'c' | 1011 |
'r' | 100 |
'f' | 111 |
Notice that the above encoding is prefix-free : no code word is a prefix of any
other code word. For instance, m is encoded above as 1010, and for it's
three prefixes (1, 10, and 101), there are no characters with that encoding.
Can you think why this is an important feature?
According to the encoding we have specified above, the representation for the
non-whitespace characters in the string "more free coffee" would be:
1010 | 110 | 100 | 0 | 111 | 100 | 0 | 0 | 1011 | 110 | 111 | 111 | 0 | 0 |
which is encoded with only 34 bits. This is clearly much better than the ASCII encoding. In fact, it can be proven that this particular encoding is optimal for this string: no other encoding can represent the string using less than 34 bits. Furthermore, this encoding is optimal for any string that has the same distribution of characters. A method to define such an optimal coding scheme was developed by David Huffman and is discussed below.
Note: While we ignored whitespaces in the above examples, there really isn't anything special about spaces and we could easily determine a word code for space. We just chose not to as we can still communicate the basic principles without worrying about them. In real life, spaces matter and you do want to give them their own code word. Tree data structure: Overview
The Huffman coding scheme takes each symbol and its frequency of occurrence, and generates proper encodings for each symbol taking account of the frequency of each symbol, so that symbols with higher frequency have fewer bits in their encoding. Huffman encoding initially creates a tree of nodes and then utilizes this tree to read the codes for each of the specified characters. So let's first do a brief review of trees.
What is a tree?
A tree is a collection of nodes, where each node is a data structure consisting
of a value, together with a list of references to child nodes. Here, we'll use a
class to represent a tree node.
Huffman Encoding: Overview
The first step of Huffman encoding is building the Huffman tree. Given a set of characters and their associated frequencies, we can build an optimal Huffman tree as follows:
Then the code for each character can be obtained by following the path from the root of the tree to the leaf holding the given character, assigning and accumulating a '0' when following a left edge and a '1' when following a right edge. The accumulated zeros and ones at each leaf constitute a Huffman encoding for those symbols. The image below illustrates this process.
Huffman Encoding: Implementation
Note: We are using a module called heapq in the code below. This module allows us to quickly get the node with the lowest frequency as we build our huffman tree, instead of having to loop over the whole list of trees each time. You do not need to worry about exactly how it works, but if you are interested, read about the Heap data structure and see pq_heap.py
-
Accessibility Application
Obviously, Huffman encoding is widely used in compressing text and other files on your computer, but there are less obvious ways compression can be applied, like accessibility. Take a look at this typing interface and see how Huffman encoding can lead to a higher typing efficiency. Note that linear scanning as detailed on the demo is actually currently used as a typing method by many people with motor impairement who do not have fine motor control to use a keyboard or a touchscreen.
- If Huffman encoding is "better", ie makes shorter strings, why does ASCII exist? Why do computers by default use ASCII? The advantages of the ASCII encoding scheme is that boundaries between characters are easily determined. Furthermore, the pattern used for each character is fixed and universal. "Decoding" an ASCII represented string (i.e. translating the binary encoding back into the original text file) simply consists of breaking the encoding into 8-bit bytes, and converting each byte into the character is represents using the ASCII table. A file that is encoded in this scheme has the advantage of needing no additional information to be passed along with the encoding, since all files and computers have the same binary-to-character mapping. On the other hand, when using Huffman encoding, data with different character distribution will have different encoding and hence storing the Huffman tree that was used to encode the data is neccessary for decoding.
- When I zip my files, is it using Huffman encoding? Depends. A lot of the compression algorithms use a combination of different techniques. For example DEFLATE uses a combination of LZ77 algorithm and Huffman encoding. LZ77 is used for repetitive data, while Huffman encoding is used for non-repetitive data.
- What other kinds of data compressions are there? There are many other ways of compressing data. Here's a few links if you're curious to learn more:
- Arithmetic Encoding: An amazing accessibility application of Arithmetic encoding is Dasher. It's probably one of the most novel and creative ways to allow users with motor impairment to type. Here's a demo as well.
- LZ77 and LZ78
- Prediction by partial matching
- I think accessibility is super cool! Where can I learn more and how can I get more involved?
- Read more about Google Accessibility. Look through the features and products to get an idea of what is currently being offered to help people with certain disabilities have equal access to technology. Keep these teams in mind when you apply for internships!
- Get involved in research. Here's a few research groups at CMU that you might find interesting:
- Take some HCI-Accessibility classes. Here's one being offered in the spring.
- Read here about a novel idea to help blind people navigate.
- Email Rudina (rmorina). I would love to tell you more and help you get involved.
Discussion
All of the above algorithms belong to a data compression category known as lossless compression. You can also look into lossy data compression which is widely used in image and video compression.