CMU 15-112: Fundamentals of Programming and Computer Science
Class Notes: Data Compression with Huffman Encoding

Data Compression: Overview
Tree Data Structure: Overview
Huffman Encoding: Overview
Huffman Encoding: Implementation
Huffman Encoding: Accessibility Application
Discussion

Data Compression: Overview

Whenever we represent data in a computer, we need to choose some sort of encoding with which to represent it. When representing strings, for example, we have learned about ASCII codes to represent individual characters. Under the ASCII encoding, each character is represented using 8 bits, so a string of length n requires 8n bits of storage. So for example let's consider an encoding for the non-whitespace characters of the string "more free coffee". Ignoring spaces, this string can be represented in ASCII using 14 * 8 = 112 bits. The table below shows the relevant subset of the standard ASCII table:

Character	ASCII	bit pattern
'm'	109	01101101
'o'	111	01101111
'r'	114	01110010
'e'	101	01100101
'f'	102	01100110
'c'	99	01100011

109

111

114

101

102

114

101

111

102

101

01101101

01101111

01110010

01100101

01100110

01110010

01100101

01100011

01101111

01100110

01100101

When using the ASCII encoding we confine ourselves to representing each character using the same number of bits ( fixed-length encoding ). What if we allowed ourselves to use a variable length encoding ? In that case we can take advatage of special properties of data, such as letter frequency, by assigning shorter codes to characters that occur more frequently. For example, consider using the following code:

Character	Code
'e'	0
'o'	110
'm'	1010
'c'	1011
'r'	100
'f'	111

Notice that the above encoding is prefix-free : no code word is a prefix of any other code word. For instance, m is encoded above as 1010, and for it's three prefixes (1, 10, and 101), there are no characters with that encoding. Can you think why this is an important feature? According to the encoding we have specified above, the representation for the non-whitespace characters in the string "more free coffee" would be:

1010

110

100

111

100

1011

110

111

Note

Tree data structure: Overview

The Huffman coding scheme takes each symbol and its frequency of occurrence, and generates proper encodings for each symbol taking account of the frequency of each symbol, so that symbols with higher frequency have fewer bits in their encoding. Huffman encoding initially creates a tree of nodes and then utilizes this tree to read the codes for each of the specified characters. So let's first do a brief review of trees.

What is a tree?
A tree is a collection of nodes, where each node is a data structure consisting of a value, together with a list of references to child nodes. Here, we'll use a class to represent a tree node.

Vocabulary

Root - The top node in a tree
Parent - An internal node has one or more child nodes and is called the parent of its child nodes.
Siblings - Nodes with the same parent
Leaf - A node with no children
Internal node - A node with at least one child
Edge - Connection between one node to another

Huffman Encoding: Overview

The first step of Huffman encoding is building the Huffman tree. Given a set of characters and their associated frequencies, we can build an optimal Huffman tree as follows:

Construct leaf Huffman trees for each character/frequency pair
Repeatedly choose two minimum-frequency Huffman trees and join them together into a new Huffman tree whose frequency is the sum of their frequencies.
When only one Huffman tree remains, it represents an optimal encoding.

Huffman Encoding: Implementation

Note

Heap data structure

pq_heap.py

import heapq class HuffmanNode(object): def __init__(self, freq, char=None, left=None, right=None): self.char = char self.freq = freq self.left = left self.right = right # used mainly for debugging purposes def __repr__(self): return "HuffmanNode(char=%s, freq=%s)" % (self.char, self.freq) # needed for node comparison. Utilized to order the nodes appropriately # in the priority queue def __lt__(self, other): return self.freq < other.freq def isLeaf(self): return (self.left == None and self.right == None) def buildHTree(freqData): huffmanNodes = [] for char in freqData: huffmanNodes.append(HuffmanNode(freqData[char], char)) # the list of huffmanNodes is transformed into a priority queue to keep # track of the minimum-frequency Huffman Nodes heapq.heapify(huffmanNodes) while (len(huffmanNodes) > 1): # obtain the two minimum-frequency Huffman nodes child1 = heapq.heappop(huffmanNodes) child2 = heapq.heappop(huffmanNodes) parent = HuffmanNode(child1.freq + child2.freq, left=child1, right=child2) heapq.heappush(huffmanNodes, parent) return None if huffmanNodes == [] else heapq.heappop(huffmanNodes) def hTreeToHCode(hTree): code = dict() # a left edge represents a 0 bit, a right edge represents a 1 bit, and # the path from the root to a leaf gives the code word for the character # stored at that leaf. def getCode(hNode, curCode=""): if (hNode == None): return if (hNode.left == None and hNode.right == None): code[hNode.char] = curCode getCode(hNode.left, curCode + "0") getCode(hNode.right, curCode + "1") getCode(hTree) return code def encode(s, freqData): hTree = buildHTree(freqData) hCode = hTreeToHCode(hTree) hEncoded = "" for char in s: hEncoded += hCode[char] return hEncoded.strip() def decode(s, freqData): hTree = buildHTree(freqData) decodedStr = "" curTreeNode = hTree for charCode in s: if (charCode == "0"): curTreeNode = curTreeNode.left else: curTreeNode = curTreeNode.right if (curTreeNode.isLeaf()): decodedStr += curTreeNode.char curTreeNode = hTree return decodedStr freqData = {"e":5, "o":2, "m":1, "c":1, "r":2, "f":3} encodedStr = encode("morefreecofee", freqData) print("encodedStr", encodedStr) decodedStr = decode(encodedStr, freqData) print("decodedStr", decodedStr)

Accessibility Application

typing interface

Discussion

If Huffman encoding is "better", ie makes shorter strings, why does ASCII exist? Why do computers by default use ASCII?

When I zip my files, is it using Huffman encoding?

DEFLATE

What other kinds of data compressions are there?

Arithmetic Encoding: An amazing accessibility application of Arithmetic encoding is Dasher. It's probably one of the most novel and creative ways to allow users with motor impairment to type. Here's a demo as well.
LZ77 and LZ78
Prediction by partial matching

lossy data compression

I think accessibility is super cool! Where can I learn more and how can I get more involved?

Read more about Google Accessibility. Look through the features and products to get an idea of what is currently being offered to help people with certain disabilities have equal access to technology. Keep these teams in mind when you apply for internships!
Get involved in research. Here's a few research groups at CMU that you might find interesting:
- Neural Devices Engineering Lab
- Quality of life technologies
Take some HCI-Accessibility classes. Here's one being offered in the spring.
Read here about a novel idea to help blind people navigate.
Email Rudina (rmorina). I would love to tell you more and help you get involved.

CMU 15-112: Fundamentals of Programming and Computer Science Class Notes: Data Compression with Huffman Encoding

CMU 15-112: Fundamentals of Programming and Computer Science
Class Notes: Data Compression with Huffman Encoding