Computer Science 15-111 (Sections A & B), Spring 2007
Class Notes: 13-Apr-2007
Logistics:
1.
Test 2 is Wednesday April 18th
(5 days from today). It is
comprehensive.
2.
Reading:
a.
Wikipedia page on
Heaps
Note: you are
responsible for insert and remove operations, not for increaseKey or merge, nor
for heap variants.
b.
Wikipedia page on Huffman
Coding
Note: you are not responsible for the Variations.
c.
Wikipedia page on Hash Functions
Note: read this, but
you do not need an especially deep understanding of the material on this page
(but simple questions regarding the material may appear on subsequent quizzes
and tests).
Topics: Applications
1.
Best-First Search (continued) – Heaps (PQ’s), HashSets!
BestFirstSearch.java
2.
Huffman Encoding – Heaps (PQ’s),
Binary (Prefix) Trees, and HashMaps!
a.
Method for compressing data
b.
Use a variable-length code (different bit lengths for each character)
i.
Key idea: More-common characters
get shorter codes
ii.
Simple example:
1.
Code: x == 0, y == 10,
z == 11
2.
Create HashMap to Encode (compress):
put(‘x’,0)
put(‘y’,10)
put(‘z’,11)
3.
Create Binary (Prefix) Tree to Decode (decompress):
*
/0
\1
x
*
/0
\1
y z
4.
Encode(xyz) = 01011
5.
Decode (10011) = yxz
6.
Results:
a.
fixed-length encoding of 3 characters requires 2 bits/character
b.
Here, we have 5/3 = 1.67 bits/character
c.
Indeed, we have lossless compression!
Well, assuming a uniform distribution of the symbols x,y,z!
c.
Compresses near Shannon’s ideal limit of “codeword length”
i.
Measure of minimum number of bits/symbol required to encode a message
containing each symbol in a given alphabet A = { s1, s2,
…, sn }
ii.
Entropy of each symbol si with weight wi = H(si)
= -wi(log2wi)
iii.
Entropy of entire alphabet A with symbols si = H(A) =
ΣH(si)
iv.
Example:
Say messages contain only symbols in A = { ‘a’, ‘b’, ‘c’, ‘d’ , ‘e’}
Frequencies:
a: 10% b: 15% c: 30%
d: 16% e: 29% (total = 100%)
entropy(a) = H(a) = -0.10(log20.10) = 0.332
entropy(b) = H(b) = -0.15(log20.15) = 0.411
entropy(c) = H(c) = -0.30(log20.30) = 0.521
entropy(d) = H(d) = -0.16(log20.16) = 0.423
entropy(e) = H(e) = -0.29(log20.29) = 0.518
entropy(A) = H(A) = H(a) + … + H(e) = 2.205
v.
So minimum encoding of “abcde” requires at least 2.205 bits/symbol
vi.
Note: for 5 symbols, fixed-length
encoding requires 3 bits/symbol (right?)
d.
Algorithm:
i.
For each symbol, create a tree with just that symbol (a leaf), add to a
PQ
ii.
While the PQ has at least two trees
1.
Remove the two trees of smallest weights
2.
Create a new tree with these trees as children
3.
The weight of the new tree is the sum of weights of its children
4.
Add the new tree to the PQ
iii.
When the PQ has only one tree, return that tree as our decoding tree.
e.
Example:
i.
PQ = (Ta,0.1), (Tb,0.15), (Td, 0.16), (Te,0.29),
(Tc, 0.30)
ii.
Remove first two trees, combine into new tree with weight 0.25, add to
PQ:
Tree Tab (weight 0.25):
*
/0
\1
a
b
iii.
PQ = (Td, 0.16), (Tab,0.25), (Te,0.29),
(Tc, 0.30)
iv.
Remove first two trees, combine into new tree with weight 0.41, add to
PQ:
Tree Tabd (weight 0.41):
*
/0 \1
d *
/0 \1
a
b
v.
PQ = (Te,0.29), (Tc, 0.30), (Tabd, 0.41)
vi.
Remove first two trees, combine into new tree with weight 0.59, add to
PQ:
Tree Tce (weight 0.59):
*
/0
\1
c
e
vii.
PQ = (Tabd, 0.41), (Tce,0.59)
viii.
Remove first two trees, combine into new tree with weight 1.00, add to
PQ:
Tree Tabcde (weight 1.00):
*
/0 \1
* *
/0 \1 /0
\1
c
e d
*
/0 \1
a b
ix.
There is one tree remaining, so we are done!
f.
Results of example:
i.
Codes:
a: 110
b: 111
c: 00
d: 10
e: 01
ii.
Expected codeword length (in bits/symbol):
= Σ (frequency of si) x (length in bits of si
encoding)
= (0.1)(3) + (0.15)(3) + (0.3)(2) + (0.16)(2) + (0.29)(2)
= 2.25 bits/symbol
iii.
Remember: Shannon’s limit is
2.205 bits/symbol
iv.
So this worked very well!!!
g.
Question: Our codes for a,b,c,d,e are different from those in the
Wikipedia example. Why? Also, why did this not affect our expected
codeword length of 2.25 bits/symbol?
3.
Hashing
a.
Message Digest
i.
File checking: identify/verify
files
ii.
MD5 was standard, now is considered “broken”
iii.
SHA is (something of a standard), but even it may now be “broken”
iv.
Java has a general framework supporting any message digest algorithm:
import
java.security.*;
import java.math.*;
public static String getDigest(String text, String algorithm) {
try {
MessageDigest md =
MessageDigest.getInstance(algorithm);
md.update(text.getBytes());
return new
java.math.BigInteger(1,md.digest())
.toString(16);
} catch (NoSuchAlgorithmException e) {
out.printf("No such
algorithm as %s, sorry.\n",algorithm);
return "";
}
}
public static void main(String[] args) {
String s = "This is a test";
out.println("MD5: " +
getDigest(s,"MD5"));
// prints: MD5:
ce114e4501d2f4e2dcea3e17b546f339
out.println("SHA: " +
getDigest(s,"SHA"));
// prints: SHA:
a54d88e06612d820bc3be72877c74f257b561b19
}
b.
Challenge: Noisy Data
i.
Hash can detect error in transmission
ii.
Clever adaptations can even correct
small errors
iii.
But what about inherently noisy data with more than just a few errors
1.
For example: Audio Data
a.
Challenge: are two MP3 files from
the same source?
b.
Using a traditional hash function like MD5 “would be very sensitive to
highly likely perturbations such as time-shifting, CD read errors, different
compression algorithms or implementations or changes in volume”