Computer Science 15-111 (Sec A+B), Spring 2007

Computer Science 15-111 (Sections A & B), Spring 2007

Class Notes: 13-Apr-2007

Logistics:

1. Test 2 is Wednesday April 18^th (5 days from today). It is comprehensive.

2. Reading:

a. Wikipedia page on Heaps
Note: you are responsible for insert and remove operations, not for increaseKey or merge, nor for heap variants.

b. Wikipedia page on Huffman Coding
Note: you are not responsible for the Variations.

c. Wikipedia page on Hash Functions
Note: read this, but you do not need an especially deep understanding of the material on this page (but simple questions regarding the material may appear on subsequent quizzes and tests).

Topics: Applications

1. Best-First Search (continued) – Heaps (PQ’s), HashSets!
BestFirstSearch.java

2. Huffman Encoding – Heaps (PQ’s), Binary (Prefix) Trees, and HashMaps!

a. Method for compressing data

b. Use a variable-length code (different bit lengths for each character)

i. Key idea: More-common characters get shorter codes

ii. Simple example:

1. Code: x == 0, y == 10, z == 11

2. Create HashMap to Encode (compress):
put(‘x’,0)
put(‘y’,10)
put(‘z’,11)

3.      Create Binary (Prefix) Tree to Decode (decompress):
      *
   /0   \1
x      *
       /0 \1
      y     z

4. Encode(xyz) = 01011

5. Decode (10011) = yxz

6. Results:

a. fixed-length encoding of 3 characters requires 2 bits/character

b. Here, we have 5/3 = 1.67 bits/character

c. Indeed, we have lossless compression!
Well, assuming a uniform distribution of the symbols x,y,z!

c. Compresses near Shannon’s ideal limit of “codeword length”

i. Measure of minimum number of bits/symbol required to encode a message containing each symbol in a given alphabet A = { s₁, s₂, …, s_n }

ii. Entropy of each symbol s_i with weight w_i = H(s_i) = -w_i(log₂w_i)

iii. Entropy of entire alphabet A with symbols s_i = H(A) = ΣH(s_i)

iv. Example:
Say messages contain only symbols in A = { ‘a’, ‘b’, ‘c’, ‘d’ , ‘e’}
Frequencies:
a: 10% b: 15% c: 30% d: 16% e: 29% (total = 100%)
entropy(a) = H(a) = -0.10(log₂0.10) = 0.332
entropy(b) = H(b) = -0.15(log₂0.15) = 0.411
entropy(c) = H(c) = -0.30(log₂0.30) = 0.521
entropy(d) = H(d) = -0.16(log₂0.16) = 0.423
entropy(e) = H(e) = -0.29(log₂0.29) = 0.518
entropy(A) = H(A) = H(a) + … + H(e) = 2.205

v. So minimum encoding of “abcde” requires at least 2.205 bits/symbol

vi. Note: for 5 symbols, fixed-length encoding requires 3 bits/symbol (right?)

d. Algorithm:

i. For each symbol, create a tree with just that symbol (a leaf), add to a PQ

ii. While the PQ has at least two trees

1. Remove the two trees of smallest weights

2. Create a new tree with these trees as children

3. The weight of the new tree is the sum of weights of its children

4. Add the new tree to the PQ

iii. When the PQ has only one tree, return that tree as our decoding tree.

e. Example:

i. PQ = (T_a,0.1), (T_b,0.15), (T_d, 0.16), (T_e,0.29), (T_c, 0.30)

                                                ii.      Remove first two trees, combine into new tree with weight 0.25, add to PQ:
Tree T_ab (weight 0.25):
     *
   /0 \1
a     b

iii. PQ = (T_d, 0.16), (T_ab,0.25), (T_e,0.29), (T_c, 0.30)

                                              iv.      Remove first two trees, combine into new tree with weight 0.41, add to PQ:
Tree T_abd (weight 0.41):
             *
           /0 \1
          d     *
            /0 \1
           a     b

v. PQ = (T_e,0.29), (T_c, 0.30), (T_abd, 0.41)

                                              vi.      Remove first two trees, combine into new tree with weight 0.59, add to PQ:
Tree T_ce (weight 0.59):
     *
   /0 \1
c     e

vii. PQ = (T_abd, 0.41), (T_ce,0.59)

                                          viii.      Remove first two trees, combine into new tree with weight 1.00, add to PQ:
Tree T_abcde (weight 1.00):
             *
         /0      \1
        *         *
      /0 \1     /0 \1
     c   e    d     *
                 /0 \1
                a     b

ix. There is one tree remaining, so we are done!

f. Results of example:

i. Codes:
a: 110
b: 111
c: 00
d: 10
e: 01

ii. Expected codeword length (in bits/symbol):
= Σ (frequency of s_i) x (length in bits of s_i encoding)
= (0.1)(3) + (0.15)(3) + (0.3)(2) + (0.16)(2) + (0.29)(2)
= 2.25 bits/symbol

iii. Remember: Shannon’s limit is 2.205 bits/symbol

iv. So this worked very well!!!

g. Question: Our codes for a,b,c,d,e are different from those in the Wikipedia example. Why? Also, why did this not affect our expected codeword length of 2.25 bits/symbol?

3. Hashing

a. Message Digest

i. File checking: identify/verify files

ii. MD5 was standard, now is considered “broken”

iii. SHA is (something of a standard), but even it may now be “broken”

                                              iv.      Java has a general framework supporting any message digest algorithm:
import java.security.*;
import java.math.*;
public static String getDigest(String text, String algorithm) {
   try {
          MessageDigest md = MessageDigest.getInstance(algorithm);
          md.update(text.getBytes());
          return new java.math.BigInteger(1,md.digest())
                                         .toString(16);
   } catch (NoSuchAlgorithmException e) {
          out.printf("No such algorithm as %s, sorry.\n",algorithm);
          return "";
   }
}

public static void main(String[] args) {
   String s = "This is a test";
   out.println("MD5: " + getDigest(s,"MD5"));
   // prints: MD5: ce114e4501d2f4e2dcea3e17b546f339
   out.println("SHA: " + getDigest(s,"SHA"));
   // prints: SHA: a54d88e06612d820bc3be72877c74f257b561b19
}

b. Challenge: Noisy Data

i. Hash can detect error in transmission

ii. Clever adaptations can even correct small errors

iii. But what about inherently noisy data with more than just a few errors

1. For example: Audio Data

a. Challenge: are two MP3 files from the same source?

b. Using a traditional hash function like MD5 “would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume”