Computer Science 15-111 (Sections A & B), Spring 2007

Class Notes:  13-Apr-2007

 

Logistics:

1.      Test 2 is Wednesday April 18th (5 days from today).  It is comprehensive.

2.      Reading:

a.      Wikipedia page on Heaps
Note:  you are responsible for insert and remove operations, not for increaseKey or merge, nor for heap variants.

b.      Wikipedia page on Huffman Coding
Note: you are not responsible for the Variations.

c.      Wikipedia page on Hash Functions
Note:  read this, but you do not need an especially deep understanding of the material on this page (but simple questions regarding the material may appear on subsequent quizzes and tests).

Topics:  Applications

1.      Best-First Search (continued) – Heaps (PQ’s), HashSets!
BestFirstSearch.java

2.      Huffman Encoding – Heaps (PQ’s), Binary (Prefix) Trees, and HashMaps!

a.      Method for compressing data

b.      Use a variable-length code (different bit lengths for each character)

                                                  i.      Key idea:  More-common characters get shorter codes

                                                ii.      Simple example:

1.      Code:  x == 0,   y == 10,  z == 11

2.      Create HashMap to Encode (compress):
put(‘x’,0)
put(‘y’,10)
put(‘z’,11)

3.      Create Binary (Prefix) Tree to Decode (decompress):
      *
   /0   \1
  x      *
       /0  \1
      y     z

4.      Encode(xyz) = 01011

5.      Decode (10011) = yxz

6.      Results:

a.      fixed-length encoding of 3 characters requires 2 bits/character

b.      Here, we have 5/3 = 1.67 bits/character

c.      Indeed, we have lossless compression!
Well, assuming a uniform distribution of the symbols x,y,z!

c.      Compresses near Shannon’s ideal limit of “codeword length”

                                                  i.      Measure of minimum number of bits/symbol required to encode a message containing each symbol in a given alphabet A = { s1, s2, …, sn }

                                                ii.      Entropy of each symbol si with weight wi = H(si) = -wi(log2wi)

                                              iii.      Entropy of entire alphabet A with symbols si = H(A) = ΣH(si)

                                              iv.      Example:
Say messages contain only symbols in A = { ‘a’, ‘b’, ‘c’, ‘d’ , ‘e’}
Frequencies:
a: 10%   b: 15%  c: 30%  d: 16%  e: 29%   (total = 100%)
entropy(a) = H(a) = -0.10(log20.10) = 0.332
entropy(b) = H(b) = -0.15(log20.15) = 0.411
entropy(c) = H(c) = -0.30(log20.30) = 0.521
entropy(d) = H(d) = -0.16(log20.16) = 0.423
entropy(e) = H(e) = -0.29(log20.29) = 0.518
entropy(A) = H(A) = H(a) + … + H(e) = 2.205

                                                v.      So minimum encoding of “abcde” requires at least 2.205 bits/symbol

                                              vi.      Note:  for 5 symbols, fixed-length encoding requires 3 bits/symbol (right?)

d.      Algorithm:

                                                  i.      For each symbol, create a tree with just that symbol (a leaf), add to a PQ

                                                ii.      While the PQ has at least two trees

1.      Remove the two trees of smallest weights

2.      Create a new tree with these trees as children

3.      The weight of the new tree is the sum of weights of its children

4.      Add the new tree to the PQ

                                              iii.      When the PQ has only one tree, return that tree as our decoding tree.

e.      Example:

                                                  i.      PQ = (Ta,0.1), (Tb,0.15), (Td, 0.16), (Te,0.29), (Tc, 0.30)

                                                ii.      Remove first two trees, combine into new tree with weight 0.25, add to PQ:
Tree Tab (weight 0.25):
     *
   /0  \1
  a     b

                                              iii.      PQ = (Td, 0.16), (Tab,0.25), (Te,0.29), (Tc, 0.30)

                                              iv.      Remove first two trees, combine into new tree with weight 0.41, add to PQ:
Tree Tabd (weight 0.41):

             *
           /0  \1
          d     *
              /0  \1
             a     b

                                                v.      PQ = (Te,0.29), (Tc, 0.30), (Tabd, 0.41)

                                              vi.      Remove first two trees, combine into new tree with weight 0.59, add to PQ:
Tree Tce (weight 0.59):
     *
   /0  \1
  c     e

                                            vii.      PQ = (Tabd, 0.41), (Tce,0.59)

                                          viii.      Remove first two trees, combine into new tree with weight 1.00, add to PQ:
Tree Tabcde (weight 1.00):

             *
         /0      \1
        *         *
      /0 \1     /0  \1
     c    e    d     *
                   /0  \1
                  a     b

                                              ix.      There is one tree remaining, so we are done!

f.        Results of example:

                                                  i.      Codes:
a:  110
b:  111
c:  00
d:  10
e:  01

                                                ii.      Expected codeword length (in bits/symbol):
=
Σ (frequency of si) x (length in bits of si encoding)
= (0.1)(3) + (0.15)(3) + (0.3)(2) + (0.16)(2) + (0.29)(2)
= 2.25 bits/symbol

                                              iii.      Remember:  Shannon’s limit is 2.205 bits/symbol

                                              iv.      So this worked very well!!!

g.      Question: Our codes for a,b,c,d,e are different from those in the Wikipedia example.  Why?  Also, why did this not affect our expected codeword length of 2.25 bits/symbol?

 

3.      Hashing

a.      Message Digest

                                                  i.      File checking:  identify/verify files

                                                ii.      MD5 was standard, now is considered “broken”

                                              iii.      SHA is (something of a standard), but even it may now be “broken”

                                              iv.      Java has a general framework supporting any message digest algorithm:
import java.security.*;
import java.math.*;
public static String getDigest(String text, String algorithm) {
   try {
          MessageDigest md = MessageDigest.getInstance(algorithm);
          md.update(text.getBytes());
          return new java.math.BigInteger(1,md.digest())
                                         .toString(16);
   } catch (NoSuchAlgorithmException e) {
          out.printf("No such algorithm as %s, sorry.\n",algorithm);
          return "";
   }
}

public static void main(String[] args) {
   String s = "This is a test";
   out.println("MD5: " + getDigest(s,"MD5"));
   // prints: MD5: ce114e4501d2f4e2dcea3e17b546f339
   out.println("SHA: " + getDigest(s,"SHA"));
   // prints: SHA: a54d88e06612d820bc3be72877c74f257b561b19
}


b.      Challenge:  Noisy Data

                                                  i.      Hash can detect error in transmission

                                                ii.      Clever adaptations can even correct small errors

                                              iii.      But what about inherently noisy data with more than just a few errors

1.      For example:  Audio Data

a.      Challenge:  are two MP3 files from the same source?

b.      Using a traditional hash function like MD5 “would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume”