Entropy Exercises

Below are three different "languages," meaning the result of a series of "experiments" in each of the following scenarios:

The goal is to come up with encodings for each of these languages which are streaming, lossless and uniquely decodable. For each language, answer the following six questions:

  1. What are the symbols in the language?
  2. Compute the entropy of this language. (Write down the instance of the formula; you don't have to compute a final numeric answer).
  3. What is the naive encoding for this language?
  4. Devise an encoding with the required properties that does better on average than the naive encoding.
  5. Can your encoding ever be worse than the naive encoding? Explain.
  6. Provide a rigorous argument that your encoding is better on average than the naive encoding.



In lecture 30, slide 5, we gave a pretty efficient encoding of a simple language with 16 symbols, under certain assumptions. Try to devise an even more efficient encoding. Calculate the efficiency of your new encoding. Show that your new encoding gives better average performance than the one given on the slide. Can you compute the entropy of the language? How does your encoding compare with the entropy?



Also, try this one: Your genetic code (DNA) uses a language of four bases (A, C, T, G). Each of twenty amino acids is coded with a sequence of three bases. The encoding is redundant since various possible sequences code for the same amino acid. If it weren't redundant, how many different amino acids could there be under this encoding? Is nature's encoding streaming, lossless, and uniquely decodable? Make some assumptions and see if you can estimate the entropy of the language. How efficient is the encoding nature has devised? (I haven't done this one, so I'm not sure how it comes out. But I'd love to see someone's solution and argument that it's right.)

Consider the following encoding: A: 0; B: 01; C: 011; D: 111.

I believe that this is uniquely decodable, but has a very unfortunate characteristic. Prove that it's uniquely decodable (or show that it's not). Explain why it's not a very good encoding. (Think about parsing the following two strings: [0 followed by 300 1's], [01 followed by 300 1's]. What is the first character of each? When do you know?)