Below are three different "languages," meaning the result of a series of "experiments" in each of the following scenarios:

The goal is to come up with encodings for each of these languages which are streaming, lossless and uniquely decodable. For each language, answer the following six questions:

  1. What are the symbols in the language?
  2. Compute the entropy of this language. (Please write down the instance of the formula; you don't have to compute a final numeric answer).
  3. What is the naive encoding for this language?
  4. Devise an encoding with the required properties and does better on average than the naive encoding.
  5. Can your encoding ever be worse than the naive encoding? Explain your answer.
  6. Provide a rigorous argument that your encoding is better on average than the naive encoding.
Also, try this one: Your genetic code (DNA) uses a language of four bases (A, C, T, G). Each of twenty amino acids is coded with a sequence of three bases. The encoding is redundant since various possible sequences code for the same amino acid. Assume that all bases and encodings are equally likely. What is the entropy of this language? (I haven't done this one, so I'm not sure how it comes out.)