## Entropy Exercises

Below are three different "languages," meaning the result of a series of "experiments" in each of the following scenarios:
• You have an unbalanced coin that is twice as likely to come up heads as tails.

• You have a six-sided die (one of a pair of dice) that is lopsided. Rolls of 1 and 2 are equally likely; so are rolls of 3 and 4; and so are rolls of 5 and 6. However, the die rolls a 1 twice as often as a 3, and rolls a 5 twice as often as a 1.

• You work as a programmer for the Goldilocks Porridge Polling Company. Company pollsters ask the local bear population about porridge preferences, both in variety (sweetened or unsweetened) and serving temperature (hot, cold, just-right). Thus, any respondent can have one of six preference profiles. Surprisingly, preferences in variety and preferences in temperature are entirely independent of one another. But sweetened porridge is twice as popular as unsweetened. Hot and cold are equally popular, but just-right is twice as popular as either hot or cold. (You'd like to send the sequence of results from the pollsters in the field to the office via an efficient encoding.)

The goal is to come up with encodings for each of these languages which are streaming, lossless and uniquely decodable. For each language, answer the following six questions:

1. What are the symbols in the language?
2. Compute the entropy of this language. (Write down the instance of the formula; you don't have to compute a final numeric answer).
3. What is the naive encoding for this language?
4. Devise an encoding with the required properties that does better on average than the naive encoding.
5. Can your encoding ever be worse than the naive encoding? Explain.
6. Provide a rigorous argument that your encoding is better on average than the naive encoding.

In lecture 30, slide 5, we gave a pretty efficient encoding of a simple language with 16 symbols, under certain assumptions. Try to devise an even more efficient encoding. Calculate the efficiency of your new encoding. Show that your new encoding gives better average performance than the one given on the slide. Can you compute the entropy of the language? How does your encoding compare with the entropy?

Also, try this one: Your genetic code (DNA) uses a language of four bases (A, C, T, G). Each of twenty amino acids is coded with a sequence of three bases. The encoding is redundant since various possible sequences code for the same amino acid. If it weren't redundant, how many different amino acids could there be under this encoding? Is nature's encoding streaming, lossless, and uniquely decodable? Make some assumptions and see if you can estimate the entropy of the language. How efficient is the encoding nature has devised? (I haven't done this one, so I'm not sure how it comes out. But I'd love to see someone's solution and argument that it's right.)

Consider the following encoding: A: 0; B: 01; C: 011; D: 111.

I believe that this is uniquely decodable, but has a very unfortunate characteristic. Prove that it's uniquely decodable (or show that it's not). Explain why it's not a very good encoding. (Think about parsing the following two strings: [0 followed by 300 1's], [01 followed by 300 1's]. What is the first character of each? When do you know?)