## Entropy Exercises

Below are three different "languages," meaning the result of a series
of "experiments" in each of the following scenarios:
- You have an unbalanced coin that is twice as likely to come up heads
as tails.

- You have a six-sided die (one of a pair of dice) that is
lopsided. Rolls of 1 and 2 are equally likely; so are rolls of 3
and 4; and so are rolls of 5 and 6. However, the die rolls a 1
twice as often as a 3, and rolls a 5 twice as often as a 1.

- You work as a programmer for the Goldilocks Porridge Polling
Company. Company pollsters ask the local bear population about
porridge preferences, both in variety (sweetened or unsweetened) and
serving temperature (hot, cold, just-right). Thus, any respondent can
have one of six preference profiles. Surprisingly, preferences in
variety and preferences in temperature are entirely independent of one
another. But sweetened porridge is twice as popular as unsweetened.
Hot and cold are equally popular, but just-right is twice as popular
as either hot or cold. (You'd like to send the sequence of results from
the pollsters in the field to the office via an efficient encoding.)

The goal is to come up with encodings for each of these languages
which are streaming, lossless and uniquely decodable. For each
language, answer the following six questions:

- What are the symbols in the language?
- Compute the entropy of this language. (Write down the instance of
the formula; you don't have to compute a final numeric answer).
- What is the naive encoding for this language?
- Devise an encoding with the required properties that does better
on average than the naive encoding.
- Can your encoding ever be worse than the naive encoding? Explain.
- Provide a rigorous argument that your encoding is better on
average than the naive encoding.

In lecture 30, slide 5, we gave a pretty efficient encoding of a
simple language with 16 symbols, under certain assumptions. Try to
devise an even more efficient encoding. Calculate the efficiency of
your new encoding. Show that your new encoding gives better average
performance than the one given on the slide. Can you compute the
entropy of the language? How does your encoding compare with the
entropy?

Also, try this one: Your genetic code (DNA) uses a language of four
bases (A, C, T, G). Each of twenty amino acids is coded with a
sequence of three bases. The encoding is redundant since various
possible sequences code for the same amino acid. If it weren't
redundant, how many different amino acids *could there be* under
this encoding? Is nature's encoding streaming, lossless, and uniquely
decodable? Make some assumptions and see if you can estimate the
entropy of the language. How efficient is the encoding nature has
devised? (*I haven't done this one, so I'm not sure how it comes
out.* But I'd love to see someone's solution and argument that
it's right.)