**Due:** Friday, June 26, 2015

**This assignment sounds much more complicated than it actually is.
If you do it right the two encodings should use all of the same
machinery.**

Suppose you have a table indicating the relative frequencies of symbols in a language. You already know how to compute the entropy of the language. You also know that the Huffman algorithm generates an encoding that is pretty good. Your task in this assignment is to automate the process. The steps are described below.

- First, you will read in a list of n non-negative integers from a
file (one per line). Interpret these as the relative frequencies of
successive characters in an n-character alphabet. (It is OK to assume
that n < 27 and that this is an initial sequence of the English
alphabet.) E.g., suppose the list contains the numbers 4, 2, 3, 1.
Interpret this as follows: out of the total (10), the first character
(A) occurs 4/10 of the time, the second (B) 2/10 of the time, the
third (C) 3/10 of the time, and the last (D) 1/10 of the time. Use
these probabilities to compute the entropy of the language.
- Then, use the Huffman algorithm to devise a binary encoding on
this alphabet, given the probabilities computed. Print your encoding
to the screen in some nice format.
*Note that you do not have to code Huffman yourself, though you can if you like. There are Java implementations on-line. But, if you do use someone else's, you must acknowledge fully in your README where you got the code you are using and explain any changes you had to make to the implementation. Each group must do their own research and modification. I.e., don't get the code from another group; part of the assignment is getting it to work in your situation.* - Create a volume of text (k characters, e.g., 10000 characters) in
your alphabet using a random function to generate characters at the
expected probabilities, and store the text in a file
named
`testText`. There's a hint below how to do this. Write a utility to encode the text using your encoding into a file named`testText.enc1`. Write another utility to decode this file into another file`testText.dec1`. By performing both steps you should be able to recover the original text. Ideally, your encoded file should be a binary file containing a very long string of 0's and 1's. But if you can't figure out how to do that, you can instead create a file containing the ASCII representation of your character codes on separate lines. For example, suppose A=0, B=10, C=11. Ideally, the string "ABC" should be encoded in your file as the binary string: 01011. But that may be hard to do in Java. It's OK to represent it instead as three lines containing the ASCII strings, "0", "10", "11". - Measure the efficiency of your encoding by computing the actual
average bits per symbol as you do the translation. Compare this with
the computed entropy of the language (from step 1) and record the
percentage difference.
- Then, consider the "2-symbol derived alphabet," where you use
sequences of 2 symbols, considering each pair of symbols to be a
single symbol in the derived alphabet (there's more on this below).
Use Huffman to devise an efficient encoding for that alphabet. Print
this encoding to the screen in a nice format.
**Note that you still consider each symbol to be independent of context. I.e., you don't have to build a second-order model. This is just to see if you gain any coding efficiency coding two symbols at a time instead of one at a time.** - Re-run your text from step 3 above with this new encoding and
corresponding decoding. You should generate
files
`testText.enc2`and`testText.dec2`. Compute the actual efficiency and compare with the entropy and with the efficiency using your 1-symbol encoding. - At the end, print a nice summary of your results.

For this assignment: The primary file name should be
`Encoder.java`. Students are welcome to organize their
assignments into multiple files and must submit those files as well.
However, the main method must be in `Encoder.java` and we
should be able to compile your program with `javac *.java`.

We should be able to run your program with the command:
`java Encoder frequenciesFile k`,
where `frequenciesFile` is the name of a file containing
integers representing the relative frequencies of characters in the
alphabet (as described in the first step above), and `k` is a
positive integer that tells how many characters to generate (according
to that probability distribution) in step 3 above. The output of your
program should be: your two encodings displayed on the screen, a file
containing the generated test text, four files containing the test
text encoded and then decoded with each of the two codes, and a nice
display of the results on the screen.

Include in the README.txt file the output from one of your successful test runs and the frequencies file used to produce it.

You can implement this by generating an integer randomly in the range [0..9]. If the integer is in [0..3], call that A. If it's [4..5], call that B. If in [6..8], write down C. If a 9, write down D. That is, the appropriate percentage of the range corresponds to the probability of the symbol within the alphabet. Assuming the number generated is actually random within the total range, this will produce the desired distribution. Now generalize this procedure to any total and any proportional distribution of symbols within that range.

- Build the functionality so that, upon supplying an integer
parameter j > 1, your system will automatically generate an
encoding on the "j-symbol derived alphabet," i.e., taking sequences of
j symbols at a time. Note, you are already building functionality to
allow encoding j = 2 successive symbols.
- Using a large quantity of English text as your "reference file," compute the probabilities of characters (in the text). I'd suggest ignoring any non-letters and ignoring case. Generate an efficient encoding for this alphabet. Then use the encoding on your reference text (ignoring non-letters and case). Compute the actual efficiency of your encoding. In your README, explain how well you think this approximates the entropy of English.