CS 361 Summer 2014

CS361 Assignment 3

Due: Friday, June 26, 2015

This assignment sounds much more complicated than it actually is. If you do it right the two encodings should use all of the same machinery.

Suppose you have a table indicating the relative frequencies of symbols in a language. You already know how to compute the entropy of the language. You also know that the Huffman algorithm generates an encoding that is pretty good. Your task in this assignment is to automate the process. The steps are described below.

First, you will read in a list of n non-negative integers from a file (one per line). Interpret these as the relative frequencies of successive characters in an n-character alphabet. (It is OK to assume that n < 27 and that this is an initial sequence of the English alphabet.) E.g., suppose the list contains the numbers 4, 2, 3, 1. Interpret this as follows: out of the total (10), the first character (A) occurs 4/10 of the time, the second (B) 2/10 of the time, the third (C) 3/10 of the time, and the last (D) 1/10 of the time. Use these probabilities to compute the entropy of the language.
Then, use the Huffman algorithm to devise a binary encoding on this alphabet, given the probabilities computed. Print your encoding to the screen in some nice format. Note that you do not have to code Huffman yourself, though you can if you like. There are Java implementations on-line. But, if you do use someone else's, you must acknowledge fully in your README where you got the code you are using and explain any changes you had to make to the implementation. Each group must do their own research and modification. I.e., don't get the code from another group; part of the assignment is getting it to work in your situation.
Create a volume of text (k characters, e.g., 10000 characters) in your alphabet using a random function to generate characters at the expected probabilities, and store the text in a file named testText. There's a hint below how to do this. Write a utility to encode the text using your encoding into a file named testText.enc1. Write another utility to decode this file into another file testText.dec1. By performing both steps you should be able to recover the original text. Ideally, your encoded file should be a binary file containing a very long string of 0's and 1's. But if you can't figure out how to do that, you can instead create a file containing the ASCII representation of your character codes on separate lines. For example, suppose A=0, B=10, C=11. Ideally, the string "ABC" should be encoded in your file as the binary string: 01011. But that may be hard to do in Java. It's OK to represent it instead as three lines containing the ASCII strings, "0", "10", "11".
Measure the efficiency of your encoding by computing the actual average bits per symbol as you do the translation. Compare this with the computed entropy of the language (from step 1) and record the percentage difference.
Then, consider the "2-symbol derived alphabet," where you use sequences of 2 symbols, considering each pair of symbols to be a single symbol in the derived alphabet (there's more on this below). Use Huffman to devise an efficient encoding for that alphabet. Print this encoding to the screen in a nice format. Note that you still consider each symbol to be independent of context. I.e., you don't have to build a second-order model. This is just to see if you gain any coding efficiency coding two symbols at a time instead of one at a time.
Re-run your text from step 3 above with this new encoding and corresponding decoding. You should generate files testText.enc2 and testText.dec2. Compute the actual efficiency and compare with the entropy and with the efficiency using your 1-symbol encoding.
At the end, print a nice summary of your results.

To clarify what I mean by the 2-symbol derived alphabet: suppose you had only symbols A, B, C in your original alphabet with probabilities 1/2, 1/4, 1/4. Then generate the alphabet with the symbols AA, AB, AC, BA, BB, BC, CA, CB, CC. The probability of a symbol in this new alphabet is the product of the two individual symbols (under the first-order model of the language). So AA has probability 1/2 * 1/2 = 1/4, etc. Then, use Huffman to find an encoding for this language. This is the "2-symbol derived language" encoding. It's probably closer to the optimal entropy than the 1-symbol version.

For this assignment: The primary file name should be Encoder.java. Students are welcome to organize their assignments into multiple files and must submit those files as well. However, the main method must be in Encoder.java and we should be able to compile your program with javac *.java.

We should be able to run your program with the command: java Encoder frequenciesFile k, where frequenciesFile is the name of a file containing integers representing the relative frequencies of characters in the alphabet (as described in the first step above), and k is a positive integer that tells how many characters to generate (according to that probability distribution) in step 3 above. The output of your program should be: your two encodings displayed on the screen, a file containing the generated test text, four files containing the test text encoded and then decoded with each of the two codes, and a nice display of the results on the screen.

Include in the README.txt file the output from one of your successful test runs and the frequencies file used to produce it.

Hint for generating characters in the right proportion:

Suppose you want to generate characters in the alphabet {A,B,C,D} with proportions of 4/10, 2/10, 3/10, and 1/10, respectively. One way would be to divide a dartboard into regions of those exact proportional sizes and throw a dart at it. If the dart landed in the A region (which consumes 40% of the total area), write down "A", etc.

You can implement this by generating an integer randomly in the range [0..9]. If the integer is in [0..3], call that A. If it's [4..5], call that B. If in [6..8], write down C. If a 9, write down D. That is, the appropriate percentage of the range corresponds to the probability of the symbol within the alphabet. Assuming the number generated is actually random within the total range, this will produce the desired distribution. Now generalize this procedure to any total and any proportional distribution of symbols within that range.

Extra Credit:

You can obtain up to 2 extra credit points as follows:

Build the functionality so that, upon supplying an integer parameter j > 1, your system will automatically generate an encoding on the "j-symbol derived alphabet," i.e., taking sequences of j symbols at a time. Note, you are already building functionality to allow encoding j = 2 successive symbols.
Using a large quantity of English text as your "reference file," compute the probabilities of characters (in the text). I'd suggest ignoring any non-letters and ignoring case. Generate an efficient encoding for this alphabet. Then use the encoding on your reference text (ignoring non-letters and case). Compute the actual efficiency of your encoding. In your README, explain how well you think this approximates the entropy of English.