CS361 Assignment 3
Due: Friday, June 26, 2015
This assignment sounds much more complicated than it actually is.
If you do it right the two encodings should use all of the same
machinery.
Suppose you have a table indicating the relative frequencies of
symbols in a language. You already know how to compute the entropy of
the language. You also know that the Huffman algorithm generates an
encoding that is pretty good. Your task in this assignment is to
automate the process. The steps are described below.
- First, you will read in a list of n non-negative integers from a
file (one per line). Interpret these as the relative frequencies of
successive characters in an n-character alphabet. (It is OK to assume
that n < 27 and that this is an initial sequence of the English
alphabet.) E.g., suppose the list contains the numbers 4, 2, 3, 1.
Interpret this as follows: out of the total (10), the first character
(A) occurs 4/10 of the time, the second (B) 2/10 of the time, the
third (C) 3/10 of the time, and the last (D) 1/10 of the time. Use
these probabilities to compute the entropy of the language.
- Then, use the Huffman algorithm to devise a binary encoding on
this alphabet, given the probabilities computed. Print your encoding
to the screen in some nice format. Note that you do not have to
code Huffman yourself, though you can if you like. There are Java
implementations on-line. But, if you do use someone else's, you must
acknowledge fully in your README where you got the code you are using
and explain any changes you had to make to the implementation. Each
group must do their own research and modification. I.e., don't get
the code from another group; part of the assignment is getting it to
work in your situation.
- Create a volume of text (k characters, e.g., 10000 characters) in
your alphabet using a random function to generate characters at the
expected probabilities, and store the text in a file
named testText. There's a hint below how to do this. Write
a utility to encode the text using your encoding into a file
named testText.enc1. Write another utility to decode this
file into another file testText.dec1. By performing both
steps you should be able to recover the original text. Ideally, your
encoded file should be a binary file containing a very long string of
0's and 1's. But if you can't figure out how to do that, you can
instead create a file containing the ASCII representation of your
character codes on separate lines. For example, suppose A=0, B=10,
C=11. Ideally, the string "ABC" should be encoded in your file as the
binary string: 01011. But that may be hard to do in Java. It's OK to
represent it instead as three lines containing the ASCII strings, "0",
"10", "11".
- Measure the efficiency of your encoding by computing the actual
average bits per symbol as you do the translation. Compare this with
the computed entropy of the language (from step 1) and record the
percentage difference.
- Then, consider the "2-symbol derived alphabet," where you use
sequences of 2 symbols, considering each pair of symbols to be a
single symbol in the derived alphabet (there's more on this below).
Use Huffman to devise an efficient encoding for that alphabet. Print
this encoding to the screen in a nice format. Note
that you still consider each symbol to be independent of
context. I.e., you don't have to build a second-order model. This is
just to see if you gain any coding efficiency coding two symbols at a
time instead of one at a time.
- Re-run your text from step 3 above with this new encoding and
corresponding decoding. You should generate
files testText.enc2 and testText.dec2. Compute
the actual efficiency and compare with the entropy and with the
efficiency using your 1-symbol encoding.
- At the end, print a nice summary of your results.
To clarify what I mean by the 2-symbol derived alphabet: suppose you
had only symbols A, B, C in your original alphabet with probabilities
1/2, 1/4, 1/4. Then generate the alphabet with the symbols AA, AB,
AC, BA, BB, BC, CA, CB, CC. The probability of a symbol in this new
alphabet is the product of the two individual symbols (under the
first-order model of the language). So AA has probability 1/2 * 1/2 =
1/4, etc. Then, use Huffman to find an encoding for this language.
This is the "2-symbol derived language" encoding. It's probably closer
to the optimal entropy than the 1-symbol version.
For this assignment: The primary file name should be
Encoder.java. Students are welcome to organize their
assignments into multiple files and must submit those files as well.
However, the main method must be in Encoder.java and we
should be able to compile your program with javac *.java.
We should be able to run your program with the command:
java Encoder frequenciesFile k,
where frequenciesFile is the name of a file containing
integers representing the relative frequencies of characters in the
alphabet (as described in the first step above), and k is a
positive integer that tells how many characters to generate (according
to that probability distribution) in step 3 above. The output of your
program should be: your two encodings displayed on the screen, a file
containing the generated test text, four files containing the test
text encoded and then decoded with each of the two codes, and a nice
display of the results on the screen.
Include in the README.txt file the output from one of your successful
test runs and the frequencies file used to produce it.
Hint for generating characters in the right proportion:
Suppose you want to generate characters in the alphabet {A,B,C,D} with
proportions of 4/10, 2/10, 3/10, and 1/10, respectively. One way
would be to divide a dartboard into regions of those exact
proportional sizes and throw a dart at it. If the dart landed in the
A region (which consumes 40% of the total area), write down "A",
etc.
You can implement this by generating an integer randomly in the range
[0..9]. If the integer is in [0..3], call that A. If it's [4..5],
call that B. If in [6..8], write down C. If a 9, write down D. That
is, the appropriate percentage of the range corresponds to the
probability of the symbol within the alphabet. Assuming the number
generated is actually random within the total range, this will produce
the desired distribution. Now generalize this procedure to any total
and any proportional distribution of symbols within that range.
Extra Credit:
You can obtain up to 2 extra credit points as follows:
- Build the functionality so that, upon supplying an integer
parameter j > 1, your system will automatically generate an
encoding on the "j-symbol derived alphabet," i.e., taking sequences of
j symbols at a time. Note, you are already building functionality to
allow encoding j = 2 successive symbols.
- Using a large quantity of English text as your "reference file,"
compute the probabilities of characters (in the text). I'd suggest
ignoring any non-letters and ignoring case. Generate an
efficient encoding for this alphabet. Then use the encoding on your
reference text (ignoring non-letters and case). Compute the actual
efficiency of your encoding. In your README, explain how well you
think this approximates the entropy of English.