CS 314 - Specification 11 - Huffman Coding and Compression
Programming Assignment 11: Pair Assignment. You may work with one other person on this assignment using the pair programming technique. Review this paper on pair programming. You are not required to work in a pair on the assignment. (You may complete it by yourself if you wish.) If you begin working with one partner and do not wish to finish the assignment with that partner you must complete the assignment individually. If you work with a partner, the intent is that you work together, at the same computer. One person "drives" (does the typing and explains what they are doing) and the other person "navigates" (watches and asks questions when something is unclear). You shall not partition the work, work on your own, and then put things together.
You and your partner may not acquire from any source (e.g. another student or an internet site) a partial or complete solution to a problem or project that has been assigned. You and your partner may not show other students your solution to an assignment. You may not have another person (current student other than your partner, former student, tutor, friend, anyone) “walk you through” how to solve an assignment. You may get help from the instructional staff. You may discuss general ideas and approaches with other students. Review the class policy on collaboration from the syllabus. If you took CS314 previously and work with a partner you must start from scratch on the assignment.
The purposes of this assignment are:
Many, many thanks to Owen Astrachan of Duke University for allowing me to use and modify his version of this assignment.
There is a lot to do. This is a complicated problem and program. Start early.
Read the documentation for the provided classes and the how to. Use the constants from the
IHuffConstants interface instead of hard coding numbers.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, it will be difficult to get a completely working program. Past students have suffered much pain and gnashing of teeth because they made an error in their code early, didn't test it, and spent hours trying to find the bug. The howto development section has more information on incremental development.
I have also provided a few files the data portion only of a file as the characters '0' and '1' along with spaces as separator characters. These can be used to check part of your work.
Implement a program that performs Huffman coding and compression on files of any type. You are given a number of classes to start with. Your solution shall use many different classes. This is a difficult assignment. The hardest of the semester. I strongly urge you to work with a partner and to start early.
For a review of Huffman coding see the class slides. I expect you to use a priority queue as shown on the slides and discussed in class. Also note your actual results will be different than the first example in the middle of slides because the period character will be before any of the other letters in the initial priority queue AND because the example does not show the PSEUDO - EOF character with a frequency of 1. Your algorithm shall match the example at the end of the slides.
There are many techniques used to compress digital data. This assignment implements the Huffman coding algorithm.
Several algorithms for data compression have been patented. To use the MP3 codec (which uses Huffman encoding as one of the steps of the algorithm) requires a license from mp3licensing.com which coordinates licensing.
Huffman coding was invented by David Huffman while he was a graduate student at MIT in 1950 when given the option of a term paper or a final exam. For details see this 1991 Scientific American Article. In an autobiography Huffman had this to say about the epiphany that led to his invention of the coding method that bears his name:
"-- but a week before the end of the term I seemed to have nothing to show for weeks of effort. I knew I'd better get busy fast, do the standard final, and forget the research problem. I remember, after breakfast that morning, throwing my research notes in the wastebasket. And at that very moment, I had a sense of sudden release, so that I could see a simple pattern in what I had been doing, that I hadn't been able to see at all until then. The result is the thing for which I'm probably best known: the Huffman Coding Procedure. I've had many breakthroughs since then, but never again all at once, like that. It was very exciting."
Huffman's original paper is available, though it's a tough read. The Wikipedia reference is extensive as is this online material developed as one of the original Nifty Assignments. Both jpeg and mp3 encodings use Huffman Coding as part of their compression algorithms. In this assignment you'll implement a program to compress and uncompress files using Huffman coding.
For this assignment you'll build a program that performs two related tasks: compressing (huff) files and uncompressing (unhuff) files that are compressed by the first task. This is done via a single program with the choice of compressing a file or uncompressing a file specified by choosing a menu-option in the GUI front-end to the code you write. Abstractly you're writing a program to read an input file and create a corresponding output file --- either from uncompressed to compressed or vice versa.
Huff class is a simple main that launches a GUI with a
IHuffProcessor implementation. The implementation
corresponds to an object oriented model named the model-view architecture or
pattern. In this pattern the view/GUI makes calls on the model/
methods which in turn may display information in the view/GUI. The code you
write will also create files of compressed or uncompressed data when the
GUI-front end calls methods you will write. You'll implement methods and store
state in your
IHuffProcessor implementation so that it can either
compress/huff or uncompress/unhuff. You should implement additional classes
to break the problem up into smaller, more manageable pieces and prevent
repeated code. You should have a
class that models a Huffman code tree, a priority queue, and separate classes to manage the
details of compression (
Compressor) and decompression (
With these additional classes the
The howto for this assignment contains a lot of detail and help.
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as other programs such as winrar or zip which use slightly different algorithms resulting in a higher degree of compression than Huffman coding.
Important Constraints: These will make more sense after you get going on the assignment. These constraints are necessary to test and grade your program.
[[X, 1], [A, 6], [E, 6], [S, 10]] -> now enqueue [Y, 6]. This value must come after the [E, 6], but before the [S,10]
resulting priority queue after enqueuing [Y, 6]
[[X, 1], [A, 6], [E, 6], [Y, 6], [S, 10]]
Howto compression section has complete information on how to create a
compressed file. Create a Huffman tree to derive
per-chunk encodings, then write bits based on these encodings. The
Huff main program has a GUI front-end whose menu offers four choices:
count chunks, compress, uncompress, and quit as a fourth choice.
The count options in the GUI counts chunks, generates Huffman codes, and determines the number of bits in the resulting file if it were to be written using the new Huffman Codes determined by this method along with all required header data. This option does not actually write any data out to a new file. Note, to determine the number of bits saved, the number of bits written includes ALL bits that will be written including the magic number, the header format number, the header to reproduce the tree, AND the actual data.
The compress option performs the same operations as the count characters option, but also writes the data out to a file. Note, the compress option in the HuffViewer will first call the preprocess compress method BEFORE calling the compress method.
SimpleHuffProcessor has a reference back to the
HuffViewer. It shall make calls to the
showError, showMessage, and update methods to display the status, codes, and
other information. Don't use
System.out.println, use the methods
Note, many students fail grading test cases, because they do not return the correct number from the preprocessCompress, compress, and uncompress methods. The preprocessCompress method must return the number of bits in the original file - the number of bits written to the compressed file (including the Huffman magic number, the header constant, the header, the data, and the pseudo-eof, but NOT any padding added by the file system to get the file size up to a multiple of 8.) The compress method returns the number of bits actually written. The uncompress method returns the number of bits written to the output file.
The uncompress option is used to decompress a file.
To uncompress a file your program previously compressed you'll need to read header information from the compressed file your program (or other versions of the Huffman compressor) creates. The header information is data used to recreate the Huffman tree that was originally used to compress the data. Your code will then read one-bit-at-a-time to uncompress the data and recreate the original file. There's information in the howto uncompress section on doing this.
Read the header
information to recreate the tree, then do a tree-walk one bit at a time to find
the characters stored in the leaves of the Huffman tree. Each time you find a
leaf, write the value to the output (and if debugging, to the
GUI.). This process recreates the original, uncompressed file.
The uncompress option is used to decompress a file.
Run the program
HuffMark which will read every file in a directory and compress it to
another file in the same directory with a ".hf" suffix. The
HuffMark program does not provide a viewer for your
SimpleHuffProcessor so you have to comment out any calls to
viewer.showMessage. You can leave in the calls
showError assuming you won't suffer any errors. The method calls
may be in other classes if you created separate
You may want to modify this benchmarking program to print more data than it currently does, and to run it on the calgary directory which represents the Calgary Corpus, a standard compression suite of files for empirical analysis. See this reference for comparisons of the Calgary Corpus, and the waterloo directory which is a collection of .tiff images used in some compression benchmarking, and on the BooksAndHTML directory which contains a number of text files and html documents. All of those collections are in zip files which you must download and unzip. You can, of course, run on other data/collections.
The benchmarking program skips files with .hf suffixes, but you may want to eventually remove this restriction . (In other words, what happens if you take a compressed file and try and compress it again?) In your assignment write-up discuss your benchmark results and provide some insight as to their meaning. Your analysis is worth 20% of your grade on this assignment, so you should try to come to some conclusions in addition to simply listing your results.
Some Results From my version of the program.
This small sample file (that contains "Eerie eyes seen near lake.") is 26 bytes or 208 bits.
When I "compress" is with the Standard Count Format
(see howto document for explanation) the resulting size is 8346
bits. (The resulting file is 8352 bits, not 8346 bits. Why? 8346 is 1043.25
bytes. Or 1043 bytes and 2 bits. The file must be written as bytes so there are
6 bits of padding. This is the whole purpose of the pseudo-eof character. To
know when the real data has ended and to ignore whatever padding may follow.)
When I "compress" it with the Standard Tree Format
the resulting size is 328 bits.
The 2008 CIA Factbook is 9,637,228 bytes or 77,097,824 bits
When I compress it with the Standard Count Format
Header the compressed file is 48,230,392 bits.
When I compress it with the Standard Tree Format Header the compressed file is 48,223,298 bits.
When I run HuffMark on the BooksAndHTML directory I get these results:
My solution to the problem, which supports the
standard count headers and
the tree headers, consists of 4 classes in addition to the
with about 650 lines including code, blank lines, lines with single braces, and
comments. (lots of comments)
A jar file with 10 classes to use in the assignment. You will make
significant changes to
||Provided by me and you|
|Guide||The Huffman HOWTO page is a guide to help you complete the assignment. It gives an overview of the various parts of the assignment.||Provided by me|
|Documentation||Documentation for the provided, non standard Java classes. You will have to refer to the documentation a great deal.||Provided by me|
|Small Test File||
The small text file contains "Eerie eyes seen near lake."
SmallTxtTreeHeader.txt.hf. Small sample file to use when
incrementally developing your program.
smallTextsFreqsAndCodes.txt contains the expected frequencies and
Huffman codes for this file.
Here is the complete small text file with 1s and 0s shown as chars, not stored as bits:
You can use this ExplicitBitOutputWriter.java class to convert any file to a file with ASCII 0's and 1's. This is useful when looking at small files. You can compare files ns Eclipse to find differences. See this page for instructions.
|Provided by me|
|Large Test File||
The 2008 CIA Factbook. The
ciaFactbook2008FreqsAndCodes.txt contains the expected frequencies
and Huffman codes for this file, plus the standard tree for that file.
ciaFactbook2008.txt.hf is the compressed file using the standard
ciaFactbook2008_stf.txt.hf is the compressed file using the standard
||Provided by me|
|Large Test Files||Provided by me|
|Test Framework and Files||
To be sure your program is correct use this tester file:
A11Test_Huffman.java. This is the exact same test harness we will
use when testing your submission, but we will use different files. Instead of using the
HuffViewer with the
GUI we use this stub version when running the test harness. I
recommend you create a separate package in your Eclipse project for
these test files OR a whole different project from your working code.
To use the harness you need to download the bevotest.jar file from John Thywissen (former CS314 Ta who wrote the testing framework we use to grade programs). bevotest.jar is a collection of code you need for the A11Test_Huffman to work. (The jar contains various .class files.) See this page for info on how to include a jar in your Eclipse project.
Finally you need the test files in this zip file. Download and unzip the file to your Eclipse project folder.
|Provided by me|
|Files for Analysis||calgary.zip, waterloo.zip, and BooksAndHTML.zip zip files with lots of large files to test your finished program. Unzip the files and use the HuffMark program / class to test your solution.||Links provided by me|
|All files||Here is a jar with most of the files above (except the howto, the documentation, and the bevo tester files) if you want to avoid downloading each individually. A11_Files.jar||Provided by me|
|Write Up||README.txt The write up shall include an estimate of how long you spent on the program and what your results. The bulk of the write up will be an analysis of your empirical data. What kinds of file lead to lots of compressions. What kind of files had little or no compression? What happens when you try and compress a huffman code file?||Provided by you|
|Submission||Submit all of the .java source code files, including the ones I provide from HuffmanStarter.jar, and your README.txt in a jar named A11.jar.||Provided by you|
Submission Requirement: Because you are designing the classes for this program, we don't know what the class names will be. Therefore shall submit all your files as a non executable jar file. jar is a program included in the standard edition of Java loaded on the computers in the Microlab and that you have on your computer if you downloaded Java. Some IDEs provide the ability to create jar files.
Here is a link
to how to create
jar files in Eclipse.
The archive tool we use in the class is called jar. (Java ARchive) I strongly recommend learning to use that tool well before the due date. In the past students have had trouble turning in the assignment on time because they did not allow enough time to learn to use jar. jar can be used to create executable Java programs and to archive multiple files into a single file. You will be using jar to create an archive, not to create an executable, so there is no need to include the .class files.
IMPORTANT: You must include the .java files in the jar, not the .class files. If your jar does not contain your source code (your .java files), you and your partner will get a 0 on the assignment. It is vital that you do not to wait until the last minute to learn how to create jar files and that you verify the jar file contains your .java files before turning it in.
If you need help creating jar see the instructor, a TA, or a proctor during lab hours.
When finished turn in a jar file named A11.jar that contains all the .java files needed for your program to be compiled and run. Include all the source code for all the classes you created, plus ALL of the provided .java files and README.txt. (Don't turn in the .class files.) Do not include any directory structure. Do not turn include a src folder. This is not an executable jar, simply an archive jar. We must be able to unjar your code and run the program without having to add ANY other files or make any changes.
Use the turnin program to turn the file in. jar is a program included in the standard edition of Java loaded on the computers in the Microlab and that you have on your computer if you downloaded Java. If you work with a partner turn in only a single version of the code. Do not turn in the jar to both accounts. Pick one person's account and turn the file into that account only.
Checklist: Did you remember to:
Back to the CS314 homepage.