Data Compression

 

This exercise has to do with file compression using key-word encoding. There are several files associated with this exercise that are in the same directory.

wordList.cpp   A file containing C++ program that produces a list of the unique words in a file and the number of times each appears.

words.dat        The output from program WordList with the words sorted by number of occurrences.

history.in         A data file containing 3436 non-blank characters, which was the input to the program.


Program WordList is case sensitive; words beginning with an uppercase letter are considered different from the same word beginning with a lowercase letter.

 
          letters[count] = tolower(letter);

Program WordList ignores words of less than three characters. Would it be better to ignore words of less than four characters? Recalculate the compression ratio not encoding words of less than four characters.