Programming Assignment 11

Extensive Vocabulary (Due 13 November 2009)

For this assignment, we would like to study the representative works of two authors and ask ourselves the question "Whose vocabulary is more extensive?"

The authors that we will be looking at are Charles Dickens and Thomas Hardy. And the representative novels that we will analyze are A Tale of Two Cities and The Return of the Native respectively. Here are the following steps in our analysis.

Step I: Go to the Project Gutenberg and download the plain text versions of both books. Call the first book Tale.txt and the second book Return.txt. Open the books in a text editor and delete the preamble in the beginning of the books and the license agreement and the end of the books. The first book in Tale.txt should begin and end with these lines:

A Tale of Two Cities

by Charles Dickens

...

it is a far, far better rest that I go to than I have ever known."

The second book in Return.txt should begin and end with these lines:

THE RETURN OF THE NATIVE

by Thomas Hardy

...

kindly received, for the story of his life had become generally known.

Step II: The program that you will be writing will be called Books.py. The following is the suggested structure. You do not have to adhere to it. However, we will be looking at good documentation, design, and adherence to the coding convention discussed in class. Use meaningful variable names in your program.


# Removes punctuation marks from a string
def parseString (st):

# Returns a dictionary of words and their frequencies
def getWordFreq (file):
  
# Compares the distinct words in two dictionaries
def wordComparison (author1, freq1, author2, freq2):


def main():
  # Enter names of the two books in electronic form
  book1 = raw_input ("Enter name of first book: ")
  book2 = raw_input ("Enter name of second book: ")
  print

  # Enter names of the two authors
  author1 = raw_input ("Enter last name of first author: ")
  author2 = raw_input ("Enter last name of second author: ")
  print 
  
  # Get the frequency of words used by the two authors
  wordFreq1 = getWordFreq (book1)
  wordFreq2 = getWordFreq (book2)

  # Compare the relative frequency of uncommon words used
  # by the two authors
  wordComparison (author1, wordFreq1, author2, wordFreq2)

main()

Step III: You will have to get the frequency of words in each text to start with. We have already developed a piece of code to do that. But here are some additional processing that you will have to do in the function getWordFrequency().

Open the file for reading
Create an empty dictionary
Read line by line through the file
For each line strip the end-of-line character and remove punctuations. Replace hyphens (-) with blanks. In the function parseString() create a blank new string. Go through the input string character-by-character. Accept only letters (using isalpha()) and spaces (using isspace()) and add those to the new string and replace hyphens with space. Return the new string.
After the dictionary is created close the input file.
Remove all words that start with a capital letter.

Go through the words in the dictionary. For each word check if it starts with a capital letter.
If it does, check if the lower case version of that word exists in the dictionary. If it exists, then add the upper case word's frequency to the lower case word's frequency.
Add the word starting with a capital letter in a list.
After you have checked for all capitalized words, remove all those words in the above list and their frequencies from the word frequency dictionary.

You should now have a dictionary of words in lower case and their frequencies. You will have removed all proper names of people and places. You will also have removed those words that occur just once in the novel and as the first word in the sentence or always as the first word in a sentence. That number should few compared to the total number of words that we are dealing with in those novels. You can always write the list of words beginning with a capital letter in a file and examine that file.

Step IV: In this step you will be working on the function wordComparison(). First you will get some statistics of the two novels separately and then you will compare the two together. For each novel compute and print the following pieces of information:

Print the number of distinct words used, i.e. number of words used if you remove the duplicates. Realize, that is just the length of the list of keys for the word frequency dictionary.
Compute and print the total number of words used by adding all the frequencies together.
Calculate and print the percentage of distinct words to the total number of words used.

You will create two sets with the list of keys from the two word frequency dictionaries. Let us call these sets D and H for the two authors respectively. The set difference D - H represents all the words that Dickens used that Hardy did not. The set difference H - D represents all the words that Hardy used that Dickens did not. For each of these set differences print the following pieces of information:

The number of words in that set difference.
Compute the total frequencies of these words in the set difference (D-H or H-D) and express that as the percentage of total words number in the novel that you found earlier. [In the example below, we computed the sum of the frequencies of the 47 words that Dickens used that Hardy did not and expressed that as a percentage of the 116 words that Dickens used in all.]

Here are two sample files - dickens.txt and hardy.txt taken from the two novels. Your output for these two sample files should be of the following form:

Enter name of first book: dickens.txt
Enter name of second book: hardy.txt

Enter last name of first author: Dickens
Enter last name of second author: Hardy

Dickens
Total distinct words = 55
Total words (including duplicates) = 116
Ratio (% of total distinct words to total words) = 47.4137931034

Hardy
Total distinct words = 92
Total words (including duplicates) = 122
Ratio(% of total distinct words to total words) = 75.4098360656

Dickens used 47 words that Hardy did not use.
Relative frequency of words used by Dickens not in common with Hardy = 62.9310344828 

Hardy used 84 words that Dickens did not use.
Relative frequency of words used by Hardy not in common with Dickens = 77.0491803279

Here is the output of the frequencies of the words in the two excerpts - dickens.out.txt and hardy.out.txt. You can use these outputs to compare the frequencies that you get.

The above program will have a header of the following form:

#  File: Books.py

#  Description:

#  Student Name:

#  Student UT EID:

#  Course Name: CS 303E

#  Unique Number: 

#  Date Created:

#  Date Last Modified:

This is an exercise in Computer Science and not in linguistics. What we are trying to illustrate is how easy it is to perform these types of computation in Python that lead to interesting observations but not necessarily deep insights.

Use the turnin program to submit your Books.py file. The TAs should receive your work by 5 PM on Friday, 13 November 2009. There will be substantial penalties if you do not adhere to the guidelines. The TA in charge of this assignment is Aibo Tian (atian@cs.utexas.edu).

Your Python program should have the header with the proper documentation.
Your code must run before submission.
You should be submitting your file through the web based turnin program. We will not accept files e-mailed to us.
Here is the Grading Criteria.