CS303E Assignment 9

Due Friday, April 26th by 11pm

Warning: This assignment is algorithmically difficult. Start early, and be certain to create an algorithm before you begin programming. As a reminder, you can get help in office hours.

Provided data file: TexasCountyPop2010.txt


Pair Assignment: You may work in pairs for this assignment, but you must follow the Pair Programming Guidelines if you do. You may only work with another person in the same lecture as you. If you work in a pair, you will receive two extra credit points on your grade for this assignment.

If working in a pair the intent is that you spend at least 80% of your time working together on the problem. You should not do any coding without the other person present.

Pair programming means working on an assignment together, sitting at the same computer. One person drives (types at the keyboard) while the other person sits next to the driver while observing, commenting, and making suggestions to the driver. This form of programming can speed up the programming process by having a helper immediately available. It also tends to reduce errors in the code because there are two sets of eyes examining the code. It also reduces errors because if the driver tries a questionable solution, the watcher will hopefully question the faulty logic and a better solution will be found. Pair programming does not mean taking an assignment and partitioning into two pieces and then having each person complete a piece.

If you start working with a partner you must complete the assignment with that partner unless you both start from scratch. You cannot start working together, produce some code, and then finish the assignment separately. Shared code turned in by two separate individual will be flagged by our plagiarism detection software. You may not copy code from other pairs, individuals, or from other sources including the web.


Background: Benford's law, also know as the first-digit law, states: given a list of numbers from a real world data set, the distribution of leading digits is often not uniform and instead skewed towards 1. Consider for example the populations of the counties in Texas from the 2010 census. Most people would guess there are roughly equal numbers of populations that start with 1, 2, 3, 4, 5, 6, 7, 8, and 9. (We don't consider 0 a leading digit for this assignment.) Given there are 254 Texas counties you might expect there to be about 28 counties (254 / 9 = 28) counties that have a leading digit of 1, 2, 3, and so forth. Travis County (home to UT) has a population of 1,024,266, a leading digit of 1.

Our intuition is often wrong. The breakdown of leading digits for populations of Texas counties according to the 2010 census are:

Leading Digit Number of Counties Percentage
1 80 31.5
2 38 15.0
3 41 16.1
4 26 10.2
5 15 5.9
6 15 5.9
7 17 6.7
8 13 5.1
9 9 3.5

Not what you expected! Populations with a leading digit of 1 occur almost 1/3rd of the time! Not 1/9th as most people would guess. Benford's law does not hold for all data sets (for example height of humans in inches), but does hold for a surprisingly large number of real world measurements.


Data Files:

Write a program that tests Benford's law for two files. One is the Texas Counties Populations from the 2010 census.

The file format is one entry per line. The format of each line is:

[LABEL]\t[NUMBER]\n

[Label] is 1 or more characters. A label may contain any characters other than a tab or new line, including spaces. The label is followed by a single tab. [NUMBER] is an integer great than 0. Numbers consist only of digits 0 through 9, but they may not start with 0 and they must be greater than 0. There are no commas or any other characters in number other that the digits 0 through 9. Immediately after [NUMBER] is a newline character.

You must create the other file yourself based on a real world data source. Name the file "StudentData.txt". The format must be the same as shown above. You must find a source with 250 or more data points. You cannot make up a data set. You must find a real world data set and create a file based on it. The real world data set must consist of integers and have 250 or more entries.


Program: Name your program Benford.py. Complete the following functions:

getData(fileName): Creates and returns a list of strings with the entries from the data file with name fileName. The elements of the list are in the same order as they appear in the data file.

getLeadDigitCounts(data): data is a list with the entries from a file. Each element in the list is a string with the label and the number separated by a single tab. There may be a newline character at the end of the string. This function returns a list of integers of length 9. The first element of the list (index 0) stores the number of elements in the list data that have a number with a leading digit of 1, the second element of the list (index 1) stores the number of elements in data that have a number with a leading digit of 2, and so forth.

showResults(counts): counts is a list of integers with length 9. It represents the count of leading digits. The function displays the total number of data points and for each leading digit the number of data points and the percentage of total data points with that leading digit rounded to one decimal place. Your output must match the output shown below.

showLeadingDigits(digit, data): digit is a string of length 1 that contains a digit character '1' to '9'. data is a list with the entries from a file. The method prints all of the entries in data that have a leading digit equal to the parameter named digit. They are printed in the order they appear in data. Your output must match the output shown below.

processFile(name): name is a string that is the name of a file that matches the format of our data files. This function calls getData, getLeadDigitCounts, showResults for the given file. The function then prompts the user for a digit and calls showLeadingDigits for the given input. The program does not error check the input.

The main function must call processFile twice, once with TexasCountyPop2010.txt and once with your StudentData.txt file.

Here is a sample run of the program, but only showing the result for TexasCountyPop2010.txt.

number of data points: 254

digit number percentage
1     80     31.5
2     38     15.0
3     41     16.1
4     26     10.2
5     15     5.9
6     15     5.9
7     17     6.7
8     13     5.1
9     9      3.5

Enter leading digit: 9

Showing data with a leading 9
Archer County 9054
Bowie County 92565
Brewster County 9232
Dimmit County 9996
Jack County 9044
Mitchell County 9403
Roberts County 929
Stephens County 9630
Terrell County 984

Name your program Benford.py and your data file StudentData.txt. Your program must start with the following header:

# File: --name of file--
# Description: --a description of your program--
# Assignment Number:
#
# Name 1: --your name or first person's name--
# EID 1: --your UTEID or first person's UTEID--
# Name 2: --second person's name if applicable--
# EID 2: --second person's UTEID if applicable--
# Course Name: CS303E
#
# Unique Number 1: --your section number or first person section number--
# Unique Number 2: --second person's section number if applicable--
#
# Date created:
# Date last modified:
#
# Slip days used this assignment:
# Total slip days used:

Use the turnin program to submit your program and your data file. The files must be turned in by April 26th at 11pm. If you use slip days, please notify your TA when you turn in your file. Your program will be graded based on the following criteria:
Correctness: Does the program pass the provided and additional test cases?
Testing: Did the student provide the required file with the correct format, and run program on this file?
Documentation: Are the functions and complicated code segments documented (especially the complicated code)?
Design/Efficiency: Points will be deducted for convoluted solutions and excessively long methods. Students are expected to modularize the solution.

If you used pair programming, both partners must have enough slip days left to cover any slip days you use. (e.g., If you use two, both partners must have at least two left to use.) Turn the assignment in using just one partner's turnin account. The grader will grade it and enter the grade for both partners.

Did you remember to: