Warning: This assignment is algorithmically difficult. Start early, and be certain to create an algorithm before you begin programming. As a reminder, you can get help in office hours.
Provided data file: TexasCountyPop2010.txt
Pair Assignment: You may work in pairs for this assignment, but you must follow the Pair Programming Guidelines if you do. You may only work with another person in the same lecture as you. If you work in a pair, you will receive two extra credit points on your grade for this assignment.
If working in a pair the intent is that you spend at least 80% of your time working together on the problem. You should not do any coding without the other person present.
Pair programming means working on an assignment
together, sitting at the same computer. One person drives (types at the
keyboard) while the other person sits next to the driver while
observing, commenting, and making suggestions to the driver. This form
of programming can speed up the programming process by having a helper
immediately available. It also tends to reduce errors in the code
because there are two sets of eyes examining the code. It also reduces
errors because if the driver tries a questionable solution, the watcher
will hopefully question the faulty logic and a better solution will be
found. Pair programming does not mean taking an assignment and
partitioning into two pieces and then having each person complete a
If you start working with a partner you must complete the assignment with that partner unless you both start from scratch. You cannot start working together, produce some code, and then finish the assignment separately. Shared code turned in by two separate individual will be flagged by our plagiarism detection software. You may not copy code from other pairs, individuals, or from other sources including the web.
Background: Benford's law, also know as the first-digit law, states: given a list of numbers from a real world data set, the distribution of leading digits is often not uniform and instead skewed towards 1. Consider for example the populations of the counties in Texas from the 2010 census. Most people would guess there are roughly equal numbers of populations that start with 1, 2, 3, 4, 5, 6, 7, 8, and 9. (We don't consider 0 a leading digit for this assignment.) Given there are 254 Texas counties you might expect there to be about 28 counties (254 / 9 = 28) counties that have a leading digit of 1, 2, 3, and so forth. Travis County (home to UT) has a population of 1,024,266, a leading digit of 1.
Our intuition is often wrong. The breakdown of leading digits for populations of Texas counties according to the 2010 census are:
|Leading Digit||Number of Counties||Percentage|
Not what you expected! Populations with a leading digit of 1 occur almost 1/3rd of the time! Not 1/9th as most people would guess. Benford's law does not hold for all data sets (for example height of humans in inches), but does hold for a surprisingly large number of real world measurements.
Write a program that tests Benford's law for two files. One is the Texas Counties Populations from the 2010 census.
The file format is one entry per line. The format of each line is:
[Label] is 1 or more characters. A label may contain any characters other than a tab or new line, including spaces. The label is followed by a single tab. [NUMBER] is an integer great than 0. Numbers consist only of digits 0 through 9, but they may not start with 0 and they must be greater than 0. There are no commas or any other characters in number other that the digits 0 through 9. Immediately after [NUMBER] is a newline character.
You must create the other file yourself based on a real world data source. Name the file "StudentData.txt". The format must be the same as shown above. You must find a source with 250 or more data points. You cannot make up a data set. You must find a real world data set and create a file based on it. The real world data set must consist of integers and have 250 or more entries.
Program: Name your program Benford.py. Complete the following functions:
Creates and returns a list of strings with the entries from the data
file with name
fileName. The elements of the list are in
the same order as they appear in the data file.
getLeadDigitCounts(data): data is a list
with the entries from a file. Each element in the list is a string with
the label and the number separated by a single tab. There may be a
newline character at the end of the string. This function returns a
list of integers of length 9. The first element of the list (index 0)
stores the number of elements in the list data that have a number with
a leading digit of 1, the second element of the list (index 1) stores
the number of elements in data that have a number with a leading digit
of 2, and so forth.
showResults(counts): counts is a list of
integers with length 9. It represents the count of leading digits. The
function displays the total number of data points and for each leading
digit the number of data points and the percentage of total data points
with that leading digit rounded to one decimal place. Your output must
match the output shown below.
showLeadingDigits(digit, data): digit is a string of length 1 that contains a digit
character '1' to '9'. data is a list with the entries from a file. The
method prints all of the entries in data that have a leading digit
equal to the parameter named digit. They are printed in the order they
appear in data. Your output must match the output shown below.
processFile(name): name is a string that
is the name of a file that matches the format of our data files. This
showResults for the given file. The function then prompts
the user for a digit and calls
showLeadingDigits for the
given input. The program does not error check the input.
The main function must call
twice, once with TexasCountyPop2010.txt and once with your
Here is a sample run of the program, but only showing the result for TexasCountyPop2010.txt.
number of data points: 254
digit number percentage
1 80 31.5
2 38 15.0
3 41 16.1
4 26 10.2
5 15 5.9
6 15 5.9
7 17 6.7
8 13 5.1
9 9 3.5
Enter leading digit: 9
Showing data with a leading 9
Archer County 9054
Bowie County 92565
Brewster County 9232
Dimmit County 9996
Jack County 9044
Mitchell County 9403
Roberts County 929
Stephens County 9630
Terrell County 984
Name your program Benford.py and your data file StudentData.txt. Your program must start with the following header:
# File: --name of file--
# Description: --a description of your program--
# Assignment Number:
# Name 1: --your name or first person's name--
# EID 1: --your UTEID or first person's UTEID--
# Name 2: --second person's name if applicable--
# EID 2: --second person's UTEID if applicable--
# Course Name: CS303E
# Unique Number 1: --your section number or first person section number--
# Unique Number 2: --second person's section number if applicable--
# Date created:
# Date last modified:
# Slip days used this assignment:
# Total slip days used:
program to submit your program and your data file. The files must
be turned in by April 26th at 11pm. If you use slip days, please notify
your TA when you turn in your file. Your program will be graded based
on the following criteria:
Correctness: Does the program pass the provided and additional test cases?
Testing: Did the student provide the required file with the correct format, and run program on this file?
Documentation: Are the functions and complicated code segments documented (especially the complicated code)?
Design/Efficiency: Points will be deducted for convoluted solutions and excessively long methods. Students are expected to modularize the solution.
If you used pair programming, both partners must have enough slip days left to cover any slip days you use. (e.g., If you use two, both partners must have at least two left to use.) Turn the assignment in using just one partner's turnin account. The grader will grade it and enter the grade for both partners.
Did you remember to: