
What do you do? If you choose the first good candidate you interview you may be missing the opportunity to hire a better candidate later. But if you pass on a good candidate in hopes that someone better will come along, you risk the possibility that no later candidate will be better. When should you stop interviewing and hire the current candidate? This is a classic problem in optimal stopping theory: what is the best strategy for maximizing your odds of picking the best candidate? Is there even a best strategy?
It turns out that there is. The optimal strategy is to reject the first 1/e (about 37%) of the candidates, and then choose the next candidate who is better than all those you have already seen. (The number e is Euler's number, an irrational number approximately equal to 2.71828. It is the base of the natural logarithms and appears often in calculus.) This strategy has a success rate of about 37% (or 1/e) of finding the absolute best candidate. And it is better than any alternative strategy. You can read the math behind it here: Math of the Secretary Problem.
The input for your experiments is a file containing lines of data values. For example, suppose the file contains the line:
8.278 38.13 77.347 58.217 83.248 65.285 6.917 86.719 7.319 40.324These are 10 random values between 0 and 100 expressed to 3 digits of accuracy. This line should be interpreted as follows:
To implement our strategy on this cohort of 10 applicants, we'd take the following steps:
Notice that given the line of 10 data above, the strategy doesn't hire the best overall candidate. The benchmark is 77.347 and the hiring would stop with the first candidate better than the benchmark (number 4 with ranking of 83.248), but which is inferior to the very best candidate (number 7 with ranking of 86.719). However, we did hire the second best candidate overall, so I guess we shouldn't complain!
We'll run n such experiments, where n is the number of lines in the data file. Keep three running totals: no hire was made (the experiment returned -1), the best candidate was hired, or someone other than the best candidate was hired. Seeing if the best candidate was hired means comparing the ranking/score of the candidate selected to the max value in the ranks; if those are equal, the best overall candidate was hired. Report the three statistics as percentages of the total number of experiments/lines. (One would expect that as the number of experiments grows, the percentage of times we've hired the best overall candidate would approach 1/e ≈ 37%.)
You should test your program on this file: SecretaryData1 file. It has data for 100 experiments. Download this to your own directory and use it to test your program. This test file was generated using the command: createDataFile( 100, "SecretaryData1" ). The code for the function createDataFile is below. However, it's suggested that you create your own much smaller file for use during development and testing, using that function to do so.
Program incrementally. I would strongly advise against writing your program all at once. Perhaps take the following steps.
8.278 38.13 77.347 58.217 83.248 65.285 6.917 86.719 7.319 40.324For brevity, this specific line has 10 floats separated by spaces. Each line in the real file has between 25 to 100 data points. Each data point is a float between 0 and 100 with 3 decimals of precision. To create such a file, you can use the following function:
def createDataFile(numberOfLines, fileName):
"""This creates a file of lines of data for the Secretary
Problem. Each line contains between 25 and 100 data points, each a
float between 0 and 100. The idea is that these are the rankings
of secretaries. In the file each ranking has 3 digits of
precision. I originally made the rankings integers, but this led
too often to duplicate rankings. It's still possible, but
unlikely, that two rankings could be identical.
"""
# Open a new file for writing. If a file of this name exists,
# this will overwrite it.
outfile = open(fileName, "w")
# We want to generate a file with numberOfLines lines of data
for i in range(numberOfLines):
# Each line has a random number of data points between 25 and
# 100. This is the number of candidates in that experiment.
lineLen = random.randint(25, 100)
# This is a list of the data items, each a random float
# between 0 and 100. These are the rankings of the
# secretaries. It's unlikely that any two data points will be
# identical, though that's possible.
data = [ (random.random() * 100) for j in range(lineLen) ]
for datum in data:
# Write each datum to the file with 3 digits of precision
# followed by a blank. All data from an experiment will
# be on one line.
outfile.write( format( datum, ".3f" ) + " " )
# Close out the line (experiment) by writing a newline.
outfile.write( "\n" )
# Close the output file.
outfile.close()
You are welcome to generate your own data file to play with. The TAs will
use a fixed file to test your program.
Here's some sample output running some of my individual functions in interactive mode:
>>> from SecretaryProblem import *
>>> createDataFile( 3, "TestData" ) # create data file (3 lines)
>>> linesOfData = readDataFile( "TestData" ) # parse into list of lists
>>> linesOfData # show the result
[[62.967, 43.776, 32.067, 3.105, 41.994, 3.705, 36.416, 46.918,
42.988, 23.06, 30.524, 31.299, 55.958, 34.511, 80.217, 49.994, 60.345,
83.509, 96.205, 47.311, 48.02, 67.646, 79.833, 64.119, 13.653, 3.666,
64.659, 87.195, 97.336, 91.348, 30.755, 24.378, 32.752, 53.031,
89.512, 7.998, 72.472, 47.435, 93.616, 29.002, 48.493, 31.665, 52.466,
21.829, 7.556, 12.981, 91.493, 1.495, 63.651, 42.827, 69.029, 13.168,
4.368, 2.373, 36.273, 4.258, 82.339, 66.537, 69.111, 5.952, 51.596,
60.097, 24.549, 68.826, 18.39, 29.68, 0.76, 85.05, 29.535, 32.025,
67.158, 85.347, 8.328, 90.482, 55.505, 19.486],
[66.636, 5.894, 7.521, 82.888, 10.359, 76.596, 4.265, 19.508, 2.881,
....
76.729, 53.795, 55.921, 8.9, 26.352, 56.8, 96.35, 6.275, 35.263, 22.114],
[25.1, 70.797, 68.634, 56.905, 60.645, 4.661, 81.913, 55.672, 76.095,
....
1.306, 53.247, 80.215, 49.465, 90.73, 24.746, 93.669, 96.773, 66.622]]
>>> len( linesOfData )
3
>>> dataLine1 = linesOfData[0]
>>> dataLine1
[62.967, 43.776, 32.067, 3.105, 41.994, 3.705, 36.416, 46.918, 42.988,
....
8.328, 90.482, 55.505, 19.486]
>>> len(dataLine1)
76
>>> choice = chooseSecretary( dataLine1 )
>>> choice
28
>>> dataLine1[choice]
97.336
>>> max( dataLine1 ) # We won on this one
97.336
>>>
To test the main program you might do the following:
>>> main() Enter data file name: BogusFileName File doesn't exist: BogusFileName >>> createDataFile( 1000, "SecretaryData" ) >>> main() Enter data file name: SecretaryData Results: No choice made: 37.50% Best choice made: 38.20% Bad choice made: 24.30% >>> createDataFile( 1000, "SecretaryData" ) >>> main() Enter data file name: SecretaryData Results: No choice made: 36.30% Best choice made: 40.10% Bad choice made: 23.60% >>>
Your file must compile and run before submission. It must also contain a header with the following format:
# Assignment: Project 3 # File: SecretaryProblem.py # Student: # UT EID: # Course Name: CS303E # # Date Created: # Description of Program:
In Python, there isn't a default entry point. If you want to start executing by calling main you have to have an explicit call main(). But since you are telling the system where to begin executing, you can call any function you like.