CS303E Project 3: Secretary Problem

Instructor: Dr. Bill Young
Due Date: Monday, December 1, 2025 at 11:59pm

The Secretary Problem

Suppose you ran a small company and needed to hire a new secretary. It's a tough job market and you get a lot of applicants, but you know how many candidates are in the hiring "pool." The candidates all have their strengths and weaknesses and you're an experienced interviewer who can rank applicants after a single interview. But for some reason, you have to interview candidates one at a time and make a decision on the spot about hiring any specific candidate. That is, once they walk out the door, they're gone. You can't "circle back" to any candidate. If you pass up a candidate, you can't bring them back later.

What do you do? If you choose the first good candidate you interview you may be missing the opportunity to hire a better candidate later. But if you pass on a good candidate in hopes that someone better will come along, you risk the possibility that no later candidate will be better. When should you stop interviewing and hire the current candidate? This is a classic problem in optimal stopping theory: what is the best strategy for maximizing your odds of picking the best candidate? Is there even a best strategy?

It turns out that there is. The optimal strategy is to reject the first 1/e (about 37%) of the candidates, and then choose the next candidate who is better than all those you have already seen. (The number e is Euler's number, an irrational number approximately equal to 2.71828. It is the base of the natural logarithms and appears often in calculus.) This strategy has a success rate of about 37% (or 1/e) of finding the absolute best candidate. And it is better than any alternative strategy. You can read the math behind it here: Math of the Secretary Problem.

The Assignment

In this project, you'll be implementing this strategy for hiring a good secretary, running a series of experiments and testing how often it finds the best candidate from a pool of applicants.

The input for your experiments is a file containing lines of data values. For example, suppose the file contains the line:

8.278 38.13 77.347 58.217 83.248 65.285 6.917 86.719 7.319 40.324

These are 10 random values between 0 and 100 expressed to 3 digits of accuracy. This line should be interpreted as follows:

This describes a pool of applicants for a single "round" of hiring (that is hiring of one secretary from a pool of 10 applicants).
Each number represents the ranking of the respective candidate. E.g. for this pool the top ranked candidate is number 7 (0-based) with score of 86.719 and the lowest is number 8 with a score of 7.319.
In actuality, you wouldn't know these numbers in advance. Think of each number as the ranking you'd assign to the candidate after their interview.

To implement our strategy on this cohort of 10 applicants, we'd take the following steps:

Read that line from our data file and store the data from that line in a list ranks.
Compute a "benchmark" by finding the best score in the first k applicants in ranks. For this line k = 4, because round( 10 * 1 / math.e ) == 4. This benchmark is 77.347 (the best ranking among the first 4 candidates).
Then begin interviewing the elements in ranks starting from applicant 4 (i.e., ranks[4:]).
If you encounter an applicant with score higher than the benchmark, hire that applicant and stop. That is, return the index of that candidate in the list.
Note that it's possible that you don't encounter any candidate with score higher than the benchmark. If so, return -1.

It's also possible, but highly unlikely, that two candidates in the list have exactly the same score. We'll ignore that possibility.

Notice that given the line of 10 data above, the strategy doesn't hire the best overall candidate. The benchmark is 77.347 and the hiring would stop with the first candidate better than the benchmark (number 4 with ranking of 83.248), but which is inferior to the very best candidate (number 7 with ranking of 86.719). However, we did hire the second best candidate overall, so I guess we shouldn't complain!

We'll run n such experiments, where n is the number of lines in the data file. Keep three running totals: no hire was made (the experiment returned -1), the best candidate was hired, or someone other than the best candidate was hired. Seeing if the best candidate was hired means comparing the ranking/score of the candidate selected to the max value in the ranks; if those are equal, the best overall candidate was hired. Report the three statistics as percentages of the total number of experiments/lines. (One would expect that as the number of experiments grows, the percentage of times we've hired the best overall candidate would approach 1/e ≈ 37%.)

You should test your program on this file: SecretaryData1 file. It has data for 100 experiments. Download this to your own directory and use it to test your program. This test file was generated using the command: createDataFile( 100, "SecretaryData1" ). The code for the function createDataFile is below. However, it's suggested that you create your own much smaller file for use during development and testing, using that function to do so.

Program incrementally. I would strongly advise against writing your program all at once. Perhaps take the following steps.

Defer the file reading part. Get your basic secretary choice algorithm working on a single list of values you supply.
Write a function chooseSecretary( ranks ) that chooses a secretary from a list of candidate scores following the algorithm above. It should return the index of the choice or -1 if there is none.
Write a top level loop that takes a list of such lists and runs the secretary choice experiment on each, and reports the result.
Add the recording of statistics and a final reporting.
Write a function that reads in the data file and produces a list of lists, where each list contains the data on one line of the file.
Pass that list of lists to the function you wrote in step 3.
Pat yourself on the back. You're done!

Creating a Data File:

Recall that a data file for this problem contains n lines of data that might look like this:

8.278 38.13 77.347 58.217 83.248 65.285 6.917 86.719 7.319 40.324

For brevity, this specific line has 10 floats separated by spaces. Each line in the real file has between 25 to 100 data points. Each data point is a float between 0 and 100 with 3 decimals of precision. To create such a file, you can use the following function:

def createDataFile(numberOfLines, fileName):
    """This creates a file of lines of data for the Secretary
    Problem. Each line contains between 25 and 100 data points, each a
    float between 0 and 100.  The idea is that these are the rankings
    of secretaries.  In the file each ranking has 3 digits of
    precision.  I originally made the rankings integers, but this led
    too often to duplicate rankings.  It's still possible, but
    unlikely, that two rankings could be identical.

    """
    # Open a new file for writing.  If a file of this name exists,
    # this will overwrite it.
    outfile = open(fileName, "w")
    
    # We want to generate a file with numberOfLines lines of data
    for i in range(numberOfLines):

        # Each line has a random number of data points between 25 and
        # 100.  This is the number of candidates in that experiment.
        lineLen = random.randint(25, 100)

        # This is a list of the data items, each a random float
        # between 0 and 100. These are the rankings of the
        # secretaries. It's unlikely that any two data points will be
        # identical, though that's possible.
        data = [ (random.random() * 100) for j in range(lineLen) ]
        
        for datum in data:
            # Write each datum to the file with 3 digits of precision
            # followed by a blank.  All data from an experiment will
            # be on one line.
            outfile.write( format( datum, ".3f" ) + " " )

        # Close out the line (experiment) by writing a newline.
        outfile.write( "\n" )
            
    # Close the output file.
    outfile.close()

You are welcome to generate your own data file to play with. The TAs will use a fixed file to test your program.

Expected Output:

Much of what you see below is calling the functions I wrote to implement this. Those were:

createDataFile( numberOfLines, fileName ): see that code above.
readDataFile( fileName ): read lines from the data file, parse them, and put them into a list of lists.
chooseSecretary( ranks ): given a list of candidate rankings, follow the algorithm to choose one (or not); return the index of the choice or -1 if no choice is made.
main() ): put it all together. Ask the user for a data file name, parse the file into a list of lists, run the n experiments (where n is the number of lists), collect and report the statistics.

You don't have to use the same functions I wrote. In particular, you could read one line from the file, parse it into a list and run the experiment directly on that list, and then read the next line. I chose to parse the entire file into a list of lists instead just because I found it easier to separate the file handling stuff into a single function. You don't have to.

Here's some sample output running some of my individual functions in interactive mode:

>>> from SecretaryProblem import *
>>> createDataFile( 3, "TestData" )                    # create data file (3 lines)
>>> linesOfData = readDataFile( "TestData" )           # parse into list of lists
>>> linesOfData                                        # show the result
[[62.967, 43.776, 32.067, 3.105, 41.994, 3.705, 36.416, 46.918,
  42.988, 23.06, 30.524, 31.299, 55.958, 34.511, 80.217, 49.994, 60.345,
  83.509, 96.205, 47.311, 48.02, 67.646, 79.833, 64.119, 13.653, 3.666,
  64.659, 87.195, 97.336, 91.348, 30.755, 24.378, 32.752, 53.031,
  89.512, 7.998, 72.472, 47.435, 93.616, 29.002, 48.493, 31.665, 52.466,
  21.829, 7.556, 12.981, 91.493, 1.495, 63.651, 42.827, 69.029, 13.168,
  4.368, 2.373, 36.273, 4.258, 82.339, 66.537, 69.111, 5.952, 51.596,
  60.097, 24.549, 68.826, 18.39, 29.68, 0.76, 85.05, 29.535, 32.025,
  67.158, 85.347, 8.328, 90.482, 55.505, 19.486], 
 [66.636, 5.894, 7.521, 82.888, 10.359, 76.596, 4.265, 19.508, 2.881,
    ....
  76.729, 53.795, 55.921, 8.9, 26.352, 56.8, 96.35, 6.275, 35.263, 22.114], 
 [25.1, 70.797, 68.634, 56.905, 60.645, 4.661, 81.913, 55.672, 76.095,
    ....
  1.306, 53.247, 80.215, 49.465, 90.73, 24.746, 93.669, 96.773, 66.622]]
>>> len( linesOfData )
3
>>> dataLine1 = linesOfData[0]                              
>>> dataLine1
[62.967, 43.776, 32.067, 3.105, 41.994, 3.705, 36.416, 46.918, 42.988,
    ....
 8.328, 90.482, 55.505, 19.486]
>>> len(dataLine1)
76
>>> choice = chooseSecretary( dataLine1 )
>>> choice
28
>>> dataLine1[choice]
97.336
>>> max( dataLine1 )                                 # We won on this one
97.336
>>>

To test the main program you might do the following:

>>> main()
Enter data file name: BogusFileName
File doesn't exist: BogusFileName
>>> createDataFile( 1000, "SecretaryData" )
>>> main()
Enter data file name: SecretaryData
Results:
  No choice made:   37.50%
  Best choice made: 38.20%
  Bad choice made:  24.30%
>>> createDataFile( 1000, "SecretaryData" )
>>> main()
Enter data file name: SecretaryData
Results:
  No choice made:   36.30%
  Best choice made: 40.10%
  Bad choice made:  23.60%
>>>

Turning in the Assignment:

The program should be in a file named SecretaryProblem.py. Submit the file via Canvas before the deadline shown at the top of this page. Submit it to the assignment project3 under the assignments sections by uploading your Python file. Make sure that you follow good coding style and comment your code as appropriate.

Your file must compile and run before submission. It must also contain a header with the following format:

# Assignment: Project 3
# File: SecretaryProblem.py
# Student: 
# UT EID:
# Course Name: CS303E
# 
# Date Created:
# Description of Program:

Programming Tips:

Your main function need not be called main. Calling your primary function main is just a convention. In some programming languages, such as C, you must have a function called main because that's how the system knows where to begin executing. This is called the entry point for the program. In C, you don't need an explicit call to main; the system generates one automatically (which it can do, since it knows the entry point is always called main).

In Python, there isn't a default entry point. If you want to start executing by calling main you have to have an explicit call main(). But since you are telling the system where to begin executing, you can call any function you like.