Netflix


Due: Thu, 26 Jun 2014, 10pm
70 pts, 7% of total grade.


Specification


Write a program to win the Netflix Prize in Python.

Ignore the qualifying data. It's just there for explanation.

Just use the probe data and produce an RMSE of less than 1.0 and a runtime of less than 1 min.

Note: Because the Netflix files are very large, it's impractical to make copies of them to your local machine. Your best bet is to develop the solution on the CS machines.

Data


Training Data:

  • 17,770 movies
  • 480,189 customers
  • about 100,000,000 ratings
  • about 5,600 ratings per movie
  • about 200 ratings per customer

Qualifying Data:

  • 2,836,401 ratings, no customers from training data

Probe Data:

  • 1,425,333 ratings, all customers from training data

Files:

/u/downing/cs/netflix/README

/u/downing/cs/netflix/training_set/* (1.4 GB)
/u/downing/cs/netflix/training_set/mv_0002043.txt (162 KB)
17,770 files total (one per movie)
customer_id rating rating_date
1 - 2,649,429
with gaps
480,189 total
1 - 5 Oct 1998 - Dec 2005
YYYY-MM-DD
2043:
...
1417435 3 2005-12-14
...
2312054 1 2004-12-21
...
462685 4 2005-07-21
...
7,872 records total

/u/downing/cs/netflix/movie_titles.txt (578 KB)
movie_id movie_year movie_title
1 - 17,770 1890 - 2005
...
2043 1953 Shane
...
10851 1948 Red River
...
16306 1960 Spartacus
...
17,770 records total

/u/downing/cs/netflix/qualifying.txt (52.5 MB)
customer_id rating_date
1 - 2,649,429
with gaps
480,189 total
Oct 1998 - Dec 2005
YYYY-MM-DD
...
2043:
2012320 2005-06-07
436529 2005-12-28
843246 2005-12-15
...
2,836,401 records total

/u/downing/cs/netflix/probe.txt (10.8 MB)
RMSE = 0.9474
customer_id
1 - 2,649,429
with gaps
480,189 total
...
2043:
1417435
2312054
462685
...
10851:
1417435
2312054
462685
...
1,425,333 records total

Input


standard in (a single test case) and cache files

Mukund will have a clone of the public test repo with all of the cache files here: /u/mukund/netflix-tests, so that you can hardcode the pathnames of the cache files you choose to use.

Mukund will keep the clone up-to-date regularly.

Output


standard out (a single test case)

rating
1.0 - 5.0
...
2043:
3.4
4.1
1.9
...
10851:
4.3
1.4
2.8
...
RMSE: 0.95
1,425,333 records total

Explanation


Your program only processes a single test case via standard in and standard out.

With probe.txt being provided via standard in, your program produces the predictions being sought.

The output is large (10 MB), so it's more convenient for the grader to run your program and for you to not turn it in.

Analysis


These are additional descriptions of the underlying math:

Requirements


  1. Estimate time to completion.
  2. Create a private Git repository at GitHub, named cs373-netflix.
  3. Add these requirements to the issue tracker at GitHub, at least 10 issues.
    Add at least 10 more issues, one for each bug or feature, both open and closed with a good description and a label.
  4. Invite the grader to your private code repo.
  5. Clone your private code repo onto your local directory.
  6. Make at least 5 commits, one for each bug or feature.
    If you cannot describe your changes in a sentence, you are not committing often enough.
    Make meaningful commit messages identifying the corresponding issue in the issue tracker (see here).
  7. Clone the public class repo onto your local directory.
    It is critical that you clone the public class repo into a different directory than the one you're using for your private code repo.
  8. Copy the code files from the clone of the public class repo to the clone of the private code repo.
  9. Write unit tests in TestNetflix.py that test corner cases and failure cases until you have an average of 3 tests for each function, confirm the expected failures, and add, commit, and push to the private code repo.
  10. Implement and debug the simplest possible solution in Netflix.py with assertions that check pre-conditions, post-conditions, argument validity, and return-value validity, until all tests pass, and add, commit, and push to the private code repo.
  11. Create 1000 lines of acceptance tests in RunNetflix.in and RunNetflix.out that test corner cases and failure cases, and add, commit, and push to the private code repo.
  12. Pass five other students' acceptance tests.
  13. Clone the public test repo onto your local directory.
    It is critical that you clone the public test repo into a different directory than the one you're using for your private code repo.
  14. Copy your unit tests and your acceptance tests to your clone of the public test repo, rename the files, do a git pull to synchronize your clone, and then add, commit and push to the public test repo.
    The files MUST be named <cs-username>-RunNetflix.in, <cs-username>-RunNetflix.out, <cs-username>-TestNetflix.py, and <cs-username>-TestNetflix.out in the public test repo.
  15. Implement (or reuse) and debug the simplest possible set of caches until all tests pass, and add, commit, and push to the private code repo.
  16. Run pydoc on Netflix.py, which will create Netflix.html, that then documents the interfaces to your functions.
    Create inline comments if you need to explain the why of a particular implementation.
    Use a consistent coding convention with good variable names, good indentation, blank lines, and blank spaces.
  17. Create a log of your commits in Netflix.log.
  18. Obtain the git SHA with
    git rev-parse HEAD
  19. Fill in the Google Form.
  20. It is your responsibility to protect your code from the rest of the students in the class. If your code gets out, you are as guilty as the recipient of academic dishonesty.

Requirements for getting a non-zero grade.


  1. [  5 pts] GitHub private repo with grader invited as collaborator and a log of the commits.
  2. [  5 pts] GitHub issue tracker with issues from requirements and more.
  3. [15 pts] Standard-compliant Python 3.2.3 with an RMSE of less than 1.0 and a runtime of less than 1 min on probe.txt.
  4. [15 pts] Average of 3 unit tests per function with good coverage in the public test repo with the precise naming of the files.
  5. [15 pts] 1000 lines of acceptance tests in the public test repo with the precise naming of the files. Your code must successfully pass five other students' acceptance tests with an RMSE of less than 1.0.
  6. [10 pts] Pydoc documentation.
  7. [  5 pts] Google Form with time estimate.

Extra Credit


  • You can earn 5 bonus pts, if you produce an RMSE of less than 0.9474.
  • You can earn another 5 bonus pts, if you work with a partner using pair programming and vouch for the fact that you worked on the project together for more than 75% of the time.
    Only one solution must be turned in for the pair. If two solutions are turned in, there will be a 10% penalty, and the later one will be graded.
  • Bonus pts will not increase the total score beyond the max score.

Files


  1. RunNetflix.in
  2. RunNetflix.out
  3. Netflix.html
  4. Netflix.log
  5. Netflix.py
  6. RunNetflix.py
  7. TestNetflix.out
  8. TestNetflix.py

Tools


Guides


Grader


Name GitHub ID GitHub Test Repository Google Form
Mukund Rathi 004rathim netflix-tests Google Form