The student side autograding was headed by Nick Hay, Brad Miller, and Pieter Abbeel.
In this project, you will design three classifiers: a naive Bayes classifier, a perceptron classifier and a large-margin (MIRA) classifier. You will test your classifiers on two image data sets: a set of scanned handwritten digit images and a set of face images in which edges have already been detected. Even with simple features, your classifiers will be able to do quite well on these tasks when given enough training data.
Optical character recognition (OCR) is the task of extracting text from image sources. The first data set on which you will run your classifiers is a collection of handwritten numerical digits (0-9). This is a very commercially useful technology, similar to the technique used by the US post office to route mail by zip codes. There are systems that can perform with over 99% classification accuracy (see LeNet-5 for an example system in action).
Face detection is the task of localizing faces within video or still images. The faces can be at any location and vary in size. There are many applications for face detection, including human computer interaction and surveillance. You will attempt a simplified face detection task in which your system is presented with an image that has been pre-processed by an edge detection algorithm. The task is to determine whether the edge image is a face or not. There are several systems in use that perform quite well at the face detection task. One good system is the face detector by Schneiderman and Kanade.
The code for this project includes the following files and data, available as a zip file.
||Data file, including the digit and face data.|
Files you will edit
||The location where you will write your naive Bayes classifier.|
||The location where you will write your perceptron classifier.|
||The location where you will write your MIRA classifier.|
||The wrapper code that will call your classifiers. You will also write your enhanced feature extractor here. You will also use this code to analyze the behavior of your classifier.|
Files you should read but NOT edit
||Abstract super class for the classifiers you will write.
(You should read this file carefully to see how the infrastructure is set up.)
||I/O code to read in the classification data.|
||Code defining some useful tools. You may be familiar with some of these by now, and they will save you a lot of time.|
||A simple baseline classifier that just labels every instance as the most frequent class.|
What to submit: You will fill in portions of
(only) during the assignment, and submit them. If you do the
This assignment should be submitted via
with the assignment name
using these submission
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. Instead, contact the course staff if you are having trouble.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable does or what kind of values it takes, print it out.
To try out the classification pipeline, run
from the command line. This will classify the digit data using
the default classifier (
mostFrequent) which blindly
classifies every example
with the most frequent label.
As usual, you can learn more about the possible command line options by running:
python dataClassifier.py -h
We have defined some simple features for you. Later you will
design some better features. Our simple feature set includes one
each pixel location, which can take values 0 or 1 (off or on).
The features are encoded as a
Counter where keys
are feature locations (represented as (column,row)) and values
are 0 or 1. The face recognition data set has value 1 only for
those pixels identified by a Canny edge detector.
Implementation Note: You'll find it easiest to hard-code the binary feature assumption. If you do, make sure you don't include any non-binary features. Or, you can write you code more generally, to handle arbitrary feature values, though this will probably involve a preliminary pass through the training set to find all possible feature values (and you'll need an "unknown" option in case you encounter a value in the test data you never saw during training).
A skeleton implementation of a naive Bayes classifier is
provided for you in
You will fill in the
trainAndTune function, the
function and the
A naive Bayes classifier models a joint distribution over a
label and a set of observed
random variables, or features, , using the assumption that
the full joint distribution can be factored as follows (features
are conditionally independent given the label):
To classify a datum, we can find the most probable label given the feature values for each pixel, using Bayes theorem:
Because multiplying many probabilities together often results in underflow, we will instead compute log probabilities which have the same argmax:
To compute logarithms, use
built-in Python function.
We can estimate directly
from the training data:
The other parameters to estimate are the conditional probabilities of our features given each label y: . We do this for each possible feature value ().
In this project, we use Laplace smoothing, which adds k counts to every possible observation value:
If k=0, the probabilities are unsmoothed. As k grows larger, the probabilities are smoothed more and more. You can use your validation set to determine a good value for k. Note: don't smooth P(Y).
Question 1 (6 points)
trainAndTune, estimate conditional
probabilities from the training data for each possible value
of k given in the list
accuracy on the held-out validation set for each k and
the value with the highest validation accuracy. In case of ties,
prefer the lowest value of k. Test your
python dataClassifier.py -c naiveBayes --autotune
Hints and observations:
calculateLogJointProbabilitiesuses the conditional probability tables constructed by
trainAndTuneto compute the log posterior probability for each label y given a feature vector. The comments of the method describe the data structures of the input and output.
dataClassifier.pyto explore the mistakes that your classifier is making. This is optional.
--autotuneoption. This will ensure that
kgridhas only one value, which you can change with
--autotune, which tries different values of k, you should get a validation accuracy of about 74% and a test accuracy of 65%.
To run the autograder for this question use the following command (as usual):
python autograder.py -q q1
Another, better, tool for understanding the parameters is to
look at odds ratios. For each pixel
feature and classes , consider the odds ratio:
The features that have the greatest impact at classification time are those with both a high probability (because they appear often in the data) and a high odds ratio (because they strongly bias one label versus another).
Question 2 (2 points)
Fill in the function
label2). It should return a list of the 100 features
with highest odds ratios for
-o activates an odds ratio analysis.
Use the options
-1 label1 -2 label2 to specify
which labels to compare. Running the following command will show
you the 100 pixels that best distinguish between a 3 and a 2.
python dataClassifier.py -a -d digits -c naiveBayes -o -1 3 -2 2
Which of the following images best represents the distribution of these pixels:
Answer the question answers.py in the method q2, returning either 'a', 'b', 'c' or 'd'.
perceptron.py. You will fill in the
trainfunction, and the
Unlike the naive Bayes classifier, a perceptron does not use
probabilities to make its decisions. Instead, it keeps a
weight vector of each
class ( is an identifier, not an exponent).
Given a feature list ,
the perceptron compute the class whose weight vector is most similar
to the input vector . Formally,
given a feature vector (in our
case, a map from pixel locations to indicators of whether they
are on), we score each class with:
Using the addition, subtraction, and multiplication
functionality of the
Counter class in
the perceptron updates should be
relatively easy to code. Certain implementation issues have been
taken care of for you in
such as handling iterations
over the training data and ordering the update trials.
the code sets up the
weights data structure for
legal label needs its own
Counter full of weights.
Question 3 (4 points) Fill in the
Run your code with:
python dataClassifier.py -c perceptron
Hints and observations:
-i iterationsoption. Try different numbers of iterations and see how it influences the performance. In practice, you would use the performance on the validation set to figure out when to stop training, but you don't need to implement this stopping criterion for this assignment.
Question 4 (1 point) Fill in
It should return a list of the 100 features with highest feature
weights for that label. You can display the 100 pixels with the
largest weights using the command:
python dataClassifier.py -c perceptron -wUse this command to look at the weights, and answer the following true/false question. Which of the following sequence of weights is most representative of the perceptron?
Answer the question answers.py in the method q4, returning either 'a' or 'b'.
mira.py. MIRA is an online learner which is closely related to both the support vector machine and perceptron classifiers. You will fill in the
mira.py. This method should train a MIRA classifier using each value of C in
Cgrid. Evaluate accuracy on the held-out validation set for each C and choose the C with the highest validation accuracy. In case of ties, prefer the lowest value of C. Test your MIRA implementation with:
python dataClassifier.py -c mira --autotune
Hints and observations:
self.max_iterationstimes during training.
self.weights, so that these weights can be used to test your classifier.
--autotuneoption from the command above.
--autotuneshould be in the 60's.
Building classifiers is only a small part of getting a good system working for a task. Indeed, the main difference between a good classification system and a bad one is usually not the classifier itself (e.g. perceptron vs. naive Bayes), but rather the quality of the features used. So far, we have used the simplest possible features: the identity of each pixel (being on/off).
To increase your classifier's accuracy further, you will need
to extract more useful features from the data. The
is your new playground. When analyzing your classifiers'
results, you should look at some of your errors and look for
characteristics of the input that would give the classifier
useful information about the label. You can add code to the
to inspect what your classifier is doing.
For instance in the digit data, consider the number of separate,
connected regions of white pixels, which varies by digit type.
1, 2, 3, 5, 7 tend to have one contiguous region of white space
while the loops in 6, 8, 9 create more. The number of white
regions in a 4 depends on the writer. This is an example of a
feature that is not directly available to the classifier from
the per-pixel information. If your feature extractor adds new
features that encode these properties, the classifier will be
able exploit them. Note that some features may require
non-trivial computation to extract, so write efficient and
Question 6 (6 points)
Add new features for the digit dataset in the
function in such a way that it works
with your implementation of the naive Bayes classifier:
this means that for this part, you are restricted to features
which can take a finite number of discrete
values (and if you have assumed that features are binary valued,
then you are restricted to binary features).
Note that you can encode a feature which takes 3 values [1,2,3]
by using 3
binary features, of which only one is on at the time, to
of the three possibilities you have. In theory, features aren't
conditionally independent as naive Bayes requires,
but your classifier can still work well in practice. We will
test your classifier with the following command:
python dataClassifier.py -d digits -c naiveBayes -f -a -t 1000With the basic features (without the
-foption), your optimal choice of smoothing parameter should yield 82% on the validation set with a test performance of 78%. You will receive 3 points for implementing new feature(s) which yield any improvement at all. You will receive 3 additional points if your new feature(s) give you a test performance greater than or equal to 84% with the above command.
Mini Contest (3 points extra credit)
How well can you classify? Fill in code in
training and classification.
To run your classifier, use:
python dataClassifier.py -d digits -c minicontest -t 5000 -s 1000When you specify the minicontest classifier, features are extracted using
contestFeatureExtractorDigit. You are free to implement any classifier you want. You might consider modifying Mira or NaiveBayes, for example. You should encode any tuning parameters directly in
minicontest.py. We will allow your classifier to train on 5000 examples, but will test you on a new set of 1000 digits. The 3 teams with the highest classification accuracy will receive 3, 2, and 1 points, respectively. Don't forget to describe what you've done in your comments. Note that there is no autograder module for the minicontest
Congratulations! You're finished with the CS 343H projects.