Project 2 for CS 343: Artificial Intelligence
Learning for Text Categorization

Due: April 30, 2009

Note: While working on this project you will become familiar with additional files from the 'Information Retrieval' Java project and the Gnuplot plotting program. You are encouraged to start early so that you have time to ask questions as they arise.

Existing Categorization Framework

The goal of automatic text categorization is to assign text documents to one of the given categories using a classifier. First, the classifier is trained on a set of documents that have category labels. Then it can be used to assign categories to unlabeled documents. See the Powerpoint lecture slides from CS 378: Information Retrieval class for more details.

A basic text categorization framework is available in /u/mooney/cs343-code/ir/classifiers/. See the Javadoc documentation for this code. Right now, ir.classifiers has Naive Bayes and an incomplete Perceptron classifier in it. Both of them extend the abstract class Classifier.

The NaiveBayes classifier performs text categorization using the Naive Bayes method, and does Laplace smoothing. It also stores all probabilities internally in log values to prevent underflow problems.

The Perceptron classifier performs text categorization using a separate PerceptronUnit for each category. The category with the highest confidence value returned by its PerceptronUnit is chosen when text documents are classified.

The Yahoo-science document collection in /u/mooney/ir-code/corpora/yahoo-science/ contains 600 pages: 200 random samples each from the Yahoo indices for biology, physics, and chemistry. Right now, the code is set up to categorize the yahoo-science collection into 3 categories - bio, phys, chem. The NaiveBayes and Perceptron classes both have train() methods that take in a vector of training documents as Example objects. They also have a test() method that takes in a test Example and categorizes it as a bio, phys or chem document.

In vector-space representation a text document is stored as a sparse vector. The size of the vector equals the total number of words in the vocabulary over all documents; each token in the vocabulary is associated with a particular component of the vector. For a particular document, values of the components of its vector correspond to term frequencies - the number of occurrences of each token in this document. Since documents rarely include all words from the vocabulary, the vector is very sparse: most components are 0, corresponding to words that do not occur in the particular document.

Because storing the full vector consisting mostly of zeros for each document would be very inefficient, a document is represented inside the Example object by a HashMapVector. This class is a part of ir.vsr package, you will have to utilize its methods for this assignment. It represents a document vector as a hashtable, where keys are the tokens that occur in a document, and values store the term frequencies.

The CVLearningCurve class generates the learning curves with k-fold (default k=10) cross validation for a classifier. The classifier object is passed as an argument to the CVLearningCurve constructor. TestNaiveBayes and TestPerceptron classes create a NaiveBayes classifier object and a Perceptron classifier object respectively, and run 10-fold cross-validation on the corresponding classifier on the yahoo-science dataset.

The output of CVLearningCurve is two ".gplot" files that gnuplot can use to generate the k-fold learning curve (plot) in postscript. For example, running TestNaiveBayes writes "NaiveBayes.gplot" and "NaiveBayesTrain.gplot" files to the current working directory that contain learning curves for the test and training data, respectively. To run gnuplot on a script file, just execute "gnuplot filename.gplot > filename.ps". You can view the ".ps" file with "gv filename.ps" (Ghostview). See a sample trace of running the NaiveBayes classifier on the yahoo-science document collection, and the learning curve produced for the testing data.

Your Task

The Perceptron classifier is only partially implemented and your task is to complete it. As mentioned above, the Perceptron classifier operates through several PerceptronUnits, one per category. The output of the classify() method of each PerceptronUnit should be the amount by which the net input exceeds the threshold. This corresponds to confidence of the prediction of being positive. Your task is to fill in classify() and the trainCategory() methods for the ir.classifiers.PerceptronUnit class. As stated in the declaration of the trainCategory method, it trains the perceptron to only fire for examples in the given category, treating examples from all other categories as negative examples. Implement the Perceptron training method based on the pseudocode presented in class. The "-debug" flag to TestPerceptron allows you to insert conditional print statements to aid debugging (see the use of "-debug" in NaiveBayes).

First, copy the latest version of the ir.* packages from /u/mooney/cs343-code/ir/, the files have changed since Project 1.

Hint 0: You don't need to peruse all the code in ir.* packages. The classes that you should be familiar with for this assignment are
PerceptronUnit, Perceptron, Example, and HashMapVector.

Hint 1: Remember that each Example object in the training set stores the values of its features in a HashMapVector. It can be iterated as a list of Map.Entry's, where the key of each Map.Entry is the token, and the value is its frequency in the document represented by a Weight object. Thus, inputs to the perceptron are integer values, which is a natural extension of the binary-input perceptron presented in class.

Hint 2: The maximum number of training epochs parameter is currently set to 100 in PerceptronUnit class. A full run of TestPerceptron with the correct solution to this project takes approximately 10 minutes on my PC. You may consider reducing the number of default points or the number of folds in CVLearningCurve while debugging; be sure, however, to change it back for your actual data collection runs once you are done debugging.

Be sure to document your code well.

Next, train and test both NaiveBayes and Perceptron on the yahoo-science data by running TestNaiveBayes and TestPerceptron as shown above. You will obtain learning curves for both classifiers for training and testing data. As mentioned above, the curves can be plotted using the gnuplot utility by executing "gnuplot filename.gplot > filename.ps" and then viewing the Postscript file filename.ps by executing "gv filename.ps".

Combine these four curves into two comparitive plots manually: one for training data, one for testing data. To do this for the training curves, change the line in PerceptronTrain.gplot file from
plot 'PerceptronTrain.data' title "PerceptronTrain"
to
plot 'PerceptronTrain.data' title "PerceptronTrain", 'NaiveBayesTrain.data' title "NaiveBayesTrain"
Create the combined plot for testing data analogously. Name these plots "accuracy-train.ps" and "accuracy-test.ps". These graphs will allow you to clearly see the relative performance of the two sytems with respect to their ability to fit training data and to accurately predict the category of previously unseen documents.

Write a 1-2 page report on your results, containing insightful discussion of the comparative performance of the two algorithms. Discuss the differences in test and training accuracy given different numbers of training exmaples (i.e., at different points on the learning curve). Also discuss the relative training time and test time of the two methods (this is presented at the end of the traces). Hint: Perceptron's test accuracy should be reasonably comparable to NaiveBayes and its training accuracy should be uniformly better, if not, your implementation is buggy.

Submission

In submitting your solution, follow these instructions on using the turnin utility.
Along with that, follow these specific instructions: