Project 2 for CS 343: Artificial Intelligence
Learning for Text Categorization

Due: Nov. 24, 2010

Note: While working on this project you will become familiar with additional classes from the 'Information Retrieval' Java code-base and the Gnuplot plotting program. You are encouraged to start early so that you have time to ask questions as they arise.

Existing Text Categorization Framework

The goal of automatic text categorization is to assign a text document to one of a given set of categories using a classifier. First, the classifier is trained on a set of documents that a person has annotated with category labels. Then it can be used to automatically assign categories to unlabeled documents.

A basic text categorization framework is available in /u/mooney/cs343-code/ir/classifiers/. See the Javadoc documentation for this code. Right now, ir.classifiers has Naive Bayes and an incomplete Perceptron classifier in it. Both of them extend the abstract class Classifier.

The NaiveBayes classifier performs text categorization using the Naive Bayes method, and does Laplace smoothing. It also stores all probabilities internally in log values to prevent underflow problems.

The Perceptron classifier performs text categorization using a separate PerceptronUnit for each category. The category with the highest confidence value returned by its PerceptronUnit is chosen when text documents are classified.

The Yahoo-science document collection in /u/mooney/ir-code/corpora/yahoo-science/ contains 897 pages: about 300 random samples each from the Yahoo indices for biology, physics, and chemistry. Right now, the code is set up to categorize the yahoo-science collection into 3 categories - bio, phys, chem. The NaiveBayes and Perceptron classes both have train() methods that take in a vector of training documents as Example objects. They also have a test() method that takes in a test Example and categorizes it as a bio, phys or chem document.

In vector-space representation a text document is stored as a sparse vector. The size of the vector equals the total number of words in the vocabulary over all documents; each token in the vocabulary is associated with a particular component of the vector. For a particular document, values of the components of its vector correspond to term frequencies - the number of occurrences of each token in this document. Since documents rarely include all words from the vocabulary, the vector is very sparse: most components are 0, corresponding to words that do not occur in the particular document.

Because storing the full vector consisting mostly of zeros for each document would be very inefficient, a document is represented inside the Example object by a HashMapVector. This class is a part of ir.vsr package, you will have to utilize its methods for this assignment. It represents a document vector as a hashtable, where keys are the tokens that occur in a document, and values store the term frequencies.

The CVLearningCurve class generates the learning curves with k-fold (default k=10) cross validation for a classifier. The classifier object is passed as an argument to the CVLearningCurve constructor. TestNaiveBayes and TestPerceptron classes create a NaiveBayes classifier object and a Perceptron classifier object respectively, and run 10-fold cross-validation on the corresponding classifier on the yahoo-science dataset.

The output of CVLearningCurve is two ".gplot" files that gnuplot can use to generate the k-fold learning curve (plot) in postscript. Gnuplot is located in /lusr/gnu/bin/ - you may need to add this to your PATH environment variable (or use the full path when executing the program). For example, running TestNaiveBayes writes "NaiveBayes.gplot" and "NaiveBayesTrain.gplot" files to the current working directory that contain learning curves for the test and training data, respectively. To run gnuplot on a script file, just execute "gnuplot filename.gplot >". You can view the ".ps" file with "gv" (Ghostview). See a sample trace of running the NaiveBayes classifier on the yahoo-science document collection, and the learning curve produced for the testing data.

Your Task

The Perceptron classifier is only partially implemented and your task is to complete it. As mentioned above, the Perceptron classifier utilizes multiple PerceptronUnits, one per category. The real-valued output of the classify() method of each PerceptronUnit should be the amount by which the net input exceeds the threshold. This corresponds to the confidence in the prediction of being positive, and is negative if the class is negative. Your task is to fill in classify() and the trainCategory() methods for the ir.classifiers.PerceptronUnit class. As stated in the declaration of the trainCategory method, it trains the perceptron to only fire for examples in the given category, treating examples from all other categories as negative examples. Implement the Perceptron training method based on the pseudocode presented in class. The "-debug" flag to TestPerceptron allows you to insert conditional print statements to aid debugging (see the use of "-debug" in NaiveBayes). Be sure to document your code well.

First, copy the latest version of the ir.* packages from /u/mooney/cs343-code/ir/, the files have changed since Project 1.

Hint 1: You don't need to peruse all the code in ir.* packages. The classes that you should be familiar with for this assignment are
PerceptronUnit, Perceptron, Example, and HashMapVector.

Hint 2: Remember that each Example object in the training set stores the values of its features in a HashMapVector. One can iterate through the Map.Entry's in the hashmap's entrySet, where the key of each Map.Entry is the token, and the value is its frequency in the document represented by a Weight object. Thus, inputs to the perceptron are integer values, which is a natural extension of the binary-input perceptron presented in class.

Hint 3: The maximum number of training epochs parameter is currently set to 100 in PerceptronUnit class. A full run of TestPerceptron with the correct solution to this project takes approximately 15 minutes on my PC. You may consider reducing the number of default points or the number of folds in CVLearningCurve while debugging; be sure, however, to change it back for your actual data collection runs once you are done debugging. If Java gives an out-of-memory error, try executing the "java" command with the "-Xmx256m" option. It allocates more memory to the virtual machine, preventing out-of-memory errors.

Next, train and test both NaiveBayes and Perceptron on the yahoo-science data by running TestNaiveBayes and TestPerceptron as shown above. You will obtain learning curves for both classifiers for training and testing data. As mentioned above, the curves can be plotted using the gnuplot utility by executing "gnuplot filename.gplot >" and then viewing the Postscript file by executing "gv".

Combine these four curves into two comparitive plots manually: one for training data, one for testing data. To do this for the training curves, change the line in PerceptronTrain.gplot file from
plot '' title "PerceptronTrain"
plot '' title "PerceptronTrain", '' title "NaiveBayesTrain"
Create the combined plot for testing data analogously. Name these plots "" and "". These graphs will allow you to clearly see the relative performance of the two sytems with respect to their ability to fit training data and to accurately predict the category of previously unseen documents.

Write a 1-2 page report on your results, containing insightful discussion of the comparative performance of the two algorithms. Discuss the differences in test and training accuracy given different numbers of training exmaples (i.e., at different points on the learning curve). Also discuss the relative training time and test time of the two methods (this is presented at the end of the traces). Hint: Perceptron's test accuracy should be reasonably comparable to NaiveBayes and its training accuracy should be uniformly better, if not, your implementation is buggy.


In submitting your solution, follow these instructions on using the turnin utility.
For this project, submit a directory that contains:

  1. Report.* file: detailed writeup in txt or pdf format
  2. soln-trace: trace file of program execution
  3. and final postscript gnuplot output (2 learning curves shown in each file)
  4. Your source code: Submit only the .java files that you have modified. If this includes other files besides and, be sure to make it clear in your Report where those files should go.