Project 4 for CS 371R:
Text Categorization using Naive Bayes, KNN and Rocchio

Due date: April 29, 2008

Existing Categorization Framework

As discussed in class, a basic text categorization framework is available in ir.classifiers. See the Javadoc for this code. Right now, ir.classifiers has the NaiveBayes classifier in it. The NaiveBayes class is created by extending the abstract class Classifier. The NaiveBayes classifier performs text categorization using the Naive Bayes method, and does Laplace smoothing. It also stores all probabilities internally in log values, to prevent underflow problems. The NaiveBayes class has a train method that takes in a vector of training documents as classified Example objects. The ir.classifiers package also has a BayesResult class, which holds the result of training a Naive Bayes classifier. The code is currently set up to categorize the yahoo-science document collection at /u/mooney/ir-code/corpora/yahoo-science/ into 3 categories - bio, phys, chem. It also has a test method that takes in a test Example and categorizes it as a bio, phys or chem document.

The CVLearningCurve class generates the learning curves with k-fold (default k=10) cross validation for a classifier. The TestNaiveBayes class creates a NaiveBayes classifier object and runs 10-fold cross-validation of the classifier on the yahoo-science dataset. The output of CVLearningCurve are two .gplot files, one for classification accuracy on the training examples, and another for accuracy on the test data. Gnuplot can use these files to generate the learning curves (plots) in Postscript. To run gnuplot on a file, just execute "gnuplot filename.gplot > filename.ps". You can view the .ps file with "gv filename.ps" (Ghostview). See a sample trace of running TestNaiveBayes on the yahoo-science document collection (using the command "java ir.classifiers.TestNaiveBayes"), and the learning curves for training and testing accuracy results produced using gnuplot.

Your Task

Your assignment is to create KNN and Rocchio classifiers by extending the Classifier abstract class. The KNN class should perform the simple K-nearest neighbor categorization algorithm, as discussed in class. [Hint: Using the InvertedIndex(List examples) constructor from InvertedIndex in ir.vsr would make it easier to implement KNN.]

The Rocchio class should perform Rocchio categorization. [Hint: to prevent longer documents from having more influence, you still want to normalize document vectors by the maximum weight of a token before adding (or subtracting) them to create a prototype. Since HashMapVector.multiply(double) is destructive, and document vectors are reused in Examples, you may use HashMapVector.addScaled(HashMapVector, double) to avoid the need to create a scaled copy when constructing prototypes.]

Besides the normal version of Rocchio, where the prototype vectors in each category are generated by adding the documents in that category, you should implement a modified version of Rocchio in which the prototype vectors in each category are generated by adding the documents in that category as well as subtracting the documents in all other categories.

You need to create TestKNN and TestRocchio classes to run 10-fold cross validation of the corresponding algorithms. TestRocchio should take a command-line parameter -neg that would invoke the modified version of Rocchio, while TestKNN should take a command-line parameter -K which would then invoke KNN with the value of K following -K. If the -K option is not present, categorization should be performed using the default K value of 5. You will use these two classes like TestNaiveBayes to generate learning curves for the two classifiers that you will create. You can then manually combine the learning curves of Naive Bayes, normal Rocchio, modified Rocchio and KNN for K values of 1, 3 and 5 into two gnuplot files, one for the training data and one for the testing data. Each of the final plots should have 6 learning curves - put these in files called final-results-train.ps and final-results-test.ps.

To recap, the command-line syntax for TestKNN and TestRocchio should be:

    java ir.classifiers.TestKNN [-K K]
    java ir.classifiers.TestRocchio [-neg]

In the writeup, include detailed discussions of:

  1. Comparative accuracy of the algorithms at different points on the learning curve for the training data.
  2. Comparative accuracy of the algorithms at different points on the learning curve for the testing data.
  3. Comparative running times of the algorithms in training and testing phases.

A significant portion of the total credit will be based on the writeup. Try to give a good analysis and interpretation of the results in the writeup.

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage.

Along with that, follow these specific instructions for Project 4: