Project 4 for CS 371R:
Text Categorization using Naive Bayes, KNN and Rocchio

Due date: November 22, 2021 at 11:59 p.m.

NOTE: You will need to get a fresh copy of the code from /u/mooney/ir-code/ir/.

Existing Categorization Framework

As discussed in class, a basic text categorization framework is available in ir.classifiers. See the Javadoc for this code. Right now, ir.classifiers has the NaiveBayes classifier in it. The NaiveBayes class is created by extending the abstract class Classifier. The NaiveBayes classifier performs text categorization using the Naive Bayes method, and does Laplace smoothing. It also stores all probabilities internally in log values, to prevent underflow problems. The NaiveBayes class has a train method that takes in a vector of training documents as classified Example objects. The ir.classifiers package also has a BayesResult class, which holds the result of training a Naive Bayes classifier. The code is currently set up to categorize the curlie-science document collection at /u/mooney/ir-code/corpora/curlie-science-old/ into 3 categories - bio, phys, chem. We are using an old version of this data since the newly crawled data is giving unusual results for some reason. It also has a test method that takes in a test Example and categorizes it as a bio, phys or chem document.

The CVLearningCurve class generates the learning curves with k-fold (default k=10) cross validation for a classifier. The TestNaiveBayes class creates a NaiveBayes classifier object and runs 10-fold cross-validation of the classifier on the curlie-science dataset. The output of CVLearningCurve are two .gplot files, one for classification accuracy on the training examples, and another for accuracy on the test data. It also prints out information on run time for training and testing. To create a pdf plot file execute the following command:

gnuplot filename.gplot | ps2pdf - filename.pdf

See a sample trace of running TestNaiveBayes on the curlie-science document collection (using the command "java ir.classifiers.TestNaiveBayes"), and the learning curves for testing and training accuracy results produced using gnuplot. (Note: These may differ slightly each time you run the program because of randomization.)

Your Task

Your assignment is to create KNN and Rocchio classifiers by extending the Classifier abstract class. The KNN class should perform the simple K-nearest neighbor categorization algorithm using an inverted index for efficiency, as discussed in class. [Hint: Using the InvertedIndex(List examples) constructor from InvertedIndex in ir.vsr makes it easier to implement KNN.]

The Rocchio class should perform Rocchio categorization. [Hint: to prevent longer documents from having more influence, you still want to normalize document vectors by the maximum weight of a token before adding (or subtracting) them to create a prototype. Since HashMapVector.multiply(double) is destructive, and document vectors are reused in Examples, you may use HashMapVector.addScaled(HashMapVector, double) to avoid the need to create a scaled copy when constructing prototypes.]

Besides the normal version of Rocchio, where the prototype vectors in each category are generated by adding the documents in that category, you should implement a modified version of Rocchio in which the prototype vectors in each category are generated by adding the documents in that category as well as subtracting the documents in all other categories.

You need to create TestKNN and TestRocchio classes to run 10-fold cross validation of the corresponding algorithms on the curlie-science data. TestRocchio should take a command-line parameter -neg that invokes the modified version of Rocchio that subtracts vectors for negative classes, while TestKNN should take a command-line parameter -K which invokes KNN with the value of K that directly follows the -K flag. If the -K option is not present, categorization should be performed using the default K value of 5. You should use these two classes like TestNaiveBayes to generate learning curves for the two classifiers that you will create. Manually combine the learning curves of Naive Bayes, normal Rocchio, modified Rocchio and KNN for K values of 1, 3 and 5 into two joint gnuplot files, one for the training data and one for the testing data. Each of the final plots should have 6 learning curves. Put these in final plots into your PDF report.

To recap, the command-line syntax for TestKNN and TestRocchio should be:

    java ir.classifiers.TestKNN [-K K]
    java ir.classifiers.TestRocchio [-neg]

In the writeup, include detailed discussions of the following aspects of the results, explaining, to the best of your ability, any differences in performance based on the properties of the individual algorithms. In particular, consider whether the relative results on training vs. testing for a particular method are indicative of overfitting (high training accuracy leading to comparatively low test accuracy) or underfitting (low training accuracy leading to comparatively low test accuracy).

  1. Comparative accuracy of the algorithms at different points on the learning curve for the training data.
  2. Comparative accuracy of the algorithms at different points on the learning curve for the testing data.
  3. Comparative running times of the algorithms in training and testing phases. Include a summary table of training and testing times for each algorithm, as reported by CVLearningCurve.

A significant portion of the total credit will be based on the writeup. Try to give a good analysis and interpretation of the results in the writeup. Include a short description of the algorithms you implemented and what you observed about their behaviors.

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage.

Along with that, follow these specific instructions for Project 4:

Please make sure that your code compiles and runs on the UTCS lab machines.

Grading Criteria