/u/mooney/ir-code/ir/.
ir.classifiers. See the Javadoc for this
code. Right now, ir.classifiers has the NaiveBayes
classifier in it. The NaiveBayes class is created by extending the
abstract class Classifier. The
NaiveBayes classifier performs text categorization using the Naive
Bayes method, and does Laplace smoothing. It also stores all probabilities
internally in log values, to prevent underflow problems. The
NaiveBayes class has a train method that takes in a
vector of training documents as classified Example objects.
The ir.classifiers package also has a BayesResult
class, which holds the result of training a Naive Bayes classifier.
The code is currently set up to categorize the
curlie-science document collection at
/u/mooney/ir-code/corpora/curlie-science-old/ into 3 categories -
bio, phys, chem. We are using an old version of this data since the newly crawled data is giving unusual results for some reason.
It
also has a test method that takes in a test Example
and categorizes it as a bio, phys or chem document.
The CVLearningCurve
class generates the learning curves with k-fold (default k=10) cross validation for a
classifier. The TestNaiveBayes class creates a
NaiveBayes classifier object and runs 10-fold cross-validation of
the classifier on the curlie-science dataset. The output
of CVLearningCurve are two .gplot files, one for
classification accuracy on the training examples, and another for accuracy on
the test data. It also prints out information on run time for training and
testing. To create a pdf plot file execute the following command:
gnuplot filename.gplot | ps2pdf - filename.pdf
See a sample trace of running TestNaiveBayes on the
curlie-science document collection (using the command "java
ir.classifiers.TestNaiveBayes"), and the learning curves for
testing and training
accuracy results produced using gnuplot. (Note: These may differ slightly each time
you run the program because of randomization.)
Your assignment is to create KNN and Rocchio classifiers by extending the
Classifier abstract class. The KNN class should
perform the simple K-nearest neighbor categorization algorithm using an
inverted index for efficiency, as discussed in class. [Hint: Using
the InvertedIndex(List
examples) constructor
from InvertedIndex
in
ir.vsr makes it easier to implement KNN.]
The Rocchio class should perform Rocchio categorization.
[Hint: to prevent longer documents from having more influence, you still
want to normalize document vectors by the maximum weight of a token before
adding (or subtracting) them to create a prototype. Since
HashMapVector.multiply(double) is destructive, and document
vectors are reused in Examples, you may use HashMapVector.addScaled(HashMapVector, double) to
avoid the need to create a scaled copy when constructing prototypes.]
Besides the normal version of Rocchio, where the prototype vectors in each category are generated by adding the documents in that category, you should implement a modified version of Rocchio in which the prototype vectors in each category are generated by adding the documents in that category as well as subtracting the documents in all other categories.
You need to create TestKNN and TestRocchio classes
to run 10-fold cross validation of the corresponding algorithms on the curlie-science data.
TestRocchio should take a command-line parameter -neg
that invokes the modified version of Rocchio that subtracts vectors for
negative classes, while TestKNN should take a command-line
parameter -K which invokes KNN with the value
of K that directly follows the -K flag. If the
-K option is not present, categorization should be performed using
the default K value of 5. You should use these two classes like
TestNaiveBayes to generate learning curves for the two classifiers
that you will create. Manually combine the learning curves of
Naive Bayes, normal Rocchio, modified Rocchio and KNN for K values of 1,
3 and 5 into two joint gnuplot files, one for the training data and one for
the testing data. Each of the final plots should have 6 learning curves. Put
these in final plots into your PDF report.
To recap, the command-line syntax for TestKNN and TestRocchio should be:
java ir.classifiers.TestKNN [-K K]
java ir.classifiers.TestRocchio [-neg]
In the writeup, include detailed discussions of the following aspects of the results, explaining, to the best of your ability, any differences in performance based on the properties of the individual algorithms. In particular, consider whether the relative results on training vs. testing for a particular method are indicative of overfitting (high training accuracy leading to comparatively low test accuracy) or underfitting (low training accuracy leading to comparatively low test accuracy).
CVLearningCurve.
A significant portion of the total credit will be based on the writeup. Try to give a good analysis and interpretation of the results in the writeup. Include a short description of the algorithms you implemented and what you observed about their behaviors.
In submitting your solution, follow the general course instructions on submitting projects on the course homepage.
Along with that, follow these specific instructions for Project 4:
Rocchio which extends ClassifierKNN which extends ClassifierTestRocchioTestKNN[PREFIX]_code.zip - Your code in zip file (*.java and *.class file). Please do not modify the original java files but extend each class and override the appropriate methods.[PREFIX]_report.pdf - A PDF report of your experiment as described above with the 2 plots (6 learning curves each) referenced in the instructions.[PREFIX]_trace.txt - Trace file of program execution (for all 6 runs) on curlie-science.
The files listed under "Turned In" on Canvas should be:
proj4_jd1234_code.zip
proj4_jd1234_report.pdf
proj4_jd1234_trace.txt
and the zip file should have at least the following contents:
$ unzip -l proj4_jd1234_code.zip
Archive: proj4_jd1234_code.zip
Length Date Time Name
--------- ---------- ----- ----
21067 2015-09-14 12:57 ir/classifiers/Rocchio.java
10049 2015-09-14 17:26 ir/classifiers/Rocchio.class
21067 2015-09-14 12:57 ir/classifiers/KNN.java
10049 2015-09-14 17:26 ir/classifiers/KNN.class
21067 2015-09-14 12:57 ir/classifiers/TestRocchio.java
10049 2015-09-14 17:26 ir/classifiers/TestRocchio.class
21067 2015-09-14 12:57 ir/classifiers/TestKNN.java
10049 2015-09-14 17:26 ir/classifiers/TestKNN.class
--------- -------
91106 8 files
Please make sure that your code compiles and runs on the UTCS lab machines.