ir.classifiers. See the Javadoc for this
code. Right now, ir.classifiers has the NaiveBayes
classifier in it. The NaiveBayes class is created by extending the
abstract class Classifier. The
NaiveBayes classifier performs text categorization using the Naive
Bayes method, and does Laplace smoothing. It also stores all probabilities
internally in log values, to prevent underflow problems. The
NaiveBayes class has a train method that takes in a
vector of training documents as classified Example objects.
The ir.classifiers package also has a BayesResult
class, which holds the result of training a Naive Bayes classifier.
The code is currently set up to categorize the
yahoo-science document collection at
/u/mooney/ir-code/corpora/yahoo-science/ into 3 categories -
bio, phys, chem. It
also has a test method that takes in a test Example
and categorizes it as a bio, phys or chem document.
The CVLearningCurve
class generates the learning curves with k-fold (default k=10) cross validation for a
classifier. The TestNaiveBayes class creates a
NaiveBayes classifier object and runs 10-fold cross-validation of the classifier on
the yahoo-science dataset. The output of CVLearningCurve are two .gplot files, one
for classification accuracy on the training examples, and another for accuracy on the
test data. Gnuplot can use these files to generate the learning curves
(plots) in Postscript. To run gnuplot on a file, just execute "gnuplot filename.gplot
> filename.ps". You can view the .ps file with "gv filename.ps" (Ghostview). See a
sample trace of running TestNaiveBayes on the
yahoo-science document collection (using the command "java
ir.classifiers.TestNaiveBayes"), and the learning curves for training and testing
accuracy results produced using gnuplot.
Your assignment is to create KNN and Rocchio classifiers by extending the
Classifier abstract class. The KNN class should
perform the simple K-nearest neighbor categorization algorithm, as
discussed in class. [Hint: Using the InvertedIndex(List
examples) constructor from InvertedIndex in
ir.vsr would make it easier to implement KNN.]
The Rocchio class should perform Rocchio categorization.
[Hint: to prevent longer documents from having more influence, you still
want to normalize document vectors by the maximum weight of a token before
adding (or subtracting) them to create a prototype. Since
HashMapVector.multiply(double) is destructive, and document
vectors are reused in Examples, you may use HashMapVector.addScaled(HashMapVector, double) to
avoid the need to create a scaled copy when constructing prototypes.]
Besides the normal version of Rocchio, where the prototype vectors in each category are generated by adding the documents in that category, you should implement a modified version of Rocchio in which the prototype vectors in each category are generated by adding the documents in that category as well as subtracting the documents in all other categories.
You need to create TestKNN and TestRocchio classes
to run 10-fold cross validation of the corresponding algorithms.
TestRocchio should take a command-line parameter -neg
that would invoke the modified version of Rocchio, while TestKNN
should take a command-line parameter -K which would then invoke
KNN with the value of K following -K. If the
-K option is not present, categorization should be performed using
the default K value of 5. You will use these two classes like
TestNaiveBayes to generate learning curves for the two classifiers
that you will create. You can then manually combine the learning curves of
Naive Bayes, normal Rocchio, modified Rocchio and KNN for K values of 1,
3 and 5 into two gnuplot files, one for the training data and one for
the testing data. Each of the final plots should have 6 learning curves - put
these in files called final-results-train.ps and
final-results-test.ps.
To recap, the command-line syntax for TestKNN and TestRocchio should be:
java ir.classifiers.TestKNN [-K K]
java ir.classifiers.TestRocchio [-neg]
In the writeup, include detailed discussions of:
A significant portion of the total credit will be based on the writeup. Try to give a good analysis and interpretation of the results in the writeup.
In submitting your solution, follow the general course instructions on submitting projects on the course homepage.
Along with that, follow these specific instructions for Project 4:
/lusr/gnu/bin - add this to your PATH environment
variable.
proj4, which should
contain:
REPORT.*: detailed writeup in text or html or ps or pdf
soln-trace: trace file of program execution
final-results-train.ps: final Postscript gnuplot output for the training data (6 learning curves)
final-results-test.ps: final Postscript gnuplot output for the testing data (6 learning curves)
Java files: Any Java classes you modify or add.