In this homework you will explore the performance of Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) on the POS tagging task, in particular using some real-world data from the Penn Treebank, replicating some of the experiments in the original paper on CRFs (Lafferty et al., 2001) (using less data to reduce computational demands). Follow the steps outlined below and then turn in any code you write electronically and a hard-copy write-up describing your methods, your experimental results, and your interpretation and explanation of these results. Specific information regarding how to submit the electronic part is here.
/u/mooney/cs388-code/mallet-0.4 and compile it using the built in build.xml file
(run "ant" at the mallet-0.4 directory). The original Mallet doesn't have code for applying HMMs to sequence
labeling tasks, so be sure to use this version.
/projects/nlp/penn-treebank3/tagged/pos/atis/atis3.pos. It looks
something like this: [ @0y0012sx-a-11/CD ] ====================================== List/VB [ the/DT flights/NNS ] from/IN [ Baltimore/NNP ] to/TO Seattle/NNP [ that/WDT stop/VBP ] in/IN [ Minneapolis/NNP ] ====================================== [ @0y0022sx-d-5/CD ] ====================================== Does/VBZ [ this/DT flight/NN ] serve/VB [ dinner/NN ] ======================================
List VB the DT flights NNS from IN Baltimore NNP to TO Seattle NNP that WDT stop VBP in IN Minneapolis NNP Does VBZ this DT flight NN serve VB dinner NN
[ @0y0022sx-d-5/CD ] should be removed during conversion, since these identifiers are not useful English text.
The POS-tagged WSJ data is located in /projects/nlp/penn-treebank3/tagged/pos/wsj/
and is separated into multiple files called "sections".
Write preprocessing code to convert a Penn Treebank POS file into a file appropriate for Mallet.
SimpleTagger here. In the class version of the code, there is a separate
HMMSimpleTagger and CRFSimpleTagger. You may wish to examine the source code for these files for more command line arguments.
For ATIS you should train on 80% of the data and test on the remaining 20% using the commands:
$ java -cp "mallet-0.4/class:mallet-0.4/lib/mallet-deps.jar"
edu.umass.cs.mallet.base.fst.HMMSimpleTagger
--train true --model-file model_file
--training-proportion 0.8
--test lab train_file
$ java -cp "mallet-0.4/class:mallet-0.4/lib/mallet-deps.jar"
edu.umass.cs.mallet.base.fst.CRFSimpleTagger
--train true --model-file model_file
--training-proportion 0.8
--test lab train_file
HMMSimpleTagger and CRFSimpleTagger respectively.
The directory containing your copy of Mallet must be in your
CLASSPATH environment. train_file refers to the formatted training set
generated above. model_file just
refers to where the trained model output will be stored. The argument
"--test lab" tells it to measure the token labeling accuracy during testing.
Change the training and testing code in Mallet to also measure accuracy specifically for out-of-vocabulary items (OOV, as in Lafferty, et al., 2001, section 5.3). This requires storing all of the words encountered during training and efficiently checking each test word to determine whether or not it appeared in training, recording errors on OOV items separately.
Both HMMs and CRFs use Mallet's TokenAccuracyEvaluator class to measure accuracy, logging results through the Java logger. The test method of this class is called during both testing and training. One strategy is to record all seen training instances in this class during the first training iteration and reference it during testing iterations. Be aware that when training CRFs, the input sequence (the local variable input) will contain objects of type FeatureVector, while HMMs's input will contain Strings. Although Mallet does not make use of Java 1.5 features, you are welcome to use them in your code, provided you update the build.xml file.
ATIS is a fairly small corpus and should run fairly quickly. Average your
results over 10 random training/test splits of the data (changing the random
seed in each trial using the parameter --random-seed to produce different
train/test splits). The WSJ data is much larger and training on the standard
20 sections requires about a week. Therefore, just train on section 00 and
test on section 01 (which should take the CRF tagger about 4 hours or so).
Compare the test accuracy, training accuracy, and run time (you can use the
Unix command time for this or condor's timing statistics from the job completion e-mail) of the CRF and HMM taggers on both ATIS and WSJ
data. You might also try training the HMM tagger on a larger number of
sections.
CRFSimpleTagger allows extra features to be included on each line, separated by spaces.
Therefore, your new input data might look like this:
leaving ing VBG from IN Baltimore caps NNP making ing VBG a DT stop NN in IN Minneapolis caps s NNP
caps) as well as common English suffixes (e.g. -ing
and -s). Also include a feature for words containing a hyphen
("-") or words that start with a number. To find a good set of suffixes to
detect, just search the web to find a site with a good table of common English
suffixes. Rerun the earlier experiments on ATIS and WSJ with the CRF tagger
using these additional features.
Your condor submit file(s) may follow this template:
universe = vanilla environment = CLASSPATH=path to Mallet/mallet-0.4/class:path to Mallet/mallet-0.4/lib/mallet-deps.jar Initialdir = path to experiment Executable = /lusr/bin/java +Group = "GRAD" +Project = "INSTRUCTIONAL" +ProjectDescription = "CS388 Homework 1" Log = path to experiment/experiment.log Notification = complete Notify_user = your email Arguments = edu.umass.cs.mallet.base.fst.CRFSimpleTagger arguments to Mallet Output = experiment.out Error = experiment.err Queue 1 Additional experiments can go here
J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the International Conference on Machine Learning (ICML-2001)
C. Sutton and A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006