When learning PCFGs from data, we often want to get good performance while using as few annotated examples as possible. Constructing large training corpora like the Penn Treebank requires significant expert human effort, which can be quite costly. In this homework we'll roughly replicate an experiment in (Hwa, 2000) (PDF) on active learning for PCFG parsing, which attempts to reduce annotation cost.
Active learning puts more of the burden of data collection on the learning system. In particular, sample selection, requires that the learner itself select the examples for annotatation from a large sample of initially unannotated data. The goal is to pick training examples wisely in order to minimize the amount of data that needs to be annotated to achieve a desired level of performance. For statistical parsing, a training instance is a sentence and the annotation is a parse tree supplied by a linguistic expert. First, the system is initially trained on a small randomly-selected sample of annotated instances to get started. Next, using the current learned model, the system selects a small batch of the most useful examples for annotation and asks the expert to annotate them. It then retrains on all the annotated data. Based on what it has learned, the system repeatedly selects small batches of examples for the user to annotate until the desired level of performance is reached or some resource limit is exhausted.
In experiments on sample selection, a corpus of completely annotated data is used to simulate active learning. First, a disjoint portion of data is set aside for testing. The remaining data is left for training, but is initially assumed to be unannotated. When the system requests annotation for a particular instance, the annotation for that instance is retrieved from the dataset. To measure performance, the accuracy of the current learned model is tested on the test set after every batch of labeled data is selected and the model is retrained. By comparing to the learning curve of a system that selects training examples randomly, the advantage of active learning can be ascertained.
However, using number of sentences is not a fair measure of training set complexity since longer sentences are clearly more difficult for a human to annotate. Active learners have an inherent bias to select more-complex, longer sentences, so just counting the number of training sentences would give them an unfair advantage. Therefore, using the number of words or the number of phrases (i.e. the number of internal nodes in the gold-standard parse tree, called the number of "brackets" in (Hwa, 2000))., is a fairer measure of training set size. Hwa (2000) plots the number of brackets in the training set on the Y axis and parsing accuracy on the X. Learning curves that instead plot training set size (e.g. number of words or brackets) on the X axis and test accuracy on the Y axis are more normal, so you should present results as learning curves.
The simplest approach to selecting training examples is uncertainty sampling. The learner first tries to annotate all of the remaining unannotated training examples itself, using its probabilistic model to assign a certainty to its labeling of each example. It then selects for annotation those examples in which it is most uncertain. By obtaining feedback on the cases in which it is most uncertain, it hopes to learn more than obtaining labels on random sentences in which its existing model is perhaps already quite confident in, and from which it would therefore not learn much.
For statistical parsing, there are several ways of measuring the uncertainty in the automatic annotation of a sentence.
Using the Stanford parser as the underlying statistically trained parser, compare uncertainty sampling for active selection using each of these different ways of measuring uncertainty to random sample selection. Follow the steps outlined below and then turn in any code you write electronically and a hard-copy write-up describing your methods, your experimental results, and your interpretation and explanation of these results. Specific information regarding how to submit the electronic part is here.
makeSerialized.csh to get an idea for how to interact with the
parser. The Stanford parser combines a powerful unlexicalized parser
with a lexicalized dependency parser. If you're interested, you can read more
about the unlexicalized parser here
and the full factored model here.
/projects/nlp/penn-treebank3/parsed/mrg/wsj/ and collect the labeled output:
java -cp "stanford-parser.jar:" -server -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-evals "tsv" -goodPCFG \
-train /projects/nlp/penn-treebank3/parsed/mrg/wsj/ 200-270 \
-testTreebank /projects/nlp/penn-treebank3/parsed/mrg/wsj/ 2000-2100 \
> labeled_output.txt
labeled_output.txt will contain the parses of the WSJ sentences occurring in files 2000 to 2100 (roughly section 20) and will be in the Penn Treebank format. Training on 200-270 yields roughly 1000 sentences of input. The final output indicating performance will look something like this:
Testing on treebank done [108.0 sec]. pcfg LP/LR summary evalb: LP: 69.29 LR: 71.58 F1: 70.42 Exact: 11.59 N: 2069 dep DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0 factor LP/LR summary evalb: LP: 69.29 LR: 71.58 F1: 70.42 Exact: 11.59 N: 2069 factor Tag summary evalb: LP: 89.7 LR: 89.7 F1: 89.7 Exact: 14.35 N: 2069 factF1 factDA factEx pcfgF1 depDA factTA num 70.43 11.60 70.43 89.71 2069
pcfg LP/LR as it gives the scores for the unlexicalized pcfg parse.
If you would like to try the factored parser, replace the argument -goodPCFG with -goodFactored and limit the sentence length (to limit memory use) by adding -maxLength 40. Note that the factored parser is about five times slower than the PCFG parser alone.
ParserDemo.java class as a example, develop a simple command line interface to the LexicalizedParser that includes support for active learning. Your package should train a parser on a given training set and evaluate it on a given test set, as with the bundled LexicalizedParser. Additionally, choose a random set of sentences from the "unlabeled" training pool whose word count totals approximately 1500 (this represents approximately 60 additional sentences of average length). Output the original training set plus the annotated versions of the randomly selected sentences as your next training set. Output the remaining "unlabeled" training instances as your next "unlabeled" training pool. Lastly, collect your results for this iteration, including at a minimum the following:
(Hwa, 2000), "Sample selection for statistical grammar induction." In the Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.