CS 388 Natural Language Processing
Due: April 7, 2016
Homework 3: Statistical Parsing with
"Unsupervised" Domain Adaptation
When learning statistical parsers, the learned parser can be quite specific to
the genre of the training corpus. A parser trained on WSJ will perform
significantly less well on Brown, which contains a wider variety of genres.
The task of domain adaptation or transfer learning is to adapt a
system trained on one "source" domain to perform better on a new "target"
domain. Typically, one assumes that there is sufficient labeled training data
in the source, but there is little or no labeled data for the target, since
gathering sufficient training data in every new domain is expensive and labor
intensive. In this assignment, you will look at the special case of
"unsupervised" domain adaptation, where one must adapt a parser trained on
sufficient labeled data in the source (e.g. WSJ) to a new domain (e.g. Brown)
where only unlabeled training data is available.
One approach to this problem is self training, a form
of semi-supervised learning in which a system trained
on a "seed" set of labeled data is used to produce automatically labeled output (e.g. parse
trees) for an unlabeled set of "self-training" data, and the resulting
"peudo-supervised" data is added to the seed training data, and the system is retrained.
In this homework, we'll roughly replicate one of the experiments from a recent
ACL paper by Reichart
and Rappoport that explores the feasibility of self-training for
PCFGs. Note that this self-training approach is similar to semi-supervised
EM, except the system only uses "hard supervision" in the form of the most
confident output for the self-training data rather than "soft supervision" that
retrains on probabilistically labeled data, propagating the uncertainty in the
"pseudo-supervised" data thru the retraining process. Therefore, this approach is
sometimes called "hard EM".
When using self-training for unsupervised domain adaptation, the labeled seed
data is taken from the source domain, the unlabeled self-training data is taken
from the target domain, and the final system is tested in the target domain.
Reichart and Rappoport refer to this as the "OI" setting, where the seed data
is outside (O) the test domain, but the self-training data is inside (I) the
test domain. You are asked to roughly replicate some the results from the
Reichart and Rappoport paper, using the Stanford parser instead of the Collins'
parser. Follow the steps outlined below and then turn in any code you write
electronically and an electronic copy of a report describing your methods, your
experimental results, and your interpretation and explanation of these results.
Specific information regarding how to submit will be given in Canvas.
Additionally, there are FAQs and a set of tips for working with the Stanford Parser for this assignment.
- Get a copy of the Stanford parser here and other details about the Stanford parser is given here. The FAQ page is a good starting place for info, but you may want to look through the script
to get an idea for how to interact with the parser. The paper
referenced above makes use of the Collins' parser, but we're going to
try and replicate the results using the Stanford parser, which makes use
of a different architecture combining a powerful unlexicalized parser with a lexicalized dependency parser. If you're interested, you can read more about the unlexicalized parser here and the full factored model here.
- (Optional) To familiarize yourself with the parser, run the PCFG parser on the annotated WSJ corpus located in
/projects/nlp/penn-treebank3/parsed/mrg/wsj/ and collect the labeled output:
java -cp "stanford-parser.jar:slf4j-api.jar:" -server -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
-evals "tsv" -goodPCFG
-train /projects/nlp/penn-treebank3/parsed/mrg/wsj/ 200-270
-testTreebank /projects/nlp/penn-treebank3/parsed/mrg/wsj/ 2000-2100
will contain the parses of the WSJ sentences occurring in files 2000 to
2100 (roughly section 20) and will be in the Penn Treebank format.
Training on 200-270 yields roughly 1000 sentences of input. The final
output indicating performance will look something like this:
This gives the bracketed scoring results (described in your book)
including precision (LP), recall (LR) and F1. Pay attention to the line
Testing on treebank done [108.0 sec].
pcfg LP/LR summary evalb: LP: 70.55 LR: 72.81 F1: 71.66 Exact: 12.56 N: 2069
dep DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0
factor LP/LR summary evalb: LP: 70.55 LR: 72.81 F1: 71.66 Exact: 12.56 N: 2069
factor Tag summary evalb: LP: 89.71 LR: 89.71 F1: 89.7 Exact: 14.35 N: 2069
factF1 factDA factEx pcfgF1 depDA factTA num
71.67 12.57 71.67 89.71 2069
pcfg LP/LR as it gives the scores for the unlexicalized pcfg parse.
If you would like to try the factored parser, replace the argument
-goodFactored and limit the sentence length (to limit memory use) by adding
-maxLength 40. Note that the factored parser is about five times slower than the PCFG parser alone.
- Using the
ParserDemo.java class as an example, develop a simple command line interface to the
LexicalizedParser that includes support for "unsupervised" domain adaptation. As with the bundled
your package should train the parser on a given seed set and the
trained parser parses the self-training set. The whole automatically
annotated self-training set is then combined with the seed set to
retrain the parser. The retrained parser is evaluated on the given test
Write preprocessing code to create seed, self-training and test
sets. You will use WSJ sections 02-22 as the labeled seed data
corpus as the unlabeled self-training and test sets. To create the
seed set, start by extracting the first 1,000 sentences from WSJ
section 02. For the unlabeled self-training and test sets, you need to
split Brown, which is available
Brown is broken into genre's by directory, use the first 90% of the
sentences in each genre as the unlabeled self-training set and the
rest as test set. You can concatenate self-training sentences in a
file and test sentences in another file to make it easy for your
parser. Follow step 3 to train, retrain and evaluate your parser based
on F1 score. Now try increasing the size of the seed set to 2,000,
3,000, 4,000, 5,000, 7,000, 10,000, 13,000, 16,000, 20,000, 25,000,
30,000, 35,000 from WSJ sections 02-22 and generate a learning curve
of F1 score as a function of number of sentences in the seed set. Also
plot a baseline by conducting a control experiment without
self-training on "unlabeled" Brown data. Compare your
results and see if self-training always helps improve performance as
you increase the seed set. Additionally, generate a normal
"in-domain" learning curve, where you use the same seed set
but test on WSJ section 23 without self-training. Comparing with your
baseline, does F1 score drop a lot when shifting from testing on WSJ
to testing on Brown? Generate a single graph that compares the three
learning curves: normal training and testing on WSJ, normal training
on WSJ and testing on Brown, and unsupervised domain adaptation by
normal training on WSJ, self-training on Brown, and then testing on
- Also investigate how increasing the size of the the self-training set
improves F1 by varying the size of self-training set. You will use the first
10,000 labeled sentences from WSJ sections 02-22 as your seed set and the same
Brown test set as before. Now increase the self-training set from 1,000 to
2,000, 3,000, 4,000, 5,000, 7,000, 10,000, 13,000, 17,000, 21,000, in the Brown
self-training data and plot a learning curve showing how F1 score changes as
you increase the self-training set.
- Now try inverting the "source" and "target" and repeat steps 4 and 5. You
will use the previous 90% Brown data as the seed set, WSJ sections 02-22 as the
self-training data and WSJ section 23 as the test set. First, as in step 3, you
will generate a single graph that compares the three learning curves showing
how F1 score changes as you increase the Brown seed set from 1,000 to 21,000:
normal training and testing on Brown, normal training on Brown and testing on
WSJ, and unsupervised domain adaptation by normal training on Brown,
self-training on WSJ, and then testing on WSJ. Second, use the first 10,000
labeled sentences from the 90% Brown data as your seed set. Plot a learning
curve as you increase the WSJ self-training set from 1,000 to 35,000. Plot the
same points on the learning curves as before.
Report Your report should briefly describe your implementation and
contain a concise but detailed discussion of the experiments you ran, including
nicely formatted learning curves showing the relationship between F1 score and
the size of seed and self-training sets as discussed above. In your discussion,
be sure to address at least the following questions.
- How much does performance drop from in-domain testing to
- How does unsupervised domain adaptation impact performance on
- How does increasing the size of seed and self-supervised training sets
effect the relative performance?
- How does inverting the "source" and "target" impact your results and why?
- How do your results compare to the results described in the Reichart and
Rappoport paper for the OI setting?
Be sure to turn in an electronic copy of your report and any code you wrote using the steps here. Also be sure to include a README file including the specific commands you used for this assignment.
Roi Reichart and Ari Rappoport. Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets. ACL 2007.
Dan Klein and Christopher D. Manning Fast Exact Inference with a Factored Model for Natural Language Parsing. NIPS 2002.
Dan Klein and Christopher D. Manning. Accurate Unlexicalized Parsing. ACL 2003.