CS 388 Natural Language Processing
Homework 3: Statistical Parsing with "Unsupervised" Domain Adaptation

Due: March 31, 2014

When learning statistical parsers, the learned parser can be quite specific to the genre of the training corpus. A parser trained on WSJ will perform significantly less well on Brown, which contains a wider variety of genres. The task of domain adaptation or transfer learning is to adapt a system trained on one "source" domain to perform better on a new "target" domain. Typically, one assumes that there is sufficient labeled training data in the source, but there is little or no labeled data for the target, since gathering sufficient training data in every new domain is expensive and labor intensive. In this assignment, you will look at the special case of "unsupervised" domain adaptation, where one must adapt a parser trained on sufficient labeled data in the source (e.g. WSJ) to a new domain (e.g. Brown) where only unlabeled training data is available.

One approach to this problem is self training, a form of semi-supervised learning in which a system trained on a "seed" set of labeled data is used to produce automatically labeled output (e.g. parse trees) for an unlabeled set of "self-training" data, and the resulting "peudo-supervised" data is added to the seed training data, and the system is retrained. In this homework, we'll roughly replicate one of the experiments from a recent ACL paper by Reichart and Rappoport that explores the feasibility of self-training for PCFGs. Note that this self-training approach is similar to semi-supervised EM, except the system only uses "hard supervision" in the form of the most confident output for the self-training data rather than "soft supervision" that retrains on probabilistically labeled data, propagating the uncertainty in the "pseudo-supervised" data thru the retraining process. Therefore, this approach is sometimes called "hard EM".

When using self-training for unsupervised domain adaptation, the labeled seed data is taken from the source domain, the unlabeled self-training data is taken from the target domain, and the final system is tested in the target domain. Reichart and Rappoport refer to this as the "OI" setting, where the seed data is outside (O) the test domain, but the self-training data is inside (I) the test domain. You are asked to roughly replicate some the results from the Reichart and Rappoport paper, using the Stanford parser instead of the Collins' parser. Follow the steps outlined below and then turn in any code you write electronically and an electronic copy of a report describing your methods, your experimental results, and your interpretation and explanation of these results. Specific information regarding how to submit is here.
  1. Get a copy of the Stanford parser here. The FAQ page is a good starting place for info, but you may want to look through the script makeSerialized.csh to get an idea for how to interact with the parser. The paper referenced above makes use of the Collins' parser, but we're going to try and replicate the results using the Stanford parser, which makes use of a different architecture combining a powerful unlexicalized parser with a lexicalized dependency parser. If you're interested, you can read more about the unlexicalized parser here and the full factored model here.
  2. (Optional) To familiarize yourself with the parser, run the PCFG parser on the annotated WSJ corpus located in /projects/nlp/penn-treebank3/parsed/mrg/wsj/ and collect the labeled output:
    java -cp "stanford-parser.jar:" -server -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser 
         -evals "tsv" -goodPCFG 
         -train /projects/nlp/penn-treebank3/parsed/mrg/wsj/ 200-270 
         -testTreebank /projects/nlp/penn-treebank3/parsed/mrg/wsj/ 2000-2100 
      >  labeled_output.txt
    
    The file labeled_output.txt will contain the parses of the WSJ sentences occurring in files 2000 to 2100 (roughly section 20) and will be in the Penn Treebank format. Training on 200-270 yields roughly 1000 sentences of input. The final output indicating performance will look something like this:
    Testing on treebank done [108.0 sec].
    pcfg LP/LR summary evalb: LP: 70.55 LR: 72.81 F1: 71.66 Exact: 12.56 N: 2069
    dep DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0
    factor LP/LR summary evalb: LP: 70.55 LR: 72.81 F1: 71.66 Exact: 12.56 N: 2069
    factor Tag summary evalb: LP: 89.71 LR: 89.71 F1: 89.7 Exact: 14.35 N: 2069
    factF1  factDA  factEx  pcfgF1  depDA   factTA  num
    71.67           12.57   71.67           89.71   2069
    
    This gives the bracketed scoring results (described in your book) including precision (LP), recall (LR) and F1. Pay attention to the line starting with pcfg LP/LR as it gives the scores for the unlexicalized pcfg parse. If you would like to try the factored parser, replace the argument -goodPCFG with -goodFactored and limit the sentence length (to limit memory use) by adding -maxLength 40. Note that the factored parser is about five times slower than the PCFG parser alone.
  3. Using the ParserDemo.java class as an example, develop a simple command line interface to the LexicalizedParser that includes support for "unsupervised" domain adaptation. As with the bundled LexicalizedParser, your package should train the parser on a given seed set and the trained parser parses the self-training set. The whole automatically annotated self-training set is then combined with the seed set to retrain the parser. The retrained parser is evaluated on the given test set.
  4. Write preprocessing code to create seed, self-training and test sets. You will use WSJ sections 02-22 as the labeled seed data and Brown corpus as the unlabeled self-training and test sets. To create the seed set, start by extracting the first 1,000 sentences from WSJ section 02. For the unlabeled self-training and test sets, you need to split Brown, which is available in /projects/nlp/penn-treebank3/parsed/mrg/brown/. Since Brown is broken into genre's by directory, use the first 90% of the sentences in each genre as the unlabeled self-training set and the rest as test set. You can concatenate self-training sentences in a file and test sentences in another file to make it easy for your parser. Follow step 3 to train, retrain and evaluate your parser based on F1 score. Now try increasing the size of the seed set to 2,000, 3,000, 4,000, 5,000, 7,000, 10,000, 13,000, 16,000, 20,000, 25,000, 30,000, 35,000 from WSJ sections 02-22 and generate a learning curve of F1 score as a function of number of sentences in the seed set. Also plot a baseline by conducting a control experiment without self-training on "unlabeled" Brown data. Compare your results and see if self-training always helps improve performance as you increase the seed set. Additionally, generate a normal "in-domain" learning curve, where you use the same seed set but test on WSJ section 23 without self-training. Comparing with your baseline, does F1 score drop a lot when shifting from testing on WSJ to testing on Brown? Generate a single graph that compares the three learning curves: normal training and testing on WSJ, normal training on WSJ and testing on Brown, and unsupervised domain adaptation by normal training on WSJ, self-training on Brown, and then testing on Brown.
  5. Also investigate how increasing the size of the the self-training set improves F1 by varying the size of self-training set. You will use the first 10,000 labeled sentences from WSJ sections 02-22 as your seed set and the same Brown test set as before. Now increase the self-training set from 1,000 to 2,000, 3,000, 4,000, 5,000, 7,000, 10,000, 13,000, 17,000, 21,000, in the Brown self-training data and plot a learning curve showing how F1 score changes as you increase the self-training set.
  6. Now try inverting the "source" and "target" and repeat steps 4 and 5. You will use the previous 90% Brown data as the seed set, WSJ sections 02-22 as the self-training data and WSJ section 23 as the test set. First, as in step 3, you will generate a single graph that compares the three learning curves showing how F1 score changes as you increase the Brown seed set from 1,000 to 21,000: normal training and testing on Brown, normal training on Brown and testing on WSJ, and unsupervised domain adaptation by normal training on Brown, self-training on WSJ, and then testing on WSJ. Second, use the first 10,000 labeled sentences from the 90% Brown data as your seed set. Plot a learning curve as you increase the WSJ self-training set from 1,000 to 35,000. Plot the same points on the learning curves as before.
Additionally, there are FAQs and a set of tips for working with the Stanford Parser for this assignment.

Report

Your report should briefly describe your implementation and contain a concise but detailed discussion of the experiments you ran, including nicely formatted learning curves showing the relationship between F1 score and the size of seed and self-training sets as discussed above. In your discussion, be sure to address at least the following questions.

Code

Be sure to turn in an electronic copy of your report and any code you wrote using the steps here. Also be sure to include a README file including the specific commands you used for this assignment.

References

Roi Reichart and Ari Rappoport. Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets. ACL 2007.

Dan Klein and Christopher D. Manning Fast Exact Inference with a Factored Model for Natural Language Parsing. NIPS 2002.

Dan Klein and Christopher D. Manning. Accurate Unlexicalized Parsing. ACL 2003.