/u/mooney/cs388-code/nlp/lm/BigramModel.java. The code uses
simple linear interpolation with a unigram model for smoothing, just using
fixed weights (0.9 and 0.1) for combining the bigram and unigram model. During
training, it replaces the first occurrence of a token with < UNK> for
"unknown" in order to handle novel (out-of-vocabulary, OOV) words encountered
in testing (which are also replaced with < UNK>). It computes complete
sentence probability and perplexity in log space in order to avoid problems
with floating-point underflow when multiplying many small probabilities. Read
the comments in the code to familiarize yourself with this short program.
This model can be trained and tested on sentence-segmented data in the
Linguistic Data Consortium (LDC)'s Penn Treebank collection. See the part-of-speech
tagged (POS) data in /projects/nlp/penn-treebank3/tagged/pos/.
The class POSTaggedFile in /u/mooney/cs388-code/nlp/lm/
contains code for extracting tokenized, sentence-segmented text from
these POS-tagged files, see the comments in this file for details.
The main method for BigramModel trains and tests the
model on LDC tagged files. It takes a list of tagged files and/or directories
as arguments and its last argument (a real value between 0 and 1) is the
fraction of the sentences in this data to use for testing. See a trace file running this program. It shows a
trace of training and testing the model on the atis (airline
booking query), wsj (Wall Street Journal text) and
brown (Brown corpus of mixed-genre text) corpora using the first
90% of the sentences for training and the last 10% for testing. It shows
perplexity for both the training and held-out test data. The additional "Word
Perplexity" measure excludes the prediction for the end-of-sentence (</S>)
token, which will be used for the comparisons discussed below.
Your first task is to produce a "backward" bigram model that models the
generation of a sentence from right to left. Think of a very simple way to very
quickly use BigramModel to produce a
BackwardBigramModel that is identical except for the modeling
direction. The one significant difference between a left-to-right and a
right-to-left (backward) model is that the backward model's final prediction is
to predict sentence-start (< S>) rather than sentence-end (</S>).
Therefore, the "Word Perplexity" version of the evaluation metric that ignores
the prediction of sentence boundaries in either direction, is the fairest way
to compare these two models.
Test your backward model on the atis, wsj, and
brown corpora and produce a trace analogous to the BigramModel trace file for
BackwardBigramModel.
Next build a BidirectionalBigramModel that combines the forward
and backward model. When determining the probability of a word in the
sentence, it should use both the estimate from its forward and backward
contexts by linearly interpolating the probability predicted for that token by
the BigramModel and the BackwardBigramModel. By
default, just weight the prediction of both models equally when interpolating a
probability for each word/token in the sentence, although you may try different
weights if you believe the comparative results from the forward and backward
model warrant that they be weighted differently.
Test your bidirectional model on the atis, wsj, and
brown corpora and produce a trace analogous to the BigramModel trace file for
BidirectionalBigramModel.
Both your BackwardBigramModel and
BidirectionalBigramModel should take the same input and produce
the same style of output as BigramModel, except you only need to
show the "Word Perplexity" for the bidirectional model. That is, for the
bidirectional model you should not try to combine the probabilities for
predicting sentence-start and sentence-end since these are not comparable
tasks, so just don't include sentence boundary prediction in the perplexity
score at all.