CS 388 Natural Language Processing
Homework 3 FAQ

  1. What do you mean by "cache the value of the sample selection function"?

    If you sort by a function that relies on a parse of a sentence, make sure that you don't re-parse two sentences for every comparison you perform. This is especially applicable if you use Comparators. Of course, you cannot cache values across runs: the value of the sample selection function will change when you re-train your parser.

  2. What does "-mx1500m" mean?

    It specifies the maximum memory available to the Java VM to be 1.5 GB. You may want to adjust this if you're working on a machine with less memory or if you find your VM running out of memory.

  3. What should I do about single-word sentences? What about sentences that can't be parsed?

    You'll run across all kinds of practical issues like this. Do something reasonable and describe it as part of your description of experimental design.

  4. How can I add Trees to a Treebank?

    The Treebank class doesn't permit this, so it might be difficult. Additionally, you'll want to remove sentences that you added to your Treebank from your "unlabeled" candidate pool. One approach is to write out your next training bank and candidate pool to new files at the end of an iteration, terminate, and call your program again to process the next iteration, which has the advantage of letting you resume your training at any point. Reading in a TreeBank of the size we're dealing with takes only a few seconds, so efficiency should not be a big concern with this approach.

  5. What do you mean by "develop a simple command line interface to the LexicalizedParser class"?

    Your code should compile against LexicalizedParser and instantiate an instance of it, not make a system call and invoke the java interpreter. Your class should have a main method and will take some arguments on the command line. For reference, the TA's code took the following arguments: --trainBank <file>, --candidateBank <file>, --testBank <file>, --nextTrainBank <file>, --nextCandidatePool <file>, --selectionFunction <random|treeEntropy|...>. For each iteration, the first three arguments were files that were read, and the next two were filenames that were written to. You do not necessarily need to implement your active learner this way. Please remember to list the actual commands you ran in your README file.

  6. Do I need to use FileFilter in my code?

    Unlikely. You can just construct your Treebanks from one file each, modifying the makeTreebank() code appropriately when you copy it into your class.

  7. The DemoParser reads in a serialized parser (englishPCFG.ser.gz). What do I do with that?

    Nothing. Get rid of it. You're training your own parsers in this assignment.