Statistical Natural Language Processing

Statistical techniques can help remove much of the ambiguity in natural language.

A type is a word form, while a token is each occurrence of a word type. N-grams are sequences of N words: unigrams, bigrams, trigrams, etc. Statistics on the occurrences of n-grams can be gathered from text corpora.[ corpus (Latin for body) is singular, corpora is plural. A corpus is a collection of natural language text, sometimes analyzed and annotated by humans.]

Unigrams give the frequencies of occurrence of words. Bigrams begin to take context into account. Trigrams are better, but it is harder to get statistics on larger groups.

N-gram approximations to Shakespeare:[D. Jurafsky and J. Martin, Speech and Language Processing, Prentice-Hall, 2000.]

  1. Every enter now severally so, let
  2. What means, sir. I confess she? then all sorts, he is trim, captain.
  3. Sweet prince, Falstaff shall die. Harry of Monmouth's grave.
  4. They say all lovers swear more performance than they are wont to keep obliged faith unforfeited!

