Problem 1)
P(people|) = 0.9(1/5) + 0.1(3/20) = 0.195
P(topple|people) = 0.9(1/3) + 0.1(3/20) = 0.315
P(dictator|topple) = 0.9(1/3) + 0.1(3/20) = 0.315
P(|dictator) = 0.9(1/3) + 0.1(5/20) = 0.325
P( people topple dictator )
= 0.195 * 0.315 * 0.315 * 0.325
= 0.006288
----------------------
Problem 2)
v1(1) = a01 * b1(A) = 0.2 * 0.6 = 0.12
v1(2) = a02 * b2(A) = 0.8 * 0.4 = 0.32
bt1(1) = 0
bt1(2) = 0
v2(1) = max(v1(1)*a11, v1(2)*a21) * b1(B))
= max(0.12*0.2 , 0.32*0.2) * 0.4
= 0.0256
v2(2) = max(v1(1)*a12, v1(2)*a22) * b2(B)
= max(0.12*0.6 , 0.32*0.6) * 0.6
= 0.1152
bt2(1) = argmax(0.12*0.2 , 0.32*0.2) = 2
bt2(2) = argmax(0.12*0.6 , 0.32*0.6) = 2
v3(sF) = max(v2(1)*a1F, v2(2)*a2F)
= max(0.0256*0.2, 0.1152 *0.2) = 0.02304
bt3(sF) = argmax(0.0256*0.2, 0.1152 *0.2) = 2
Most likely state sequence for AB = bt1(bt2(bt3(sF))), bt2(bt3(sF)), bt3(sF), SF
S0, S2, S2, SF
-----------------------
Problem 3)
(S (NP (N GUARD)) (VP (VP (V TESTS)) (PP (PREP LIKE) (NP (N GOLD)))))
;; A guard takes a test like gold would take a test.
(S (NP (ADJ GUARD) (NP (N TESTS))) (VP (V LIKE) (NP (N GOLD))))
;; A particular type of test called guard tests are particularly fond of gold.
(S (VP (VP (V GUARD) (NP (N TESTS))) (PP (PREP LIKE) (NP (N GOLD))))) ; this is the "correct" parse
;; A command to protect tests as if they were gold.
(S (VP (V GUARD) (NP (NP (N TESTS)) (PP (PREP LIKE) (NP (N GOLD))))))
;; A command to protect only those tests which are similar to gold.
----------------------------------
Problem 4)
CNF changes:
S -> VP :
S -> KISS 0.4 x 0.3 x 0.5 = 0.06
S -> MARRY 0.4 x 0.3 x 0.5 = 0.06
S -> Verb NP 0.4 x 0.3 = 0.12
S -> VP PP 0.4 x 0.4 = 0.16
NP -> PropNoun:
NP -> DALLAS 0.6 x 0.2 = 0.12
NP -> MARY 0.6 x 0.2 = 0.12
NP -> BOB 0.6 x 0.3 = 0.18
NP -> AUSTIN 0.6 x 0.3 = 0.18
VP -> Verb:
VP -> KISS 0.3 x 0.5 = 0.15
VP -> MARRY 0.3 x 0.5 = 0.15
Marry Bob In Austin
Verb 0.5 S (Verb NP) 0.0108 S (Verb NP1) 0.00031104
S 0.06 VP2 (Verb NP) 0.027 S2 (VP2 PP) 0.00031104
VP 0.15
-------------------------------------------------------------------------------
PropNoun 0.3 NP1 (NP PP) 0.005184
NP 0.18
-------------------------------------------------------------------------------
Prep 0.4 PP (Prep NP2) 0.072
-------------------------------------------------------------------------------
PropNoun 0.3
NP2 0.18
-------------------------------------------------------------------------------
Total probability
= 0.00031104 + 0.00031104
= 0.000062208
-------------------------------------------
Problem 5)
--------
What two words best characterize the presence of ambiguity in natural language?
Ubiquitous and explosive
--------
The following is a famous dialog from the disaster-movie spoof ``Airplane!'':
Rumack: You'd better tell the Captain we've got to land as soon as we can. This woman has to be gotten to a hospital.
Elaine Dickinson: A hospital? What is it?
Rumack: It's a big building with patients, but that's not important right
now.
Explain what {\it specific} type of ambiguity in language understanding
makes this humorous.
This is an instance of co-reference or anaphora ambiguity, the pronoun "it"
actually refers to the problem that requires the woman to need a hospital, but
the joke relies on the fact that it could also refer to the word "hospital"
itself.
--------
Do the same for this other famous ``Airplane!'' dialog:
Rumack: I won't deceive you, Mr. Striker. We're running out of time.
Ted Striker: Surely there must be something you can do.
Rumack: I'm doing everything I can... and stop calling me Shirley!
This is an instance of phonetic lexical ambiguity in speech recognition, the
words "surely" and "Shirley" are homophones and the joke relies on interpretting
the first word as an instance of the second.
--------
What is ``smoothing'' and why is it critical to many areas of statistical language processing?
Smoothing is an adjustment to maximum likelihood parameter estimation that
assigns some probability mass to unseen events by discounting the probability
of seen events in order to avoid assigning zero probability to rare events not
seen in training even though they are not impossible. Zero probability
estimates can result in a probabilistic model being unable to interpret an
novel example with an unseen event, which due to Zipfs law is quite likely in
natural language.
--------
What is the primary difference between a generative and a discriminative
probabilistic model?
A generative models specifies a stochastic procedure for generating the data
and therefore implicitly defines an entire joint distribution for the data,
whereas a discriminative model only models the conditional distribution of
the variables to be predicted given the other variables as evidence.
Generative models are more powerful but harder to estimate properly from
sparse data, making discriminative models better when the model is
intended to support a specific predictive task that can be trained for
directly, which requires estimating fewer parameters than needed to
specify the entire joint distribution.
--------
Why is dynamic programming so prevelant and critical in algorithms for natural
language processing?
Many NLP tasks require maximizing over a combinatorial number of alternatives
(e.g. all possible tag sequences or parse trees). Dynamic programming is
critical to efficiently solving these optimization problems by building
solutions to larger problems from solutions to smaller problems while avoiding
solving each subproblem more than once (i.e. storing the solution the first
time it is needed for future use).
--------
What is the difference between maximum likelihood estimation (MLE) and maximum
a posteriori (MAP) training of statistical models?
MLE sets parameters to maximize the probability that the data D is produced by
the model M (argmax P(D | M)). MAP maximizes the probability of the model
given the data (argmax P(M | D). Due to Bayes theorem, MAP also maximizes the
likelihood of the data times a prior for the model, therefore MLE is MAP with a
uniform prior across all models.
--------
What is the easiest way to use semi-supervised learning to train a generative
model?
One can use EM to perform semi-supervised training of a generative model by
using the known labels of the supervised data to initialize the parameters of
the model and then, in each subsequent iteration using these known labels as
well as the predicted labels on the unlabeled data to retrain the model in the
M step.
--------
(Extra credit, 2pts)
Who was the leader of the team at IBM that built Watson who
was the UT PhD grad who was on his team?
Dave Ferrucci and James Fan
---------------
(Extra credit, 2pts) Who developed the first computer program for
parsing natural language and at what university?
Arvind Joshi, Univ of Pennsylvania