Problem 1) P(people|) = 0.9(1/5) + 0.1(3/20) = 0.195 P(topple|people) = 0.9(1/3) + 0.1(3/20) = 0.315 P(dictator|topple) = 0.9(1/3) + 0.1(3/20) = 0.315 P(|dictator) = 0.9(1/3) + 0.1(5/20) = 0.325 P( people topple dictator ) = 0.195 * 0.315 * 0.315 * 0.325 = 0.006288 ---------------------- Problem 2) v1(1) = a01 * b1(A) = 0.2 * 0.6 = 0.12 v1(2) = a02 * b2(A) = 0.8 * 0.4 = 0.32 bt1(1) = 0 bt1(2) = 0 v2(1) = max(v1(1)*a11, v1(2)*a21) * b1(B)) = max(0.12*0.2 , 0.32*0.2) * 0.4 = 0.0256 v2(2) = max(v1(1)*a12, v1(2)*a22) * b2(B) = max(0.12*0.6 , 0.32*0.6) * 0.6 = 0.1152 bt2(1) = argmax(0.12*0.2 , 0.32*0.2) = 2 bt2(2) = argmax(0.12*0.6 , 0.32*0.6) = 2 v3(sF) = max(v2(1)*a1F, v2(2)*a2F) = max(0.0256*0.2, 0.1152 *0.2) = 0.02304 bt3(sF) = argmax(0.0256*0.2, 0.1152 *0.2) = 2 Most likely state sequence for AB = bt1(bt2(bt3(sF))), bt2(bt3(sF)), bt3(sF), SF S0, S2, S2, SF ----------------------- Problem 3) (S (NP (N GUARD)) (VP (VP (V TESTS)) (PP (PREP LIKE) (NP (N GOLD))))) ;; A guard takes a test like gold would take a test. (S (NP (ADJ GUARD) (NP (N TESTS))) (VP (V LIKE) (NP (N GOLD)))) ;; A particular type of test called guard tests are particularly fond of gold. (S (VP (VP (V GUARD) (NP (N TESTS))) (PP (PREP LIKE) (NP (N GOLD))))) ; this is the "correct" parse ;; A command to protect tests as if they were gold. (S (VP (V GUARD) (NP (NP (N TESTS)) (PP (PREP LIKE) (NP (N GOLD)))))) ;; A command to protect only those tests which are similar to gold. ---------------------------------- Problem 4) CNF changes: S -> VP : S -> KISS 0.4 x 0.3 x 0.5 = 0.06 S -> MARRY 0.4 x 0.3 x 0.5 = 0.06 S -> Verb NP 0.4 x 0.3 = 0.12 S -> VP PP 0.4 x 0.4 = 0.16 NP -> PropNoun: NP -> DALLAS 0.6 x 0.2 = 0.12 NP -> MARY 0.6 x 0.2 = 0.12 NP -> BOB 0.6 x 0.3 = 0.18 NP -> AUSTIN 0.6 x 0.3 = 0.18 VP -> Verb: VP -> KISS 0.3 x 0.5 = 0.15 VP -> MARRY 0.3 x 0.5 = 0.15 Marry Bob In Austin Verb 0.5 S (Verb NP) 0.0108 S (Verb NP1) 0.00031104 S 0.06 VP2 (Verb NP) 0.027 S2 (VP2 PP) 0.00031104 VP 0.15 ------------------------------------------------------------------------------- PropNoun 0.3 NP1 (NP PP) 0.005184 NP 0.18 ------------------------------------------------------------------------------- Prep 0.4 PP (Prep NP2) 0.072 ------------------------------------------------------------------------------- PropNoun 0.3 NP2 0.18 ------------------------------------------------------------------------------- Total probability = 0.00031104 + 0.00031104 = 0.000062208 ------------------------------------------- Problem 5) -------- What two words best characterize the presence of ambiguity in natural language? Ubiquitous and explosive -------- The following is a famous dialog from the disaster-movie spoof ``Airplane!'': Rumack: You'd better tell the Captain we've got to land as soon as we can. This woman has to be gotten to a hospital. Elaine Dickinson: A hospital? What is it? Rumack: It's a big building with patients, but that's not important right now. Explain what {\it specific} type of ambiguity in language understanding makes this humorous. This is an instance of co-reference or anaphora ambiguity, the pronoun "it" actually refers to the problem that requires the woman to need a hospital, but the joke relies on the fact that it could also refer to the word "hospital" itself. -------- Do the same for this other famous ``Airplane!'' dialog: Rumack: I won't deceive you, Mr. Striker. We're running out of time. Ted Striker: Surely there must be something you can do. Rumack: I'm doing everything I can... and stop calling me Shirley! This is an instance of phonetic lexical ambiguity in speech recognition, the words "surely" and "Shirley" are homophones and the joke relies on interpretting the first word as an instance of the second. -------- What is ``smoothing'' and why is it critical to many areas of statistical language processing? Smoothing is an adjustment to maximum likelihood parameter estimation that assigns some probability mass to unseen events by discounting the probability of seen events in order to avoid assigning zero probability to rare events not seen in training even though they are not impossible. Zero probability estimates can result in a probabilistic model being unable to interpret an novel example with an unseen event, which due to Zipfs law is quite likely in natural language. -------- What is the primary difference between a generative and a discriminative probabilistic model? A generative models specifies a stochastic procedure for generating the data and therefore implicitly defines an entire joint distribution for the data, whereas a discriminative model only models the conditional distribution of the variables to be predicted given the other variables as evidence. Generative models are more powerful but harder to estimate properly from sparse data, making discriminative models better when the model is intended to support a specific predictive task that can be trained for directly, which requires estimating fewer parameters than needed to specify the entire joint distribution. -------- Why is dynamic programming so prevelant and critical in algorithms for natural language processing? Many NLP tasks require maximizing over a combinatorial number of alternatives (e.g. all possible tag sequences or parse trees). Dynamic programming is critical to efficiently solving these optimization problems by building solutions to larger problems from solutions to smaller problems while avoiding solving each subproblem more than once (i.e. storing the solution the first time it is needed for future use). -------- What is the difference between maximum likelihood estimation (MLE) and maximum a posteriori (MAP) training of statistical models? MLE sets parameters to maximize the probability that the data D is produced by the model M (argmax P(D | M)). MAP maximizes the probability of the model given the data (argmax P(M | D). Due to Bayes theorem, MAP also maximizes the likelihood of the data times a prior for the model, therefore MLE is MAP with a uniform prior across all models. -------- What is the easiest way to use semi-supervised learning to train a generative model? One can use EM to perform semi-supervised training of a generative model by using the known labels of the supervised data to initialize the parameters of the model and then, in each subsequent iteration using these known labels as well as the predicted labels on the unlabeled data to retrain the model in the M step. -------- (Extra credit, 2pts) Who was the leader of the team at IBM that built Watson who was the UT PhD grad who was on his team? Dave Ferrucci and James Fan --------------- (Extra credit, 2pts) Who developed the first computer program for parsing natural language and at what university? Arvind Joshi, Univ of Pennsylvania