UTCS AI Colloquia - Chris Callison-Burch, Johns Hopkins University, "Large-scale paraphrasing for natural language understanding and generation,"

Contact Name: 
Ray Mooney
ACE 2.302
Dec 7, 2012 11:00am - 12:00pm

Signup Schedule: http://apps.cs.utexas.edu/talkschedules/cgi/list_events.cgi

Talk Audience: UTCS Faculty, Grads, Undergrads, Other Interested Parties

Host:  Ray Mooney

Talk Abstract: I will present my method for learning paraphrases - pairs of English expressions with equivalent meaning - from the bilingual parallel corpora, which are more commonly used to train statistical machine translation systems. My method pairs English phrases like (thrown into jail, imprisoned) when they shared an aligned foreign phrase like festgenommen. Because bitexts are large and because a phrase can be aligned many different foreign phrases (including phrases in multiple foreign languages), the method extracts a diverse set of paraphrases. For thrown into jail, we not only learn imprisoned, but also arrested, detained, incarcerated, jailed, locked up, taken into custody, and thrown into prison, along with a set of incorrect/noisy paraphrases. I'll show a number of method for filtering out the poor paraphrases, by defining a paraphrase probability calculated from translation model probabilities, and by re-ranking the candidate paraphrases using monolingual distributional similarity measures.

In addition to lexical and phrasal paraphrases, I'll show how the bilingual pivoting method can be extended to learn meaning-preserving syntactic transformations like the English possessive rule or dative shift. I'll describe a way of using synchronous context free grammars (SCGFs) to represent these rules. This formalism allows us to re-use much of the machinery from statistical machine translation to perform sentential paraphrasing. We can adapt our "paraphrase grammars" to do monolingual text-to-text generation tasks like sentence compression or simplification.
I'll also briefly sketch future directions for adding a semantics to the paraphrases, which my lab will be doing for the upcoming DARPA DEFT program.

Speaker Bio: Chris Callison-Burch is an Associate Research Professor in the Computer Science Department at Johns Hopkins University, where he has built a research group within the Center for Language and Speech Processing (CLSP). He has accepted a tenure-track faculty job at the University of Pennsylvania starting in September 2013. He received his PhD from the University of Edinburgh's School of Informatics and his bachelors from Stanford University's Symbolic Systems Program. His research focuses on statistical machine translation, crowdsourcing, and broad coverage semantics via paraphrasing. He has contributed to the research community by releasing open source software like Moses and Joshua, and by organizing the shared tasks for the annual Workshop on Statistical Machine Translation (WMT). He is the Chair of the North American chapter of the Association for Computational Linguistics (NAACL) and serves on the editorial boards of Computational Linguistics and the Transactions of the ACL.