Abstract: Exponential growth in the biomedical literature and the breakdown of subdisciplinary boundaries in the post-genomic era have created a growing demand for sophisticated computational tools for managing and assimilating the biological texts. Substantial resources are being deployed in the area, including: community-built, actively curated ontologies; diverse, manually annotated corpora; organized competitions; and increased research funding. In this talk, I will characterize biomedical language and texts, describe some of the unique resources available and particular needs of biomedical community, and then discuss some recent results from my laboratory.
About the Speaker: Larry Hunter is the director of the Center for Computational Pharmacology and the Computational Bioscience Program at University of Colorado Health Sciences Center. He is an Associate Professor of Pharmacology and Preventive Medicine & Biometrics at University of Colorado School of Medicine, an Associate Professor of Computer Science at University of Colorado, Boulder, and an Associate Professor of Biology at University of Colorado, Denver. He is a founder of the International Society for Computational Biology (ISCB) and Molecular Mining Corporation.
Abstract: A novel and insightful view of a recurring problem in natural language processing will be presented, namely the problem of estimating a probability mass function (pmf) for a discrete random variable from a small sample. Formally, a pmf will be deemed admissible as an estimate if it assigns merely a higher likelihood to the observed value of a sufficient statistic than to any other value possible for the same sample size. The standard maximum likelihood estimate is trivially admissible by this definition, but so are many other pmfs. It will be shown that the principled selection of an estimate from this admissible family via criteria such as minimum divergence leads to inherently smooth estimates that make no prior assumptions about the unknown probability while still providing a way to incorporate prior domain knowledge when available. Widely prevalent practices such as discounting the probability of seen events, and ad hoc procedures such as back-off estimates of conditional pmfs, will be shown to be natural consequences of this viewpoint. Some empirical results in statistical language modeling will be presented to demonstrate the computational feasibility of the proposed methods.
About the Speaker: Sanjeev Khudanpur is an Assistant Professor in the Department of Electrical & Computer Engineering and a member of the Center for Language and Speech Processing at Johns Hopkins University. He obtained a B. Tech. from the Indian Institute of Technology, Bombay, in 1988, and a Ph. D. from the University of Maryland, College Park, in 1997, both in Electrical Engineering. His research is concerned with the application of information theoretic and statistical methods to problems in human language technology, including automatic speech recognition, machine translation and information retrieval, and he is particularly interested in maximum entropy and related techniques for model estimation from sparse data.
Abstract: The Text REtrieval Conference (TREC) is a workshop series that develops the infrastructure for large-scale evaluation of retrieval technology. The TREC question answering track was introduced in 1999 to focus attention on the problem of returning exactly the answer in response to a question. While the initial question answering tracks focused on factoid questions such as "Where is the Taj Mahal?", later tracks have incorporated more difficult question types such as list questions ("What actors have played Tevye in `Fiddler on the Roof'?") and definition/biographical questions ("What is a golden parachute?" or `"Who is Vlad the Impaler?"). The question answering track was the first large-scale evaluation of open-domain question answering systems, and it has brought the benefits of test collection evaluation used in other parts of TREC to bear on the question answering task. The track established a common task for the retrieval and natural language processing research communities, creating a renaissance in question answering research. This wave of research has created significant progress in automatic natural language understanding as researchers have successfully incorporated sophisticated language processing into their question answering systems. This talk will review the history of the TREC QA track with a focus on the role appropriate evaluation methodologies play in fostering new technology.
About the Speaker: Ellen Voorhees is a Group Leader in the Information Access Division of the U.S. National Institute of Standards and Technology (NIST) where her primary responsibility to manage the Text REtrieval Conference (TREC) project. Her research interests include information retrieval and natural language processing, especially developing evaluation schemes to measure system effectiveness.