Yahoo Data Mining Series: Professor Charles Elkan/ University of California: "Accounting for Burstiness of Words in Text Mining," ACES 2.302, Wednesday, September 23, 11:00 am

Contact Name: 
Jenna Whitney
Sep 23, 2009 11:00am - 12:00pm

Type of Talk: Yahoo Data Mining Series

Speaker/ Affiliatio

n: Professor Charles Elkan/ University of California

Date/Time: Wednes

day, September 23, 2009/ 11:00 am

Location: ACES 2.302

Host: Jo

ydeep Ghosh

Talk Title: "Accounting for Burstiness of Words in Text Mi


Talk Abstract:
A fundamental property of language is that if a
word is used once in a document, it is likely to be used again. Statistic

al models of documents applied in text mining must take this property into

account in order to be accurate. In this talk, I will describe how to mode

l burstiness using a probability distribution called the Dirichlet compound
multinomial (DCM). In particular, I will present a new topic model based

on DCM distributions. The central advantage of topic models is that they al

low documents to concern multiple themes, unlike standard clustering metho

ds that assume each document concerns a single theme. On both text and nont

ext datasets, the new topic model achieves better held-out likelihood than
standard latent Dirichlet allocation (LDA).

Speaker Bio:
Charles El

kan is a professor in the Department of Computer Science and Engineering at
the University of California, San Diego. In 2005/06 he was on sabbatical

at MIT, and in 1998/99 he was visiting associate professor at Harvard. He

is known for his research in machine learning, data mining and computation

al biology. The MEME algorithm he developed with his Ph.D. student Tim Bail

ey has been used in over 1,000 publications in biology.