Yahoo Data Mining Series: Professor Charles Elkan/ University of California: "Accounting for Burstiness of Words in Text Mining," ACES 2.302, Wednesday, September 23, 11:00 am
Type of Talk: Yahoo Data Mining Series
Speaker/ Affiliatio
n: Professor Charles Elkan/ University of California
Date/Time: Wednes
day, September 23, 2009/ 11:00 am
Location: ACES 2.302
Host: Jo
ydeep Ghosh
Talk Title: "Accounting for Burstiness of Words in Text Mi
ning"
Talk Abstract:
A fundamental property of language is that if a
word is used once in a document, it is likely to be used again. Statistic
al models of documents applied in text mining must take this property into
account in order to be accurate. In this talk, I will describe how to mode
l burstiness using a probability distribution called the Dirichlet compound
multinomial (DCM). In particular, I will present a new topic model based
on DCM distributions. The central advantage of topic models is that they al
low documents to concern multiple themes, unlike standard clustering metho
ds that assume each document concerns a single theme. On both text and nont
ext datasets, the new topic model achieves better held-out likelihood than
standard latent Dirichlet allocation (LDA).
Speaker Bio:
Charles El
kan is a professor in the Department of Computer Science and Engineering at
the University of California, San Diego. In 2005/06 he was on sabbatical
at MIT, and in 1998/99 he was visiting associate professor at Harvard. He
is known for his research in machine learning, data mining and computation
al biology. The MEME algorithm he developed with his Ph.D. student Tim Bail
ey has been used in over 1,000 publications in biology.
- About
- Research
- Faculty
- Awards & Honors
- Undergraduate
- Graduate
- Careers
- Outreach
- Alumni
- UTCS Direct