UTCS Corporate Connection/Yahoo!: Web Analysis Algorithms ACES 2.402 Wednesday November 12 2008 2:00 p.m.

Contact Name: 
Jenna Whitney
Date: 
Nov 12, 2008 2:00pm - 3:00pm

Type of Talk: UTCS Corporate Connection/FoCS
<

br>Speaker/Affiliation: Kunal Punera and Srujana Merugu/Yahoo!

Date

/Time: Wednesday November 12 2008 2:00 p.m.

Location: ACES 2.40

2

Host: UTCS FoCS

Talk Title: Web Analysis Algorithms
Talk Abstract:
Traditionally web analysis algorithms have operated wi

th
web pages as the atomic units of information. Link analysis
appr

oaches like Pagerank and HITS were initially formulated
on a graph wher

e the web pages served as the nodes. Similarly
search engines produced
rankings of web pages and contextual
advertising systems matched ads t

o the full web page content.
However over the years as the Web has evo

lved in fascinating
ways this macroscopic view of the Web has become i

ncreasingly
untenable. Web pages today are commonly a mixture of topi

cs
and links often have different purposes other than just conferring

authority. Furthermore there is increasing interest in services that <

br>address user information needs using entities and structured data
ex

tracted from web pages. In this talk we will describe two projects
at Y

ahoo! Research that tackle some of the problems raised above.

The fi

rst project involves segmenting and labeling parts of web pages
a task
fundamental to all works that employ fine-grained models of
web pages.
This task has become increasingly challenging over the
years as Web us

e has matured and web page construction has evolved
to reflect that. In
this talk we will present our work on segmenting web
pages into visual

ly and semantically cohesive parts. Our approach is
based on formulatin

g an appropriate optimization problem on weighted
graphs where the wei

ghts capture if two regions of the web page should
be placed together o

r apart in the segmentation; we present a learning
framework to learn t

hese weights from manually labeled data in a principled
manner.

T

he second project deals with managing a large number of sophisticated infor

mation extraction pipelines across different application domains at the
web scale with minimum human involvement. Three key value propositions will be described: extensibility - the ease of swapping in and out extrac

tion operators; explainability - the ability to track the provenance of ex

traction
results; and social feedback support - the facility for gathe

ring and incor-
porating community-provided feedback.

Speaker Bio

:
Kunal Punera is a Research Scientist with the Search & Data Mining
group at Yahoo! Research. He joined Yahoo! after obtaining his PhD
fro

m University of Texas at Austin in 2007. His research primarily focuses
on statistical learning and data mining often applied in the context of l

arge
scale information retrieval systems. At Yahoo! he works on develop

ing new
features for the web search engine and learning from users'' int

eraction with it. Recently he is also involved in applying learning based

approaches to various problems in the area of computational advertising.

Srujana Merugu is a Research Scientist with the Machine Learning group
at Yahoo! Research. She joined Yahoo! after obtaining his PhD from Univers

ity of Texas at Austin in 2006. Her research primarily focuses on statistic

al learning and data mining. At Yahoo! she primarily works on algorithms f

or information extraction from web artifacts incorporating social feed bac

k as well as scalable modeling of multi-type data.