UTCS Colloquium: "Building Knowledge Bases from the Web," Rajeev Rastogi/Yahoo, ACES 6.304

Contact Name: 
Jenna Whitney
Date: 
Dec 3, 2010 11:00am - 12:00pm

There is a sign-up schedule for this event that can be found at

http://www.cs.utexas.edu/department/webevent/utcs/events/cgi/list_event

s.cgi

Type of Talk: UTCS Colloquium

Speaker/Affiliation: Rajeev R

astogi/Yahoo

Date/Time: Friday, December 3, 2010, 11:00 a.m.

L

ocation: ACES 6.304

Host: Inderjit Dhillon

Talk Title: Building K

nowledge Bases from the Web

Talk Abstract: The web is a vast repositor

y of human knowledge. Extracting structured data from web pages can enable

applications like comparison shopping, and lead to improved ranking and re

ndering of search results. In this talk, I will describe two efforts at Ya

hoo! Labs to extract records from pages at web scale. The first is a wrappe

r induction system that handles end-to-end extraction tasks from clustering
web pages to learning XPath extraction rules to relearning rules when site

s change. The system has been deployed in production within Yahoo! to extra

ct more than 200 million records from ~200 web sites. The second effort exp

loits content redundancy on the web to automatically extract records withou

t human supervision. Starting with a seed database, we determine values in
the pages of each new site that match attribute values in the seed records

. We devise a new notion of similarity for matching templatized attribute c

ontent, and an apriori style algorithm that exploits templatized page stru

cture to prune spurious attribute matches.

Speaker Bio: Rajeev Rastogi
is the Vice President of Yahoo! Labs Bangalore where he directs basic and

applied research in the areas of web search, advertizing, and cloud compu

ting. Previously Rajeev was at Bell Labs where he was a Bell Labs Fellow an

d the founding Director of the Bell Labs Research Center in Bangalore. Raje

ev is active in the fields of databases, data mining, and networking, an

d has served on the program committees of several conferences in these area

s. He currently serves on the editorial board of the CACM, and has been an
Associate editor for IEEE Transactions on Knowledge and Data Engineering i

n the past. He has published over 125 papers, and filed over 70 patents of
which 40 have been issued. Rajeev received his B. Tech degree from IIT Bom

bay, and a PhD degree in Computer Science from the University of Texas, A

ustin.