UTCS Colloquia - Data Mining - AnHai Doan, University of Wisconsin-Madison, "Toward Hands-Off Crowdsourcing: Crowdsourced Entity Matching for the Masses"

Contact Name: 
Inderjit Dhillon
GDC 4.302
Oct 25, 2013 2:00pm - 3:00pm

Signup Schedule: http://apps.cs.utexas.edu/talkschedules/cgi/list_events.cgi

Talk Audience: UTCS Faculty, Grads, Undergrads, Other Interested Parties

Host: Inderjit Dhillon


Talk Abstract: Entity matching (EM) finds data records that refer to the same real-world entity. Recent work has applied crowdsourcing to EM, and has clearly established the promise of this approach. This work however is limited in that it crowdsources only parts of the EM workflow, requiring a developer who knows how to code to execute the remaining parts. Consequently, this work does not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. 

To address these problems, we propose the notion of hands-off crowdsourcing (HOC), which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can represent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM. We show how Corleone uses the crowd to generate blocking rules, applies active learning to learn matchers, estimates accuracy given severe skew, and identifies difficult-to-match pairs to which Corleone can apply more complex matchers. Finally, we discuss the implications of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers. If time permits,  I will also touch on other ongoing crowdsourcing efforts at UW and WalmartLabs.

Speaker Bio: AnHai Doan is an Associate Professor in the database group at the University of Wisconsin, Madison. His current interests include crowdsourcing, knowledge bases, data integration, and information extraction. He received the ACM Doctoral Dissertation Award in 2003 and a Sloan fellowship in 2007. AnHai was Chief Scientist of Kosmix, a social media startup acquired by Walmart in 2011. Currently he also works as Chief Scientist of WalmartLabs, a research and development lab devoted to analyzing and integrating data for e-commerce. AnHai is a co-author of “Principles of Data Integration” (with Alon Halevy and Zack Ives), a textbook published by Morgan Kaufmann in 2012.