Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection

Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection (2011)

David L. Chen and William B. Dolan

Traditional methods of collecting translation and paraphrase data are prohibitively expensive, making the construction of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. We discuss a novel annotation task that uses videos as the stimulus which discourages cheating. In addi- tion, our approach requires only monolingual speakers, thus making it easier to scale since more workers are qualified to contribute. Finally, we employ a multi-tiered payment system that helps retain good workers over the long-term, resulting in a persistent, high-quality workforce. We present the results of one of the largest linguistic data collection efforts to date using Mechanical Turk, yielding 85K English sentences and more than 1k sentences for each of a dozen more languages.

View:

PDF

Citation:

In Proceedings of The 3rd Human Computation Workshop (HCOMP 2011), August 2011.

Bibtex:

Presentation:

Slides (PPT)

People

David Chen

Ph.D. Alumni

cooldc [at] hotmail com

Areas of Interest

Natural Language Processing

Labs

Machine Learning