Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk

University of Texas at Austin	Microsoft Research
Department of Computer Science	Natural Language Processing Group
David L. Chen	William B. Dolan

[ Description | Publication and Talks | Data | Contact ]

Description of the project

Traditional methods of collecting translation and paraphrase data can be prohibitively expensive, making construction of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. In this project we introduce a novel annotation task that uses short video clips (usually less than 10 seconds) as the stimulus to elicit parallel linguistic responses from the annotators. Descriptions of the same video in the same language can then be used as paraphrases of each other while descriptions in different languages can be used as translations of each other.

Some of the advantages of this data collection method are:

Only requires monolingual speakers to create translation data
Create more natural paraphrases that are unbiased by a source sentence
Discourages cheating such as using an online translation service since there are no source sentences to translate

Over a two-month period from July to September in 2010, we collected 85K English descriptions for 2,089 video clips as well as over a thousand descriptions for each of a dozen more languages. In addition to providing training and testing data for paraphrase and translation engines, this data also provides natural language descriptions for a significant amount of video data. The video clips generally depict a single, unambiguous action or event.

Publication and Talks

Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection [Abstract] [PDF] [Slides (PPT)]
David L. Chen and William B. Dolan
In the proceedings of The 3rd Human Computation Workshop (HCOMP 2011), San Francisco, CA, August, 2011

Collecting Highly Parallel Data for Paraphrase Evaluation [Abstract] [PDF] [Slides (PPT)]
David L. Chen and William B. Dolan
In the proceedings of The 49th Annual Meetings of the Association for Computational Linguistics (ACL) , Portland, OR, June, 2011

Data

Overview

The data consists of 122K descriptions for 2089 video clips. Below is a breakdown of the number of annotations obtained for each language:

English 85550 Hindi 6245 Romanian 3998 Slovene 3584

Serbian 3420 Tamil 2789 Dutch 2735 German 2326

Macedonian 1915 Spanish 1883 Gujarati 1437 Russian 1243

French 1226 Italian 953 Georgian 907 Polish 544

Other languages that had at least 1 annotation includes: Chinese, Malayalam, Tagalog, Portuguese, Norwegian, Filipino, Estonian, Turkish, Arabic, Urdu, Hungarian, Indonesian, Malay, Bulgarian, Danish, Bosnian, Marathi, Swedish, and Albanian.

We have also included some of the video clips that were used to gather these descriptions. Unfortunately, due to the volatility of YouTube, some of the videos were removed before we could archive them. A total of 1970 out of 2089 video clips are included in the tarball below.

2021 Fall Notes:

The links to download the video description dataset from MSR no longer works. Unifortunetely, we are only able to reconstruct the English corpus.

Citations

Please use the following citations when referencing the sources of the data:

@InProceedings{chen:acl11,
  title = "Collecting Highly Parallel Data for Paraphrase Evaluation",
  author = "David L. Chen and William B. Dolan",
  booktitle = "Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011)",
  address = "Portland, OR",
  month = "June",
  year = 2011
}

Downloads

To download the reconstructed English descriptions of the videos, please visit:
Microsoft Research Video Description Corpus

Here is a tarball of most of the video files (.avi):
YouTubeClips.tar

Contact Information

If you have any questions or comments, please contact David Chen or Bill Dolan