Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk

University of Texas at Austin Microsoft Research
Department of Computer Science Natural Language Processing Group
David L. Chen William B. Dolan

[ Description | Publication and Talks | Data | Contact ]

Description of the project

Back to Top

Traditional methods of collecting translation and paraphrase data can be prohibitively expensive, making construction of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. In this project we introduce a novel annotation task that uses short video clips (usually less than 10 seconds) as the stimulus to elicit parallel linguistic responses from the annotators. Descriptions of the same video in the same language can then be used as paraphrases of each other while descriptions in different languages can be used as translations of each other.

Some of the advantages of this data collection method are:

  • Only requires monolingual speakers to create translation data
  • Create more natural paraphrases that are unbiased by a source sentence
  • Discourages cheating such as using an online translation service since there are no source sentences to translate

Over a two-month period from July to September in 2010, we collected 85K English descriptions for 2,089 video clips as well as over a thousand descriptions for each of a dozen more languages. In addition to providing training and testing data for paraphrase and translation engines, this data also provides natural language descriptions for a significant amount of video data. The video clips generally depict a single, unambiguous action or event.

Publication and Talks

Back to Top

Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection [Abstract] [PDF] [Slides (PPT)]
David L. Chen and William B. Dolan
In the proceedings of The 3rd Human Computation Workshop (HCOMP 2011), San Francisco, CA, August, 2011

Collecting Highly Parallel Data for Paraphrase Evaluation [Abstract] [PDF] [Slides (PPT)]
David L. Chen and William B. Dolan
In the proceedings of The 49th Annual Meetings of the Association for Computational Linguistics (ACL) , Portland, OR, June, 2011


Back to Top


The data consists of 122K descriptions for 2089 video clips. Below is a breakdown of the number of annotations obtained for each language:

English85550 Hindi6245 Romanian3998 Slovene3584
Serbian3420 Tamil2789 Dutch2735 German2326
Macedonian1915 Spanish1883 Gujarati1437 Russian1243
French1226 Italian953 Georgian907 Polish544
Other languages that had at least 1 annotation includes: Chinese, Malayalam, Tagalog, Portuguese, Norwegian, Filipino, Estonian, Turkish, Arabic, Urdu, Hungarian, Indonesian, Malay, Bulgarian, Danish, Bosnian, Marathi, Swedish, and Albanian.

We have also included some of the video clips that were used to gather these descriptions. Unfortunately, due to the volatility of YouTube, some of the videos were removed before we could archive them. A total of 1970 out of 2089 video clips are included in the tarball below.

2021 Fall Notes:

The links to download the video description dataset from MSR no longer works. Unifortunetely, we are only able to reconstruct the English corpus.


Please use the following citations when referencing the sources of the data:

  title = "Collecting Highly Parallel Data for Paraphrase Evaluation",
  author = "David L. Chen and William B. Dolan",
  booktitle = "Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011)",
  address = "Portland, OR",
  month = "June",
  year = 2011


To download the reconstructed English descriptions of the videos, please visit:
Microsoft Research Video Description Corpus

Here is a tarball of most of the video files (.avi):

Contact Information

Back to Top

If you have any questions or comments, please contact David Chen or Bill Dolan