Watch, Listen & Learn: Co-training on Captioned Images and Videos

University of Texas at Austin
Department of Computer Sciences
Sonal Gupta Joohyun Kim Kristen Grauman and Raymond J. Mooney

[ Description | Publication | Data | Results | Contact ]

Description of the project

Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.

Publication

Watch, Listen & Learn: Co-training on Captioned Images and Videos [Abstract] [PDF] [slides(PPT on Mac Office'04) (PDF)] [poster (PDF)]
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond J. Mooney
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2008) , Antwerp, Belgium, September 2008.

Citation

@InProceedings{gupta:ecml08,
  title = "Watch, Listen \& Learn: Co-training on Captioned Images and Videos ",
  author = "Sonal Gupta and Joohyun Kim and Kristen Grauman and Raymond J. Mooney",
  booktitle = "Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 
(ECML PKDD 2008)",
  address = "Antwerp, Belgium",
  month = "September",
  year = 2008
}

Data

Overview

Image Data

Our image data is taken from the Israel dataset introduced in Bekkerman & Joen CVPR 2007, which consists of images with short text captions. In order to evaluate the cotraining approach, we used two classes from this data, Desert and Trees. These two classes were selected since they satisfy the sufficiency assumption of cotraining, which requires that both views be effective at discriminating the classes (given sufficient labeled data). We refer to this set as the Desert-Trees dataset. The complete dataset contains 362 instances.

Video Data

We collected video clips of soccer and ice skating. One set of video clips is from the DVD titled '1998 Olympic Winter Games: Figure Skating Exhibition Highlights', which contains highlights of the figure skating competition at the 1998 Nagano Olympics. Another set of video clips is on soccer playing, acquired either from the DVD titled 'Strictly Soccer Individual Skills' or downloaded from YouTube. These videos mostly concentrate on the player in the middle of the screen and usually the motions are repeated several times with different viewpoints. The soccer clips are mainly about soccer specific actions such as kicking and dribbling. There is significant variation in the size of the person across the clips.

The video clips are resized to 240x360 resolution and then manually divided into short clips. The clip length varies from 20 to 120 frames, though most are between 20 and 40 frames. While segmenting activities in video is itself a difficult problem, in this work we specifically focus on classifying pre-segmented clips. The clips are labeled according to one of four categories: kicking, dribbling, spinning and dancing. The first two are soccer activities and the last two are skating activities. The number of clips in each category are, dancing: 59, spinning: 47, dribbling: 55 and kicking: 60.

As the video clips were not originally captioned, we recruited two colleagues unaware of the goals of the project to supply the commentary for the soccer videos. The skating commentary was provided by two of the authors.

Download

For more information on the image dataset, please go here . Ron Bekkerman has provided link and caption of the images in this file.

Here is a compressed file of some random videos from the dataset: videos.zip
Captions of some of the videos in our dataset are in this file: captions.txt

Results

Image: Co-training vs. Supervised SVM
Image: Co-training vs. Semi-Supervised EM
Image: Co-training vs. Transductive SVM in Semi-Supervised Setting
Video: Co-training vs. Supervised SVM
Video, when captions are available only during training: Co-training vs. Supervised SVM

Contact Information

If you have any questions or comments, please contact Sonal Gupta or Joohyun Kim

If you are interested in reading more literature in this area, check out our reading group CLAMP