YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition

YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition (2013)

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko

Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities "in-the-wild". We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action, and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to "fill in" novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

View:

PDF

Citation:

In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), pp. 2712--2719, Sydney, Australia, December 2013.

Bibtex:

Presentation:

Poster

People

Niveda Krishnamoorthy	Masters Alumni	niveda [at] cs utexas edu
Girish Malkarnenkar	Masters Alumni	girish [at] cs utexas edu
Raymond J. Mooney	Faculty	mooney [at] cs utexas edu
Subhashini Venugopalan	Ph.D. Alumni	vsub [at] cs utexas edu

Areas of Interest

Computer Vision Language and Vision Natural Language Processing

Labs

Machine Learning