Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

Generating Natural-Language Video Descriptions Using Text-Mined Knowledge (2013)

Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama

We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61 percent of the time.

View:

PDF

Citation:

In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), pp. 541--547, July 2013.

Bibtex:

Presentation:

Slides (PPT)

People

Niveda Krishnamoorthy	Masters Alumni	niveda [at] cs utexas edu
Girish Malkarnenkar	Masters Alumni	girish [at] cs utexas edu
Raymond J. Mooney	Faculty	mooney [at] cs utexas edu

Areas of Interest

Computer Vision Language and Vision Machine Learning Natural Language Processing

Labs

Machine Learning