Language and Vision
To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Vision is the primary source of perception and grounding language in vision is an important AI problem with many applications. Our group has focused particularly on automated video captioning, producing natural language descriptions of short video clips using both graphical models and deep neural networks.
Sonal Gupta Masters Alumni sonal [at] cs stanford edu
Niveda Krishnamoorthy Masters Alumni niveda [at] cs utexas edu
Calvin MacKenzie Undergraduate Alumni calvinm mackenzie [at] utexas edu
Tanvi S Motwani Masters Alumni tanvi [at] cs utexas edu
Nazneen Rajani Ph.D. Alumni nrajani [at] cs utexas edu
Stephen Roller Ph.D. Alumni roller [at] cs utexas edu
Subhashini Venugopalan Ph.D. Alumni vsub [at] cs utexas edu
Jialin Wu Ph.D. Student jialinwu [at] utexas edu
Harel Yedidsion Postdoctoral Fellow harel [at] cs utexas edu
     [Expand to show all 30][Minimize]
Faithful Multimodal Explanation for Visual Question Answering 2019
Jialin Wu and Raymond J. Mooney, In Proceedings of the Second BlackboxNLP Workshop at ACL, pp. 103-112, Florence, Italy, August 2019.
Generating Question Relevant Captions to Aid Visual Question Answering 2019
Jialin Wu, Zeyuan Hu, Raymond J. Mooney, In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, August 2019.
Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder 2019
Jialin Wu and Raymond J. Mooney, To Appear In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019, December 2019.
Self-Critical Reasoning for Robust Visual Question Answering 2019
Jialin Wu and Raymond J. Mooney, To Appear In Proceedings of Neural Information Processing Systems (NeurIPS) , December 2019.
Explainable Improved Ensembling for Natural Language and Vision 2018
Nazneen Rajani, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Joint Image Captioning and Question Answering 2018
Jialin Wu, Zeyuan Hu and Raymond J. Mooney , In VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18) , June 2018.
Learning a Policy for Opportunistic Active Learning 2018
Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney, In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18), Brussels, Belgium, November 2018.
Stacking With Auxiliary Features for Visual Question Answering 2018
Nazneen Fatema Rajani, Raymond J. Mooney, In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2217-2226 2018.
Captioning Images with Diverse Objects 2017
Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko, In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-17), pp. 5753--5761 2017.
Ensembling Visual Explanations for VQA 2017
Nazneen Fatema Rajani, Raymond J. Mooney, In Proceedings of the NIPS 2017 workshop on Visually-Grounded Interaction and Language (ViGIL), December 2017.
Multi-Modal Word Synset Induction 2017
Jesse Thomason and Raymond J. Mooney, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), pp. 4116--4122, Melbourne, Australia 2017.
Natural-Language Video Description with Deep Recurrent Neural Networks 2017
Subhashini Venugopalan, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Using Explanations to Improve Ensembling of Visual Question Answering Systems 2017
Nazneen Fatema Rajani and Raymond J. Mooney, In Proceedings of the IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), pp. 43-47, Melbourne, Australia, August 2017.
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data 2016
Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell, In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-16), pp. 1--10 2016.
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text 2016
Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, and Kate Saenko, In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), pp. 1961--1966, Austin, Texas 2016.
Stacking With Auxiliary Features: Improved Ensembling for Natural Language and Vision 2016
Nazneen Fatema Rajani, PhD proposal, Department of Computer Science, The University of Texas at Austin.
Natural Language Video Description using Deep Recurrent Neural Networks 2015
Subhashini Venugopalan, PhD proposal, Department of Computer Science, The University of Texas at Austin.
Sequence to Sequence -- Video to Text 2015
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko, In Proceedings of the 2015 International Conference on Computer Vision (ICCV-15), Santiago, Chile, December 2015.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks 2015
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko, In Proceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT 2015), pp. 1494--1504, Denver, Colora...
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild 2014
Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney, In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pp. 1218--1227, Dublin, Ireland, August 2014.
Integrating Visual and Linguistic Information to Describe Properties of Objects 2014
Calvin MacKenzie, Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities 2013
Stephen Roller and Sabine Schulte im Walde, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pp. 1146--1157, Seattle, WA, October 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge 2013
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama, In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), pp. 541--547, July 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge 2013
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama, Proceedings of the NAACL HLT Workshop on Vision and Language (WVL '13) (2013), pp. 10--19.
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition 2013
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko, In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), pp. 2712--2719, Sydney, Australia, December 2013.
Improving Video Activity Recognition using Object Recognition and Text Mining 2012
Tanvi S. Motwani and Raymond J. Mooney, In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), pp. 600--605, August 2012.
Using Closed Captions as Supervision for Video Activity Recognition 2010
Sonal Gupta, Raymond J. Mooney, In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010), pp. 1083--1088, Atlanta, GA, July 2010.
Activity Retrieval in Closed Captioned Videos 2009
Sonal Gupta, Masters Thesis, Department of Computer Sciences, University of Texas at Austin. 64 pages.
Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval 2009
Sonal Gupta and Raymond Mooney, In Proceedings of the CVPR-09 Workshop on Visual and Contextual Learning from Annotated Images and Videos (VCL), Miami, FL, June 2009.
Watch, Listen & Learn: Co-training on Captioned Images and Videos 2008
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney, In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), pp. 457--472, Antwerp Belgium, September 2008.