Machine Learning Research Group | University of Texas

Publications: Language and Vision

To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Vision is the primary source of perception and grounding language in vision is an important AI problem with many applications. Our group has focused particularly on automated video captioning, producing natural language descriptions of short video clips using both graphical models and deep neural networks.

Show abstracts

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
[Details] [PDF]
Vanya Cohen, Raymond Mooney
Preprint, January 2025.
Multimodal Contextualized Semantic Parsing from Speech
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Jordan Voas, Raymond Mooney, David Harwath
In Association for Computational Linguistics (ACL), August 2024.
Incorporating External Information for Visual Question Answering
[Details] [PDF] [Slides (PDF)]
Jialin Wu
PhD Thesis, Department of Computer Science, UT Austin, August 2022.
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
[Details] [PDF] [Poster]
Anuj Diwan, Puyuan Peng, Raymond J. Mooney
In Workshop on Transfer Learning for Natural Language Processing at NeurIPS 2022, December 2022.
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering
[Details] [PDF] [Poster] [Video]
Jialin Wu, Raymond Mooney
In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), December 2022.
Using Natural Language to Aid Task Specification in Sequential Decision Making Problems
[Details] [PDF] [Slides (PDF)] [Video]
Prasoon Goyal
PhD Thesis, Department of Computer Science, UT Austin, July 2022.
Multi-Modal Answer Validation for Knowledge-Based VQA
[Details] [PDF] [Video]
Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi
In Proceedings of the AAAI Conference on Artificial Intelligence, February 2022.
Towards Automated Error Analysis: Learning to Characterize Errors
[Details] [PDF] [Poster]
Tong Gao, Shivang Singh, Raymond J. Mooney
Short version appears in the 19th International Florida Artificial Intelligence Research Society Conference (FLAIRS), May 2022.
Incorporating Textual Resources to Improve Visual Question Answering
[Details] [PDF] [Slides (PDF)]
Jialin Wu
September 2021. Ph.D. Proposal.
Improving VQA and its Explanations by Comparing Competing Explanations
[Details] [PDF] [Slides (PDF)]
Jialin Wu, Liyan Chen, Raymond J. Mooney
In The AAAI Conference on Artificial Intelligence (AAAI), Explainable Agency in Artificial Intelligence Workshop, February 2021.
Dialog Policy Learning for Joint Clarification and Active Learning Queries
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Aishwarya Padmakumar, Raymond J. Mooney
In The AAAI Conference on Artificial Intelligence (AAAI), February 2021.
Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems
[Details] [PDF] [Slides (PDF)]
Aishwarya Padmakumar
PhD Thesis, Department of Computer Science, The University of Texas at Austin, August 2020.
Self-Critical Reasoning for Robust Visual Question Answering
[Details] [PDF] [Slides (PDF)] [Poster]
Jialin Wu and Raymond J. Mooney
In Proceedings of Neural Information Processing Systems (NeurIPS) , December 2019.
Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder
[Details] [PDF] [Poster]
Jialin Wu and Raymond J. Mooney
In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019, December 2019.
Generating Question Relevant Captions to Aid Visual Question Answering
[Details] [PDF] [Slides (PPT)]
Jialin Wu, Zeyuan Hu, Raymond J. Mooney
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, August 2019.
Faithful Multimodal Explanation for Visual Question Answering
[Details] [PDF] [Slides (PPT)]
Jialin Wu and Raymond J. Mooney
In Proceedings of the Second BlackboxNLP Workshop at ACL, 103-112, Florence, Italy, August 2019.
Learning a Policy for Opportunistic Active Learning
[Details] [PDF]
Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18), Brussels, Belgium, November 2018.
Explainable Improved Ensembling for Natural Language and Vision
[Details] [PDF] [Slides (PPT)] [Slides (PDF)]
Nazneen Rajani
PhD Thesis, Department of Computer Science, The University of Texas at Austin, July 2018.
Joint Image Captioning and Question Answering
[Details] [PDF] [Poster]
Jialin Wu, Zeyuan Hu and Raymond J. Mooney
In VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18) , June 2018.
Stacking With Auxiliary Features for Visual Question Answering
[Details] [PDF] [Poster]
Nazneen Fatema Rajani, Raymond J. Mooney
In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2217-2226, 2018.
Ensembling Visual Explanations for VQA
[Details] [PDF] [Poster]
Nazneen Fatema Rajani, Raymond J. Mooney
In Proceedings of the NIPS 2017 workshop on Visually-Grounded Interaction and Language (ViGIL), December 2017.
Natural-Language Video Description with Deep Recurrent Neural Networks
[Details] [PDF] [Slides (PDF)]
Subhashini Venugopalan
PhD Thesis, Department of Computer Science, The University of Texas at Austin, August 2017.
Using Explanations to Improve Ensembling of Visual Question Answering Systems
[Details] [PDF] [Poster]
Nazneen Fatema Rajani and Raymond J. Mooney
In Proceedings of the IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 43-47, Melbourne, Australia, August 2017.
Multi-Modal Word Synset Induction
[Details] [PDF]
Jesse Thomason and Raymond J. Mooney
In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), 4116--4122, Melbourne, Australia, 2017.
Captioning Images with Diverse Objects
[Details] [PDF] [Slides (PDF)] [Poster]
Subhashini Venugopalan and Lisa Anne Hendricks and Marcus Rohrbach and Raymond Mooney and Trevor Darrell and Kate Saenko
In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-17), 5753--5761, 2017.
Stacking With Auxiliary Features: Improved Ensembling for Natural Language and Vision
[Details] [PDF] [Slides (PDF)]
Nazneen Fatema Rajani
November 2016. PhD proposal, Department of Computer Science, The University of Texas at Austin.
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
[Details] [PDF] [Poster]
Subhashini Venugopalan and Lisa Anne Hendricks and Raymond Mooney and Kate Saenko
In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 1961--1966, Austin, Texas, 2016.
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
[Details] [PDF]
Lisa Anne Hendricks and Subhashini Venugopalan and Marcus Rohrbach and Raymond Mooney and Kate Saenko and Trevor Darrell
In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-16), 1--10, 2016.
Natural Language Video Description using Deep Recurrent Neural Networks
[Details] [PDF] [Slides (PDF)]
Subhashini Venugopalan
November 2015. PhD proposal, Department of Computer Science, The University of Texas at Austin.
Sequence to Sequence -- Video to Text
[Details] [PDF]
Subhashini Venugopalan and Marcus Rohrbach and Jeff Donahue and Raymond J. Mooney and Trevor Darrell and Kate Saenko
In Proceedings of the 2015 International Conference on Computer Vision (ICCV-15), Santiago, Chile, December 2015.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
[Details] [PDF] [Slides (PDF)]
Subhashini Venugopalan and Huijuan Xu and Jeff Donahue and Marcus Rohrbach and Raymond Mooney and Kate Saenko
In Proceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT 2015), 1494--1504, Denver, Colorado, June 2015.
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
[Details] [PDF]
Jesse Thomason and Subhashini Venugopalan and Sergio Guadarrama and Kate Saenko and Raymond Mooney
In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), 1218--1227, Dublin, Ireland, August 2014.
Integrating Visual and Linguistic Information to Describe Properties of Objects
[Details] [PDF]
Calvin MacKenzie
2014. Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition
[Details] [PDF] [Poster]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), 2712--2719, Sydney, Australia, December 2013.
A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities
[Details] [PDF]
Stephen Roller and Sabine Schulte im Walde
In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 1146--1157, Seattle, WA, October 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
[Details] [PDF] [Slides (PPT)]
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama
In Proceedings of the NAACL HLT Workshop on Vision and Language (WVL '13), 10--19, Atlanta, Georgia, July 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
[Details] [PDF] [Slides (PPT)]
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama
In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), 541--547, July 2013.
Improving Video Activity Recognition using Object Recognition and Text Mining
[Details] [PDF] [Slides (PPT)]
Tanvi S. Motwani and Raymond J. Mooney
In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), 600--605, August 2012.
Using Closed Captions as Supervision for Video Activity Recognition
[Details] [PDF]
Sonal Gupta, Raymond J. Mooney
In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010), 1083--1088, Atlanta, GA, July 2010.
Activity Retrieval in Closed Captioned Videos
[Details] [PDF]
Sonal Gupta
Masters Thesis, Department of Computer Sciences, University of Texas at Austin, August 2009. 64 pages.
Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval
[Details] [PDF]
Sonal Gupta and Raymond Mooney
In Proceedings of the CVPR-09 Workshop on Visual and Contextual Learning from Annotated Images and Videos (VCL), Miami, FL, June 2009.
Watch, Listen & Learn: Co-training on Captioned Images and Videos
[Details] [PDF]
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 457--472, Antwerp Belgium, September 2008.