AI Lab Areas - Language and Vision

Language and Vision

To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Vision is the primary source of perception and grounding language in vision is an important AI problem with many applications. Our group has focused particularly on automated video captioning, producing natural language descriptions of short video clips using both graphical models and deep neural networks.

People

Tong Gao	Masters Alumni	gaotong [at] utexas edu
Sonal Gupta	Masters Alumni	sonal [at] cs stanford edu
Niveda Krishnamoorthy	Masters Alumni	niveda [at] cs utexas edu
Calvin MacKenzie	Undergraduate Alumni	calvinm mackenzie [at] utexas edu
Tanvi S Motwani	Masters Alumni	tanvi [at] cs utexas edu
Nazneen Rajani	Ph.D. Alumni	nrajani [at] cs utexas edu
Stephen Roller	Ph.D. Alumni	roller [at] cs utexas edu
Subhashini Venugopalan	Ph.D. Alumni	vsub [at] cs utexas edu
Jialin Wu	Ph.D. Alumni	jialinwu [at] utexas edu
Harel Yedidsion	Postdoctoral Fellow	harel [at] cs utexas edu

Publications

[Expand to show all 40]

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering	2022
Jialin Wu, Raymond Mooney, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), December 2022.
Incorporating External Information for Visual Question Answering	2022
Jialin Wu, PhD Thesis, Department of Computer Science, UT Austin.
Multi-Modal Answer Validation for Knowledge-Based VQA	2022
Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi, Proceedings of the AAAI Conference on Artificial Intelligence (2022).
Towards Automated Error Analysis: Learning to Characterize Errors	2022
Tong Gao, Shivang Singh, Raymond J. Mooney, Short version appears in the 19th International Florida Artificial Intelligence Research Society Conference (FLAIRS) (2022).
Using Natural Language to Aid Task Specification in Sequential Decision Making Problems	2022
Prasoon Goyal, PhD Thesis, Department of Computer Science, UT Austin.
Zero-shot Video Moment Retrieval With Off-the-Shelf Models	2022
Anuj Diwan, Puyuan Peng, Raymond J. Mooney, In Workshop on Transfer Learning for Natural Language Processing at NeurIPS 2022, December 2022.
Dialog Policy Learning for Joint Clarification and Active Learning Queries	2021
Aishwarya Padmakumar, Raymond J. Mooney, In The AAAI Conference on Artificial Intelligence (AAAI), Vol. , February 2021.
Improving VQA and its Explanations by Comparing Competing Explanations	2021
Jialin Wu, Liyan Chen, Raymond J. Mooney, In The AAAI Conference on Artificial Intelligence (AAAI), Explainable Agency in Artificial Intelligence Workshop, Vol. arXiv:2006.15631, February 2021.
Incorporating Textual Resources to Improve Visual Question Answering	2021
Jialin Wu, Ph.D. Proposal.
Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems	2020
Aishwarya Padmakumar, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Faithful Multimodal Explanation for Visual Question Answering	2019
Jialin Wu and Raymond J. Mooney, In Proceedings of the Second BlackboxNLP Workshop at ACL, pp. 103-112, Florence, Italy, August 2019.
Generating Question Relevant Captions to Aid Visual Question Answering	2019
Jialin Wu, Zeyuan Hu, Raymond J. Mooney, In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, August 2019.
Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder	2019
Jialin Wu and Raymond J. Mooney, In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019, December 2019.
Self-Critical Reasoning for Robust Visual Question Answering	2019
Jialin Wu and Raymond J. Mooney, In Proceedings of Neural Information Processing Systems (NeurIPS) , December 2019.
Explainable Improved Ensembling for Natural Language and Vision	2018
Nazneen Rajani, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Joint Image Captioning and Question Answering	2018
Jialin Wu, Zeyuan Hu and Raymond J. Mooney , In VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18) , June 2018.
Learning a Policy for Opportunistic Active Learning	2018
Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney, In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18), Brussels, Belgium, November 2018.
Stacking With Auxiliary Features for Visual Question Answering	2018
Nazneen Fatema Rajani, Raymond J. Mooney, In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2217-2226 2018.
Captioning Images with Diverse Objects	2017
Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko, In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-17), pp. 5753--5761 2017.
Ensembling Visual Explanations for VQA	2017
Nazneen Fatema Rajani, Raymond J. Mooney, In Proceedings of the NIPS 2017 workshop on Visually-Grounded Interaction and Language (ViGIL), December 2017.
Multi-Modal Word Synset Induction	2017
Jesse Thomason and Raymond J. Mooney, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), pp. 4116--4122, Melbourne, Australia 2017.
Natural-Language Video Description with Deep Recurrent Neural Networks	2017
Subhashini Venugopalan, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Using Explanations to Improve Ensembling of Visual Question Answering Systems	2017
Nazneen Fatema Rajani and Raymond J. Mooney, In Proceedings of the IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), pp. 43-47, Melbourne, Australia, August 2017.
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data	2016
Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell, In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-16), pp. 1--10 2016.
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text	2016
Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, and Kate Saenko, In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), pp. 1961--1966, Austin, Texas 2016.
Stacking With Auxiliary Features: Improved Ensembling for Natural Language and Vision	2016
Nazneen Fatema Rajani, PhD proposal, Department of Computer Science, The University of Texas at Austin.
Natural Language Video Description using Deep Recurrent Neural Networks	2015
Subhashini Venugopalan, PhD proposal, Department of Computer Science, The University of Texas at Austin.
Sequence to Sequence -- Video to Text	2015
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko, In Proceedings of the 2015 International Conference on Computer Vision (ICCV-15), Santiago, Chile, December 2015.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks	2015
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko, In Proceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT 2015), pp. 1494--1504, Denver, Colora...
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild	2014
Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney, In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pp. 1218--1227, Dublin, Ireland, August 2014.
Integrating Visual and Linguistic Information to Describe Properties of Objects	2014
Calvin MacKenzie, Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities	2013
Stephen Roller and Sabine Schulte im Walde, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pp. 1146--1157, Seattle, WA, October 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge	2013
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama, In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), pp. 541--547, July 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge	2013
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama, Proceedings of the NAACL HLT Workshop on Vision and Language (WVL '13) (2013), pp. 10--19.
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition	2013
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko, In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), pp. 2712--2719, Sydney, Australia, December 2013.
Improving Video Activity Recognition using Object Recognition and Text Mining	2012
Tanvi S. Motwani and Raymond J. Mooney, In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), pp. 600--605, August 2012.
Using Closed Captions as Supervision for Video Activity Recognition	2010
Sonal Gupta, Raymond J. Mooney, In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010), pp. 1083--1088, Atlanta, GA, July 2010.
Activity Retrieval in Closed Captioned Videos	2009
Sonal Gupta, Masters Thesis, Department of Computer Sciences, University of Texas at Austin. 64 pages.
Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval	2009
Sonal Gupta and Raymond Mooney, In Proceedings of the CVPR-09 Workshop on Visual and Contextual Learning from Annotated Images and Videos (VCL), Miami, FL, June 2009.
Watch, Listen & Learn: Co-training on Captioned Images and Videos	2008
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney, In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), pp. 457--472, Antwerp Belgium, September 2008.

Labs

Machine Learning