UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Language and Vision
To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Vision is the primary source of perception and grounding language in vision is an important AI problem with many applications. Our group has focused particularly on automated video captioning, producing natural language descriptions of short video clips using both graphical models and deep neural networks.
People
Tong Gao
Masters Alumni
gaotong [at] utexas edu
Sonal Gupta
Masters Alumni
sonal [at] cs stanford edu
Niveda Krishnamoorthy
Masters Alumni
niveda [at] cs utexas edu
Calvin MacKenzie
Undergraduate Alumni
calvinm mackenzie [at] utexas edu
Tanvi S Motwani
Masters Alumni
tanvi [at] cs utexas edu
Nazneen Rajani
Ph.D. Alumni
nrajani [at] cs utexas edu
Stephen Roller
Ph.D. Alumni
roller [at] cs utexas edu
Subhashini Venugopalan
Ph.D. Alumni
vsub [at] cs utexas edu
Jialin Wu
Ph.D. Alumni
jialinwu [at] utexas edu
Harel Yedidsion
Postdoctoral Fellow
harel [at] cs utexas edu
Publications
[Expand to show all 40]
[Minimize]
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering
2022
Jialin Wu, Raymond Mooney, In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, December 2022.
Incorporating External Information for Visual Question Answering
2022
Jialin Wu, PhD Thesis, Department of Computer Science, UT Austin.
Multi-Modal Answer Validation for Knowledge-Based VQA
2022
Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi,
Proceedings of the AAAI Conference on Artificial Intelligence
(2022).
Towards Automated Error Analysis: Learning to Characterize Errors
2022
Tong Gao, Shivang Singh, Raymond J. Mooney,
Short version appears in the 19th International Florida Artificial Intelligence Research Society Conference (FLAIRS)
(2022).
Using Natural Language to Aid Task Specification in Sequential Decision Making Problems
2022
Prasoon Goyal, PhD Thesis, Department of Computer Science, UT Austin.
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
2022
Anuj Diwan, Puyuan Peng, Raymond J. Mooney, In
Workshop on Transfer Learning for Natural Language Processing at NeurIPS 2022
, December 2022.
Dialog Policy Learning for Joint Clarification and Active Learning Queries
2021
Aishwarya Padmakumar, Raymond J. Mooney, In
The AAAI Conference on Artificial Intelligence (AAAI)
, Vol. , February 2021.
Improving VQA and its Explanations by Comparing Competing Explanations
2021
Jialin Wu, Liyan Chen, Raymond J. Mooney, In
The AAAI Conference on Artificial Intelligence (AAAI), Explainable Agency in Artificial Intelligence Workshop
, Vol. arXiv:2006.15631, February 2021.
Incorporating Textual Resources to Improve Visual Question Answering
2021
Jialin Wu, Ph.D. Proposal.
Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems
2020
Aishwarya Padmakumar, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Faithful Multimodal Explanation for Visual Question Answering
2019
Jialin Wu and Raymond J. Mooney, In
Proceedings of the Second BlackboxNLP Workshop at ACL
, pp. 103-112, Florence, Italy, August 2019.
Generating Question Relevant Captions to Aid Visual Question Answering
2019
Jialin Wu, Zeyuan Hu, Raymond J. Mooney, In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)
, Florence, Italy, August 2019.
Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder
2019
Jialin Wu and Raymond J. Mooney, In
Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019
, December 2019.
Self-Critical Reasoning for Robust Visual Question Answering
2019
Jialin Wu and Raymond J. Mooney, In
Proceedings of Neural Information Processing Systems (NeurIPS)
, December 2019.
Explainable Improved Ensembling for Natural Language and Vision
2018
Nazneen Rajani, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Joint Image Captioning and Question Answering
2018
Jialin Wu, Zeyuan Hu and Raymond J. Mooney , In
VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18)
, June 2018.
Learning a Policy for Opportunistic Active Learning
2018
Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney, In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18)
, Brussels, Belgium, November 2018.
Stacking With Auxiliary Features for Visual Question Answering
2018
Nazneen Fatema Rajani, Raymond J. Mooney, In
Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pp. 2217-2226 2018.
Captioning Images with Diverse Objects
2017
Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko, In
Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-17)
, pp. 5753--5761 2017.
Ensembling Visual Explanations for VQA
2017
Nazneen Fatema Rajani, Raymond J. Mooney, In
Proceedings of the NIPS 2017 workshop on Visually-Grounded Interaction and Language (ViGIL)
, December 2017.
Multi-Modal Word Synset Induction
2017
Jesse Thomason and Raymond J. Mooney, In
Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17)
, pp. 4116--4122, Melbourne, Australia 2017.
Natural-Language Video Description with Deep Recurrent Neural Networks
2017
Subhashini Venugopalan, PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Using Explanations to Improve Ensembling of Visual Question Answering Systems
2017
Nazneen Fatema Rajani and Raymond J. Mooney, In
Proceedings of the IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI)
, pp. 43-47, Melbourne, Australia, August 2017.
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
2016
Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell, In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-16)
, pp. 1--10 2016.
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
2016
Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, and Kate Saenko, In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16)
, pp. 1961--1966, Austin, Texas 2016.
Stacking With Auxiliary Features: Improved Ensembling for Natural Language and Vision
2016
Nazneen Fatema Rajani, PhD proposal, Department of Computer Science, The University of Texas at Austin.
Natural Language Video Description using Deep Recurrent Neural Networks
2015
Subhashini Venugopalan, PhD proposal, Department of Computer Science, The University of Texas at Austin.
Sequence to Sequence -- Video to Text
2015
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko, In
Proceedings of the 2015 International Conference on Computer Vision (ICCV-15)
, Santiago, Chile, December 2015.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
2015
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko, In
Proceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT 2015)
, pp. 1494--1504, Denver, Colora...
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
2014
Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney, In
Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014)
, pp. 1218--1227, Dublin, Ireland, August 2014.
Integrating Visual and Linguistic Information to Describe Properties of Objects
2014
Calvin MacKenzie, Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities
2013
Stephen Roller and Sabine Schulte im Walde, In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)
, pp. 1146--1157, Seattle, WA, October 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
2013
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama, In
Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013)
, pp. 541--547, July 2013.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
2013
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama,
Proceedings of the NAACL HLT Workshop on Vision and Language (WVL '13)
(2013), pp. 10--19.
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition
2013
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko, In
Proceedings of the 14th International Conference on Computer Vision (ICCV-2013)
, pp. 2712--2719, Sydney, Australia, December 2013.
Improving Video Activity Recognition using Object Recognition and Text Mining
2012
Tanvi S. Motwani and Raymond J. Mooney, In
Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012)
, pp. 600--605, August 2012.
Using Closed Captions as Supervision for Video Activity Recognition
2010
Sonal Gupta, Raymond J. Mooney, In
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010)
, pp. 1083--1088, Atlanta, GA, July 2010.
Activity Retrieval in Closed Captioned Videos
2009
Sonal Gupta, Masters Thesis, Department of Computer Sciences, University of Texas at Austin. 64 pages.
Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval
2009
Sonal Gupta and Raymond Mooney, In
Proceedings of the CVPR-09 Workshop on Visual and Contextual Learning from Annotated Images and Videos (VCL)
, Miami, FL, June 2009.
Watch, Listen & Learn: Co-training on Captioned Images and Videos
2008
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney, In
Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
, pp. 457--472, Antwerp Belgium, September 2008.
Labs
Machine Learning