Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: 2019

  1. Self-Critical Reasoning for Robust Visual Question Answering
    [Details] [PDF] [Slides (PDF)] [Poster]
    Jialin Wu and Raymond J. Mooney
    In Proceedings of Neural Information Processing Systems (NeurIPS) , December 2019.
    Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution [1]. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5 % using textual explanations and 48.5 % using automatically annotated regions.
    ML ID: 380
  2. Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder
    [Details] [PDF] [Poster]
    Jialin Wu and Raymond J. Mooney
    In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019, December 2019.
    Most RNN-based image captioning models receive supervision on the output words to mimic human captions. Therefore, the hidden states can only receive noisy gradient signals via layers of back-propagation through time, leading to less accurate generated captions. Consequently, we propose a novel framework, Hidden State Guidance (HSG), that matches the hidden states in the caption decoder to those in a teacher decoder trained on an easier task of autoencoding the captions conditioned on the image. During training with the REINFORCE algorithm, the conventional rewards are sentence-based evaluation metrics equally distributed to each generated word, no matter their relevance. HSG provides a word-level reward that helps the model learn better hidden representations. Experimental results demonstrate that HSG clearly outperforms various state-of-the-art caption decoders using either raw images, detected objects, or scene graph features as inputs.
    ML ID: 379
  3. Optimal Use Of Verbal Instructions For Multi-Robot Human Navigation Guidance
    [Details] [PDF] [Slides (PDF)] [Video]
    Harel Yedidsion, Jacqueline Deans, Connor Sheehan, Mahathi Chillara, Justin Hart, Peter Stone, and Raymond J. Mooney
    In Proceedings of the Eleventh International Conference on Social Robotics, 133-143, 2019. Springer.
    Efficiently guiding humans in indoor environments is a challenging open problem. Due to recent advances in mobile robotics and natural language processing, it has recently become possible to consider doing so with the help of mobile, verbally communicating robots. In the past, stationary verbal robots have been used for this purpose at Microsoft Research, and mobile non-verbal robots have been used at UT Austin in their multi-robot human guidance system. This paper extends that mobile multi-robot human guidance research by adding the element of natural language instructions, which are dynamically generated based on the robots’ path planner, and by implementing and testing the system on real robots. Generating natural language instructions from the robots’ plan opens up a variety of optimization opportunities such as deciding where to place the robots, where to lead humans, and where to verbally instruct them. We present experimental results of the full multi-robot human guidance system and show that it is more effective than two baseline systems: one which only provides humans with verbal instructions, and another which only uses a single robot to lead users to their destinations.
    ML ID: 378
  4. A Framework for Writing Trigger - Action Todo Comments in Executable Format
    [Details] [PDF] [Slides (PPT)]
    Pengyu Nie, Rishabh Rai, Junyi Jessy Li, Sarfraz Khurshid, Raymond J. Mooney, Milos Gligoric
    In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Tallinn, Estonia, August 2019. Distinguished Paper Award.
    Natural language elements, e.g., todo comments, are frequently used to communicate among developers and to describe tasks that need to be performed (actions) when specific conditions hold on artifacts related to the code repository (triggers), e.g., from the Apache Struts project: “remove expectedJDK15 and if() after switching to Java 1.6”. As projects evolve, development processes change, and development teams reorganize, these comments, because of their informal nature, frequently become irrelevant or forgotten. We present the first framework, dubbed TrigIt, to specify trigger-action todo comments in executable format. Thus, actions are executed automatically when triggers evaluate to true. TrigIt specifications are written in the host language (e.g., Java) and are evaluated as part of the build process. The triggers are specified as query statements over abstract syntax trees, abstract representation of build configuration scripts, issue tracking systems, and system clock time. The actions are either notifications to developers or code transformation steps. We implemented TrigIt for the Java programming language and migrated 44 existing trigger-action comments from several popular open-source projects. Evaluation of TrigIt, via a user study, showed that users find TrigIt easy to learn and use. TrigIt has the potential to enforce more discipline in writing and maintaining comments in large code repositories.
    ML ID: 377
  5. Using Natural Language for Reward Shaping in Reinforcement Learning
    [Details] [PDF] [Slides (PDF)] [Poster]
    Prasoon Goyal, Scott Niekum, Raymond J. Mooney
    In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, August 2019.
    Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. These intermediate language-based rewards can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma’s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60 % more often on average, compared to learning without language.
    ML ID: 376
  6. Generating Question Relevant Captions to Aid Visual Question Answering
    [Details] [PDF] [Slides (PPT)]
    Jialin Wu, Zeyuan Hu, Raymond J. Mooney
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, August 2019.
    Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.
    ML ID: 375
  7. Faithful Multimodal Explanation for Visual Question Answering
    [Details] [PDF] [Slides (PPT)]
    Jialin Wu and Raymond J. Mooney
    In Proceedings of the Second BlackboxNLP Workshop at ACL, 103-112, Florence, Italy, August 2019.
    AI systems’ ability to explain their reasoning is critical to their utility and trustworthiness. Deep neural networks have enabled significant progress on many challenging problems such as visual question answering (VQA). However, most of them are opaque black boxes with limited explanatory capability. This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning process while capturing the style of comprehensible human explanations. Extensive experimental evaluation demonstrates the advantages of this approach compared to competing methods using both automated metrics and human evaluation.
    ML ID: 374
  8. Do Human Rationales Improve Machine Explanations?
    [Details] [PDF] [Poster]
    Julia Strout, Ye Zhang, Raymond J. Mooney
    In Proceedings of the Second BlackboxNLP Workshop at ACL, 56-62, Florence, Italy, August 2019.
    Work on “learning with rationales” shows that humans providing explanations to a machine learning system can improve the system’s predictive accuracy. However, this work has not been connected to work in “explainable AI” which concerns machines explaining their reasoning to humans. In this work, we show that learning with rationales can also improve the quality of the machine’s explanations as evaluated by human judges. Specifically, we present experiments showing that, for CNN-based text classification, explanations generated using “supervised attention” are judged superior to explanations generated using normal unsupervised attention.
    ML ID: 373
  9. AInix: An open platform for natural language interfaces to shell commands
    [Details] [PDF]
    David Gros
    May 2019. Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
    This report discusses initial work on the AInix Platform. This platform is designed to allow developers to add natural language interfaces to Unix-like shell commands. This can be used with the aish shell, which allows users to intermix natural language with shell commands. We create a high-level way of specifying semantic parsing grammars and collect a dataset of basic shell commands. We experiment with seq2seq models, abstract syntax networks (ASN), and embedded nearest neighbor-based models. We find highest accuracy is achieved with seq2seq models and ASN’s. While not as accurate, we find that when embedders are pretrained on large-scale code-related text, nearest neighbor models can achieve decent performance.
    ML ID: 372
  10. Improving Grounded Natural Language Understanding through Human-Robot Dialog
    [Details] [PDF]
    Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Nick Walker, Yuqian Jiang, Harel Yedidsion, Justin Hart, Peter Stone, and Raymond J. Mooney
    In IEEE International Conference on Robotics and Automation (ICRA), Montreal, Canada, May 2019.
    Natural language understanding for robotics can require substantial domain- and platform-specific engineering. For example, for mobile robots to pick-and-place objects in an environment to satisfy human commands, we can specify the language humans use to issue such commands, and connect concept words like red can to physical object properties. One way to alleviate this engineering for a new domain is to enable robots in human environments to adapt dynamically -- continually learning new language constructions and perceptual concepts. In this work, we present an end-to-end pipeline for translating natural language commands to discrete robot actions, and use clarification dialogs to jointly improve language parsing and concept grounding. We train and evaluate this agent in a virtual setting on Amazon Mechanical Turk, and we transfer the learned agent to a physical robot platform to demonstrate it in the real world.
    ML ID: 371