Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: 2018

  1. Explainable Improved Ensembling for Natural Language and Vision
    [Details] [PDF] [Slides (PPT)] [Slides (PDF)]
    Nazneen Rajani
    PhD Thesis, Department of Computer Science, The University of Texas at Austin, July 2018.
    Ensemble methods are well-known in machine learning for improving prediction accuracy. However, they do not adequately discriminate among underlying component models. The measure of how good a model is can sometimes be estimated from “why” it made a specific prediction. We propose a novel approach called Stacking With Auxiliary Features (SWAF) that effectively leverages component models by integrating such relevant information from context to improve ensembling. Using auxiliary features, our algorithm learns to rely on systems that not just agree on an output prediction but also the source or origin of that output. We demonstrate our approach to challenging structured prediction problems in Natural Language Processing and Vision including Information Extraction, Object Detection, and Visual Question Answering. We also present a variant of SWAF for combining systems that do not have training data in an unsupervised ensemble with systems that do have training data. Our combined approach obtains a new state-of-the-art, beating our prior performance on Information Extraction. The state-of-the-art systems on many AI applications are ensembles of deep-learning models. These models are hard to interpret and can sometimes make odd mistakes. Explanations make AI systems more transparent and also justify their predictions. We propose a scalable approach to generate visual explanations for ensemble methods using the localization maps of the component systems. Crowdsourced human evaluation on two new metrics indicates that our ensemble’s explanation significantly qualitatively outperforms individual systems’ explanations.
    ML ID: 363
  2. Joint Image Captioning and Question Answering
    [Details] [PDF] [Poster]
    Jialin Wu, Zeyuan Hu and Raymond J. Mooney
    In VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18) , 2018.
    Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.
    ML ID: 362
  3. Continually Improving Grounded Natural Language Understanding through Human-Robot Dialog
    [Details] [PDF] [Slides (PPT)] [Slides (PDF)]
    Jesse Thomason
    PhD Thesis, Department of Computer Science, The University of Texas at Austin, April 2018.
    As robots become ubiquitous in homes and workplaces such as hospitals and factories, they must be able to communicate with humans. Several kinds of knowledge are required to understand and respond to a human's natural language commands and questions. If a person requests an assistant robot to "take me to Alice's office", the robot must know that Alice is a person who owns some unique office, and that "take me" means it should navigate there. Similarly, if a person requests "bring me the heavy, green mug", the robot must have accurate mental models of the physical concepts "heavy", "green", and "mug". To avoid forcing humans to use key phrases or words robots already know, this thesis focuses on helping robots understanding new language constructs through interactions with humans and with the world around them. To understand a command in natural language, a robot must first convert that command to an internal representation that it can reason with. Semantic parsing is a method for performing this conversion, and the target representation is often semantic forms represented as predicate logic with lambda calculus. Traditional semantic parsing relies on hand-crafted resources from a human expert: an ontology of concepts, a lexicon connecting language to those concepts, and training examples of language with abstract meanings. One thrust of this thesis is to perform semantic parsing with sparse initial data. We use the conversations between a robot and human users to induce pairs of natural language utterances with the target semantic forms a robot discovers through its questions, reducing the annotation effort of creating training examples for parsing. We use this data to build more dialog-capable robots in new domains with much less expert human effort. Meanings of many language concepts are bound to the physical world. Understanding object properties and categories, such as "heavy", "green", and "mug" requires interacting with and perceiving the physical world. Embodied robots can use manipulation capabilities, such as pushing, picking up, and dropping objects to gather sensory data about them. This data can be used to understand non-visual concepts like "heavy" and "empty" (e.g. "get the empty carton of milk from the fridge"), and assist with concepts that have both visual and non-visual expression (e.g. "tall" things look big and also exert force sooner than "short" things when pressed down on). A second thrust of this thesis focuses on strategies for learning these concepts using multi-modal sensory information. We use human-in-the-loop learning to get labels between concept words and actual objects in the environment. We also explore ways to tease out polysemy and synonymy in concept words such as "light", which can refer to a weight or a color, the latter sense being synonymous with "pale". Additionally, pushing, picking up, and dropping objects to gather sensory information is prohibitively time-consuming, so we investigate strategies for using linguistic information and human input to expedite exploration when learning a new concept. Finally, we build an integrated agent with both parsing and perception capabilities that learns from conversations with users to improve both components over time. We demonstrate that parser learning from conversations can be combined with multi-modal perception using predicate-object labels gathered through opportunistic active learning during those conversations to improve performance for understanding natural language commands from humans. Human users also qualitatively rate this integrated learning agent as more usable after it has improved from conversation-based learning.
    ML ID: 361
  4. Stacking With Auxiliary Features for Visual Question Answering
    [Details] [PDF] [Poster]
    Nazneen Fatema Rajani, Raymond J. Mooney
    In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2217-2226, 2018.
    Visual Question Answering (VQA) is a well-known and challenging task that requires systems to jointly reason about natural language and vision. Deep learning models in various forms have been the standard for solving VQA. However, some of these VQA models are better at certain types of image-question pairs than other models. Ensembling VQA models intelligently to leverage their diverse expertise is, therefore, advantageous. Stacking With Auxiliary Features (SWAF) is an intelligent ensembling technique which learns to combine the results of multiple models using features of the current problem as context. We propose four categories of auxiliary features for ensembling for VQA. Three out of the four categories of features can be inferred from an image-question pair and do not require querying the component models. The fourth category of auxiliary features uses model-specific explanations. In this paper, we describe how we use these various categories of auxiliary features to improve performance for VQA. Using SWAF to effectively ensemble three recent systems, we obtain a new state-of-the-art. Our work also highlights the advantages of explainable AI models.
    ML ID: 360
  5. Natural Language Processing and Program Analysis for Supporting Todo Comments as Software Evolves
    [Details] [PDF]
    Pengyu Nie, Junyi Jessy Li, Sarfraz Khurshid, Raymond Mooney, Milos Gligoric
    In In Proceedings of the AAAI Workshop on NLP for Software Engineering, February 2018.
    Natural language elements (e.g., API comments, todo comments) form a substantial part of software repositories. While developers routinely use many natural language elements (e.g., todo comments) for communication, the semantic content of these elements is often neglected by software engineering techniques and tools. Additionally, as software evolves and development teams re-organize, these natural language elements are frequently forgotten, or just become outdated, imprecise and irrelevant. We envision several techniques, which combine natural language processing and program analysis, to help developers maintain their todo comments. Specifically, we propose techniques to synthesize code from comments, make comments executable, answer questions in comments, improve comment quality, and detect dangling comments.
    ML ID: 358
  6. Guiding Exploratory Behaviors for Multi-Modal Grounding of Linguistic Descriptions
    [Details] [PDF] [Slides (PDF)]
    Jesse Thomason, Jivko Sinapov, Raymond Mooney, Peter Stone
    In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) , February 2018.
    A major goal of grounded language learning research is to enable robots to connect language predicates to a robot’s physical interactive perception of the world. Coupling object exploratory behaviors such as grasping, lifting, and looking with multiple sensory modalities (e.g., audio, haptics, and vision) enables a robot to ground non-visual words like “heavy” as well as visual words like “red”. A major limitation of existing approaches to multi-modal language grounding is that a robot has to exhaustively explore training objects with a variety of actions when learning a new such language predicate. This paper proposes a method for guiding a robot’s behavioral exploration policy when learning a novel predicate based on known grounded predicates and the novel predicate’s linguistic relationship to them. We demonstrate our approach on two datasets in which a robot explored large sets of objects and was tasked with learning to recognize whether novel words applied to those objects.
    ML ID: 357
  7. Multi-modal Predicate Identification using Dynamically Learned Robot Controllers
    [Details] [PDF]
    Saeid Amiri and Suhua Wei and Shiqi Zhang and Jivko Sinapov and Jesse Thomason and Peter Stone
    In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, July 2018.
    Intelligent robots frequently need to explore the objects in their working environments. Modern sensors have enabled robots to learn object properties via perception of multiple modalities. However, object exploration in the real world poses a challenging trade-off between information gains and exploration action costs. Mixed observability Markov decision process (MOMDP) is a framework for planning under uncertainty, while accounting for both fully and partially observable components of the state. Robot perception frequently has to face such mixed observability. This work enables a robot equipped with an arm to dynamically construct query-oriented MOMDPs for multi-modal predicate identification (MPI) of objects. The robot's behavioral policy is learned from two datasets collected using real robots. Our approach enables a robot to explore object properties in a way that is significantly faster while improving accuracies in comparison to existing methods that rely on hand-coded exploration strategies.