Continually Improving Grounded Natural Language Understanding through Human-Robot Dialog (2018)
As robots become ubiquitous in homes and workplaces such as hospitals and factories, they must be able to communicate with humans. Several kinds of knowledge are required to understand and respond to a human's natural language commands and questions. If a person requests an assistant robot to "take me to Alice's office", the robot must know that Alice is a person who owns some unique office, and that "take me" means it should navigate there. Similarly, if a person requests "bring me the heavy, green mug", the robot must have accurate mental models of the physical concepts "heavy", "green", and "mug". To avoid forcing humans to use key phrases or words robots already know, this thesis focuses on helping robots understanding new language constructs through interactions with humans and with the world around them. To understand a command in natural language, a robot must first convert that command to an internal representation that it can reason with. Semantic parsing is a method for performing this conversion, and the target representation is often semantic forms represented as predicate logic with lambda calculus. Traditional semantic parsing relies on hand-crafted resources from a human expert: an ontology of concepts, a lexicon connecting language to those concepts, and training examples of language with abstract meanings. One thrust of this thesis is to perform semantic parsing with sparse initial data. We use the conversations between a robot and human users to induce pairs of natural language utterances with the target semantic forms a robot discovers through its questions, reducing the annotation effort of creating training examples for parsing. We use this data to build more dialog-capable robots in new domains with much less expert human effort. Meanings of many language concepts are bound to the physical world. Understanding object properties and categories, such as "heavy", "green", and "mug" requires interacting with and perceiving the physical world. Embodied robots can use manipulation capabilities, such as pushing, picking up, and dropping objects to gather sensory data about them. This data can be used to understand non-visual concepts like "heavy" and "empty" (e.g. "get the empty carton of milk from the fridge"), and assist with concepts that have both visual and non-visual expression (e.g. "tall" things look big and also exert force sooner than "short" things when pressed down on). A second thrust of this thesis focuses on strategies for learning these concepts using multi-modal sensory information. We use human-in-the-loop learning to get labels between concept words and actual objects in the environment. We also explore ways to tease out polysemy and synonymy in concept words such as "light", which can refer to a weight or a color, the latter sense being synonymous with "pale". Additionally, pushing, picking up, and dropping objects to gather sensory information is prohibitively time-consuming, so we investigate strategies for using linguistic information and human input to expedite exploration when learning a new concept. Finally, we build an integrated agent with both parsing and perception capabilities that learns from conversations with users to improve both components over time. We demonstrate that parser learning from conversations can be combined with multi-modal perception using predicate-object labels gathered through opportunistic active learning during those conversations to improve performance for understanding natural language commands from humans. Human users also qualitatively rate this integrated learning agent as more usable after it has improved from conversation-based learning.
View:
PDF
Citation:
PhD Thesis, Department of Computer Science, The University of Texas at Austin.
Bibtex:

Jesse Thomason Ph.D. Alumni thomason DOT jesse AT gmail