Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: 2026

  1. Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
    [Details] [PDF]
    Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu and Ray Mooney
    In European Chapter of the Association for Computational Linguistics (EACL), March 2026.
    Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with system-atically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2 percent after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
    ML ID: 446
  2. Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
    [Details] [PDF]
    Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín
    International Conference on Robotics and Automation (ICRA), June 2026.
    Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We propose MICoBot, a system that enables the human and robot, both using natural language, to take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the estimated human's willingness to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models. See additional videos and materials at our project site: https://robin-lab.cs.utexas.edu/MicoBot/.
    ML ID: 441