Publications: 2025
- Reasoning about Actions with Large Multimodal Models
[Details] [PDF] [Slides (PDF)]
Vanya Cohen
October 2025. Ph.D. Proposal.Large multimodal models have become central for solving sequential decision-making tasks, enabling improved learning in diverse areas such as home robotics and automated software development. However, leveraging these models for sequential decision-making requires robust action reasoning capabilities, which remain a significant challenge. This thesis aims to improve and evaluate action reasoning in large multimodal models. First, we introduce a method to improve the parsing of instructional texts into action sequences by integrating external symbolic planners and planning domains during autoregressive language model decoding. Next, we propose a method that leverages the compositional structure of language instructions to improve generalization and sample efficiency of acquiring new tasks with reinforcement learning. Last, we propose a new benchmark to evaluate the understanding of dependencies between actions described in instructional texts. Future work will focus on evaluating the world modeling limitations of frontier models. Current models struggle to reason about the effects of actions in multimodal entity state tracking tasks. We aim to extend entity state tracking evaluations to embodied domains. From this benchmark we derive a post-training method for improving the entity-state reasoning abilities of language models. Together these contributions enhance the understanding of how models reason about actions and provide insights toward their improvement for real-world sequential decision-making problems.
ML ID: 445
- Augmenting Robotic Capabilities through Natural Language
[Details] [PDF] [Slides (PDF)]
Albert Yu
October 2025. Ph.D. Proposal.Despite rapid advances in language and vision models, current robots still lag far behind human physical capabilities due to the relative scarcity of real-world data compared to online text and images. How can we leverage abundant language data to advance robotic capabilities? Language provides semantic structure that facilitates the understanding of diverse data, improving sample efficiency in scarce data regimes. It also provides a natural communicative medium when interacting with and learning from humans.
To leverage the first benefit of language, we first take inspiration from how humans teach each other in video tutorials, through simultaneous video and language streams, to more efficiently teach robots new skills. We then show that language can bridge wide visual sim2real gaps, enabling robots to learn tasks with just a few real-world demonstrations by leveraging knowledge from imperfect simulation data. To leverage the second benefit of language, we explore how bidirectional dialog can enable robots to solve complex manipulation tasks by communicating to and collaborating with a wide distribution of human collaborators in the real-world. We develop a robotic framework that requests and proactively offers help through mixed-initiative, free-form dialog, enabling the robot to adapt to changing human preferences and each agent’s physical capabilities to be strategically utilized. Finally, we discuss avenues of future work, such as how human-robot collaboration can be facilitated through dialog-based replanning, how both agents can improve through bidirectional feedback, and how language-based guidelines extracted from manuals can enable robots to behave more safely and learn more quickly.
ML ID: 444
- Enhancing Competitive-level Code Generation by Utilizing Natural Language Reasoning
[Details] [PDF] [Slides (PDF)]
Jierui Li
September 2025. Ph.D. Proposal.Recent progress in large language models (LLMs) has shown strong performance in code generation. Models trained with long reasoning chains achieve promising results on complex competitive programming (CP) tasks. However, it remains unclear where the main bottlenecks in solving such problems lie. This dissertation studies these obstacles and explores how leveraging LLMs’ natural language reasoning abilities can improve code generation for CP.
This proposal highlights three completed contributions: Explanation and Distilling: LLMs are e↵ective at explaining solution code(Li et al., 2023), and their ability to implement a verbal solution is stronger than solving a problem directly. Based on this, we developed a supervised finetuning method that distills LLM-generated explanations into chain-of-thought style problem-solving steps(Li and Mooney, 2024). Agent-Guided CodeTree Search: We introduced CodeTree(Li et al., 2025), an agent system for code generation that iteratively thinks, solves, reflects, refines, and verifies through an auto-expanded tree search until reaching the final solution. AlgoSimBench Benchmark: We built AlgoSimBench(Li and Mooney, 2025), a benchmark for evaluating LLMs’ ability to identify algorithmically similar problems. We found that using attempted solutions to match problems improves both end-to-end LLM selection and cosine similarity-based retrieval. Finally, we outline two directions for future work. Task-Aware Code Representation: Develop a zero-shot code embedding method that weighs tokens based on the task-specific prompt, focusing the representation on distinct aspects such as algorithm, functionality, and semantics. Retriever–LLM Training: Investigate why Retrieval-Augmented Generation (RAG) shows limited improvement in coding tasks, with two hypotheses: (a) retrievers fail to find useful context, and (b) LLMs struggle to use retrieved information effectively. To address this, we plan to jointly train retrievers and LLMs on context-dependent coding tasks.
ML ID: 443
- AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming
[Details] [PDF]
Jierui Li and Raymond Mooney
In Preprint, July 2025.Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs' ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9 percent accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7 percent to 11.7 percent across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2 percent accuracy. Code and data are available at https://github.com/lijierui/AlgoSimBench.
ML ID: 442
- Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
[Details] [PDF]
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín
Preprint, August 2025.Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the human's estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world -- on a physical robot with 18 unique human participants over 27 hours -- demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.
ML ID: 441
- Text-Guided Interactive Scene Synthesis with Scene Prior Guidance
[Details] [PDF]
Shaoheng Fang, Haitao Yang, Raymond Mooney, Qixing Huang
In European Association for Computer Graphics, May 2025.3D scene synthesis using natural language instructions has become a popular direction in computer graphics, with significant progress made by data-driven generative models recently. However, previous methods have mainly focused on one-time scene generation, lacking the interactive capability to generate, update, or correct scenes according to user instructions. To overcome this limitation, this paper focuses on text-guided interactive scene synthesis. First, we introduce the SceneMod dataset, which comprises 168k paired scenes with textual descriptions of the modifications. To support the interactive scene synthesis task, we propose a two-stage diffusion generative model that integrates scene-prior guidance into the denoising process to explicitly enforce physical constraints and foster more realistic scenes. Experimental results demonstrate that our approach outperforms baseline methods in text-guided scene synthesis tasks. Our system expands the scope of data-driven scene synthesis tasks and provides a novel, more flexible tool for users and designers in 3D scene generation. Code and dataset are available at https://github.com/bshfang/SceneMod.
ML ID: 439
- CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models
[Details] [PDF]
Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo
In Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), April 2025.Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with capabilities to self-refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi-stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision-making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution-based feedback and LLM-agent-generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT-4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.
ML ID: 438
- MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
[Details] [PDF]
Vanya Cohen, Raymond Mooney
Preprint, January 2025.Entity tracking is a fundamental challenge in natural language understanding, requiring models to maintain coherent representations of entities. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using two structured domains, Chess and the Shell Game, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet substantial limitations remain, especially in long-horizon multimodal scenarios. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.
ML ID: 437
- Temporally Streaming Audio-Visual Synchronization for Real-World Videos
[Details] [PDF]
Jordan Voas, Wei-Cheng Tseng, Layne Berry, Xixi Hu, Puyuan Peng, James Stuedemann, and David Harwath
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), February 2025.We introduce RealSync, a novel dataset designed to significantly enhance the training and evaluation of models for audio-visual synchronization (AV Sync) tasks. Sourced from high-quality YouTube channels, RealSync covers a wide range of content domains, providing an improved scale, diversity, and alignment with broadcast content compared to existing datasets. It features extended-length video samples, catering to the critical need for more comprehensive, real-world training and evaluation materials. Alongside this dataset, we present StreamSync, a model tailored for real-world AV Sync applications. StreamSync is designed to be backbone agnostic and incorporates a streaming mechanism that processes consecutive video segments dynamically, iteratively refining synchronization predictions. This innovative approach enables StreamSync to outperform existing models, offering superior synchronization accuracy with minimal computational cost per iteration. Together, our dataset and the StreamSync model establish a new benchmark for AVSync research, promising to drive the development of more robust and practical AVSync methods. https://github.com/jvoas655/StreamSync
ML ID: 435