Reasoning about Actions with Large Multimodal Models (2025)
Large multimodal models have become central for solving sequential decision-making tasks, enabling improved learning in diverse areas such as home robotics and automated software development. However, leveraging these models for sequential decision-making requires robust action reasoning capabilities, which remain a significant challenge. This thesis aims to improve and evaluate action reasoning in large multimodal models. First, we introduce a method to improve the parsing of instructional texts into action sequences by integrating external symbolic planners and planning domains during autoregressive language model decoding. Next, we propose a method that leverages the compositional structure of language instructions to improve generalization and sample efficiency of acquiring new tasks with reinforcement learning. Last, we propose a new benchmark to evaluate the understanding of dependencies between actions described in instructional texts. Future work will focus on evaluating the world modeling limitations of frontier models. Current models struggle to reason about the effects of actions in multimodal entity state tracking tasks. We aim to extend entity state tracking evaluations to embodied domains. From this benchmark we derive a post-training method for improving the entity-state reasoning abilities of language models. Together these contributions enhance the understanding of how models reason about actions and provide insights toward their improvement for real-world sequential decision-making problems.
View:
PDF
Citation:
Ph.D. Proposal.
Bibtex:

Presentation:
Slides (PDF)
Vanya Cohen Ph.D. Student vanya [at] utexas edu