Peter Stone's Selected Publications

Classified by TopicClassified by Publication TypeSorted by DateSorted by First Author Last NameClassified by Funding Source


MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention.
Yuxin Chen, Chen Tang, Jianglan Wei, Chenran Li, Ran Tian, Xiang Zhang, Wei Zhan, Peter Stone, and Masayoshi Tomizuka.
In Conference on Robot Learning (CoRL), September 2025.

Download

[PDF]18.9MB  

Abstract

Aligning robot behavior with human preferences is crucial for deploying embodiedAI agents in human-centered environments. A promising solution is interactiveimitation learning from human intervention, where a human expert observes thepolicy's execution and provides interventions as feedback. However, existingmethods often fail to utilize the prior policy efficiently to facilitatelearning, thus hindering sample efficiency. In this work, we introduceMaximum-Entropy Residual-Q Inverse Reinforcement Learning, designed forsample-efficient alignment from human intervention. Instead of inferring thecomplete human behavior characteristics, MEReQ infers a residual reward functionthat captures the discrepancy between the human expert's and the prior policy'sunderlying reward functions. It then employs Residual Q-Learning (RQL) to alignthe policy with human preferences using this residual reward function. Extensiveevaluations on simulated and real-world tasks demonstrate that MEReQ achievessample-efficient policy alignment from human intervention compared to otherbaselines.

BibTeX Entry

@InProceedings{mereq_corl2025,
  author   = {Yuxin Chen and Chen Tang and Jianglan Wei and Chenran Li and Ran Tian and Xiang Zhang and Wei Zhan and Peter Stone and Masayoshi Tomizuka},
  title    = "{MEReQ}: Max-Ent Residual-{Q} Inverse {RL} for Sample-Efficient Alignment from Intervention",
  booktitle = {Conference on Robot Learning (CoRL)},
  year     = {2025},
  month    = {September},
  location = {Seoul, Korea},
  abstract = {Aligning robot behavior with human preferences is crucial for deploying embodied
AI agents in human-centered environments. A promising solution is interactive
imitation learning from human intervention, where a human expert observes the
policy's execution and provides interventions as feedback. However, existing
methods often fail to utilize the prior policy efficiently to facilitate
learning, thus hindering sample efficiency. In this work, we introduce
Maximum-Entropy Residual-Q Inverse Reinforcement Learning, designed for
sample-efficient alignment from human intervention. Instead of inferring the
complete human behavior characteristics, MEReQ infers a residual reward function
that captures the discrepancy between the human expert's and the prior policy's
underlying reward functions. It then employs Residual Q-Learning (RQL) to align
the policy with human preferences using this residual reward function. Extensive
evaluations on simulated and real-world tasks demonstrate that MEReQ achieves
sample-efficient policy alignment from human intervention compared to other
baselines.
  },
}

Generated by bib2html.pl (written by Patrick Riley ) on Sat Oct 04, 2025 14:51:47