Peter Stone's Selected Publications

Classified by TopicClassified by Publication TypeSorted by DateSorted by First Author Last NameClassified by Funding Source


The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications

The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications.
Serena Booth, W Bradley Knox, Julie Shah, Scott Niekum, Peter Stone, and Alessandro Allievi.
In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI), Feb 2023.
Project page with slides and video.

Download

[PDF]1.4MB  

Abstract

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. These sparse task metrics can be hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. One question this process raises is whether the same reward function is optimal for all algorithms, or, put differently, whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. To broadly examine ad hoc reward design, we also conduct a controlled observation study which emulates expert practitioners' typical reward design experiences. Here, we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: https://github.com/serenabooth/reward-design-perils.

BibTeX Entry

@InProceedings{booth2023perils,
  title={The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications},
  author={Serena Booth and W Bradley Knox and Julie Shah and Scott Niekum and Peter Stone and Alessandro Allievi},
  booktitle = {Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI)},
  location = {Washington, D.C.},
  month = {Feb},
  year = {2023},
  abstract={In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. These sparse task metrics can be hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. One question this process raises is whether the same reward function is optimal for all algorithms, or, put differently, whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. To broadly examine ad hoc reward design, we also conduct a controlled observation study which emulates expert practitioners' typical reward design experiences. Here, we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: https://github.com/serenabooth/reward-design-perils.
},
  wwwnote={<a href="https://slbooth.com/Reward_Design_Perils/">Project page</a> with slides and video</a>.},
}

Generated by bib2html.pl (written by Patrick Riley ) on Wed Apr 17, 2024 18:42:51