On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning (2016)
Matthew Hausknecht and Peter Stone
Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability compared to using exclusively one or the other. The same technique applied to DQN in a discrete action space drastically slows down learning. Our findings raise questions about the nature of on-policy and off-policy bootstrap and Monte Carlo updates and their relationship to deep reinforcement learning methods.
In Deep Reinforcement Learning: Frontiers and Challenges, IJCAI Workshop, New York, July 2016.

Peter Stone Faculty pstone [at] cs utexas edu