Home Publications Dissertation

Data Efficient Reinforcement Learning with Off-policy and Simulated Data

Josiah Hanna

[Full Dissertation] [Defense Presentation] [Download Slides]

Abstract

Learning from interaction with the environment -- trying untested actions, observing successes and failures, and tying effects back to causes -- is one of the first capabilities we think of when considering autonomous agents. Reinforcement learning (RL) is the area of artificial intelligence research that has the goal of allowing autonomous agents to learn in this way. Despite much recent success, many modern reinforcement learning algorithms are still limited by the requirement of large amounts of experience before useful skills are learned. Two possible approaches to improving data efficiency are to allow algorithms to make better use of past experience collected with past behaviors (known as off-policy data) and to allow algorithms to make better use of simulated data sources. This dissertation investigates the use of such auxiliary data by answering the question, ``How can a reinforcement learning agent leverage off-policy and simulated data to evaluate and improve upon the expected performance of a policy?"

This dissertation first considers how to directly use off-policy data in reinforcement learning through importance sampling. When used in reinforcement learning, importance sampling is limited by high variance that leads to inaccurate estimates. This dissertation addresses this limitation in two ways. First, this dissertation introduces the behavior policy gradient algorithm that adapts the data collection policy towards a policy that generates data that leads to low variance importance sampling evaluation of a fixed policy. Second, this dissertation introduces the family of regression importance sampling estimators which improve the weighting of already collected off-policy data so as to lower the variance of importance sampling evaluation of a fixed policy. In addition to evaluation of a fixed policy, we apply the behavior policy gradient algorithm and regression importance sampling to \textit{batch policy gradient} policy improvement. In the case of regression importance sampling, this application leads to the introduction of the \textit{sampling error corrected policy gradient estimator} that improves the data efficiency of batch policy gradient algorithms.

Towards the goal of learning from simulated experience, this dissertation introduces an algorithm -- the grounded action transformation algorithm -- that takes small amounts of real world data and modifies the simulator such that skills learned in simulation are more likely to carry over to the real world. Key to this approach is the idea of local simulator modification -- the simulator is automatically altered to better model the real world for actions the data collection policy would take in states the data collection policy would visit. Local modification necessitates an iterative approach: the simulator is modified, the policy improved, and then more data is collected for further modification.

Finally, in addition to examining them each independently, this dissertation also considers the possibility of combining the use of simulated data with importance sampled off-policy data. We combine these sources of auxiliary data by control variate techniques that use simulated data to lower the variance of off-policy policy value estimation. Combining these sources of auxiliary data allows us to introduce two algorithms -- weighted doubly robust bootstrap and model-based bootstrap -- for the problem of lower-bounding the performance of an untested policy.