# Reinforcement Learning in 3D Physics Simulator

### Authors: Varun Jain, Aditya Rawal

Computer Science Department
The University of Texas at Austin

## Experiment 1: Push Ball to Goal

In this experiment, there are two spheres in the world, one is an agent, another is non-agent. The agent has to learn to push the non-agent to the goal, which is the right most edge of the world. If either, the agent, or the non agent falls off the world from any other edge, the agent receives a negative reward. Thus, the learner has to control its movement so as to apply appropriate force at the correct angle to push the spherical non-agent to the goal. The agent incurs negative reward at time step.

The video below shows the policy learned by the agent over 500 episodes of learning using Sarsa (lambda) algorithm and approximating value function using Radial basis function. The agent learned to push the non-agent to the goal as fast as possible.

In this experiment, the task of the pushing the ball has been made difficult by positioning the non-agent further away from the goal. Still the agent learns to go around the non-agent and push it from the left all the way to the goal at the right.

## Experiment 2: Challenging the Human

Here, an agent (sphere) is in a competitive situation, where it has to chase the non-agent (box) controlled by a human. The agent was trained by initialising the agent and non-agent in a random position. During training, the agent learned to catch the box following the shortest path. After 10000 episodes of training, the agent has learned enough to catch a human controlled non-agent. The video below shows the result of using Sarsa(lambda) and radial basis function approximator. The agent beats the human most of the times.

## Experiments 3: Division of Labor

This experiment is intended to understand how a single agent can coordinate actions of its multiple parts to achieve a desired goal. The experiment is a collaborative task, where there are two spheres and they are separated from the goal by a wall. On the same side of the wall as spheres, there is a control area (a specific area in the world), going into which can make the wall disappear. The control area is on the bottom left corner of the world. Even if one of the spheres reaches the goal, both of the sphere finishes the task and receives equal reward.

As shown in the video, the agent has learned to clearly divide the task between two spheres. One of the spheres learned to always go to the control area to make the wall disappear, while, the other sphere head towards the goal.

## Experiment 4: Joint Action Learner

A single learner controls the joint actions of the two spheres in the world. The goal is to push the large box to the right edge of the world. The box is heavy and the friction coefficient of the bodies is such that it is very difficult for one sphere to push the box.

In order to reduce learning time and utilize domain knowledge a technique called learning from easy missions was used for training the agent. The weight of the box was reduced at the start and slowly increased after every 1000 episode. As the video below shows, the agent get stuck in a local optima and only one agent is involved in pushing the box.

### Optimal policy using Joint Action Learner

The optimal policy obtained after convergence is interesting. The learner uses the two spheres to push the box from the same end and generate a torque about its center of mass, instead of pushing it at the center or simultaneously at its two ends. The box acts like a lever and it is easily pushed into the goal.

## Experiment 5: Independent Learner

Independent learner also converges to a similar policy as joint action learner. The video below shows the same. However, the actions of two spheres controlled by different learners are less coordinated than the joint action learners. This is because the agent is not aware of the other agent's position.

## Experiment 6: Cooperative Agents

Cooperative agents have information about other cooperative agent's position. Hence, they can easily coordinate. The video below shows the policy learned by cooperative agents.