The goal of the agent is to walk to the end of the sidewalk as fast as possible, while avoiding the obstacles. The agent has the ability to step on a square that has an obstacle on it, but it will receive a punishment (ie. a negative reward). The policy the agent learns should work on a variety of sidewalks, where the shape of the sidewalk is always the same, but the positions and number of obstacles varies.
The agent should learn two modules: one that tries to get to the end of the sidewalk, and one that tries to avoid obstacles. Each module should have its own state representation and reward function. It's your job to design the state representation and reward functions and to describe them in your report. You should design the state spaces in such a way that minimizes the number of possible states, but that provides enough information to discriminate between situations that call for different actions. The action space will be the same for both modules (move up,down,left,right). If the agent attempts to walk off the sidewalk (ie. beyond the bounds of the rectangle) it will just stay in the same spot. Here's a hint: the position of nearby objects should be encoded in relative coordinates (ie. the position relative to the agent's position). If you encode the absolute positions of the obstacles, there's no way to generalize across a variety of obstacle layouts. Only representing the location of the closest obstacle makes the agent myopic, but it's probably good enough for the assignment. It's possible that this could cause the agent to occasionally get into infinite loops, and so that's something to watch out for. You might want to do a timeout to deal with this situation.
Each module will be trained independently using Q-learning, using the epsilon-greedy strategy to pick the action at each time step. Given the current state, pick the best (ie. has highest q-value) action with probability epsilon (you get to choose this value, but 0.9 seems reasonable), and a randomly selected action (with uniform probability) with probability 1-epsilon. Training will consist of a sequence of episodes that ends when a convergence criterion is reached. Each episode is not a single action -- it is a sequence of actions that ends when some termination criterion is reached.
For the module that tries to get to the other side (the "approach" module), each episode the agent will start at a random spot on one side and explore and update q-values until reaching some spot on the other side, at which point the episode ends. Be consistent about which side is the start side and which is the end side.
For the obstacle avoidance module, each episode will consist of wandering around for some fixed number of steps. Each episode, change the layout of obstacles.
Here is how to combine the two modules. Given the current state, pick the action that has the greatest weighted average q-value across the two modules. The weights you pick control the strength of each module in influencing the behavior of the agent. However, there is one problem with this strategy. The approach module's Q-values will increase with proximity to the goal, and will therefore unduly have more influence. You can come up with a normalization strategy to deal with this problem.
It is your job to come up with a way of convincingly demonstrating that the agent eventually learns a good walking policy and that it improves over time. This takes some thought.
Write and submit a report that includes the following: explanation of all things that were not directly specified by the assignment (such as the state representations and reward functions and various parameters) and your demonstration that the algorithm works. The more convincing the better. Provide some commentary on all graphs. If something does not seem to work correctly explain what you think might be the problem. The report should be in pdf format and should contain your name, email and EID. The command to submit will be something like: turnin --submit jcooper hw5 .