In this section, we present the results of our experiments. We begin by finding an appropriate memory size to use for this task. Then we explore our agent's ability to learn time-varying and nondeterministic defender behavior, introducing a more sophisticated memory storage technique.
While examining the results, keep in mind that even if the agent used the functions and to decide whether to shoot or to pass, the success rate would be significantly less than 100% (it would differ for different defender speeds): there were many defender starting positions for which neither shooting nor passing led to a goal (see Figure 2).
Figure 2: For different defender starting positions (solid rectangle), the agent can score when a) shooting, b) passing, c) neither, or d) both.
For example, from our experiments with the defender moving at a constant speed of 50, we found that an agent acting optimally scores 73.6% of the time; an agent acting randomly scores only 41.3% of the time. These values set good reference points for evaluating our learning agent's performance. We indicate the scoring rate of an optimally acting agent on our graphs.
In order to increase the optimal success rate, we also experimented with allowing the agent not to act when for a given defender position, i.e. when neither shooting nor passing was likely to work. The agent then scored 100% of the time by waiting until the defender moved into a position in which scoring was possible. In this setup, however, we had a difficult time collecting meaningful results, since the agent learned how to score when the defender was in a single position and then only acted when it was near that position. Therefore, we required the agent to act immediately.