Imagine yourself again as the agent in this setup. Suppose that you have a very important match to play tomorrow and in your diligent way, you decide to learn for a single ball position whether it is best to pass or to shoot. You convince your teammate to come and stand in the place where you plan to pass, and then you convince your goalie to start in several different positions for two attempts: one shot and one pass. Of course the goalie will move to try to block the ball, but being a consistent goalie, you know that she will always move in the same way. Therefore you need only try shooting and passing once for each starting position. After a short amount of time, you have learned perfectly whether you should shoot or pass when the goalie is in a given position (with only a little error due to the necessity of rounding the goalie's position to the nearest position used for training).
The next day you turn up at the big game confident that if the ball is in your chosen spot, you will be able to choose correctly whether to shoot or to pass. You know that the opposing goalie is just as consistent as your own, so you believe that everything you learned yesterday should apply. But alas, your first attempt at a shot--one that you were sure would score--is blocked by the opposing goalie. What happened? The goalie's behavior is still deterministic, but it has changed completely: the new goalie is slower than your own. If you keep acting based on your experiences in practice, you are not going to score much, so you had better start adapting your memory to the current situation.
The memory-based technique we have been using so far works well when the defender's motion is deterministic and remains unchanged over time. However, if some noise is added to the defender's motion or if the defender changes its speed over time, then we need to use a more powerful technique. The technique we have used to this point converges monotonically since it assumes that once has been learned perfectly at a memory location, then need never change. If there is a training example , then no number of nearby conflicting examples will alter the value in Mem[ ].
In our current scenario, memory needs to be able to adapt in response to new conflicting examples. In order to accommodate this requirement, we change our method of storing experiences to memory. We continue to scale the result of an experience stored to Mem[ ] by , but rather than only storing the result of the experience with closest to , we now let each experience with affect Mem[ ] in proportion to the distance . In particular, Mem[ ] keeps running sums of the magnitudes of scaled results, Mem[ ].total-a-results, and of scaled positive results, Mem[ ].positive-a-results, affecting , where ``a'' stands for ``s'' or ``p'' as before. Then at any given time, . The ``-1'' is for the lower bound of our probability range, and the `` '' is to scale the result to this range. Call this our adaptive memory storage technique:
For example, would set both total-p-results and positive-p-results for Mem (and Mem) to 0.5 and consequently (and ) to 1.0. But then would increment total-p-results for Mem by .75, while leaving positive-p-results unchanged. Thus becomes .
This method of storing to memory is effective both for time-varying concepts and for concepts involving random noise. It performs better than the basic memory storage technique described earlier because it is able to deal with conflicting examples within the range of the same memory slot.
Figure 3 demonstrates the effectiveness of adaptive memory when the defender's speed changes.
Figure 3: For all trials shown in these graphs, the agent began with a memory trained for a defender moving at constant speed 50. Adaptive memory outperforms basic memory for memories of size both 360 (left) and 18 (right). Since the basic memory does not change over time, the next 1000 trials produced the same results as the first 1000, and therefore are not plotted.
In all of the experiments represented in these graphs, the agent started with a memory trained by attempting a single pass and a single shot with the defender starting at each position for which Mem[ ] is defined and moving in its circle at speed 50. We tested the agent's performance with the defender moving at various (constant) speeds.
Notice that in both graphs of Figure 3, basic memory causes performance to degrade as the defender's speed moves farther from 50. At the extremes, performance even becomes worse than random action, which leads to roughly a 40% success rate. In contrast, with adaptive memory, the agent is able to unlearn the training that no longer applies and approach optimal behavior: it re-learns the new setup. During the first 1000 trials the agent suffers from having practiced in a different situation (especially for the less generalized memory, M = 360), but then it is able to approach optimal behavior over the next 1000 trials. Remember that optimal behavior, represented in the graph, leads to roughly a 70% success rate, since at many starting positions, neither passing nor shooting is successful. As in Table 1, we can see that the smaller memory converges to more quickly than does the larger memory.
From these results we conclude that our adaptive memory can effectively deal with time-varying concepts. It can also perform well when the defender's motion is nondeterministic, as we show next.