The low-level skill that we focus on in this article is the ability to shoot a moving ball. Although a single agent does the shooting, we consider it a multiagent learning scenario since the ball is typically moving as the result of a pass from a teammate: it is possible only because other agents are present. Furthermore, as discussed in Section 5, this skill is necessary for the creation of higher level multiagent behaviors. However, in order to use the skill in a variety of situation, it must be as robust and situation-independent as possible. In this section, we describe in detail how we created a robust, learned behavior in a multiagent scenario.
In all of our experiments there are two agents: a passer accelerates as fast as possible towards a stationary ball in order to propel it between a shooter and the goal. The resulting speed of the ball is determined by the distance that the passer started from the ball. The shooter's task is to time its acceleration so that it intercepts the ball's path and redirects it into the goal. We constrain the shooter to accelerate at a fixed constant rate (while steering along a fixed line) once it has decided to begin its approach. Thus the behavior to be learned consists of the decision of when to begin moving: at each action opportunity the shooter either starts or waits. Once having started, the decision may not be retracted. The key issue is that the shooter must make its decision based on the observed field: the ball's and the shooter's coordinates reported at a (simulated) rate of 60Hz. The method in which the shooter makes this decision is called its shooting policy.
Throughout our experiments, the shooter's initial position varies randomly within a continuous range: its initial heading varies over 70 degrees and its initial x and y coordinates vary independently over 40 units as shown in Figure 3(a). The two shooters pictured show the extreme possible starting positions, both in terms of heading and location.
Since the ball's momentum is initially across the front of the goal, the shooter must compensate by aiming wide of the goal when making contact with the ball (see Figure 3(b)). Before beginning its approach, the shooter chooses a point wide of the goal at which to aim. Once deciding to start, it then steers along an imaginary line between this point and the shooter's initial position, continually adjusting its heading until it is moving in the right direction along this line (see Appendix B). The line along which the shooter steers in the steering line. The method in which the shooter chooses the steering line is called its aiming policy.
Figure 3: (a) The initial position for the experiments in this paper. The agent in the lower part of the picture, the passer, accelerates full speed ahead until it hits the ball. Another agent, the shooter then attempts to redirect the ball into the goal on the left. The two agents in the top of the figure illustrate the extremes of the range of angles of the shooter's initial position. The square behind these two agents indicates the range of the initial position of the center of the shooter. (b) A diagram illustrating the paths of the ball and the agents during a typical trial.
The task of learning a shooting policy has several parameters that can control its level of difficulty. First, the ball can be moving at the same speed for all training examples or at different speeds. Second, the ball can be coming with the same trajectory or with different trajectories. Third, the goal can always be in the same place during testing as during training, or it can change locations (think of this parameter as the possibility of aiming for different parts of the goal). Fourth, the training and testing can occur all in the same location, or the testing can be moved to a different action quadrant: a symmetrical location on the field. To acquire a robust behavior, we perform and report a series of experiments in which we increase the difficulty of our task incrementally. We develop a learning agent to test how far training in a limited scenario can extend. Table 1 indicates how the article is organized according to the parameters varied throughout our experiments. Figure 4 illustrates some of these variations.
Table 1: The parameters that control the difficulty of the task and the sections in which they are varied.
Figure 4: Variations to the initial setup: (a) The initial position in the opposite corner of the field (a different action quadrant); (b) Varied ball trajectory: The line indicates the ball's possible initial positions, while the passer always starts directly behind the ball facing towards a fixed point. The passer's initial distance from the ball controls the speed at which the ball is passed; (c) Varied goal position: the placement of the higher and lower goals are both pictured at once.
We use a supervised learning technique to learn the task at hand. Throughout the article, all training is done with simulated sensor noise of up to about 2 units for x and y and 2 degrees for . All reported success rates are based on at least 1000 trials.