Learning to Acquire the Ball

It is sometimes advantageous to get the ball into a precise position relative to the robot so that the effects of subsequent actions taken by the robot will be more predictable. For example, being able to assume that the ball will start under the robot's chin makes the design of effective kicks much more tractable. Thus, our team adopted the following strategy: when the Aibo is walking to a ball with the intent of kicking it and gets "close enough," it first slows down to allow for more precise positioning, and then it lowers its head to capture the ball under its chin (this is the "capturing motion").

Initial acquisition training video.
Initial policy

(2.1 MB MPEG)

However, executing this motion so that the ball is not knocked away in the process is a challenge: if the head is lowered when the ball is too far away, the head may knock the ball away, but if it is not lowered in time, the body of the robot may bump the ball away. Naturally, the time at which the head should be lowered depends on how fast the robot is walking, so the amount that the robot slows down when close to the ball must be tuned simultaneously with the timing of the capturing motion. For the same reason, every time the speed of the base gait changes (due to a newly-designed gait or a different walking surface, for example), the approach must be retuned. This is a time-consuming task to perform by hand.

Subsequent acquisition training video.
Best learned policy

(2.2 MB MPEG)

Therefore, we have designed a training scenario in which the robot can learn the best parameters for this action on its own, with no human intervention other than battery changes. The following two videos depict the training process and show the difference between the initial behavior and the learned behavior. (Note that the "pushing" motion executed by the robot is not intended to kick the ball away; it serves only as a visual indicator, so that human observers can see when the robot believes it has bumped the ball away and therefore is counting the trial as a failure.)

In the first video, we show the training process, during which the robot executes the initial policy. It captures the ball successfully on its first attempt, but then fails several times in succession. In the second video, the robot is training using the best policy learned by the policy gradient algorithm. It captures the ball successfully much more often.

Full details of our approach are available in the following paper:

Valid CSS!
Valid XHTML 1.0!