Igor Karpov (ikarpov@cs.utexas.edu)

Mathematica Notebook of this page.

This is for Excercise 7.3 in Barto and Sutton's Reinforcement Learning.

In section 7.1 of the book, two methods of n-step TD prediction are discussed. One method, **on-line n-step TD prediction**, updates the value function after each step in the episode. The other method, **off-line n-step TD prediction**, computes a cumulative TD update for the state value function over the course of the entire episode. In Figure 7.2 of the book, experimental results are presented showing the average root mean square error of the learned value function for different values of α and n for both methods.

It turns out that in the off-line n-step TD prediction, RMS error grows with α much more sharply for n=3,5,7 and 9 then for n=1,2,4,6 and 8. Exercise 7.3 asks to explain this effect. The following is an intuitive statistical explanation of what might be happening.

If we look at the average value function learned after 10 episodes using off-line 2-step TD prediction,we see a reasonable approximation (Figure 1).

If we look at the average value function learned after 10 episodes using off-line 3-step TD prediction, we see a "resonance effect" in the middle (Figure 2).

Apparently, under certain conditions and for higher values of α, the value function after 10 episodes is a poor approximation of the true value (RMS error is high). While trying to re-create what was happening, I accidentally introduced a bug into my code, which actually made it performs the n-step off-line TD prediction on the same (long) walk 10 times. For odd values of n, I sometimes saw the following effect:

Frequency of states encountered during a particular walk, starting at state 9. We can see that the walk spent a long time around states 7,8 and 13,14.

Value function after one series of off-line 3-step TD updates on the same walk

Value function after two series of off-line 3-step TD updates on the same walk

Value function after three series of off-line 3-step TD updates on the same walk

Value function after four series of off-line 3-step TD updates on the same walk

Value function after five series of off-line 3-step TD updates on the same walk

Value function after six series of off-line 3-step TD updates on the same walk. **Notice that something is going wrong around states 12 and 13**.

Value function after seven series of off-line 3-step TD updates on the same walk. **Notice that something is going wrong around states 12 and 13**.

Value function after eight series of off-line 3-step TD updates on the same walk. **Notice that something is going wrong around states 12 and 13**.

Value function after nine series of off-line 3-step TD updates on the same walk. **Notice that something is going wrong around states 12 and 13**.

Value function after ten series of off-line 3-step TD updates on the same walk. **Notice that something is going wrong around states 12 and 13 and in general the value function is now in poor shape compared to it's real value.**

Around step 6 or 7, we start to see that something is going wrong. The state values start to overshoot their mark and repeatedly overcompensate for their growth. If we were to continue this experiment, the value function would grow without bound.

What is actually happening here is in some sense similar to wave resonance - error signal that feeds into itself due to the configuration of the walk. (See mathematical explanation below). It is interesting that the peak of the "resonance" seems to coincide with the states that are visited most frequently. When I tried this with even n or with n=1, I did not notice this effect nearly as often or as great, as in the book's results.

This effect can also be seen when we train with 10 different random walks, although it is less pronounced and less frequent.

Value function after one episode of off-line 3-step TD updates on a random walk

Value function after two episodes of off-line 3-step TD updates on a random walk

Value function after three episodes of off-line 3-step TD updates on a random walk

Value function after four episodes of off-line 3-step TD updates on a random walk

Value function after five episodes of off-line 3-step TD updates on a random walk

Value function after six episodes of off-line 3-step TD updates on a random walk

Value function after seven episodes of off-line 3-step TD updates on a random walk

Value function after eight episodes of off-line 3-step TD updates on a random walk

Value function after nine episodes of off-line 3-step TD updates on a random walk

Value function after ten episodes of off-line 3-step TD updates on a random walk

We can see from the sequence of value functions above that a similar resonance effect can appear even when we train on a new random walk each time. What is actually going on in this effect?

To see what is happening, consider the n-step TD update:

In the off-line case, V(s) is constant and each update becomes:

The corrected n-step truncated return in this case is going to be

because all the rewards along the way are 0 and γ=1. If we consider a walk which contains R n-step sequences A->...->B and R n-step sequences B->...->A, then the contributions to the updates from these states are:

After the next episode, which in the above (imaginary) example would have the same number of repetitions, would be

**Note that if A=B, all the updates go to zero. If the two states are different and 2 R α (R α-1)>1, then it is possible for V[A] and V[B] to reinforce each other without bound (or resonate).** Intuitively, the conditions are a high enough rate of n-step mirror sequences and a high enough alpha for this effect to take place.

How do we explain this effect in the real task? The resonance seems to appear when the walk contains many n-step sequences A->...->B and B->...->A at the same time for pairs of states A~=B, especially when these states are close by. How likely is this to happen? The plots below show how likely the value of a state d steps away is to be used for the truncation term.

Distribution of distances to the truncation state in the corrected 1-step truncated return

Distribution of distances to the truncation state in the corrected 2-step truncated return

Distribution of distances to the truncation state in the corrected 3-step truncated return

Distribution of distances to the truncation state in the corrected 4-step truncated return

Distribution of distances to the truncation state in the corrected 5-step truncated return

Distribution of distances to the truncation state in the corrected 6-step truncated return

Distribution of distances to the truncation state in the corrected 7-step truncated return

Distribution of distances to the truncation state in the corrected 8-step truncated return

Distribution of distances to the truncation state in the corrected 9-step truncated return

The above plots show that while for even values of n a considerable proportion of the n-step sequences terminate in the same state, this cannot happen for odd values of n. For odd values of n, more n-step sequences terminate in a nearby state instead.

Here is the number of other-state resonances for n from 1 to 9. An "other-state" resonance occurs when a walk contains both an A->B transition and a B->A transition for some A, B where A~=B. Note that the states closer to the middle are more likely to have resonances with other states. Also note that there are more "other-state" resonances for odd values of n (at the top of the graph).

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 1.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 2.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 3.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 4.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 5.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 6.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 7.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 8.

The number of resonance n-step sequences (A->B transitions for which there is a B->A transition, A~=B) for each state A, n= 9.

The average total number of resonances with other states for different values of n for each walk. Note that the number of resonances is always higher for n odd, however this difference goes down with time. We can see a summary of this on the plot below. If the larger number of resonances corresponds to a higher likelihood of the resonance effect, this is consistent with the results from the book, except why does the value of 1 not exhibit the same behavior?

From the graph above, we can see that, on average, there is only about 1 resonance per state for n even, while for n odd this number is higher.

So far we have not seen a difference between n=1 and other odd values of n. To explain the difference between n=1 and other odd values of n, we can look at the average number of non-zero updates to each state during the first 10 episodes:

Number of non-zero updates over 10 episodes for n=1

Number of non-zero updates over 10 episodes for n=2

Number of non-zero updates over 10 episodes for n=3

Number of non-zero updates over 10 episodes for n=4

Number of non-zero updates over 10 episodes for n=5

Number of non-zero updates over 10 episodes for n=6

Number of non-zero updates over 10 episodes for n=7

Number of non-zero updates over 10 episodes for n=8

Number of non-zero updates over 10 episodes for n=9

The graphs above are labeled with the value of n at the top.As we can see,the non-zero TD error does not get to the middle of the state space,where resonance is most likely to occur. If this is a correct explanation,then this difference should go away if we increase the number of training episodes.

It seems intuitive that what happens in exercise 7.3 is a combination of several things.

First,repeated A->B,B->A n-step sequences in the random walk can lead to mutual "overcorrection" of the TD error in the n-step off-line TD update if they happen frequently enough and if α is high enough and γ is close enough to 1. After several episodes of off-line training,the compound effects of this overcorrection can lead to the value function growing in a "standing wave" or "resonance" pattern.

Second,this effect does not happen when the state transitions to itself in n steps;transitions to self can only happen when n is even, which makes the n-step back-and-forth repeat sequences much more likely in odd cases.

Third,the highest probability of n-step resonance falls in the middle of the random walk state space.Since for n=1 it takes a larger number of episodes to propagate the non-zero reward back to the middle states,n=1 does not show this resonance anomaly for first 10 episodes as do other odd values of n.

While it seems that a lot of the reason why off-line TD fails on this task has to do with rather unique conditions,it is possible that the problem of repeated state sequences is a general one and could lead to similar resonance effect in other situations.It is easy to combat with a reasonable discounting factor.

Converted by