ALAFF Technical details

Subsection 8.3.4 Technical details

This unit is probably the most technically difficult unit in the course. We give the details here for completeness, but you will likely live a happy and productive research life without worrying about them too much... The important part is the final observation: that the next search direction computed by the Conjugate Gradient Method is a linear combination of the current residual (the direction of steepest descent) and the last search direction.

Let's look more carefully at \(p^{(k)} \) that satisfies

\begin{equation*} \| r^{(k)} - p^{(k)} \|_2 = \min_{p \perp \Span( Ap^{(0)}, \ldots, Ap^{(k-1)} )} \| r^{(k)} - p \|_2 . \end{equation*}

Notice that

\begin{equation*} r^{(k)} = v + p^{(k)} \end{equation*}

where \(v \) is the orthogonal projection of \(r^{(k)} \) onto \(\Span( Ap^{(0)}, \ldots, Ap^{(k-1)} )\)

\begin{equation*} \| r^{(k)} - v \|_2 = \min_{w \in \Span( Ap^{(0)}, \ldots, Ap^{(k-1)} )} \| r^{(k)} - w \|_2 \end{equation*}

which can also be formulated as \(v = A P^{(k-1)} z^{(k)} \text{,}\) where

\begin{equation*} \| r^{(k)} - A P^{(k-1)} z^{(k)} \|_2 = \min_{z \in \R^k} \| r^{(k)} - A P^{(k-1)} z \|_2 . \end{equation*}

This can be recognized as a standard linear least squares problem. This allows us to make a few important observations:

Theorem 8.3.4.1.

In Figure 8.3.3.1,

\(P^{(k-1)\,T} r^{(k)} = 0. \)
\(\Span( p^{(0)}, \ldots, p^{(k-1)} ) = \Span( r^{(0)}, \ldots, r^{(k-1)} ) = \Span( b, A b, \ldots, A^{k-1} b ). \)

Proof.

Proving that

\begin{equation*} P^{(k-1)\,T} r^{(k)} = 0. \end{equation*}

starts by considering that

\begin{equation*} \begin{array}{l} f( P^{(k-1)} y ) \\ ~~~=~~~~ \\ \frac{1}{2} ( P^{(k-1)} y )^T A ( P^{(k-1)} y ) - ( P^{(k-1)} y )^T b \\ ~~~= ~~~~\\ \frac{1}{2} y^T ( P^{(k-1)\,T} A P^{(k-1)} ) y - y^T P^{(k-1)\,T} b \end{array} \end{equation*}

is minimized by \(y_0 \) that satisfies

\begin{equation*} ( P^{(k-1)\,T} A P^{(k-1)} ) y_0 = P^{(k-1)\,T} b. \end{equation*}

Since \(x^{(k)} \) minimizes

\begin{equation*} \min_{ x \in \Span( p^{(0)}, \ldots , p^{(k-1)} )} f( x ) \end{equation*}

we conclude that \(x = P^{(k-1)} y_0 \text{.}\) But then

\begin{equation*} 0 = P^{(k-1)\,T} b - \left( P^{(k-1)\,T} A x^{(k)} \right) = P^{(k-1)\,T} \left( b - A x^{(k)} \right) = P^{(k-1)\,T} r^{(k)}. \end{equation*}
Show that \(\Span( p^{(0)}, \ldots, p^{(k-1)} ) = \Span( r^{(0)}, \ldots, r^{(k-1)} ) = \Span( b, A b, \ldots, A^{k-1} b ).\)

Proof by induction on \(k \text{.}\)
- Base case: \(k = 1 \text{.}\)
  
  The result clearly holds since \(p^{(0)} = r^{(0)} = b \text{.}\)
- Inductive Hypothesis: Assume the result holds for \(n \leq k \text{.}\)
  
  Show that the result holds for \(k = n+1 \text{.}\)
  - If \(k = n + 1 \) then \(r^{(k-1}) = r^{(n)} = r^{(n-1)} - \alpha_{n-1} A p^{(n-1)} \text{.}\) By I.H.
    
    \begin{equation*} r^{(n-1)} \in \Span( b, A b , \ldots , A^{n-1} b ) \end{equation*}
    
    and
    
    \begin{equation*} p^{(n-1)} \in \Span( b, A b , \ldots , A^{n-1} b )\text{.} \end{equation*}
    
    But then
    
    \begin{equation*} A p^{(n-1)} \in \Span( A b, A^2 b , \ldots , A^{n} b ) \end{equation*}
    
    and hence
    
    \begin{equation*} r^{(n)} \in \Span( b, A b, A^2 b , \ldots , A^{n} b ). \end{equation*}
  - \(p^{(n)} = r^{(n)} - A P^{(n-1)} y_0 \) and hence
    
    \begin{equation*} p^{(n)} \in \Span( b, A b, A^2 b , \ldots , A^{n} b ) \end{equation*}
    
    since
    
    \begin{equation*} r^{(n)} \in \Span( b, A b, A^2 b , \ldots , A^{n} b ) \end{equation*}
    
    and
    
    \begin{equation*} A P^{n-1} y_0 \in \Span( A b, A^2 b , \ldots , A^{n} b ). \end{equation*}
- We complete the inductive step by noting that all three subspaces have the same dimension and hence must be the same subspace.
- By the Principle of Mathematical Induction, the result holds.

Definition 8.3.4.2. Krylov subspace.

The subspace

\begin{equation*} {\cal K}_k( A, b ) = \Span( b, A b, \ldots, A^{k-1} b ) \end{equation*}

is known as the order-k Krylov subspace.

The next technical detail regards the residuals that are computed by the Conjugate Gradient Method. They are mutually orthogonal, and hence we, once again, conclude that the method must compute the solution (in exact arithmetic) in at most \(n \) iterations. It will also play an important role in reducing the number of matrix-vector multiplications needed to implement the final version of the Conjugate Gradient Method.

Theorem 8.3.4.3.

The residual vectors \(r^{(k)} \) are mutually orthogonal.

Proof.

In Theorem 8.3.4.1 we established that

\begin{equation*} \Span( p^{(0)}, \ldots, p^{(j-1)} ) = \Span( r^{(0)}, \ldots, r^{(j-1)} ) \end{equation*}

and hence

\begin{equation*} \Span( r^{(0)}, \ldots, r^{(j-1)} ) \subset \Span( p^{(0)}, \ldots, p^{(k-1)} ) = \end{equation*}

for \(j \lt k \text{.}\) Hence \(r^{(j)} = P^{(k-1)} t^{(j)} \) for some vector \(t^{(j)} \in \R^k \text{.}\) Then

\begin{equation*} r^{(k)\,T} r^{(j)} = r^{(k)\,T} P^{(k-1)} t^{(j)} = 0. \end{equation*}

Since this holds for all \(k \) and \(j \lt k \text{,}\) the desired result is established.

Next comes the most important result. We established that

\begin{equation} p^{(k)} = r^{(k)} - A P^{(k-1)} z^{(k)}\label{chapter08-cg-eq-1}\tag{8.3.3} \end{equation}

where \(z^{(k)} \) solves

\begin{equation*} \min_{z \in \R^k} \| r^{(k)} - A P^{(k-1)} z \|_2 . \end{equation*}

What we are going to show is that in fact the next search direction equals a linear combination of the current residual and the previous search direction.

Theorem 8.3.4.4.

For \(k \geq 1 \text{,}\) the search directions generated by the Conjugate Gradient Method satisfy

\begin{equation*} p^{(k)} = r^{(k)} + \gamma_{k} p^{(k-1)} \end{equation*}

for some constant \(\gamma_k \text{.}\)

Proof.

This proof has a lot of very technical details. No harm done if you only pay cursory attention to those details.

Partition \(z^{(k-1)} = \left( \begin{array}{c} z_0 \\ \zeta_1 \end{array} \right) \) and recall that \(r^{(k)} = r^{(k-1)} - \gamma_{k-1} A p^{(k-1)} \) so that

\begin{equation*} \begin{array}{l} p^{(k)} \\ ~~~=~~~~ \lt \mbox{ (8.3.3) } \gt \\ r^{(k)} - A P^{(k-1)} z^{(k-1)} \\ ~~~=~~~~ \lt z^{(k-1)} = \left( \begin{array}{c} z_0 \\ \zeta_1 \end{array} \right) \gt \\ r^{(k)} - A P^{(k-2)} z_0 + \zeta_1 A p^{(k-1)} \\ ~~~=~~~~ \lt \gt \\ r^{(k)} - \left( A P^{(k-2)} z_0 + \zeta_1( r^{(k)} - r^{(k-1)} ) / \alpha_{k-1} \right) \\ ~~~=~~~~ \lt \gt \\ \left( 1 - \frac{\zeta_1}{\alpha_{k-1}} \right) r^{(k)} + \begin{array}[t]{c} \underbrace{ \left( \frac{\zeta_1}{\alpha_{k-1} }r^{(k-1)} -A P^{(k-2)} z_0 \right) } \\ s^{(k)} \end{array} \\ ~~~=~~~~ \lt \gt \\ \left( 1 - \frac{\zeta_1}{\alpha_{k-1}} \right) r^{(k)} + s^{(k)}. \end{array} \end{equation*}

We notice that \(r^{(k)} \) and \(s^{(k)} \) are orthogonal. Hence

\begin{equation*} \| p^{(k)} \|_2^2 = \left( 1 + \frac{\zeta_1}{\alpha_{k-1}} \right) \| r^{(k)} \|_2^2 + \| s^{(k)} \|_2^2 \end{equation*}

and minimizing \(p^{(k)} \) means minimizing the two separate parts. Since \(r^{(k)} \) is fixed, this means minimizing \(\| s^{(k)} \|_2^2 \text{.}\) An examination of \(s^{(k)} \) exposes that

\begin{equation*} s^{(k)} = \frac{\zeta_1}{\alpha_{k-1} }r^{(k-1)} -A P^{(k-2)} z_0 = - \frac{\zeta_1}{\alpha_{k-1} } \left( r^{(k-1)} - A P^{(k-2)} w_0 \right) \end{equation*}

where \(w_0 = -( \alpha_{k-1} / \zeta_1 ) z_0 \text{.}\) We recall that

\begin{equation*} \| r^{(k-1)} - p^{(k-1)} \|_2 = \min_{p \perp \Span( p^{(0)}, \ldots , p^{(k-2)} )} \| r^{(k-1)} - A p \|_2 \end{equation*}

and hence we conclude that \(s_k \) is a vector the direction of \(p^{(k-1)} \text{.}\) Since we are only interested in the direction of \(p^{(k)} \text{,}\) \(\frac{\zeta_1}{\alpha_{k-1}} \) is not relevant. The upshot of this lengthy analysis is that

\begin{equation*} p^{(k)} = r^{(k)} + \gamma_{k} p^{(k-1)}. \end{equation*}

This implies that while the Conjugate Gradient Method is an A-conjugate method and hence leverages a "memory" of all previous search directions,

\begin{equation*} f( x^{(k)}) = \min_{x \in \Span( p^{(0)}, \ldots , p^{(k-1)} )} f( x ), \end{equation*}

only the last search direction is needed to compute the current one. This reduces the cost of computing the current search direction and means we don't have to store all previous ones.

Remark 8.3.4.5.

This is a very, very, very big deal...