ALAFF Blocked Householder QR factorization

Subsection 3.4.1 Blocked Householder QR factorization

Subsubsection 3.4.1.1 Casting computation in terms of matrix-matrix multiplication

Modern processors have very fast processors with very fast floating point units (which perform the multiply/adds that are the bread and butter of our computations), but very slow memory. Without getting into details, the reason is that modern memories are large and hence are physically far from the processor, with limited bandwidth between the two. To overcome this, smaller "cache" memories are closer to the CPU of the processor. In order to achieve high performance (efficient use of the fast processor), the strategy is to bring data into such a cache and perform a lot of computations with this data before writing a result out to memory.

Operations like a dot product of vectors or an "axpy" (\(y := \alpha x + y \)) perform \(O( m ) \) computation with \(O( m ) \) data and hence don't present much opportunity for reuse of data. Similarly, matrix-vector multiplication and rank-1 update operations perform \(O( m^2 ) \) computation with \(O( m^2 ) \) data, again limiting the opportunity for reuse. In contrast, matrix-matrix multiplication performs \(O( m^3 ) \) computation with \(O( m^2 ) \) data, and hence there is an opportunity to reuse data.

The goal becomes to rearrange computation so that most computation is cast in terms of matrix-matrix multiplication-like operations. Algorithms that achieve this are called blocked algorithms.

It is probably best to return to this enrichment after you have encountered simpler algorithms and their blocked variants later in the course, since Householder QR factorization is one of the more difficult operations to cast in terms of matrix-matrix multiplication.

Subsubsection 3.4.1.2 Accumulating Householder transformations

Given a sequence of Householder transformations, computed as part of Householder QR factorization, these Householder transformations can be accumulated into a new transformation: If \(H_0, \cdots , H_{k-1} \) are Householder transformations, then

\begin{equation*} H_0 H_1 \cdots H_{k-1} = I - U T^{-1} U^H, \end{equation*}

where \(T \) is an upper triangular matrix. If \(U \) stores the Householder vectors that define \(H_0, \ldots, H_{k-1}\) (with "1"s explicitly on its diagonal) and \(t \) holds the scalars \(\tau_0, \ldots, \tau_{k-1} \text{,}\) then

T := FormT( U, t )

computes the desired matrix \(T \text{.}\) Now, applying this UT transformation to a matrix \(B \) yields

\begin{equation*} ( I - U T^{-1} U^H ) B = B - U ( T^{-1} ( U^H B ) ), \end{equation*}

which demontrates that this operations requires the matrix-matrix multiplication \(W := U^H B \text{,}\) the triangular matrix-matrix multiplication \(W := T^{-1} W \) and the matrix-matrix multipication \(B - U W \text{,}\) each of which can attain high performance.

In [23] we call the transformation \(I - U T^{-1} U^H \) that equals the accumulated Householder transformations the UT transform and prove that \(T \) can instead by computed as

\begin{equation*} T = \triu( U^H U ) \end{equation*}

(the upper triangular part of \(U^H U \)) followed by either dividing the diagonal elements by two or setting them to \(\tau_0 , \ldots, \tau_{k-1} \) (in order). In that paper, we point out similar published results [8] [35] [46] [32].

Subsubsection 3.4.1.3 A blocked algorithm

A QR factorization that exploits the insights that yielded the UT transform can now be described:

Partition

\begin{equation*} A \rightarrow \FlaTwoByTwoSingleLine { A_{11} }{ A_{12} } { A_{21} }{ A_{22} } \end{equation*}

where \(A_{11} \) is \(b \times b \text{.}\)
We can use the unblocked algorithm in Subsection 3.3.4 to factor the panel \(\FlaTwoByOneSingleLine { A_{11} } { A_{21} }\)

\begin{equation*} [ \FlaTwoByOneSingleLine { A_{11} } { A_{21} }, t_1 ] := {\rm HouseQR\_unb\_var1}( \FlaTwoByOneSingleLine { A_{11} } { A_{21} } ), \end{equation*}

overwriting the entries below the diagonal with the Householder vectors \(\FlaTwoByOneSingleLine { U_{11} } { U_{21} }\) (with the ones on the diagonal implicitly stored) and the upper triangular part with \(R_{11} \text{.}\)
Form \(T_{11} \) from the Householder vectors using the procedure described earlier in this unit:

\begin{equation*} T_{11} := {\rm FormT}( \FlaTwoByOneSingleLine { A_{11} } { A_{21} } ) \end{equation*}
Now we need to also apply the Householder transformations to the rest of the columns:

\begin{equation*} \begin{array}{l} \FlaTwoByOneSingleLine { A_{12} } { A_{22} } \\ ~~~=~~~~ \\ \left( I - \FlaTwoByOneSingleLine { U_{11} } { U_{21} } T_{11}^{-1} \FlaTwoByOneSingleLine { U_{11} } { U_{21} }^H \right)^H \FlaTwoByOneSingleLine { A_{12} } { A_{22} } \\ ~~~=~~~~ \\ \FlaTwoByOneSingleLine { A_{12} } { A_{22} } - \FlaTwoByOneSingleLine { U_{11} } { U_{21} } W_{12} \\ ~~~=~~~~ \\ \FlaTwoByOneSingleLine { A_{12} - U_{11} W_{12} } { A_{22} - U_{21} W_{12} } , \end{array} \end{equation*}

where

\begin{equation*} W_{12} = T_{11}^{-H} ( U_{11}^H A_{12} + U_{21}^H A_{22} ). \end{equation*}

This motivates the blocked algorithm in Figure 3.4.1.1.

\begin{equation*} \newcommand{\routinename}{ \left[ A, t \right] := {\rm HouseQR\_blk\_var1}( A, t ) } \newcommand{\guard}{ m( A_{TL} ) \lt m( A ) } \newcommand{\partitionings}{ A \rightarrow \FlaTwoByTwo{A_{TL}}{A_{TR}} {A_{BL}}{A_{BR}} , t \rightarrow \FlaTwoByOne{t_{T}} {t_{B}} } \newcommand{\partitionsizes}{ A_{TL} {\rm ~is~} 0 \times 0 , t_{T} {\rm ~has~} 0 {\rm ~rows} } \newcommand{\blocksize}{b} \newcommand{\repartitionings}{ \FlaTwoByTwo{A_{TL}}{A_{TR}} {A_{BL}}{A_{BR}} \rightarrow \FlaThreeByThreeBR{A_{00}}{A_{01}}{A_{02}} {A_{10}}{A_{11}}{A_{12}} {A_{20}}{A_{21}}{A_{22}} , \FlaTwoByOne{ t_T } { t_B } \rightarrow \FlaThreeByOneB{t_0} {t_1} {t_2} } \newcommand{\repartitionsizes}{ A_{11} {\rm ~is~} b \times b , t_1 {\rm ~has~} b {\rm ~rows} } \newcommand{\moveboundaries}{ \FlaTwoByTwo{A_{TL}}{A_{TR}} {A_{BL}}{A_{BR}} \leftarrow \FlaThreeByThreeTL{A_{00}}{A_{01}}{A_{02}} {A_{10}}{A_{11}}{A_{12}} {A_{20}}{A_{21}}{A_{22}} , \FlaTwoByOne{ t_T } { t_B } \leftarrow \FlaThreeByOneT{t_0} {t_1} {t_2} } \newcommand{\update}{ \begin{array}{l} [ \FlaTwoByOneSingleLine { A_{11} } { A_{21} }, t_1 ] := {\rm HQR\_unb\_var1}( \FlaTwoByOneSingleLine { A_{11} } { A_{21} } ) \\ T_{11} := {\rm FormT}( \FlaTwoByOneSingleLine { A_{11} } { A_{21} }, t_1 ) \\ W_{12} := T_{11}^{-H} ( U_{11}^H A_{12} + U_{21}^H A_{22} ) \\ \FlaTwoByOneSingleLine { A_{12} } { A_{22} } := \FlaTwoByOneSingleLine { A_{12} - U_{11} W_{12} } { A_{22} - U_{21} W_{12} } \end{array} } \FlaBlkAlgorithm \end{equation*}

Figure 3.4.1.1. Blocked Householder transformation based QR factorization.

Details can be found in [23].

Subsubsection 3.4.1.4 The \(WY \) transform

An alternative (and more usual) way of expressing a Householder transform is

\begin{equation*} I - \beta v v^H , \end{equation*}

where \(\beta = 2 / v^H v \) (\(= 1 / \tau \text{,}\) where \(\tau \) is as discussed before). This leads to an alternative accumulation of Householder transforms known as the compact WY transform [35]:

\begin{equation*} I - U S U^H \end{equation*}

where upper triangular matrix \(S \) relates to the matrix \(T \) in the UT transform via \(S= T^{-1} \text{.}\) Obviously, \(T \) can be computed first and then inverted via the insights in the next exercise. Alternatively, inversion of matrix \(T \) can be incorporated into the algorithm that computes \(T \) (which is what is done in the implementation in LAPACK [1]).