## Unit12.3.2Blocked LU factorization

###### Homework12.3.2.1.

Consider the LU factorization $A = L U \text{,}$ where $A$ is an $m \times m$ matrix, discussed in Unit 5.1.1 and subsequent units. The right-looking algorithm for computing it is given by

For simplicity, we do not consider pivoting yet.

• How many floating point operations (flops) are required to compute this operation?

• If the matrix is initially stored in main memory and is written back to main memory, how many reads and writes (memops) are required? (Give a simple lower bound.)

• What is the ratio of flops to memops for this lower bound?

Solution
• How many floating point operations are required to compute this operation?

From Homework 5.2.2.1, we know that this right-looking algorithm requires $\frac{2}{3} m^3$ flops.

• If the matrix is initially stored in main memory and is written back to main memory, how many reads and writes (memops) are required? (Give a simple lower bound.)

We observe that every element of $A$ must be read and written. Thus, a lower bound is $2 m^2$ memops.

• What is the ratio of flops to memops?

\begin{equation*} \frac{\frac{2}{3} m^3 \mbox{ flops}}{ 2 m^2 \mbox{ memops}} \approx \frac{m}{3} \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}

The insight is that the ratio of computation to memory operations, ${m}/{3} \text{,}$ does not preclude high performance. However, the algorithm in Figure 12.3.2.1 casts most computation in terms of rank-1 updates and hence will not achieve high performance. The reason is that the performance of each of those individual updates is limited by the unfavorable ratio of floating point operations to memory operations.

Just like it is possible to rearrange the computations for a matrix-matrix multiplication in order to reuse data that is brought into caches, one could carefully rearrange the computations needed to perform an LU factorization. While one can do so for an individual operation, think of the effort that this would involve for every operation in (dense) linear algebra. Not only that, this effort would likely have to be repeated for new architectures as they become available.

###### Remark12.3.2.2.

The key observation is that if we can cast most computation for the LU factorization in terms of matrix-matrix operations (level-3 BLAS functionality) and we link an implementation to a high-performance library with BLAS functionality, then we can achieve portable high performance for our LU factorization algorithm.

Let us examine how to cast most computation for the LU factorization in terms of matrix-matrix operations. You may want to start by reviewing the discussion of how to derive the (unblocked) algorithm in Figure 12.3.2.1, in Unit 5.2.2, which we repeat below on the left. The derivation of a so-called blocked algorithm is given to its right.

Partition

\begin{equation*} A \!\rightarrow\! \left( \begin{array}{c c} \alpha_{11} \amp a_{12}^T \\ a_{21} \amp A_{22} \end{array} \right), L \!\rightarrow\! \left( \begin{array}{c c} 1 \amp 0 \\ l_{21} \amp L_{22} \end{array} \right), \end{equation*}
\begin{equation*} \mbox{ and } U \!\rightarrow\! \left( \begin{array}{c c} \upsilon_{11} \amp u_{12}^T \\ 0 \amp U_{22} \end{array} \right). \end{equation*}

Partition

\begin{equation*} A \!\rightarrow\! \left( \begin{array}{c c} A_{11} \amp A_{12} \\ A_{21} \amp A_{22} \end{array} \right), L \!\rightarrow\! \left( \begin{array}{c c} L_{11} \amp 0 \\ L_{21} \amp L_{22} \end{array} \right), \end{equation*}
\begin{equation*} \mbox{ and } U \!\rightarrow\! \left( \begin{array}{c c} U_{11} \amp U_{12} \\ 0 \amp U_{22} \end{array} \right), \end{equation*}

where $A_{11} \text{,}$ $L_{11} \text{,}$ and $U_{11}$ are $b \times b \text{.}$

Plug partitioned matrices into $A = L U \text{.}$

\begin{equation*} \begin{array}{l} \left( \begin{array}{c c} \alpha_{11} \amp a_{12}^T \\ a_{21} \amp A_{22} \end{array} \right) \\ = \begin{array}[t]{c} \underbrace{ \left( \begin{array}{c c} 1 \amp 0 \\ l_{21} \amp L_{22} \end{array} \right) \left( \begin{array}{c c} \upsilon_{11} \amp u_{12}^T \\ 0 \amp U_{22} \end{array} \right) } \\ \left( \begin{array}{c c} \upsilon_{11} \amp u_{12}^T \\ \upsilon_{11} l_{21} \amp l_{21} u_{12}^T + L_{22} U_{22} \end{array} \right) \end{array} . \end{array} \end{equation*}

Plug partitioned matrices into $A = L U \text{.}$

\begin{equation*} \begin{array}{l} \left( \begin{array}{c c} A_{11} \amp A_{12} \\ A_{21} \amp A_{22} \end{array} \right) \\ = \begin{array}[t]{c} \underbrace{ \left( \begin{array}{c c} L_{11} \amp 0 \\ L_{21} \amp L_{22} \end{array} \right) \left( \begin{array}{c c} U_{11} \amp U_{12} \\ 0 \amp U_{22} \end{array} \right) } \\ \left( \begin{array}{c c} L_{11} U_{11} \amp L_{11} U_{12} \\ L_{21} U_{11} \amp L_{21} U_{12} + L_{22} U_{22} \end{array} \right) \end{array} . \end{array} \end{equation*}

Equate the submatrices and manipulate

• $\alpha_{11} := \upsilon_{11} = \alpha_{11}$ (No-op).

• $a_{12}^T := u_{12}^T = a_{12}^T$ (No-op).

• $a_{21} := l_{21} = a_{21}/\upsilon_{11} = a_{21} / \alpha_{11} \text{.}$

• $A_{22} := A_{22} - l_{21} u_{12}^T = A_{22} - a_{21} a_{12}^T \text{.}$

Equate the submatrices and manipulate

• $A_{11} := {L \backslash U}_{11}$ (overwrite $A_{11}$ with its LU factorization).

• $A_{12} := U_{12} = L_{11}^{-1} A_{12}$ (triangular solve with multiple right-hand sides).

• $A_{21} := L_{21} = A_{21} U_{11}^{-1}$ (triangular solve with multiple right-hand sides).

• $A_{22} := A_{22} - L_{21} U_{12} = A_{22} - A_{21} A_{12}$ (matrix-matrix multiplication).

The derivation on the left yields the algorithm in Figure 12.3.2.1.

The derivation on the right yields the algorithm in Figure 12.3.2.3.

Let us comment on each of the operations in Figure 12.3.2.3.

• $A_{11} := {L \backslash U}_{11} = LU( A_{11} )$ indicates that we need to compute the LU factorization of $A_{11} \text{,}$ overwriting that matrix with the unit lower triangular matrix $L_{11}$ and upper triangular matrix $U_{11} \text{.}$ Since $A_{11}$ is $b \times b$ and we usually take $b \ll m \text{,}$ not much of the total computation is in the operation, and it can therefore be computed with, for example, an unblocked algorithm. Also, if it is small enough, it will fit in one of the smaller caches and hence the memory overhead of performing the rank-1 updates will not be as dramatic.

• $A_{12} := U_{12} = L_{11}^{-1} A_{12}$ is an instance of solving $L X = B$ with unit lower triangular matrix $L$ for $X \text{.}$ This is referred to as a triangular solve with multiple right-hand sides (TRSM) since we can partition $B$ and $X$ by columns so that

\begin{equation*} L \begin{array}[t]{c} \underbrace{ \left( \begin{array}{c | c | c} x_0 \amp x_1 \amp \cdots \end{array} \right) }\\ \left( \begin{array}{c | c | c} L x_0 \amp L x_1 \amp \cdots \end{array} \right) \end{array} = \left( \begin{array}{c | c | c} b_0 \amp b_1 \amp \cdots \end{array} \right), \end{equation*}

and hence for each column of the right-hand side, $b_j \text{,}$ we need to solve a triangular system, $L x_j = b_j \text{.}$

• $A_{21} := L_{21} = A_{21} U_{11}^{-1}$ is an instance of solving $X U = B \text{,}$ where $U$ is upper triangular. We notice that if we partition $X$ and $B$ by rows, then

\begin{equation*} \begin{array}[t]{c} \underbrace{ \left( \begin{array}{c} \widetilde x_0^T \\ \hline \widetilde x_1^T \\ \hline \vdots \end{array} \right) U } \\ \left( \begin{array}{c} \widetilde x_0^T U\\ \hline \widetilde x_1^T U \\ \hline \vdots \end{array} \right) \end{array} = \left( \begin{array}{c} \widetilde b_0^T \\ \hline \widetilde b_1^T \\ \hline \vdots \end{array} \right) \end{equation*}

and we recognize that each row, $\widetilde x_i^T \text{,}$ is computed from $\widetilde x_i^T U = \widetilde b_i^T$ or, equivalently, by solving $U^T (\widetilde x_i^T)^T = (\widetilde b_i^T)^T \text{.}$ We observe it is also a triangular solve with multiple right-hand sides (TRSM).

• The update $A_{22} := A_{22} - A_{21} A_{12}$ is an instance of $C := \alpha A B + \beta C \text{,}$ where the $k$ (inner) size is small. This is often referred to as a rank-k update.

In the following homework, you will determine that most computation is now cast in terms of the rank-k update (matrix-matrix multiplication).

###### Homework12.3.2.2.

For the algorithm in Figure 12.3.2.3, analyze the (approximate) number of flops that are performed by the LU factorization of $A_{11}$ and updates of $A_{21}$ and $A_{12} \text{,}$ aggregated over all iterations. You may assume that the size of $A \text{,}$ $m \text{,}$ is an integer multiple of the block size $b \text{,}$ so that $m = K b \text{.}$ Next, determine the ratio of the flops spent in the indicated operations to the total flops.

Solution

During the $k$th iteration, when $A_{00}$ is $(kb) \times (kb) \text{,}$ we perform the following number of flops in these operations:

• $A_{11} := LU( A_{11} ) \text{:}$ approximately $\frac{2}{3} b^3$ flops.

• $A_{12} := L_{11}^{-1} A_{12} \text{:}$ During the $k$th iteration, $A_{00}$ is $(kb) \times (kb) \text{,}$ $A_{11}$ is $b \times b \text{,}$ and $A_{12}$ is $b \times ((K - k - 1 )b) \text{.}$ (It helps to draw a picture.) Hence, the total computation spent in the operation is approximately $((K-k-1)b) b^2 = (K-k-1)b^3$ flops.

• $A_{21} := A_{21} U_{11}^{-1} \text{:}$ During the $k$th iteration, $A_{00}$ is $(kb) \times (kb) \text{,}$ $A_{11}$ is $b \times b \text{,}$ and $A_{21}$ is $((K - k - 1 )b) \times b \text{.}$ Hence, the total computation spent in the operation is approximately $((K-k-1)b) b^2 = (K-k-1)b^3$ flops.

If we sum this over all $K$ iterations, we find that the total equals

\begin{equation*} \begin{array}{l} \sum_{k=0}^{K-1} \left[ \frac{2}{3} b^3 + 2 ( K - k - 1 ) b^3 \right] \\ ~~~ = ~~~~ \\ \left[ \frac{2}{3} K + 2 \sum_{k=0}^{K-1} ( K - k - 1 ) \right] b^3 \\ ~~~ = ~~~~ \\ \left[ \frac{2}{3} K + 2 \sum_{j=0}^{K-1} j \right] b^3 \\ ~~~ \approx ~~~~ \\ \left[ \frac{2}{3} K + K^2 \right] b^3 ~~~ = ~~~~ \\ \frac{2}{3} b^2 m + b m^2. \end{array} \end{equation*}

Thus, the ratio of time spent in these operation to the total cost of the LU factorization is

\begin{equation*} \frac{ \frac{2}{3} b^2 m + b m^2. } { \frac{2}{3} m^3 } = \left(\frac{b}{m}\right)^2 + \frac{3}{2} \frac{b}{m}. \end{equation*}

From this last exercise, we learn that if $b$ is fixed and $m$ gets large, essentially all computation is in the update $A_{22} := A_{22} - A_{21} A_{12} \text{,}$ which we know can attain high performance.

###### Homework12.3.2.3.

In directory Assignments/Week12/matlab/, you will find the following files:

•  LU_right_looking.m: an implementation of the unblocked right-looking LU factorization.

• LU_blk_right_looking.m: A code skeleton for a function that implements a blocked LU factorization.

[ A_out ] = LU_blk_right_looking( A, nb_alg )


performs a blocked LU factorization with block size $b$ equal to nb_alg, overwriting A_out with $L$ and $U \text{.}$

• time_LU.m: A test routine that tests the various implementations.

These resources give you the tools to implement and test the blocked LU factorization.

1. Translate the blocked LU factorization in Figure 12.3.2.3 into code by completing LU_blk_right_looking.m.

2. Test your implementation by executing time_LU.m.

3. Once you get the right answer, try different problem sizes to see how the different implementations perform. Try $m = 500, 1000, 1500, 2000 \text{.}$

On the discussion forum, discuss what you think are some of the reasons behind the performance you observe.

Hint

You can extract $L_{11}$ and $U_{11}$ from A11 with

L11 = tril( A11,-1 ) + eye( size( A11 ) );
A11 = triu( A11 );


Don't invert $L_{11}$ and $U_{11} \text{.}$ In the command window execute

help /
help \


to read up on how those operators allow you to solve with a matrix.

Solution

Notice that your blocked algorithm gets MUCH better performance than does the unblocked algorithm. However, the native LU factorization of Matlab does much better yet. The call lu( A ) by Matlab links to a high performance implementation of the LAPACK interface, which we will discuss later.

###### Homework12.3.2.4.

For this exercise you will implement the unblocked and blocked algorithm in C. To do so, you need to install the BLAS-like Library Instantiation Software (BLAS) (see Subsubsection 0.2.4.1) and the libflame library (see Subsubsection 0.2.4.2). Even if you have not previously programmed in C, you should be able to follow along.

In Assignments/Week12/LU/FLAMEC/, you will find the following files:

•  LU_unb_var5.c : a skeleton for implementing the unblocked right-looking LU factorization (which we often call "Variant 5").

•  LU_blk_var5.c : a skeleton for implementing the blocked right-looking LU factorization.

• REF_LU.c: A simple triple-nested loop implementation of the algorithm.

• driver.c: A "driver" for testing various implementations. This driver creates matrices of different sizes and checks the performance and correctness of the result.

• Makefile: A "makefile" that compiles, links, and executes. To learn about Makefiles, you may want to check Wikipedia. (search for "Make" and choose "Make (software)".)

Our libflame library allows a simple translation from algorithms that have been typeset with the FLAME notation (like those in Figure 12.3.2.1 and Figure 12.3.2.3). The BLAS functionality discussed earlier in this week is made available through an interface that hides details like size, stride, etc. A Quick Reference Guide for this interface can be found at http://www.cs.utexas.edu/users/flame/pubs/FLAMEC-BLAS-Quickguide.pdf.

With these resources, complete

• LU_unb_var5.c and

• LU_blk_var5.c.

You can skip the testing of LU_blk_var5 by changing the appropriate TRUE to FALSE in driver.c.

Once you have implemented one or both, you can test by executing make test in a terminal session (provided you are in the correct directory). This will compile and link, yieding the executable driver.x, and will then execute this driver. The output is redirected to the file output.m. Here is a typical line in the output:

data_REF( 10, 1:2 ) = [ 2000 2.806644e+00 ];
data_FLAME( 10, 1:2 ) = [ 2000 4.565624e+01 ];
data_unb_var5( 10, 1:3 ) = [ 2000 3.002356e+00  4.440892e-16];
data_blk_var5( 10, 1:3 ) = [ 2000 4.422223e+01  7.730705e-12];


• The first line reports the performance of the reference implementation (a simple triple-nested loop implementation of LU factorization without pivoting). The last number is the rate of computation in GFLOPS.

• The second line reports the performance of the LU factorization without pivoting that is part of the libflame library. Again, the last number reports the rate of computation in GFLOPS.

• The last two lines report the rate of performance (middle number) and difference of the result relative to the reference implementation (last number), for your unblocked and blocked implementations. It is important to check that the last number is small. For larger problem sizes, the reference implementation is not executed, and this difference is not relevant.

The fact that the blocked version shows a larger difference than does the unblocked is not significant here. Both have roundoff error in the result (as does the reference implementation) and we cannot tell which is more accurate.

You can cut and past the output.m file into matlab to see the performance data presented as a graph.

###### Ponder This12.3.2.5.

In Unit 5.5.1, we discuss five unblocked algorithms for computing LU factorization. Can you derive the corresponding blocked algorithms? You can use the materials in Assignments/Week12/LU/FLAMEC/ to implement and test all five.