Unit 12.3.2 Blocked LU factorization
Homework 12.3.2.1.
Consider the LU factorization \(A = L U \text{,}\) where \(A \) is an \(m \times m \) matrix, discussed in Unit 5.1.1 and subsequent units. The rightlooking algorithm for computing it is given by
For simplicity, we do not consider pivoting yet.How many floating point operations (flops) are required to compute this operation?
If the matrix is initially stored in main memory and is written back to main memory, how many reads and writes (memops) are required? (Give a simple lower bound.)
What is the ratio of flops to memops for this lower bound?

How many floating point operations are required to compute this operation?
From Homework 5.2.2.1, we know that this rightlooking algorithm requires \(\frac{2}{3} m^3 \) flops.

If the matrix is initially stored in main memory and is written back to main memory, how many reads and writes (memops) are required? (Give a simple lower bound.)
We observe that every element of \(A\) must be read and written. Thus, a lower bound is \(2 m^2 \) memops.

What is the ratio of flops to memops?
\begin{equation*} \frac{\frac{2}{3} m^3 \mbox{ flops}}{ 2 m^2 \mbox{ memops}} \approx \frac{m}{3} \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}
The insight is that the ratio of computation to memory operations, \({m}/{3} \text{,}\) does not preclude high performance. However, the algorithm in Figure 12.3.2.1 casts most computation in terms of rank1 updates and hence will not achieve high performance. The reason is that the performance of each of those individual updates is limited by the unfavorable ratio of floating point operations to memory operations.
Just like it is possible to rearrange the computations for a matrixmatrix multiplication in order to reuse data that is brought into caches, one could carefully rearrange the computations needed to perform an LU factorization. While one can do so for an individual operation, think of the effort that this would involve for every operation in (dense) linear algebra. Not only that, this effort would likely have to be repeated for new architectures as they become available.
Remark 12.3.2.2.
The key observation is that if we can cast most computation for the LU factorization in terms of matrixmatrix operations (level3 BLAS functionality) and we link an implementation to a highperformance library with BLAS functionality, then we can achieve portable high performance for our LU factorization algorithm.
Let us examine how to cast most computation for the LU factorization in terms of matrixmatrix operations. You may want to start by reviewing the discussion of how to derive the (unblocked) algorithm in Figure 12.3.2.1, in Unit 5.2.2, which we repeat below on the left. The derivation of a socalled blocked algorithm is given to its right.
Partition
Partition
where \(A_{11} \text{,}\) \(L_{11} \text{,}\) and \(U_{11}\) are \(b \times b \text{.}\)
Plug partitioned matrices into \(A = L U \text{.}\)
Plug partitioned matrices into \(A = L U \text{.}\)
Equate the submatrices and manipulate
\(\alpha_{11} := \upsilon_{11} = \alpha_{11} \) (Noop).

\(a_{12}^T := u_{12}^T = a_{12}^T \) (Noop).

\(a_{21} := l_{21} = a_{21}/\upsilon_{11} = a_{21} / \alpha_{11} \text{.}\)
\(A_{22} := A_{22}  l_{21} u_{12}^T = A_{22}  a_{21} a_{12}^T \text{.}\)
Equate the submatrices and manipulate
\(A_{11} := {L \backslash U}_{11} \) (overwrite \(A_{11} \) with its LU factorization).
\(A_{12} := U_{12} = L_{11}^{1} A_{12} \) (triangular solve with multiple righthand sides).
\(A_{21} := L_{21} = A_{21} U_{11}^{1} \) (triangular solve with multiple righthand sides).
\(A_{22} := A_{22}  L_{21} U_{12} = A_{22}  A_{21} A_{12} \) (matrixmatrix multiplication).
The derivation on the left yields the algorithm in Figure 12.3.2.1.
The derivation on the right yields the algorithm in Figure 12.3.2.3.
Let us comment on each of the operations in Figure 12.3.2.3.
\(A_{11} := {L \backslash U}_{11} = LU( A_{11} )\) indicates that we need to compute the LU factorization of \(A_{11} \text{,}\) overwriting that matrix with the unit lower triangular matrix \(L_{11} \) and upper triangular matrix \(U_{11} \text{.}\) Since \(A_{11} \) is \(b \times b \) and we usually take \(b \ll m \text{,}\) not much of the total computation is in the operation, and it can therefore be computed with, for example, an unblocked algorithm. Also, if it is small enough, it will fit in one of the smaller caches and hence the memory overhead of performing the rank1 updates will not be as dramatic.

\(A_{12} := U_{12} = L_{11}^{1} A_{12} \) is an instance of solving \(L X = B \) with unit lower triangular matrix \(L \) for \(X \text{.}\) This is referred to as a triangular solve with multiple righthand sides (TRSM) since we can partition \(B \) and \(X \) by columns so that
\begin{equation*} L \begin{array}[t]{c} \underbrace{ \left( \begin{array}{c  c  c} x_0 \amp x_1 \amp \cdots \end{array} \right) }\\ \left( \begin{array}{c  c  c} L x_0 \amp L x_1 \amp \cdots \end{array} \right) \end{array} = \left( \begin{array}{c  c  c} b_0 \amp b_1 \amp \cdots \end{array} \right), \end{equation*}and hence for each column of the righthand side, \(b_j \text{,}\) we need to solve a triangular system, \(L x_j = b_j \text{.}\)

\(A_{21} := L_{21} = A_{21} U_{11}^{1} \) is an instance of solving \(X U = B \text{,}\) where \(U \) is upper triangular. We notice that if we partition \(X \) and \(B \) by rows, then
\begin{equation*} \begin{array}[t]{c} \underbrace{ \left( \begin{array}{c} \widetilde x_0^T \\ \hline \widetilde x_1^T \\ \hline \vdots \end{array} \right) U } \\ \left( \begin{array}{c} \widetilde x_0^T U\\ \hline \widetilde x_1^T U \\ \hline \vdots \end{array} \right) \end{array} = \left( \begin{array}{c} \widetilde b_0^T \\ \hline \widetilde b_1^T \\ \hline \vdots \end{array} \right) \end{equation*}and we recognize that each row, \(\widetilde x_i^T \text{,}\) is computed from \(\widetilde x_i^T U = \widetilde b_i^T \) or, equivalently, by solving \(U^T (\widetilde x_i^T)^T = (\widetilde b_i^T)^T \text{.}\) We observe it is also a triangular solve with multiple righthand sides (TRSM).
The update \(A_{22} := A_{22}  A_{21} A_{12} \) is an instance of \(C := \alpha A B + \beta C \text{,}\) where the \(k \) (inner) size is small. This is often referred to as a rankk update.
In the following homework, you will determine that most computation is now cast in terms of the rankk update (matrixmatrix multiplication).
Homework 12.3.2.2.
For the algorithm in Figure 12.3.2.3, analyze the (approximate) number of flops that are performed by the LU factorization of \(A_{11} \) and updates of \(A_{21} \) and \(A_{12} \text{,}\) aggregated over all iterations. You may assume that the size of \(A \text{,}\) \(m \text{,}\) is an integer multiple of the block size \(b \text{,}\) so that \(m = K b \text{.}\) Next, determine the ratio of the flops spent in the indicated operations to the total flops.
During the \(k \)th iteration, when \(A_{00} \) is \((kb) \times (kb) \text{,}\) we perform the following number of flops in these operations:
\(A_{11} := LU( A_{11} ) \text{:}\) approximately \(\frac{2}{3} b^3 \) flops.
\(A_{12} := L_{11}^{1} A_{12} \text{:}\) During the \(k\)th iteration, \(A_{00} \) is \((kb) \times (kb) \text{,}\) \(A_{11} \) is \(b \times b \text{,}\) and \(A_{12} \) is \(b \times ((K  k  1 )b) \text{.}\) (It helps to draw a picture.) Hence, the total computation spent in the operation is approximately \(((Kk1)b) b^2 = (Kk1)b^3 \) flops.
\(A_{21} := A_{21} U_{11}^{1} \text{:}\) During the \(k\)th iteration, \(A_{00} \) is \((kb) \times (kb) \text{,}\) \(A_{11} \) is \(b \times b \text{,}\) and \(A_{21} \) is \(((K  k  1 )b) \times b \text{.}\) Hence, the total computation spent in the operation is approximately \(((Kk1)b) b^2 = (Kk1)b^3 \) flops.
If we sum this over all \(K \) iterations, we find that the total equals
Thus, the ratio of time spent in these operation to the total cost of the LU factorization is
From this last exercise, we learn that if \(b \) is fixed and \(m \) gets large, essentially all computation is in the update \(A_{22} := A_{22}  A_{21} A_{12} \text{,}\) which we know can attain high performance.
Homework 12.3.2.3.
In directory Assignments/Week12/matlab/
, you will find the following files:
LU_right_looking.m
: an implementation of the unblocked rightlooking LU factorization.LU_blk_right_looking.m
: A code skeleton for a function that implements a blocked LU factorization.[ A_out ] = LU_blk_right_looking( A, nb_alg )
performs a blocked LU factorization with block size \(b \) equal tonb_alg
, overwritingA_out
with \(L \) and \(U \text{.}\)time_LU.m
: A test routine that tests the various implementations.
These resources give you the tools to implement and test the blocked LU factorization.
Translate the blocked LU factorization in Figure 12.3.2.3 into code by completing
LU_blk_right_looking.m
.Test your implementation by executing
time_LU.m
.Once you get the right answer, try different problem sizes to see how the different implementations perform. Try \(m = 500, 1000, 1500, 2000 \text{.}\)
On the discussion forum, discuss what you think are some of the reasons behind the performance you observe.
You can extract \(L_{11}\) and \(U_{11} \) from A11
with
L11 = tril( A11,1 ) + eye( size( A11 ) ); A11 = triu( A11 );Don't invert \(L_{11} \) and \(U_{11} \text{.}\) In the command window execute
help / help \to read up on how those operators allow you to solve with a matrix.
Notice that your blocked algorithm gets MUCH better performance than does the unblocked algorithm. However, the native LU factorization of Matlab does much better yet. The call lu( A )
by Matlab links to a high performance implementation of the LAPACK interface, which we will discuss later.
Homework 12.3.2.4.
For this exercise you will implement the unblocked and blocked algorithm in C. To do so, you need to install the BLASlike Library Instantiation Software (BLAS) (see Subsubsection 0.2.4.1) and the libflame library (see Subsubsection 0.2.4.2). Even if you have not previously programmed in C, you should be able to follow along.
In Assignments/Week12/LU/FLAMEC/
, you will find the following files:
LU_unb_var5.c
: a skeleton for implementing the unblocked rightlooking LU factorization (which we often call "Variant 5").LU_blk_var5.c
: a skeleton for implementing the blocked rightlooking LU factorization.REF_LU.c
: A simple triplenested loop implementation of the algorithm.driver.c
: A "driver" for testing various implementations. This driver creates matrices of different sizes and checks the performance and correctness of the result.Makefile
: A "makefile" that compiles, links, and executes. To learn about Makefiles, you may want to check Wikipedia. (search for "Make" and choose "Make (software)".)
Our libflame library allows a simple translation from algorithms that have been typeset with the FLAME notation (like those in Figure 12.3.2.1 and Figure 12.3.2.3). The BLAS functionality discussed earlier in this week is made available through an interface that hides details like size, stride, etc. A Quick Reference Guide for this interface can be found at http://www.cs.utexas.edu/users/flame/pubs/FLAMECBLASQuickguide.pdf
.
With these resources, complete
LU_unb_var5.c
andLU_blk_var5.c
.
You can skip the testing of LU_blk_var5
by changing the appropriate TRUE
to FALSE
in driver.c
.
Once you have implemented one or both, you can test by executing make test
in a terminal session (provided you are in the correct directory). This will compile and link, yieding the executable driver.x
, and will then execute this driver. The output is redirected to the file output.m
. Here is a typical line in the output:
data_REF( 10, 1:2 ) = [ 2000 2.806644e+00 ]; data_FLAME( 10, 1:2 ) = [ 2000 4.565624e+01 ]; data_unb_var5( 10, 1:3 ) = [ 2000 3.002356e+00 4.440892e16]; data_blk_var5( 10, 1:3 ) = [ 2000 4.422223e+01 7.730705e12];
The first line reports the performance of the reference implementation (a simple triplenested loop implementation of LU factorization without pivoting). The last number is the rate of computation in GFLOPS.
The second line reports the performance of the LU factorization without pivoting that is part of the libflame library. Again, the last number reports the rate of computation in GFLOPS.

The last two lines report the rate of performance (middle number) and difference of the result relative to the reference implementation (last number), for your unblocked and blocked implementations. It is important to check that the last number is small. For larger problem sizes, the reference implementation is not executed, and this difference is not relevant.
The fact that the blocked version shows a larger difference than does the unblocked is not significant here. Both have roundoff error in the result (as does the reference implementation) and we cannot tell which is more accurate.
You can cut and past the output.m
file into matlab to see the performance data presented as a graph.
Ponder This 12.3.2.5.
In Unit 5.5.1, we discuss five unblocked algorithms for computing LU factorization. Can you derive the corresponding blocked algorithms? You can use the materials in Assignments/Week12/LU/FLAMEC/
to implement and test all five.