## Unit4.5.1Casting computation in terms of matrix-matrix multiplication

A key characteristic of matrix-matrix multiplication that allows it to achieve high performance is that it performs $O( n^3 )$ computations with $O( n^2 )$ data (if $m = n = k$). This means that if $n$ is large, it may be possible to amortize the cost of moving a data item between memory and faster memory over $O(n)$ computations.

There are many computations that also have this property. Some of these are minor variations on matrix-matrix multiplications, like those that are part of the level-3 BLAS mentioned in Unit 1.5.1. Others are higher level operations with matrices, like the LU factorization of an $n \times n$ matrix we discuss later in this unit. For each of these, one could in principle apply techniques you experienced in this course. But you can imagine that this could get really tedious.

In the 1980s, it was observed that many such operations can be formulated so that most computations are cast in terms of a matrix-matrix multiplication [7] [15] [1]. This meant that software libraries with broad functionality could be created without the burden of going through all the steps to optimize each routine at a low level.

Let us illustrate how this works for the LU factorization, which, given an $n \times n$ matrix $A$ computes a unit lower triangular $L$ and upper triangular matrix $U$ such that $A = L U \text{.}$ If we partition

\begin{equation*} A \rightarrow \left( \begin{array} {c c } A_{11} \amp A_{12} \\ A_{21} \amp A_{22} \end{array} \right), L \rightarrow \left( \begin{array} {c c } L_{11} \amp 0 \\ L_{21} \amp L_{22} \end{array} \right), \mbox{ and } U \rightarrow \left( \begin{array} {c c } U_{11} \amp U_{12} \\ 0 \amp U_{22} \end{array} \right), \end{equation*}

where $A_{11} \text{,}$ $L_{11} \text{,}$ and $U_{11}$ are $b \times b \text{,}$ then $A = L U$ means that

\begin{equation*} \begin{array}{r c l} \left( \begin{array} {c c } A_{11} \amp A_{12} \\ A_{21} \amp A_{22} \end{array} \right) \amp = \amp \left( \begin{array} {c c } L_{11} \amp 0 \\ L_{21} \amp L_{22} \end{array} \right) \left( \begin{array} {c c } U_{11} \amp U_{12} \\ 0 \amp U_{22} \end{array} \right) \\ \amp = \amp \left( \begin{array} {c c } L_{11} U_{11} \amp L_{11} U_{11} \\ L_{21} U_{11} \amp L_{21} U_{12} + L_{22} U_{22} \end{array} \right). \end{array} \end{equation*}

If one manipulates this, one finds that

\begin{equation*} \begin{array}{rcl} A_{11} \amp = \amp L_{11} U_{11} \\ A_{12} \amp = \amp L_{11} U_{12} \\ A_{21} \amp = \amp L_{21} U_{11} \\ A_{22} - L_{21} U_{12} \amp = \amp L_{22} U_{22} . \end{array} \end{equation*}

which then leads to the steps

• Compute the LU factoriation of the smaller matrix $A_{11} \rightarrow L_{11} U_{11} \text{.}$

Upon completion, we know unit lower triangular $L_{11}$ and upper triangular matrix $U_{11} \text{.}$

• Solve $L_{11} U_{12} \rightarrow A_{12}$ for $U_{12} \text{.}$ This is known as a "triangular solve with multiple right-hand sides," which is an operation supported by the level-3 BLAS.

Upon completion, we know $U_{12} \text{.}$

• Solve $L_{21} U_{11} \rightarrow A_{21}$ for $L_{12} \text{.}$ This is a different case of "triangular solve with multiple right-hand sides," also supported by the level-3 BLAS.

Upon completion, we know $L_{21} \text{.}$

• Update $A_{22} := A_{21} - L_{21} U_{12} \text{.}$

• Since before this update $A_{22} - L_{21} U_{12} = L_{22} U_{22} \text{,}$ this process can now continue with $A = A_{22} \text{.}$

These steps, are presented as an algorithm in Figure 4.5.1, using the "FLAME" notation developed as part of our research and used in our other two MOOCs (Linear Algebra: Foundations to Frontiers [21] and LAFF-On Programming for Correctness [20]). This leaves matrix $A$ overwritten with the unit lower triangular matrix $L$ below the diagonal (with the ones on the diagonal implicit) and $U$ on and above the diagonal.

The important insight is that if $A$ is $n \times n \text{,}$ then factoring $A_{11}$ requires approximately $\frac{2}{3} b^3$ flops, computing $U_{12}$ and $L_{21}$ each require approximately $b^2 (n-b)$ flops, and updating $A_{22} := A_{22} - L_{21} U_{12}$ requires approximately $2 b (n-b)^2$ flops. If $b \ll n$ then most computation is in the update $A_{22} := A_{22} - L_{21} U_{12} \text{,}$ which we recognize as a matrix-matrix multiplication (except that the result of the matrix-matrix multiplication is subtracted rather than added to the matrix). Thus, most of the computation is cast in terms of a matrix-matrix multiplication. If it is fast, then the overall computation is fast once $n$ is large.

If you don't know what an LU factorization is, you will want to have a look at Week 6 of our MOOC titled "Linear Algebra: Foundations to Frontiers [21]. There, it is shown that (Gaussian) elimination, which allows one to solve systems of linear equations, is equivalent to LU factorization.

Algorithms like the one illustrated here for LU factorization are referred to as blocked algorithms. Unblocked algorithms are blocked algorithms for which the block size has been chosen to equal 1. Typically, for the smaller subproblem (the factorization of $A_{11}$ in our example), an unblocked algorithm is used. A large number of unblocked and blocked algorithms for different linear algebra operations are discussed in the notes that we are preparing for our graduate level course "Advanced Linear Algebra: Foundations to Frontiers," [19], which will be offered as part of the UT-Austin Online Masters in Computer Science. We intend to offer it as an auditable course on edX as well.

A question you may ask yourself is "What block size, $b \text{,}$ should be used in the algorithm (when $n$ is large)?"