Unit 4.5.1 Casting computation in terms of matrixmatrix multiplication
¶A key characteristic of matrixmatrix multiplication that allows it to achieve high performance is that it performs \(O( n^3 ) \) computations with \(O( n^2 ) \) data (if \(m = n = k \)). This means that if \(n \) is large, it may be possible to amortize the cost of moving a data item between memory and faster memory over \(O(n) \) computations.
There are many computations that also have this property. Some of these are minor variations on matrixmatrix multiplications, like those that are part of the level3 BLAS mentioned in Unit 1.5.1. Others are higher level operations with matrices, like the LU factorization of an \(n \times n \) matrix we discuss later in this unit. For each of these, one could in principle apply techniques you experienced in this course. But you can imagine that this could get really tedious.
In the 1980s, it was observed that many such operations can be formulated so that most computations are cast in terms of a matrixmatrix multiplication [7] [15] [1]. This meant that software libraries with broad functionality could be created without the burden of going through all the steps to optimize each routine at a low level.
Let us illustrate how this works for the LU factorization, which, given an \(n \times n \) matrix \(A \) computes a unit lower triangular \(L \) and upper triangular matrix \(U \) such that \(A = L U \text{.}\) If we partition
where \(A_{11} \text{,}\) \(L_{11} \text{,}\) and \(U_{11} \) are \(b \times b \text{,}\) then \(A = L U \) means that
If one manipulates this, one finds that
which then leads to the steps

Compute the LU factoriation of the smaller matrix \(A_{11} \rightarrow L_{11} U_{11} \text{.}\)
Upon completion, we know unit lower triangular \(L_{11}\) and upper triangular matrix \(U_{11} \text{.}\)

Solve \(L_{11} U_{12} \rightarrow A_{12} \) for \(U_{12} \text{.}\) This is known as a "triangular solve with multiple righthand sides," which is an operation supported by the level3 BLAS.
Upon completion, we know \(U_{12} \text{.}\)

Solve \(L_{21} U_{11} \rightarrow A_{21} \) for \(L_{12} \text{.}\) This is a different case of "triangular solve with multiple righthand sides," also supported by the level3 BLAS.
Upon completion, we know \(L_{21} \text{.}\)
Update \(A_{22} := A_{21}  L_{21} U_{12} \text{.}\)
Since before this update \(A_{22}  L_{21} U_{12} = L_{22} U_{22} \text{,}\) this process can now continue with \(A = A_{22} \text{.}\)
These steps, are presented as an algorithm in Figure 4.5.1, using the "FLAME" notation developed as part of our research and used in our other two MOOCs (Linear Algebra: Foundations to Frontiers [21] and LAFFOn Programming for Correctness [20]). This leaves matrix \(A \) overwritten with the unit lower triangular matrix \(L \) below the diagonal (with the ones on the diagonal implicit) and \(U \) on and above the diagonal.
The important insight is that if \(A \) is \(n \times n \text{,}\) then factoring \(A_{11} \) requires approximately \(\frac{2}{3} b^3 \) flops, computing \(U_{12} \) and \(L_{21} \) each require approximately \(b^2 (nb)\) flops, and updating \(A_{22} := A_{22}  L_{21} U_{12}\) requires approximately \(2 b (nb)^2 \) flops. If \(b \ll n \) then most computation is in the update \(A_{22} := A_{22}  L_{21} U_{12} \text{,}\) which we recognize as a matrixmatrix multiplication (except that the result of the matrixmatrix multiplication is subtracted rather than added to the matrix). Thus, most of the computation is cast in terms of a matrixmatrix multiplication. If it is fast, then the overall computation is fast once \(n \) is large.
If you don't know what an LU factorization is, you will want to have a look at Week 6 of our MOOC titled "Linear Algebra: Foundations to Frontiers [21]. There, it is shown that (Gaussian) elimination, which allows one to solve systems of linear equations, is equivalent to LU factorization.
Algorithms like the one illustrated here for LU factorization are referred to as blocked algorithms. Unblocked algorithms are blocked algorithms for which the block size has been chosen to equal 1. Typically, for the smaller subproblem (the factorization of \(A_{11} \) in our example), an unblocked algorithm is used. A large number of unblocked and blocked algorithms for different linear algebra operations are discussed in the notes that we are preparing for our graduate level course "Advanced Linear Algebra: Foundations to Frontiers," [19], which will be offered as part of the UTAustin Online Masters in Computer Science. We intend to offer it as an auditable course on edX as well.
A question you may ask yourself is "What block size, \(b \text{,}\) should be used in the algorithm (when \(n \) is large)?"