## Unit 12.3.3 Other high-performance dense linear algebra algorithms

Throughout this course, we have pointed to papers that discuss the high-performance implementation of various operations. Here we review some of these.

### Subsubsection 12.3.3.1 High-performance QR factorization

We saw in Week 3 that the algorithm of choice for computing the QR factorization is based on Householder transformations. The reason is that this casts the computation in terms of the application of unitary transformations, which do not amplify error. In Unit 3.4.1, we discussed how the Householder QR factorization algorithm can be cast in terms of matrix-matrix operations.

An important point to note is that in order to cast the computation in terms of matrix-matrix multiplication, one has to form the "block Householder transformation"

When the original matrix is \(m \times n \text{,}\) this requires \(O( m b^2 ) \) floating point operations to be performed to compute the upper triangular matrix \(T \) in each iteration, which adds \(O( m n b ) \) to the total cost of the QR factorization. This is computation that an unblocked algorithm does not perform. In return, the bulk of the computation is performed much faster, which in the balance benefits performance. Details can be found in, for example,

[23] Thierry Joffrain, Tze Meng Low, Enrique S. Quintana-Orti, Robert van de Geijn, Field G. Van Zee, Accumulating Householder transformations, revisited, ACM Transactions on Mathematical Software, Vol. 32, No 2, 2006.

Casting the Rank-Revealing QR factorization, discussed in Unit 4.5.2, in terms of matrix-matrix multiplications is trickier. In order to determine what column should be swapped at each step, the rest of the matrix has to be at least partially updated. One solution to this is to use a randomized algorithm, as discussed in

[25] Per-Gunnar Martinsson, Gregorio Quintana-Orti, Nathan Heavner, Robert van de Geijn, Householder QR Factorization With Randomization for Column Pivoting (HQRRP), SIAM Journal on Scientific Computing, Vol. 39, Issue 2, 2017.

### Subsubsection 12.3.3.2 Optimizing reduction to condensed form

In Unit 10.3.1 and Unit 11.2.3, we discussed algorithms for reducing a matrix to tridiagonal and bidiagonal form, respectively. These are special cases of the reduction of a matrix to condensed form. The algorithms in those sections cast most of the computation in terms of matrix-vector multiplication and rank-1 or rank-2 updates, which are matrix-vector operations that do not attain high performance. In enrichments in those chapters, we point to papers that cast some of the computation in terms of matrix-matrix multiplication. Here, we discuss the basic issues.

When computing the LU, Cholesky, or QR factorization, it is possible to factor a panel of columns before applying an accumulation of the transformations that are encounted to the rest of the matrix. It is this that allows the computation to be mostly cast in terms of matrix-matrix operations. When computing the reduction to tridiagonal or bidiagonal form, this is not possible. The reason is that if we have just computed a unitary transformation, this transformation must be applied to the rest of the matrix both from the left and from the right. What this means is that the next transformation to be computed depends on an update that involves the rest of the matrix. This in turn means that inherently a matrix-vector operation (involving the "rest of the matrix") must be performed at every step at a cost of \(O ( m^2 ) \) computation per iteration (if the matrix is \(m \times m \)). The insight is that \(O( m^3 ) \) computation is cast in terms of matrix-vector operations, which is of the same order as the total computation.

While this is bad news, there is still a way to cast about half the computation in terms of matrix-matrix multiplication for the reduction to tridiagonal form. Notice that this means the computation is sped up by at most a factor two, since even if the part that is cast in terms of matrix-matrix multiplication takes no time at all relative to the rest of the computation, this only cuts the time to completion in half.

The reduction to bidiagonal form is trickier yet. It requires the fusing of a matrix-vector multiplication with a rank-1 update in order to reuse data that is already in cache.

Details can be found in, for example,

[45] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, G. Joseph Elizondo, Families of Algorithms for Reducing a Matrix to Condensed Form, ACM Transactions on Mathematical Software (TOMS) , Vol. 39, No. 1, 2012.

### Subsubsection 12.3.3.3 Optimizing the implicitly shifted QR algorithm

Optimizing the QR algorithm for computing the Spectral Decomposition or Singular Value Decomposition gets even trickier. Part of the cost is in the reduction to condensed form, which we already have noted exhibits limited opportunity for casting computation in terms of matrix-matrix multiplication. Once the algorithm proceeds to the implicitly shifted QR algorithm, most of the computation is in the accumulation of the eigenvectors or singular vectors. In other words, it is in the application of the Givens' rotations from the right to the columns of a matrix \(Q \) in which the eigenvectors are being computed. Let us look at one such application to two columns, \(q_i\) and \(q_j \text{:}\)

The update of each column is a vector-vector operation, requiring \(O( m ) \) computation with \(O( m ) \) data (if the vectors are of size \(m \)). We have reasoned that for such an operation it is the cost of accessing memory that dominates. In

[44] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance, ACM Transactions on Mathematical Software (TOMS), Vol. 40, No. 3, 2014.

we discuss how the rotations from many Francis Steps can be saved up and applied to \(Q \) at the same time. By carefully orchestrating this so that data in cache can be reused, the performance can be improved to rival that attained by a matrix-matrix operation.