ALAFF Basics of optimizing matrix-matrix multiplication

Subsection 12.2.3 Basics of optimizing matrix-matrix multiplication

Let us again consider the computation

\begin{equation*} C := A B + C, \end{equation*}

where \(A \text{,}\) \(B \text{,}\) and \(C \) are \(m \times m \) matrices. If \(m \) is small enough, then we can read the three matrices into the L1 cache, perform the operation, and write the updated matrix \(C \) back to memory. In this case,

During the computation, the matrices are in a fast memory (the L1 cache), which can keep up with the speed of floating point computation and
The cost of moving each floating point number from main memory into the L1 cache is amortized over \(m/2 \) floating point computations.

If \(m \) is large enough, then the cost of moving the data becomes insignificant. (If carefully orchestrated, some of the movement of data can even be overlapped with computation, but that is beyond our discussion.)

We immediately notice there is a tension: \(m \) must be small so that all three matrices can fit in the L1 cache. Thus, this only works for relatively small matrices. However, for small matrices, the ratio \(m / 2 \) may not be favorable enough to offset the very slow main memory.

Fortunately, matrix-matrix multiplication can be orchestrated by partitioning the matrices that are involved into submatrices, and computing with these submatrices instead. We recall that if we partition

\begin{equation*} C = \left( \begin{array}{c | c | c | c } C_{0,0} \amp C_{0,1} \amp \cdots \amp C_{0,N-1} \\ \hline C_{1,0} \amp C_{1,1} \amp \cdots \amp C_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp C_{M-1,1} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}

\begin{equation*} A = \left( \begin{array}{c | c | c | c } A_{0,0} \amp A_{0,1} \amp \cdots \amp A_{0,K-1} \\ \hline A_{1,0} \amp A_{1,1} \amp \cdots \amp A_{1,K-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline A_{M-1,0} \amp A_{M-1,1} \amp \cdots \amp A_{M-1,K-1} \end{array} \right), \end{equation*}

and

\begin{equation*} B = \left( \begin{array}{c | c | c | c } B_{0,0} \amp B_{0,1} \amp \cdots \amp B_{0,N-1} \\ \hline B_{1,0} \amp B_{1,1} \amp \cdots \amp B_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline B_{K-1,0} \amp B_{K-1,1} \amp \cdots \amp B_{K-1,N-1} \end{array} \right), \end{equation*}

where \(C_{i,j} \) is \(m_i \times n_j \text{,}\) \(A_{i,p} \) is \(m_i \times k_p \text{,}\) and \(B_{p,j} \) is \(k_p \times n_j \text{,}\) with \(\sum_{i=0}^{M-1} m_i = m \text{,}\) \(\sum_{j=0}^{N-1} n_i = n \text{,}\) and \(\sum_{p=0}^{K-1} k_i = k \text{,}\) then

\begin{equation*} C_{i,j} := \sum_{p=0}^{K-1} A_{i,p} B_{p,j} + C_{i,j} . \end{equation*}

If we choose each \(m_i \text{,}\) \(n_j \text{,}\) and \(k_p\) small enough, then the submatrices fit in the L1 cache. This still leaves us with the problem that these sizes must be reasonably small if the ratio of flops to memops is to be sufficient. The answer to that is to block for multiple levels of caches.