Section 2.5 When Optimal Means Optimal
¶Unit 2.5.1 Reasoning about optimality
¶Early in our careers, we learned that if you say that an implementation is optimal, you better prove that it is optimal.
In our empirical studies, graphing the measured performance, we can compare our achieved results to the theoretical peak. Obviously, if we achieved theoretical peak performance, then we would know that the implementation is optimal. The problem is that we rarely achieve the theoretical peak performance as computed so far (multiplying the clock rate by the number of floating point operations that can be performed per clock cycle).
In order to claim optimality, one must carefully model an architecture and compute, through analysis, the exact limit of what it theoretical can achieve. Then, one can check achieved performance against the theoretical limit, and make a claim. Usually, this is also not practical.
In practice, one creates a model of computation for a simplified architecture. With that, one then computes a theoretical limit on performance. The next step is to show that the theoretical limit can be (nearly) achieved by an algorithm that executes on that simplified architecture. This then says something about the optimality of the algorithm under idealized circumstances. By finally comparing and contrasting the simplified architecture with an actual architecture, and the algorithm that targets the simplified architecture with an actual algorithm designed for the actual architecture, one can reason about the optimality, or lack thereof, of the practical algorithm.
In this section, we will illustrate this process with matrixmatrix multiplication.
Unit 2.5.2 A simple model
¶Let us give a simple model of computation that matches what we have assumed so far when programming matrixmatrix multiplication:
We wish to compute \(C := A B + C \) where \(C \text{,}\) \(A \text{,}\) and \(C \) are \(m \times n \text{,}\) \(m \times k \text{,}\) and \(k \times n \text{,}\) respectively.
The computation is cast in terms of FMAs.
Our machine has two layers of memory: fast memory (registers) and slow memory (main memory).
Initially, data reside in main memory.
To compute a FMA, all three operands must be in fast memory.
Fast memory can hold at most \(S\) floats.
Slow memory is large enough that its size is not relevant to this analysis.
Computation cannot be overlapped with data movement.
Notice that this model matches pretty well how we have viewed our processor so far.
Unit 2.5.3 Minimizing data movement
¶We have seen that matrixmatrix multiplication requires \(m \times n \times k \) FMA operations, or \(2 m n k \) flops. Executing floating point operations constitutes useful computation. Moving data between slow memory and fast memory is overheadm since we assume it cannot be overlapped. Hence, under our simplified model, if an algorithm only performs the minimum number of flops (namely \(2 m n k \)), minimizes the time spent moving data between memory layers, and we at any given time are either performing useful computation (flops) or moving data, then we can argue that (under our model) the algorithm is optimal.
We now focus the argument by reasoning about a lower bound on the number of data that must be moved from fast memory to slow memory. We will build the argument with a sequence of observations.
Consider the loop
One can view the computations
as a cube of points in 3D, \(( i,j,k ) \) for \(0 \leq i \lt m \text{,}\) \(0 \leq j \lt n \text{,}\) \(0 \leq p \lt k \text{.}\) The set of all possible algorithms that execute each such update only once can be viewed as an arbitrary ordering on that set of points. We can this as indexing the set of all such triples with \(i_r \text{,}\) \(j_r \text{,}\) \(p_r \text{,}\) \(0 \leq r \lt m \times n \times k \text{:}\)
so that the algorithm that computes \(C := A B + C \) can then be written as
Obviously, this puts certain restrictions on \(i_r \text{,}\) \(j_r \text{,}\) and \(p_r\text{.}\) Articulating those exactly is not important right now.
We now partition the ordered set \(0, \ldots , mnk1 \) into ordered contiguous subranges (phases) each of which requires \(S + M \) distinct elements from \(A \text{,}\) \(B \text{,}\) and \(C \) (e.g., \(S \) elements from \(A \text{,}\) \(M/3 \) elements of \(B \text{,}\) and \(2M/3 \) elements of \(C \)), except for the last phase, which will contain fewer. (Strictly speaking, it is a bit more complicated than splitting the range of the iterations, since the last FMA may require anywhere from \(0\) to \(3 \) new elements to be loaded from slow memory. Fixing this is a matter of thinking of the loads that are required as separate from the computation (as our model does) and then splitting the operations  loads, FMAs, and stores  into phases rather than the range. This does not change our analysis.)
Recall that \(S \) equals the number of floats that fit in fast memory. A typical phase will start with \(S \) elements in fast memory, and will in addition read \(M \) elements from slow memory (except for the final phase). % We will call such an ordered subset a phase of triples. Such a typical phase will start at \(r = R \) and consists of \(F \) triples, \(( i_R, j_R, p_R ) \) through \(( i_{R+F1}, j_{R+F1}, p_{R+F1} )\text{.}\) Let us denote the set of these triples by \(\mathbf{D} \text{.}\) These represent \(F\) FMAs being performed in our algorithm. Even the first phase needs to read at least \(M \) elements since it will require \(S + M \) elements to be read.
The key question now becomes what the upper bound on the number of FMAs is that can be performed with \(S+M \) elements. Let us denote this bound by \(F_{\rm max} \text{.}\) If we know \(F_{\rm max} \) as a function of \(S + M \text{,}\) then we know that at least \(\frac{m n k}{F_{\rm max}} 1 \) phases of FMAs need to be executed, where each of those phases requires at least \(S \) reads from slow memory. The total number of reads requires for any algorithm is thus at least
To find \(F_{\rm max} \) we make a few observations:
A typical triple \(( i_r,j_r,p_r ) \in \mathbf{D} \) represents a FMA that requires one element from each of matrices \(C \text{,}\) \(A \text{,}\) and \(B\text{:}\) \(\gamma_{i,j} \text{,}\) \(\alpha_{i,p}\text{,}\) and \(\beta_{p,j} \text{.}\)
The set of all elements from \(C \text{,}\) \(\gamma_{i_r, j_r} \text{,}\) that are needed for the computations represented by the triples in \(\mathbf{D}\) are indexed with the tuples \(\mathbf{C_D} = \{ ( i_r, j_r ) ~\vert R \leq r \lt R+F \} \text{.}\) If we think of the triples in \(\mathbf{D}\) as points in 3D, then \(\mathbf{C_D} \) is the projection of those points onto the \(i,j \) plane. Its size, \(\vert \mathbf{C_D} \vert \text{,}\) tells us how many elements of \(C \) must at some point be in fast memory during that phase of computations.
The set of all elements from \(A \text{,}\) \(\alpha_{i_r, p_r} \text{,}\) that are needed for the computations represented by the triples in \(\mathbf{D}\) are indexed with the tuples \(\mathbf{A_D} = \{ ( i_r, p_r ) ~\vert R \leq r \lt R+F \} \text{.}\) If we think of the triples \(( i_r, j_r, p_r ) \) as points in 3D, then \(\mathbf{A_D} \) is the projection of those points onto the \(i,p \) plane. Its size, \(\vert \mathbf{A_D} \vert \text{,}\) tells us how many elements of \(A \) must at some point be in fast memory during that phase of computations.
The set of all elements from \(C \text{,}\) \(\beta_{p_r, j_r} \text{,}\) that are needed for the computations represented by the triples in \(\mathbf{D} \) are indexed with the tuples \(\mathbf{B_D} = \{ ( p_r, j_r ) ~\vert R \leq r \lt R+F \} \text{.}\) If we think of the triples \(( i_r, j_r, p_r ) \) as points in 3D, then \(\mathbf{B_D} \) is the projection of those points onto the \(p,j \) plane. Its size, \(\vert \mathbf{B_D} \vert \text{,}\) tells us how many elements of \(B \) must at some point be in fast memory during that phase of computations.
Now, there is an result known as the discrete LoomisWhitney inequality that tells us that in our situation \(\vert \mathbf{D} \vert \leq \sqrt{\vert \mathbf{C_D} \vert \vert \mathbf{A_D} \vert \vert \mathbf{B_D} \vert }\text{.}\) In other words, \(F_{\rm max} \leq \sqrt{\vert \mathbf{C_D} \vert \vert \mathbf{A_D} \vert \vert \mathbf{B_D} \vert }\text{.}\) The name of the game now becomes to find the largest value \(F_{\rm max} \) that satisfies
An application known as Lagrange multipliers yields the solution
With that largest \(F_{\rm max} \) we can then establish a lower bound on the number of memory reads given by Equation~((2.5.1)):
Now, \(M \) is a free variable. To come up with the sharpest (best) lower bound, we want the largest lower bound. It turns out that, using techniques from calculus, one can show that \(M = 2 S \) maximizes the lower bound. Thus, the best lower bound our analysis yields is given by
Unit 2.5.4 A nearly optimal algorithm
¶We now discuss a (nearly) optimal algorithm for our simplified architecture. Recall that we assume fast memory can hold \(S \) elements. For simplicity, assume \(S \) is a perfect square. Partition
where \(C_{i,j} \) is \(m_R \times n_R \text{,}\) \(A_{i}\) is \(m_R \times k \text{,}\) and \(B_j \) is \(k \times n_R \text{.}\) Here we choose \(m_R \times n_R = (\sqrt{S}1) \times \sqrt{S} \) so that fast memory can hold one submatrix \(C_{i,j} \text{,}\) one column of \(A_i \text{,}\) and one element of \(B_j \text{:}\) \(m_R \times n_R + m_R + 1 = (\sqrt{S}1) \times \sqrt{S} + \sqrt{S}1 + 1 = S \text{.}\)
When computing \(C := A B + C \text{,}\) we recognize that \(C_{i,j} := A_i B_j + C_{i,j} \text{.}\) Now, let's further partition
We now recognize that \(C_{i,j} := A_i B_j + C_{i,j} \) can be computed as
the by now very familiar sequence of rank1 updates that makes up the microkernel discussed in Section 2.3. The following loop exposes the computation \(C := A B + C \text{,}\) including the loads and stores from and to slow memory:
For simplicity, here \(M = m/m_r \) and \(N = n/n_r \text{.}\)
On the surface, this seems to require \(C_{i,j} \text{,}\) \(a_{i,p} \text{,}\) and \(b_{p,j}^T \) to be in fast memory at the same time, placing \(m_R \times n_R + m_R + n_r = S + \sqrt{S}1 \) floats in fast memory. However, we have seen before that the rank1 update \(C_{i,j} := a_{i,p} b_{p,j}^T + C_{i,j} \) can be implemented as a loop around axpy operations, so that the elements of \(b_{p,j}^T \) only need to be in fast memory one at a time.
Let us now analyze the number of memory operations incurred by this algorithm:
Loading and storing all \(C_{i,j} \) incurs \(m n\) loads and \(m n \) stores, for a total of \(2 m n \) memory operations. (Each such block is loaded once and stored once, meaning every element of \(C\) is loaded once and stored once.)

Loading all \(a_{i,p} \) requires
\begin{equation*} M N k m_R = (M m_R ) N k = m \frac{n}{n_R} k = \frac{m n k}{\sqrt{S}} \end{equation*}memory operations.

Loading all \(b_{p,j}^T \) requires
\begin{equation*} M N k n_R = M (N n_R) k = \frac{m}{m_R} n k = \frac{m n k}{\sqrt{S}1} \end{equation*}memory operations.
The total number of memory operations is hence
We can now compare this to the lower bound from the last unit:
The cost of reading and writing elements of \(C \text{,}\) \(2 m n \text{,}\) contributes a lower order term, as does \(\frac{mnk}{S  \sqrt{S}} \) if \(S \) (the size of fast memory) is reasonably large. Thus, the proposed algorithm is nearly optimal with regards to the amount of data that is moved between slow memory and fast memory.
Unit 2.5.5 Discussion
¶What we notice is that the algorithm presented in the last unit is quite similar to the algorithm that in the end delivered good performance in Section 2.3. Both utilize most of fast memory (registers in Section 2.3) with a submatrix of \(C \text{.}\) Both organize the computation in terms of a kernel that performs rank1 updates of that submatrix of \(C \text{.}\)
What is different is that the theory indicates that the number of memory operations are minimized if the block of \(C \) is chosen to be square. In Section 2.3, the best performance was observed with a kernel that chose the submatrix of \(C\) in registers to be \(12 \times 4 \text{.}\) The reason is that loading elements of \(A \) into registers (four at a time) is cheaper (on a perelement basis) than broadcasting elements of \(B \text{.}\) While in Unit 2.4.4we try to minimize memory operations, in Section 2.3 we are trying to minimize time spent in those memory operations.