## Unit2.5.1Lower bound on data movement

The discussion in this enrichment was inspired by the paper [23]

Tyler Michael Smith, Bradley Lowery, Julien Langou, Robert A. van de Geijn. Tight I/O Lower Bound for Matrix Multiplication. Submitted to ACM Transactions on Mathematical Software. Draft available from arXiv.org.

For more details, we encourage you to read that paper.

Early in our careers, we learned that if you say that an implementation is optimal, you better prove that it is optimal.

In our empirical studies, graphing the measured performance, we can compare our achieved results to the theoretical peak. Obviously, if we achieved theoretical peak performance, then we would know that the implementation is optimal. The problem is that we rarely achieve the theoretical peak performance as computed so far (multiplying the clock rate by the number of floating point operations that can be performed per clock cycle).

In order to claim optimality, one must carefully model an architecture and compute, through analysis, the exact limit of what it theoretical can achieve. Then, one can check achieved performance against the theoretical limit, and make a claim. Usually, this is also not practical.

In practice, one creates a model of computation for a simplified architecture. With that, one then computes a theoretical limit on performance. The next step is to show that the theoretical limit can be (nearly) achieved by an algorithm that executes on that simplified architecture. This then says something about the optimality of the algorithm under idealized circumstances. By finally comparing and contrasting the simplified architecture with an actual architecture, and the algorithm that targets the simplified architecture with an actual algorithm designed for the actual architecture, one can reason about the optimality, or lack thereof, of the practical algorithm.

### Subsubsection2.5.1.2A simple model

Let us give a simple model of computation that matches what we have assumed so far when programming matrix-matrix multiplication:

• We wish to compute $C := A B + C$ where $C \text{,}$ $A \text{,}$ and $C$ are $m \times n \text{,}$ $m \times k \text{,}$ and $k \times n \text{,}$ respectively.

• The computation is cast in terms of FMAs.

• Our machine has two layers of memory: fast memory (registers) and slow memory (main memory).

• Initially, data reside in main memory.

• To compute a FMA, all three operands must be in fast memory.

• Fast memory can hold at most $S$ floats.

• Slow memory is large enough that its size is not relevant to this analysis.

• Computation cannot be overlapped with data movement.

Notice that this model matches pretty well how we have viewed our processor so far.

### Subsubsection2.5.1.3Minimizing data movement

We have seen that matrix-matrix multiplication requires $m \times n \times k$ FMA operations, or $2 m n k$ flops. Executing floating point operations constitutes useful computation. Moving data between slow memory and fast memory is overhead since we assume it cannot be overlapped. Hence, under our simplified model, if an algorithm only performs the minimum number of flops (namely $2 m n k$), minimizes the time spent moving data between memory layers, and we at any given time are either performing useful computation (flops) or moving data, then we can argue that (under our model) the algorithm is optimal.

We now focus the argument by reasoning about a lower bound on the number of data that must be moved from fast memory to slow memory. We will build the argument with a sequence of observations.

Consider the loop

\begin{equation*} \begin{array}{l} {\bf for~} p := 0, \ldots , k-1 \\ ~~~ {\bf for~} j := 0, \ldots , n-1 \\ ~~~ ~~~ {\bf for~} i := 0, \ldots , m-1 \\ ~~~ ~~~ ~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \\ ~~~ ~~~ {\bf end} \\ ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

One can view the computations

\begin{equation*} \begin{array}{l} ~~~ ~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \end{array} \end{equation*}

as a cube of points in 3D, $( i,j,k )$ for $0 \leq i \lt m \text{,}$ $0 \leq j \lt n \text{,}$ $0 \leq p \lt k \text{.}$ The set of all possible algorithms that execute each such update only once can be viewed as an arbitrary ordering on that set of points. We can this as indexing the set of all such triples with $i_r \text{,}$ $j_r \text{,}$ $p_r \text{,}$ $0 \leq r \lt m \times n \times k \text{:}$

\begin{equation*} ( i_r, j_r, k_r ). \end{equation*}

so that the algorithm that computes $C := A B + C$ can then be written as

\begin{equation*} \begin{array}{l} {\bf for~} r := 0, \ldots , mnk-1 \\ ~~~ \gamma_{i_r,j_r} := \alpha_{i_r,p_r} \beta_{p_r,j_r} + \gamma_{i_r,j_r} \\ {\bf end} \end{array} \end{equation*}

Obviously, this puts certain restrictions on $i_r \text{,}$ $j_r \text{,}$ and $p_r\text{.}$ Articulating those exactly is not important right now.

We now partition the ordered set $0, \ldots , mnk-1$ into ordered contiguous subranges (phases) each of which requires $S + M$ distinct elements from $A \text{,}$ $B \text{,}$ and $C$ (e.g., $S$ elements from $A \text{,}$ $M/3$ elements of $B \text{,}$ and $2M/3$ elements of $C$), except for the last phase, which will contain fewer. (Strictly speaking, it is a bit more complicated than splitting the range of the iterations, since the last FMA may require anywhere from $0$ to $3$ new elements to be loaded from slow memory. Fixing this is a matter of thinking of the loads that are required as separate from the computation (as our model does) and then splitting the operations - loads, FMAs, and stores - into phases rather than the range. This does not change our analysis.)

Recall that $S$ equals the number of floats that fit in fast memory. A typical phase will start with $S$ elements in fast memory, and will in addition read $M$ elements from slow memory (except for the final phase). % We will call such an ordered subset a phase of triples. Such a typical phase will start at $r = R$ and consists of $F$ triples, $( i_R, j_R, p_R )$ through $( i_{R+F-1}, j_{R+F-1}, p_{R+F-1} )\text{.}$ Let us denote the set of these triples by $\mathbf{D} \text{.}$ These represent $F$ FMAs being performed in our algorithm. Even the first phase needs to read at least $M$ elements since it will require $S + M$ elements to be read.

The key question now becomes what the upper bound on the number of FMAs is that can be performed with $S+M$ elements. Let us denote this bound by $F_{\rm max} \text{.}$ If we know $F_{\rm max}$ as a function of $S + M \text{,}$ then we know that at least $\frac{m n k}{F_{\rm max}} -1$ phases of FMAs need to be executed, where each of those phases requires at least $S$ reads from slow memory. The total number of reads required for any algorithm is thus at least

$$\left(\frac{m n k}{F_{\rm max}} -1 \right) M.\label{eq-lower}\tag{2.5.1}$$

To find $F_{\rm max}$ we make a few observations:

• A typical triple $( i_r,j_r,p_r ) \in \mathbf{D}$ represents a FMA that requires one element from each of matrices $C \text{,}$ $A \text{,}$ and $B\text{:}$ $\gamma_{i,j} \text{,}$ $\alpha_{i,p}\text{,}$ and $\beta_{p,j} \text{.}$

• The set of all elements from $C \text{,}$ $\gamma_{i_r, j_r} \text{,}$ that are needed for the computations represented by the triples in $\mathbf{D}$ are indexed with the tuples $\mathbf{C_D} = \{ ( i_r, j_r ) ~\vert R \leq r \lt R+F \} \text{.}$ If we think of the triples in $\mathbf{D}$ as points in 3D, then $\mathbf{C_D}$ is the projection of those points onto the $i,j$ plane. Its size, $\vert \mathbf{C_D} \vert \text{,}$ tells us how many elements of $C$ must at some point be in fast memory during that phase of computations.

• The set of all elements from $A \text{,}$ $\alpha_{i_r, p_r} \text{,}$ that are needed for the computations represented by the triples in $\mathbf{D}$ are indexed with the tuples $\mathbf{A_D} = \{ ( i_r, p_r ) ~\vert R \leq r \lt R+F \} \text{.}$ If we think of the triples $( i_r, j_r, p_r )$ as points in 3D, then $\mathbf{A_D}$ is the projection of those points onto the $i,p$ plane. Its size, $\vert \mathbf{A_D} \vert \text{,}$ tells us how many elements of $A$ must at some point be in fast memory during that phase of computations.

• The set of all elements from $C \text{,}$ $\beta_{p_r, j_r} \text{,}$ that are needed for the computations represented by the triples in $\mathbf{D}$ are indexed with the tuples $\mathbf{B_D} = \{ ( p_r, j_r ) ~\vert R \leq r \lt R+F \} \text{.}$ If we think of the triples $( i_r, j_r, p_r )$ as points in 3D, then $\mathbf{B_D}$ is the projection of those points onto the $p,j$ plane. Its size, $\vert \mathbf{B_D} \vert \text{,}$ tells us how many elements of $B$ must at some point be in fast memory during that phase of computations.

Now, there is a result known as the discrete Loomis-Whitney inequality that tells us that in our situation $\vert \mathbf{D} \vert \leq \sqrt{\vert \mathbf{C_D} \vert \vert \mathbf{A_D} \vert \vert \mathbf{B_D} \vert }\text{.}$ In other words, $F_{\rm max} \leq \sqrt{\vert \mathbf{C_D} \vert \vert \mathbf{A_D} \vert \vert \mathbf{B_D} \vert }\text{.}$ The name of the game now becomes to find the largest value $F_{\rm max}$ that satisfies

\begin{equation*} {\rm maximize~} F_{\rm max} {\rm ~such~that~} \left\{ \begin{array}{l} F_{\rm max} \leq \sqrt{\vert \mathbf{C_D} \vert \vert \mathbf{A_D} \vert \vert \mathbf{B_D} \vert } \\ \vert \mathbf{C_D} \vert \gt 0, \vert \mathbf{A_D} \vert \gt 0, \vert \mathbf{B_D} \vert \gt 0 \\ \vert \mathbf{C_D} \vert + \vert \mathbf{A_D} \vert + \vert \mathbf{B_D} \vert = S + M. \end{array} \right. \end{equation*}

An application known as Lagrange multipliers yields the solution

\begin{equation*} \vert \mathbf{C_D} \vert = \vert \mathbf{A_D} \vert = \vert \mathbf{B_D} \vert = \frac{S+M}{3} \quad \mbox{and} \quad F_{\rm max} = \frac{( S + M )\sqrt{S+M}}{3 \sqrt{3}}. \end{equation*}

With that largest $F_{\rm max}$ we can then establish a lower bound on the number of memory reads given by (2.5.1):

\begin{equation*} \left(\frac{m n k}{F_{\rm max}} -1 \right) M = \left(3 \sqrt{3} \frac{m n k}{( S + M )\sqrt{S+M}} -1 \right) M. \end{equation*}

Now, $M$ is a free variable. To come up with the sharpest (best) lower bound, we want the largest lower bound. It turns out that, using techniques from calculus, one can show that $M = 2 S$ maximizes the lower bound. Thus, the best lower bound our analysis yields is given by

\begin{equation*} \left(3 \sqrt{3} \frac{m n k}{( 3 S )\sqrt{3 S}} -1 \right) (2S) = 2 \frac{m n k}{\sqrt{S}} - 2S. \end{equation*}

### Subsubsection2.5.1.4A nearly optimal algorithm

We now discuss a (nearly) optimal algorithm for our simplified architecture. Recall that we assume fast memory can hold $S$ elements. For simplicity, assume $S$ is a perfect square. Partition

\begin{equation*} C = \left( \begin{array}{c c c} C_{0,0} \amp C_{0,1} \amp \cdots \\ C_{1,0} \amp C_{1,1} \amp \cdots \\ \vdots \amp \vdots \amp \end{array} \right), \quad A = \left(\begin{array}{c} A_0 \\ A_1 \\ \vdots \end{array} \right), \quad \mbox{and} \quad B = \left( \begin{array}{c c c} B_{0} \amp B_{1} \amp \cdots \end{array} \right). \end{equation*}

where $C_{i,j}$ is $m_R \times n_R \text{,}$ $A_{i}$ is $m_R \times k \text{,}$ and $B_j$ is $k \times n_R \text{.}$ Here we choose $m_R \times n_R = (\sqrt{S}-1) \times \sqrt{S}$ so that fast memory can hold one submatrix $C_{i,j} \text{,}$ one column of $A_i \text{,}$ and one element of $B_j \text{:}$ $m_R \times n_R + m_R + 1 = (\sqrt{S}-1) \times \sqrt{S} + \sqrt{S}-1 + 1 = S \text{.}$

When computing $C := A B + C \text{,}$ we recognize that $C_{i,j} := A_i B_j + C_{i,j} \text{.}$ Now, let's further partition

\begin{equation*} A_i = \left(\begin{array}{c c c} a_{i,0} \amp a_{i,1} \amp \cdots \end{array} \right) \quad \mbox{and} \quad B_j = \left(\begin{array}{c} b_{0,j}^T \\ b_{1,j}^T \\ \vdots \end{array} \right). \end{equation*}

We now recognize that $C_{i,j} := A_i B_j + C_{i,j}$ can be computed as

\begin{equation*} C_{i,j} := a_{i,0} b_{0,j}^T + a_{i,1} b_{1,j}^T + \cdots, \end{equation*}

the by now very familiar sequence of rank-1 updates that makes up the micro-kernel discussed in Unit 2.4.1. The following loop exposes the computation $C := A B + C \text{,}$ including the loads and stores from and to slow memory:

\begin{equation*} \begin{array}{l} {\bf for~} j := 0, \ldots , N-1 \\ ~~~ {\bf for~} i := 0, \ldots , M-1 \\ ~~~ ~~~ {\rm Load~} C_{i,j} {\rm ~into~fast~memory} \\ ~~~ ~~~ {\bf for~} p := 0, \ldots , k-1 \\ ~~~ ~~~ ~~~ {\rm Load~} a_{i,p} {\rm ~and~} b_{p,j}^T {\rm ~into~fast~memory} \\ ~~~ ~~~ ~~~ C_{i,j} := a_{i,p} b_{p,j}^T + C_{i,j} \\ ~~~ ~~~ {\bf end} \\ ~~~ ~~~ {\rm Store~} C_{i,j} {\rm ~to~slow~memory} \\ ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

For simplicity, here $M = m/m_r$ and $N = n/n_r \text{.}$

On the surface, this seems to require $C_{i,j} \text{,}$ $a_{i,p} \text{,}$ and $b_{p,j}^T$ to be in fast memory at the same time, placing $m_R \times n_R + m_R + n_r = S + \sqrt{S}-1$ floats in fast memory. However, we have seen before that the rank-1 update $C_{i,j} := a_{i,p} b_{p,j}^T + C_{i,j}$ can be implemented as a loop around axpy operations, so that the elements of $b_{p,j}^T$ only need to be in fast memory one at a time.

Let us now analyze the number of memory operations incurred by this algorithm:

• Loading and storing all $C_{i,j}$ incurs $m n$ loads and $m n$ stores, for a total of $2 m n$ memory operations. (Each such block is loaded once and stored once, meaning every element of $C$ is loaded once and stored once.)

• Loading all $a_{i,p}$ requires

\begin{equation*} M N k m_R = (M m_R ) N k = m \frac{n}{n_R} k = \frac{m n k}{\sqrt{S}} \end{equation*}

memory operations.

• Loading all $b_{p,j}^T$ requires

\begin{equation*} M N k n_R = M (N n_R) k = \frac{m}{m_R} n k = \frac{m n k}{\sqrt{S}-1} \end{equation*}

memory operations.

The total number of memory operations is hence

\begin{equation*} 2 m n + \frac{m n k}{\sqrt{S}} + \frac{m n k}{\sqrt{S}-1} = 2 \frac{m n k}{\sqrt{S}} + 2 m n + \frac{mnk}{S - \sqrt{S}}. \end{equation*}

We can now compare this to the lower bound from the last unit:

\begin{equation*} 2 \frac{mnk}{\sqrt{S}} - 2S. \end{equation*}

The cost of reading and writing elements of $C \text{,}$ $2 m n \text{,}$ contributes a lower order term, as does $\frac{mnk}{S - \sqrt{S}}$ if $S$ (the size of fast memory) is reasonably large. Thus, the proposed algorithm is nearly optimal with regards to the amount of data that is moved between slow memory and fast memory.

### Subsubsection2.5.1.5Discussion

What we notice is that the algorithm presented in the last unit is quite similar to the algorithm that in the end delivered good performance in [23]. It utilizes most of fast memory (registers in [23]) with a submatrix of $C \text{.}$ Both organize the computation in terms of a kernel that performs rank-1 updates of that submatrix of $C \text{.}$

The theory suggests that the number of memory operations are minimized if the block of $C$ is chosen to be (roughly) square. In Unit 2.4.1, the best performance was observed with a kernel that chose the submatrix of $C$ in registers to be $8 \times 6 \text{,}$ so for this architecture, the theory is supported by practice. For other architectures, there may be issues that skew the aspect ratio to be less square.